Genome annotations

Genome annotations

Genome annotation is a process of attributing structural and functional information to sequences. These annotations range from sequence similarities, biological functions, location of regulatory motifs, expression and interactions.

Recommendations

Target user(s): Data managers, Bioinformaticians

Summary

Recommended data format: GFF3 format.
Provide comprehensive content description for column 9 in the GFF3 file.
Consistent use of external database cross references (Dbxref).
Consistent use of ontologies.

1. Data format

We recommend GFF3 file format for the representation of genome annotations.

Description

The GFF3 file format is widely used by the community and is a good option for representing genome annotations. However, descriptions with regard to specific columns need attention, for instance, column 9 “attributes” in the GFF3 file varies in the type of information it contains when compared to the rest that are specific (position, chromosome…). The information contained in column 9 needs guidelines (currently ID and Name of the Feature are the mandatory information) — the other attributes are not specified, resulting in adopters using it in different ways.

Specifics

Guidelines for describing content for Column 9 “attributes”:

ID: Indicates the ID of the feature. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a single feature.
Name: Display name for the feature. This is the name to be displayed to the user. Unlike IDs, there is no requirement that the Name be unique within the file.
Alias: A secondary name for the feature. It suggests that this tag can be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Unlike ID, there is no requirement that Alias be unique within the file.
Parent: Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, and so forth. A feature may have multiple parents. Parent can *only* be used to indicate a part of relationship.
Target: Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is “target_id start end [strand]”, where strand is optional and may be “+” or “-“. If the target_id contains spaces, they must be escaped as hex escape %20.
Gap: The alignment of the feature to the target if the two are not collinear (e.g. contain gaps). The alignment format is taken from the CIGAR format described in the Exonerate documentation. See “THE GAP ATTRIBUTE” for a description of this format.
Derives_from: Used to disambiguate the relationship between one feature and another when the relationship is a temporal one rather than a purely structural “part of” one. This is needed for polycistronic genes. See “PATHOLOGICAL CASES” for further discussion.
Note: A free text note.
Dbxref: A database cross reference. See the section “Ontology Associations and Db Cross References” for details on the format.
Ontology_term: A cross reference to an ontology term. See the section “Ontology Associations and Db Cross References” for details.
Is_circular: A flag to indicate whether a feature is circular. See extended discussion below.

2. Good practices

Use homogeneous abbreviation tags for database.

Consequent use format column 9 – use Dbxref attribute.

Dbxref is the ID of the cross referenced object in the form

DBTAG:ID – The DBTAG indicates which database the referenced object can be found in, and ID indicates the identifier of the object within that database. IDs can contain unescaped colons but DBTAGs cannot, so parsing code should split on the first colon encountered in the attribute value.

Here are some suggestions for a homogeneous GFF3-format:

original GFF3 for Hordeum_vulgare (EnsemblPlants):

1 ensembl gene 3656 4845 . – . ID=gene:MLOC_65880;assembly_name=082214v1;biotype=protein_coding;description=Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:M0Y5H6];logic_name=ibsc;version=1

proposed format is:

1 ensembl gene 3656 4845 . – . ID=gene:MLOC_65880;assembly_name=082214v1;biotype=protein_coding;description=Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:M0Y5H6];logic_name=ibsc;version=1;Dbxref=UniProt:M0Y5H6,EnsemblPlants:MLOC_65880

3. Metadata and Vocabularies

We recommend the use of ontologies for functional annotation in column 9, such as, Gene Ontology and Sequence Ontology.

4. Tools

Convert data format
You can convert different formats to GFF3 using the Bioconvert tool.

GFF3 validator – Genome tools:

http://genometools.org/cgi-bin/gff3validator.cgi

5. Examples

GFF3 sample from the 3B annotation browser:

traes3bPseudomoleculeV1	GDEC	marker	82454936	82455352	.	-	.	ID=XwPt1159-3B;Name=XwPt1159-3B;marker=wPt1159;type=darts
traes3bPseudomoleculeV1	GDEC	marker	771172313	771172855	.	-	.	ID=XwPt2416-3B;Name=XwPt2416-3B;marker=wPt2416;type=darts
traes3bPseudomoleculeV1	GDEC	marker	12174851	12175713	.	+	.	ID=XwPt2757-3B;Name=XwPt2757-3B;marker=wPt2757;type=darts
traes3bPseudomoleculeV1	GDEC	marker	586057169	586057670	.	-	.	ID=XwPt3327-3B;Name=XwPt3327-3B;marker=wPt3327;type=darts
traes3bPseudomoleculeV1	GDEC	marker	295038909	295039410	.	-	.	ID=XwPt3327-3B.2;Name=XwPt3327-3B;marker=wPt3327;type=darts
v443_0484	GDEC	marker	134945	135646	.	+	.	ID=XwPt4933-3B;Name=XwPt4933-3B;marker=wPt4933;type=darts
traes3bPseudomoleculeV1	GDEC	marker	755916365	755916938	.	-	.	ID=XwPt5295-3B;Name=XwPt5295-3B;marker=wPt5295;type=darts
traes3bPseudomoleculeV1	GDEC	marker	236794223	236794836	.	+	.	ID=XwPt5390-3B;Name=XwPt5390-3B;marker=wPt5390;type=darts
traes3bPseudomoleculeV1	GDEC	marker	749409255	749409819	.	+	.	ID=XwPt5947-3B;Name=XwPt5947-3B;marker=wPt5947;type=darts
traes3bPseudomoleculeV1	GDEC	marker	736342105	736342613	.	-	.	ID=XwPt7301-3B;Name=XwPt7301-3B;marker=wPt7301;type=darts
traes3bPseudomoleculeV1	GDEC	marker	614658212	614659360	.	+	.	ID=XwPt7502-3B;Name=XwPt7502-3B;marker=wPt7502;type=darts
traes3bPseudomoleculeV1	GDEC	marker	765686199	765687128	.	+	.	ID=XwPt7514-3B;Name=XwPt7514-3B;marker=wPt7514;type=darts
traes3bPseudomoleculeV1	GDEC	marker	765009795	765010398	.	+	.	ID=XwPt8845-3B;Name=XwPt8845-3B;marker=wPt8845;type=darts
traes3bPseudomoleculeV1	GDEC	marker	9584806	9585578	.	+	.	ID=XwPt8855-3B;Name=XwPt8855-3B;marker=wPt8855;type=darts

Writing: WDI working group
Creation date: 02 October 2014
Update: 31 July 2015

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	6 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	6 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	6 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	6 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	6 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	6 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	6 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.

Cookie	Duration	Description
_pk_id.23.b4c0	13 months	This cookie name is associated with the Matomo open source web analytics platform. It is used to help website owners track visitor behaviour and measure site performance. It is a pattern type cookie, where the prefix _pk_ses is followed by a short series of numbers and letters, which is believed to be a reference code for the domain setting the cookie.
_pk_ses.23.b4c0	30 minutes	This cookie name is associated with the Matomo open source web analytics platform. It is used to help website owners track visitor behaviour and measure site performance. It is a pattern type cookie, where the prefix _pk_ses is followed by a short series of numbers and letters, which is believed to be a reference code for the domain setting the cookie.

Cookie	Duration	Description
CONSENT	16 years 9 months 1 day 16 hours	YouTube is a Google owned platform for hosting and sharing videos. YouTube collects user data through videos embedded in websites, which is aggregated with profile data from other Google services in order to display targeted advertising to web visitors across a broad range of their own and other websites.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Wheat Data Interoperability Guidelines