<div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif">Arnaud,</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">
There's been threads on this in the SO-devel list. Many people use the same ID for discontinuous features such as CDS and my feeling is that this is tacitly accepted but there should never be a case where there are shared IDs between different feature types (i.e. transcript and CDS). </div>
<div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">See <a href="http://gmod.org/wiki/GFF3#Discontinuous_Features">http://gmod.org/wiki/GFF3#Discontinuous_Features</a></div>
<div class="gmail_default"><font face="verdana, sans-serif">and <a href="http://www.sequenceontology.org/gff3.shtml">http://www.sequenceontology.org/gff3.shtml</a></font><br></div><div class="gmail_default"><font face="verdana, sans-serif"><br>
</font></div><div class="gmail_default"><font face="verdana, sans-serif">Note in the 2nd link that the CDS are marked up as discontinuous features with shared IDs in the example.</font></div><div class="gmail_default"><font face="verdana, sans-serif"><br>
</font></div><div class="gmail_default" style="font-family:verdana,sans-serif">Are you planning to resolve just the clash of IDs between features or to add suffices to the CDS lines? I'm assuming that the latter will break some browser visualizations where features are linked based on their ID and not the parent ID. Of course that is not necessarily the driver of GFF3 formatting but useful to remember.</div>
<div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">cheers</div><div class="gmail_default"><div class="gmail_default"><font face="verdana, sans-serif">D</font></div>
</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">
On 24 February 2014 10:10, Arnaud Kerhornou <span dir="ltr"><<a href="mailto:arnaud@ebi.ac.uk" target="_blank">arnaud@ebi.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<br>
<div>
<div>Hello Hans,<br>
<br>
Sorry about that, it's something we missed. This will be
corrected with the coming release of Ensembl Genomes, which will
be out around mid-march, so feel free to correct it on your side
in the meantime.<br>
Note that the next release of bread wheat will include an
updated gene set.<br>
<br>
Best regards,<br>
Arnaud<div><div class="h5"><br>
<br>
On 22/02/2014 01:34, Hans Vasquez-Gross wrote:<br>
</div></div></div>
<blockquote type="cite"><div><div class="h5">
<div dir="ltr">
<div>Hello,</div>
<div><br>
</div>
<div>I recently downloaded the MIPs GFF3 annotation provided
on your FTP for Triticum_aestivum.</div>
<div><br>
</div>
<div><a href="ftp://ftp.ensemblgenomes.org/pub/plants/release-21/gff3/triticum_aestivum/Triticum_aestivum.IWGSP1.21.gff3.gz" target="_blank">ftp://ftp.ensemblgenomes.org/pub/plants/release-21/gff3/triticum_aestivum/Triticum_aestivum.IWGSP1.21.gff3.gz</a><br>
</div>
<div><br>
</div>
<div dir="ltr">
<div>I tried running this file for visualization in a genome
browser, but it does not validate. There seems to be a
problem in the manner the ID= field in the 9th column is
setup. According to SO (<a href="http://www.sequenceontology.org/gff3.shtml" target="_blank">http://www.sequenceontology.org/gff3.shtml</a>),
the ID= in the 9th column MUST be unique. But currently,
all transcript/CDS/exon relationships have the same ID
collision issue which I'll explain below with the first
example problem.</div>
<div><br>
</div>
<div>If you take a look at lines 133-138 in the gff3 file,
you should see this:</div>
<div>
<div>##sequence-region IWGSC_CSS_3AS_scaff_369935 1 200</div>
<div>IWGSC_CSS_3AS_scaff_369935 ensembl
protein_coding_gene 1 200 . -
.
ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum</div>
<div>IWGSC_CSS_3AS_scaff_369935 ensembl transcript
1 200 . - .
ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum</div>
<div>IWGSC_CSS_3AS_scaff_369935 . CDS 1
198 . - 0
ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2.1;rank=1</div>
<div>IWGSC_CSS_3AS_scaff_369935 . exon 1
200 . - .
ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.1;constitutive=1;ensembl_phase=-1;rank=1</div>
<div>IWGSC_CSS_3AS_scaff_369935 .
five_prime_UTR 199 200 . - .
Parent=Traes_3AS_775C097A2.1;</div>
</div>
<div><br>
</div>
<div>The transcript and CDS definition have the exact same
ID defined "Traes_3AS_775C097A2.1" which is causing the
naming collision. You will also notice in the CDS
definition line, the ID= and Parent= are exactly the same.
The parent in this case is trying to refer to the
transcript ID, but the CDS has the same ID. </div>
<div><br>
</div>
<div>ProposedSolution:</div>
<div>Any CDS ID could have a "C" appended after the period.
For example, Traes_3AS_775C097A2.1 would become
Traes_3AS_775C097A2.C1. This is similar to what you are
doing for the Exons. The exon line would then have to be
updated with this new ID for the Parent= string. Then,
the new GFF3 block for this transcript definition would
be:</div>
<div><br>
<div>##sequence-region IWGSC_CSS_3AS_scaff_369935 1 200</div>
<div>IWGSC_CSS_3AS_scaff_369935 ensembl
protein_coding_gene 1 200 . -
.
ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum</div>
<div>IWGSC_CSS_3AS_scaff_369935 ensembl transcript
1 200 . - .
ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum</div>
<div>IWGSC_CSS_3AS_scaff_369935 . CDS 1
198 . - 0
ID=Traes_3AS_775C097A2.C1;Parent=Traes_3AS_775C097A2.1;rank=1</div>
<div>IWGSC_CSS_3AS_scaff_369935 . exon 1
200 . - .
ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.C1;constitutive=1;ensembl_phase=-1;rank=1</div>
<div>IWGSC_CSS_3AS_scaff_369935 .
five_prime_UTR 199 200 . - .
Parent=Traes_3AS_775C097A2.1;</div>
</div>
<div><br>
</div>
<div>Would this be a fast fix on your side to regenerate the
data to be valid? If not, I'll write my own script next
week to fix the errors in the GFF3 file.</div>
<div><br>
</div>
<div>Cheers,</div>
<div>-Hans</div>
<div><br>
</div>
</div>
</div>
<br>
<fieldset></fieldset>
<br>
</div></div><pre>_______________________________________________
Dev mailing list <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a>
Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a>
Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a>
</pre>
</blockquote>
<br>
<br>
</div>
<br>
</div>
<br>_______________________________________________<br>
Dev mailing list <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>
Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>
Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ensembl Genomes | VectorBase | i5K insect genome initiative
</div>