<div dir="ltr">Hi Maria,<div><br></div><div>Thanks for your email. I'll try my best to answer your questions inline below. A couple of things to bear in mind:</div><div><br></div><div>1) The GTF/GFF parser is still in development. It originally supported only GTF, and a strict subset of GTF at that. We had multiple requests to support GFF as well, hence where we are now. As a result there may be lingering issues, and as you've noted the documentation is currently very sparse and incomplete.</div><div><br></div><div>2) GFF is a poorly defined specification and is not well adhered to in many cases, making it difficult to parse consistently and reliably</div><div><br></div><div>3) The GFF parser was written to support parsing GFF files as produced by Ensembl (for Ensembl genes) and NCBI (for RefSeq genes). The NCBI GFF in particular has many quirks that make it more complicated to parse; this explains the sometimes odd logic in the script.</div><div><br></div><div class="gmail_extra"><br><div class="gmail_quote">On 30 October 2015 at 15:33, Maria <span dir="ltr"><<a href="mailto:maria.bernard@jouy.inra.fr" target="_blank">maria.bernard@jouy.inra.fr</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<br>
<div>
Dear developpers,<br>
<br>
I am a bioinformatic engineer working on various species, and some
of them are not in EnsEMBL database (such as the trout genome, for
now).<br>
Biologists are interested in the annotation of their variant, and
we given them the possibility to use Variant_effect_predictor API
via our Galaxy instance.<br>
<br>
For non EnsEMBL species I use gtf2vep to construct the "annotation
database" before VEP. Here are the command lines I use:<br>
<font face="Courier New, Courier, monospace"><a href="http://gtf2vep.pl" target="_blank">gtf2vep.pl</a> -i
genome.gff -f genome.fa -s genomeName --dir database_directory<br>
<a href="http://variant_effect_predictor.pl" target="_blank">variant_effect_predictor.pl</a> --species genomeName -i genome.vcf
--format vcf -o genome.vep --offline --no_stats --dir_cache
database_directory --verbose</font><br>
<br>
I have 3 different questions/comments about the use of these two
programs together. I will take the trout and human genomes to
illustrate.<br>
<ol>
<li><u>GFF format specifications</u></li>
</ol>
<p>I clearly understand the GFF/GFT format asked. But I could not
find anywhere the minimum necessary informations and the way to
express them in each field of the annotation file. <br>
</p>
<p>Trout annotation file (that you could find here : <a href="http://www.genoscope.cns.fr/trout/data/" target="_blank"></a><a href="http://www.genoscope.cns.fr/trout/data/" target="_blank">http://www.genoscope.cns.fr/trout/data/</a>),
looks like this:</p>
<p><font face="Courier New, Courier, monospace">seq_name
source feature_type start stop score strand
frame attribute <br>
chr_1 Gaze Gene 94130 196267 36.8621 +
. Gene GSONMG00071408001<br>
chr_1 Gaze mRNA 94130 196267 36.8621 +
. mRNA GSONMT00071408001<br>
chr_1 Gaze CDS 94130 94318 6.0000 + .
CDS GSONMT00071408001<br>
chr_1 Gaze CDS 115360 115542 10.5000 +
. CDS GSONMT00071408001<br>
chr_1 Gaze CDS 116586 116839 11.0000 +
. CDS GSONMT00071408001<br>
chr_1 Gaze CDS 154563 154705 5.0000 +
. CDS GSONMT00071408001<br>
chr_1 Gaze CDS 195996 196267 6.0000 +
. CDS GSONMT00071408001</font><br>
</p>
<p>Here is what I founded in the gtf2vep code (version 81), maybe
I passed through someones.</p>
<ul>
<li>particular feature_type are read and the other are excluded:</li>
</ul>
<p>
</p>
<div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><big>included : gene transcript mrna rrna trna ncrna
primary_transcript c_gene_segment d_gene_segment
vd_gene_segment j_gene_segment v_gene_segment lincrna_gene
mirna_gene rrna_gene snorna_gene snrna_gene lincrna mirna
snorna snrna nmd_transcript_variant pseudogene
processed_pseudogene pseudogenic_transcript
aberrant_processed_transcript nc_primary_transcript
processed_transcript gene_segment exon cds<br></big></div></div></div></blockquote><div><br></div><div>Correct. Note that some types are effectively duplicates of others to support the different GFF specifications produced by Ensembl and NCBI. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><big>
</big>
<ul>
<li><big>for transcript lines, particular attributes are
necessery, and some are optionnal.</big></li>
<ul>
<li><big>necesary : gene_id , transcript_id,
transcript_biotype (if feature_type is not a particular
transcript type)</big></li></ul></ul></div></div></div></blockquote><div>gene_id is not required. If gene entries are present in the file (they may be used to store gene symbols, for example), they can be referred to by "gene_id" or by "parent_id"</div><div><br></div><div>transcript_id may also be "ID" or "name" on transcript lines.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><ul><ul>
<li><big>optionnal : </big><big>transcript_biotype (if
feature type is a particular transcript type such as
mRNA), transcript_protein ...</big></li>
</ul>
</ul>
<ul>
<li><big>"protein_coding" transcript type (indicated as mRNA
in feature type column or whith "protein_coding" as
transcript_type attribute), must be reliated with exon/CDS
line(s).</big></li></ul></div></div></div></blockquote><div>Correct. The logic to infer biotype can be very complex, especially with the NCBI GFF which does not use a consistent way of describing them. If you have biotype data available then the script will work better if you enter it as a transcript_biotype attribute, though make sure to use biotypes from the Sequence Ontology.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><ul>
</ul>
<ul>
<li><big>CDS lines must overlap exon lines and must have a
frame (0,1,or 2 and not "." in column 8), and (not shure)
must have a transcript_id attribute (other attribute are
optionnal)</big></li>
</ul>
<ul>
<li><big>exon lines must have at least a transcript_id
attribute.</big></li></ul></div></div></div></blockquote><div>Yes; the parent_id attribute may also be used to identify the relevant transcript for both exon and CDS lines.</div><div><br></div><div>Furthermore, exon and CDS lines must occur in the file _after_ their parent transcript entries. It looks from your example data like you are following this pragma.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><ul>
</ul>
<p><big>So finally, my trout GFF file looks now like this:</big><br>
</p>
<p><font face="Courier New, Courier, monospace"><big>chr_1
Gaze Gene 94130 196267 36.8621 + .
gene_id GSONMG00071408001<br>
chr_1 Gaze mRNA 94130 196267 36.8621
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 94130 94318 6.0000 +
. transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 94130 94318 6.0000 +
0 transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 115360 115542 10.5000
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 115360 115542 10.5000
+ 0 transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 116586 116839 11.0000
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 116586 116839 11.0000
+ 0 transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 154563 154705 5.0000
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 154563 154705 5.0000 +
0 transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 195996 196267 6.0000
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 195996 196267 6.0000 +
0 transcript_id GSONMT00071408001</big></font><br>
</p>
</div>
<p>As I said, I found those informations by reading <a href="http://gtf2vep.pl" target="_blank">gtf2vep.pl</a>
code version 81. Do I miss something (may be for gene lines, or
gene_id in transcript line ...)?<br>
So it would be very helpfull if you can publish those kind of
informations, because I am not shure to take time to read future
API versions code to find out. <br></p></div></div></blockquote><div>You've done a great job at interpreting it, and I do apologise for the lack of documentation. We will aim to rectify this for the next Ensembl release.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><p>
</p>
<u>2. format of sequence name in the fasta file</u><br>
<br>
As you can see, the sequence of the trout genome are currently
named chr_1, chr_Un1, .... and chrUn (it will certainly change
when included in EnsEMBL).<br>
It seem's that having the presence of "chr" causes VEP to crash. I
made some tests (on version 81 and 82) with or without "chr" in
the gff, and/or fasta, and/or vcf files. <br>
<ul>
<li>GTF2VEP tests</li>
</ul>
<p>gtf2vep works fine if you give these combinations: <br>
- GFF and FASTA with "chr" ==> database called trout_chr,<br>
- GFF and FASTA without "chr" ==> database called trout,
<br>
- GFF with "chr" and FASTA without "chr" ==> database
called trout_gchr<br>
</p>
<p>The constructed database stores fasta name sequences and change
the gff sequence name (if it does not correspond to the fasta
sequence names) by deleting « chr » at the begining. <br>
But this is not working if fasta contains « chr » and gff not.<br>
</p>
<ul>
<li>VEP tests<br>
</li>
</ul>
For the 3 precedent databases, I tested VEP whith a VCF file
containing or not "chr" in the sequence names.<br>
<br>
VEP looks for sequence names whithout "chr" even if « chr » is
specified in the VCF. So the full pipeline works only if the fasta
file does not contain « chr » in the fasta sequence name (gff name
does not matter).<br>
<br>
Why sequence names are not simply stored as they are indicated,
and softwares crash if GFF FASTA and VCF are not corresponding to
each other? Maybe I missed something, but could you also indicate
those kind of information somewhere in the documentation. (shame
on me if you already have ...) <br></div></div></blockquote><div><br></div><div>Ensembl prefers to work in a system without "chr" prefixed to the names. This creates unfortunate side effects as you've seen here; if your FASTA and GFF have "chr" prefixes, the cache will be created with "chr" names but the VEP won't be able to use it as it trims "chr" from the chromosome names of your input variant data before looking them up.</div><div><br></div><div>We'll look into ways of making this easier to deal with.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div>
<br>
<u>3. </u><u>how VEP deals with rsid in the VCF file</u><br>
<br>
Last thing I wanted to comment is the way rsid are stored in VEP
output.<br>
<br>
If VCF file (for an EnsEMBL genomie) does not contain rsID,
variant_effect_predictor may add it in the output file in the
"Existing_variation" column.<br>
exemple on the human genome:<br>
<font face="Courier New, Courier, monospace"><br>
#CHROM POS <b><font color="#3333ff">ID</font></b>
REF ALT QUAL FILTER INFO FORMAT Pickrell
Sample1<br>
chr12 939302 <b><font color="#3333ff">.</font></b>
A G . PASS
AC=1;ADP=587;AF=0.500;AN=2;HET=1;HOM=0;NC=0;WT=0;set=Varscan
GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PVAL:RBQ:RD:RDF:RDR:SDP ./.
0/1:30:213:85:128:587:36,29%:99:3,2364E-75:27:370:149:221:651</font><br>
<br>
become after VEP (one line exemple among many others)<br>
<font face="Courier New, Courier, monospace"><br>
<font color="#3333ff"><b>#Uploaded_variation</b></font>
Location Allele Gene Feature Feature_type
Consequence cDNA_position CDS_position
Protein_position Amino_acids Codons <b><font color="#3333ff">Existing_variation</font></b> Extra<br>
<font color="#3333ff"><b>12_939302_A/G</b></font>
12:939302 G ENSG00000002016 ENST00000544742
Transcript intron_variant - - - - - <b><font color="#3333ff">rs139381800</font></b>
IMPACT=MODIFIER;STRAND=-1</font><br>
<br>
but if the variant caller used we already annotated variant with
existing rsID, rsID are managed differently:<br>
<br>
<font face="Courier New, Courier, monospace"> #CHROM POS
ID REF ALT QUAL FILTER INFO FORMAT
Pickrell Sample1<br>
chr12 406292 <font color="#3333ff"><b>rs2229351</b></font>
G A 1002.77 PASS
AC=2;ADP=163;AF=0.500;AN=4;BaseQRankSum=3.502;DB;DP=69;Dels=0.00;FS=4.853;HET=1;HOM=0;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=37.00;MQ0=0;MQRankSum=0.943;NC=0;QD=14.53;ReadPosRankSum=-1.219;WT=0;set=Intersection
GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PL:PVAL:RBQ:RD:RDF:RDR:SDP
0/1:.:33,36:.:.:66:.:99:1031,0,999
0/1:26:39:21:18:163:23,93%:99:.:1,3713E-13:30:124:66:58:169<br>
</font>become (only one line result)<br>
<br>
<font face="Courier New, Courier, monospace"> <font color="#3333ff"><b>#Uploaded_variation</b></font>
Location Allele Gene Feature Feature_type
Consequence cDNA_position CDS_position
Protein_position Amino_acids Codons <b><font color="#3333ff">Existing_variation</font></b> Extra<br>
<font color="#3333ff"><b> rs2229351</b></font>
12:406292 A ENSG00000120647 ENST00000535052
Transcript intron_variant - - - - - <b><font color="#3333ff"> -</font></b> IMPACT=MODIFIER;STRAND=1<br>
</font><br>
First case, uploaded variation id is construct with sequence name,
position and alleles, and rsID is added in Existing_variation
column. <br>
Second case, uploaded variation id is the known rsID, and
Existing_variation stays empty.<br>
Would it be possible to stay in case 1 even if the VCF file
includes the rsID ? <br></div></div></blockquote><div><br></div><div>This should work for your second example; I get the following:</div><div><br></div><div><div>> echo "chr12 406292 rs2229351 G A" | perl <a href="http://variant_effect_predictor.pl">variant_effect_predictor.pl</a> -force -cache -check_existing -port 3337 -o stdout | grep -v # | head -n1</div><div>rs2229351 12:406292 A ENSG00000073614 ENST00000382815 Transcript synonymous_variant 4512 4149 1383 S tcC/tcT rs2229351 IMPACT=LOW;STRAND=-1<br></div></div><div><br></div><div>Perhaps you have not enabled --check_existing, or you are using the wrong genome (this appears to be a location from GRCh37; the VEP default is to use GRCh38)?</div><div><br></div><div>The ID you enter in your input has no bearing on what gets looked up when using --check_existing; the two are treated independently.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div>
<br>
In any cases thank you for those API which evolved since the last
few versions. A little bit more documentation would be great (not
a funny task, I know), some behaviour corrections that facilitate
my life would be very very very much appreciated but I know that
we can't satisfy every one.<br></div></div></blockquote><div><br></div><div>Thanks for sticking with it, we are always working to improve and hopefully we will have some improved GFF-parsing documentation for the next Ensembl release.</div><div><br></div><div>Regards</div><div><br></div><div>Will McLaren</div><div>Ensembl Variation</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div>
<br>
Again thank you for these tools.<br>
<br>
Best Regards<span class=""><font color="#888888"><br>
<pre cols="72">--
Maria Bernard
Bioinformatics
Sigenae team
GABI Unity
INRA de Jouy en Josas - batiment 320
Domaine de Vilvert
78352 Jouy en josas
FRANCE
</pre>
<br>
</font></span></div>
<br>
</div>
<br>_______________________________________________<br>
Dev mailing list <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>
Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" rel="noreferrer" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>
Ensembl Blog: <a href="http://www.ensembl.info/" rel="noreferrer" target="_blank">http://www.ensembl.info/</a><br>
<br></blockquote></div><br></div></div>