<div dir="ltr">Hi Maria,<div><br></div><div>Thanks for your email. I'll try my best to answer your questions inline below. A couple of things to bear in mind:</div><div><br></div><div>1) The GTF/GFF parser is still in development. It originally supported only GTF, and a strict subset of GTF at that. We had multiple requests to support GFF as well, hence where we are now. As a result there may be lingering issues, and as you've noted the documentation is currently very sparse and incomplete.</div><div><br></div><div>2) GFF is a poorly defined specification and is not well adhered to in many cases, making it difficult to parse consistently and reliably</div><div><br></div><div>3) The GFF parser was written to support parsing GFF files as produced by Ensembl (for Ensembl genes) and NCBI (for RefSeq genes). The NCBI GFF in particular has many quirks that make it more complicated to parse; this explains the sometimes odd logic in the script.</div><div><br></div><div class="gmail_extra"><br><div class="gmail_quote">On 30 October 2015 at 15:33, Maria <span dir="ltr"><<a href="mailto:maria.bernard@jouy.inra.fr" target="_blank">maria.bernard@jouy.inra.fr</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

  
  <div bgcolor="#FFFFFF" text="#000000">

    <br>

    <div>

      
      Dear developpers,<br>

      <br>

      I am a bioinformatic engineer working on various species, and some

      of them are not in EnsEMBL database (such as the trout genome, for

      now).<br>

      Biologists are interested in the annotation of their variant, and

      we given them the possibility to use Variant_effect_predictor API

      via our Galaxy instance.<br>

      <br>

      For non EnsEMBL species I use gtf2vep to construct the "annotation

      database" before VEP. Here are the command lines I use:<br>

      <font face="Courier New, Courier, monospace"><a href="http://gtf2vep.pl" target="_blank">gtf2vep.pl</a> -i

        genome.gff -f genome.fa -s genomeName --dir database_directory<br>

        <a href="http://variant_effect_predictor.pl" target="_blank">variant_effect_predictor.pl</a> --species genomeName -i genome.vcf

        --format vcf -o genome.vep --offline --no_stats --dir_cache

        database_directory --verbose</font><br>

      <br>

      I have 3 different questions/comments about the use of these two

      programs together. I will take the trout and human genomes to

      illustrate.<br>

      <ol>

        <li><u>GFF format specifications</u></li>

      </ol>

      <p>I clearly understand the GFF/GFT format asked. But I could not

        find anywhere the minimum necessary informations and the way to

        express them in each field of the annotation file. <br>

      </p>

      <p>Trout annotation file (that you could find here : <a href="http://www.genoscope.cns.fr/trout/data/" target="_blank"></a><a href="http://www.genoscope.cns.fr/trout/data/" target="_blank">http://www.genoscope.cns.fr/trout/data/</a>),

        looks like this:</p>

      <p><font face="Courier New, Courier, monospace">seq_name   

          source    feature_type    start    stop    score    strand   

          frame attribute   <br>

          chr_1    Gaze    Gene    94130    196267    36.8621    +   

          .    Gene GSONMG00071408001<br>

          chr_1    Gaze    mRNA    94130    196267    36.8621    +   

          .    mRNA GSONMT00071408001<br>

          chr_1    Gaze    CDS    94130    94318    6.0000    +    .   

          CDS GSONMT00071408001<br>

          chr_1    Gaze    CDS    115360    115542    10.5000    +   

          .    CDS GSONMT00071408001<br>

          chr_1    Gaze    CDS    116586    116839    11.0000    +   

          .    CDS GSONMT00071408001<br>

          chr_1    Gaze    CDS    154563    154705    5.0000    +   

          .    CDS GSONMT00071408001<br>

          chr_1    Gaze    CDS    195996    196267    6.0000    +   

          .    CDS GSONMT00071408001</font><br>

      </p>

      <p>Here is what I founded in the gtf2vep code (version 81), maybe

        I passed through someones.</p>

      <ul>

        <li>particular feature_type are read and the other are excluded:</li>

      </ul>

      <p>

        
      </p>

      <div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><big>included : gene transcript mrna rrna trna ncrna

          primary_transcript c_gene_segment d_gene_segment

          vd_gene_segment j_gene_segment v_gene_segment lincrna_gene

          mirna_gene rrna_gene snorna_gene  snrna_gene lincrna mirna

          snorna snrna nmd_transcript_variant pseudogene

          processed_pseudogene pseudogenic_transcript

          aberrant_processed_transcript nc_primary_transcript

          processed_transcript gene_segment exon cds<br></big></div></div></div></blockquote><div><br></div><div>Correct. Note that some types are effectively duplicates of others to support the different GFF specifications produced by Ensembl and NCBI. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><big>

        </big>

        <ul>

          <li><big>for transcript lines, particular attributes are

              necessery, and some are optionnal.</big></li>

          <ul>

            <li><big>necesary : gene_id , transcript_id,

                transcript_biotype (if feature_type is not a particular

                transcript type)</big></li></ul></ul></div></div></div></blockquote><div>gene_id is not required. If gene entries are present in the file (they may be used to store gene symbols, for example), they can be referred to by "gene_id" or by "parent_id"</div><div><br></div><div>transcript_id may also be "ID" or "name" on transcript lines.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><ul><ul>

            <li><big>optionnal : </big><big>transcript_biotype (if

                feature type is a particular transcript type such as

                mRNA), transcript_protein ...</big></li>

          </ul>

        </ul>

        <ul>

          <li><big>"protein_coding" transcript type (indicated as mRNA

              in feature type column or whith "protein_coding" as

              transcript_type attribute), must be reliated with exon/CDS

              line(s).</big></li></ul></div></div></div></blockquote><div>Correct. The logic to infer biotype can be very complex, especially with the NCBI GFF which does not use a consistent way of describing them. If you have biotype data available then the script will work better if you enter it as a transcript_biotype attribute, though make sure to use biotypes from the Sequence Ontology.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><ul>

        </ul>

        <ul>

          <li><big>CDS lines must overlap exon lines and must have a

              frame (0,1,or 2 and not "." in column 8), and (not shure)

              must have a transcript_id attribute (other attribute are

              optionnal)</big></li>

        </ul>

        <ul>

          <li><big>exon lines must have at least a transcript_id

              attribute.</big></li></ul></div></div></div></blockquote><div>Yes; the parent_id attribute may also be used to identify the relevant transcript for both exon and CDS lines.</div><div><br></div><div>Furthermore, exon and CDS lines must occur in the file _after_ their parent transcript entries. It looks from your example data like you are following this pragma.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><div style="color:rgb(0,0,0);font-family:'Lucida Grande',verdana,arial,helvetica,sans-serif;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:17.1429px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)"><ul>

        </ul>

        <p><big>So finally, my trout GFF file looks now like this:</big><br>

        </p>

        <p><font face="Courier New, Courier, monospace"><big>chr_1   

              Gaze    Gene    94130    196267    36.8621    +    .   

              gene_id GSONMG00071408001<br>

              chr_1    Gaze    mRNA    94130    196267    36.8621   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    94130    94318    6.0000    +   

              .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    94130    94318    6.0000    +   

              0    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    115360    115542    10.5000   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    115360    115542    10.5000   

              +    0    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    116586    116839    11.0000   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    116586    116839    11.0000   

              +    0    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    154563    154705    5.0000   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    154563    154705    5.0000    +   

              0    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    195996    196267    6.0000   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    195996    196267    6.0000    +   

              0    transcript_id GSONMT00071408001</big></font><br>

        </p>

      </div>

      <p>As I said, I found those informations by reading <a href="http://gtf2vep.pl" target="_blank">gtf2vep.pl</a>

        code version 81. Do I miss something (may be for gene lines, or

        gene_id in transcript line ...)?<br>

        So it would be very helpfull if you can publish those kind of

        informations, because I am not shure to take time to read future

        API versions code to find out. <br></p></div></div></blockquote><div>You've done a great job at interpreting it, and I do apologise for the lack of documentation. We will aim to rectify this for the next Ensembl release.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><p>

      </p>

          <u>2. format of sequence name in the fasta file</u><br>

      <br>

      As you can see, the sequence of the trout genome are currently

      named chr_1, chr_Un1, .... and chrUn (it will certainly change

      when included in EnsEMBL).<br>

      It seem's that having the presence of "chr" causes VEP to crash. I

      made some tests (on version 81 and 82) with or without "chr" in

      the gff, and/or fasta, and/or vcf files. <br>

      <ul>

        <li>GTF2VEP tests</li>

      </ul>

      <p>gtf2vep works fine if you give these combinations: <br>

            - GFF and FASTA with "chr" ==> database called trout_chr,<br>

            - GFF and FASTA without "chr" ==> database called trout,

        <br>

            - GFF with "chr" and FASTA without "chr" ==> database

        called trout_gchr<br>

      </p>

      <p>The constructed database stores fasta name sequences and change

        the gff sequence name (if it does not correspond to the fasta

        sequence names) by deleting « chr » at the begining. <br>

        But this is not working if fasta contains « chr » and gff not.<br>

      </p>

      <ul>

        <li>VEP tests<br>

        </li>

      </ul>

      For the 3 precedent databases, I tested VEP whith a VCF file

      containing or not "chr" in the sequence names.<br>

      <br>

      VEP looks for sequence names whithout "chr" even if « chr » is

      specified in the VCF. So the full pipeline works only if the fasta

      file does not contain « chr » in the fasta sequence name (gff name

      does not matter).<br>

      <br>

      Why sequence names are not simply stored as they are indicated,

      and softwares crash if GFF FASTA and VCF are not corresponding to

      each other? Maybe I missed something, but could you also indicate

      those kind of information somewhere in the documentation. (shame

      on me if you already have ...) <br></div></div></blockquote><div><br></div><div>Ensembl prefers to work in a system without "chr" prefixed to the names. This creates unfortunate side effects as you've seen here; if your FASTA and GFF have "chr" prefixes, the cache will be created with "chr" names but the VEP won't be able to use it as it trims "chr" from the chromosome names of your input variant data before looking them up.</div><div><br></div><div>We'll look into ways of making this easier to deal with.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div>

      <br>

          <u>3. </u><u>how VEP deals with rsid in the VCF file</u><br>

      <br>

      Last thing I wanted to comment is the way rsid are stored in VEP

      output.<br>

      <br>

      If VCF file (for an EnsEMBL genomie) does not contain rsID,

      variant_effect_predictor may add it in the output file in the

      "Existing_variation" column.<br>

      exemple on the human genome:<br>

      <font face="Courier New, Courier, monospace"><br>

            #CHROM    POS    <b><font color="#3333ff">ID</font></b>   

        REF    ALT    QUAL    FILTER    INFO    FORMAT    Pickrell   

        Sample1<br>

            chr12    939302    <b><font color="#3333ff">.</font></b>   

        A    G    .    PASS   

        AC=1;ADP=587;AF=0.500;AN=2;HET=1;HOM=0;NC=0;WT=0;set=Varscan   

        GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PVAL:RBQ:RD:RDF:RDR:SDP    ./.   

        0/1:30:213:85:128:587:36,29%:99:3,2364E-75:27:370:149:221:651</font><br>

      <br>

      become after VEP (one line exemple among many others)<br>

      <font face="Courier New, Courier, monospace"><br>

            <font color="#3333ff"><b>#Uploaded_variation</b></font>   

        Location    Allele    Gene    Feature    Feature_type   

        Consequence    cDNA_position    CDS_position   

        Protein_position    Amino_acids    Codons    <b><font color="#3333ff">Existing_variation</font></b>    Extra<br>

            <font color="#3333ff"><b>12_939302_A/G</b></font>   

        12:939302    G    ENSG00000002016    ENST00000544742   

        Transcript    intron_variant    -    -    -    -    -    <b><font color="#3333ff">rs139381800</font></b>   

        IMPACT=MODIFIER;STRAND=-1</font><br>

      <br>

      but if the variant caller used we already annotated variant with

      existing rsID, rsID are managed differently:<br>

      <br>

      <font face="Courier New, Courier, monospace">    #CHROM    POS   

        ID    REF    ALT    QUAL    FILTER    INFO    FORMAT   

        Pickrell    Sample1<br>

            chr12    406292    <font color="#3333ff"><b>rs2229351</b></font>   


        G    A    1002.77    PASS   

        AC=2;ADP=163;AF=0.500;AN=4;BaseQRankSum=3.502;DB;DP=69;Dels=0.00;FS=4.853;HET=1;HOM=0;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=37.00;MQ0=0;MQRankSum=0.943;NC=0;QD=14.53;ReadPosRankSum=-1.219;WT=0;set=Intersection   


        GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PL:PVAL:RBQ:RD:RDF:RDR:SDP   

        0/1:.:33,36:.:.:66:.:99:1031,0,999   

        0/1:26:39:21:18:163:23,93%:99:.:1,3713E-13:30:124:66:58:169<br>

      </font>become (only one line result)<br>

      <br>

      <font face="Courier New, Courier, monospace">    <font color="#3333ff"><b>#Uploaded_variation</b></font>   

        Location    Allele    Gene    Feature    Feature_type   

        Consequence    cDNA_position    CDS_position   

        Protein_position    Amino_acids    Codons    <b><font color="#3333ff">Existing_variation</font></b>    Extra<br>

        <font color="#3333ff"><b>    rs2229351</b></font>   

        12:406292    A    ENSG00000120647    ENST00000535052   

        Transcript    intron_variant    -    -    -    -    -   <b><font color="#3333ff"> -</font></b>    IMPACT=MODIFIER;STRAND=1<br>

      </font><br>

      First case, uploaded variation id is construct with sequence name,

      position and alleles, and rsID is added in Existing_variation

      column. <br>

      Second case, uploaded variation id is the known rsID, and

      Existing_variation stays empty.<br>

      Would it be possible to stay in case 1 even if the VCF file

      includes the rsID ? <br></div></div></blockquote><div><br></div><div>This should work for your second example; I get the following:</div><div><br></div><div><div>> echo "chr12    406292    rs2229351    G    A" | perl <a href="http://variant_effect_predictor.pl">variant_effect_predictor.pl</a> -force -cache -check_existing -port 3337 -o stdout | grep -v # | head -n1</div><div>rs2229351       12:406292       A       ENSG00000073614 ENST00000382815 Transcript      synonymous_variant      4512    4149    1383    S   tcC/tcT  rs2229351       IMPACT=LOW;STRAND=-1<br></div></div><div><br></div><div>Perhaps you have not enabled --check_existing, or you are using the wrong genome (this appears to be a location from GRCh37; the VEP default is to use GRCh38)?</div><div><br></div><div>The ID you enter in your input has no bearing on what gets looked up when using --check_existing; the two are treated independently.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div>

      <br>

      In any cases thank you for those API which evolved since the last

      few versions. A little bit more documentation would be great (not

      a funny task, I know), some behaviour corrections that facilitate

      my life would be very very very much appreciated but I know that

      we can't satisfy every one.<br></div></div></blockquote><div><br></div><div>Thanks for sticking with it, we are always working to improve and hopefully we will have some improved GFF-parsing documentation for the next Ensembl release.</div><div><br></div><div>Regards</div><div><br></div><div>Will McLaren</div><div>Ensembl Variation</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div>

      <br>

      Again thank you for these tools.<br>

      <br>

      Best Regards<span class=""><font color="#888888"><br>

      <pre cols="72">-- 

Maria Bernard

Bioinformatics


Sigenae team

GABI Unity


INRA de Jouy en Josas - batiment 320

Domaine de Vilvert

78352 Jouy en josas

FRANCE

</pre>

      <br>

    </font></span></div>

    <br>

  </div>


<br>_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" rel="noreferrer" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" rel="noreferrer" target="_blank">http://www.ensembl.info/</a><br>

<br></blockquote></div><br></div></div>