<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <div class="moz-forward-container">

      <meta http-equiv="content-type" content="text/html; charset=utf-8">

      Dear developpers,<br>

      <br>

      I am a bioinformatic engineer working on various species, and some

      of them are not in EnsEMBL database (such as the trout genome, for

      now).<br>

      Biologists are interested in the annotation of their variant, and

      we given them the possibility to use Variant_effect_predictor API

      via our Galaxy instance.<br>

      <br>

      For non EnsEMBL species I use gtf2vep to construct the "annotation

      database" before VEP. Here are the command lines I use:<br>

      <font face="Courier New, Courier, monospace">gtf2vep.pl -i

        genome.gff -f genome.fa -s genomeName --dir database_directory<br>

        variant_effect_predictor.pl --species genomeName -i genome.vcf

        --format vcf -o genome.vep --offline --no_stats --dir_cache

        database_directory --verbose</font><br>

      <br>

      I have 3 different questions/comments about the use of these two

      programs together. I will take the trout and human genomes to

      illustrate.<br>

      <ol>

        <li><u>GFF format specifications</u></li>

      </ol>

      <p>I clearly understand the GFF/GFT format asked. But I could not

        find anywhere the minimum necessary informations and the way to

        express them in each field of the annotation file. <br>

      </p>

      <p>Trout annotation file (that you could find here : <a

          moz-do-not-send="true" class="moz-txt-link-freetext"

          href="http://www.genoscope.cns.fr/trout/data/"><a class="moz-txt-link-freetext" href="http://www.genoscope.cns.fr/trout/data/">http://www.genoscope.cns.fr/trout/data/</a></a>),

        looks like this:</p>

      <p><font face="Courier New, Courier, monospace">seq_name   

          source    feature_type    start    stop    score    strand   

          frame attribute   <br>

          chr_1    Gaze    Gene    94130    196267    36.8621    +   

          .    Gene GSONMG00071408001<br>

          chr_1    Gaze    mRNA    94130    196267    36.8621    +   

          .    mRNA GSONMT00071408001<br>

          chr_1    Gaze    CDS    94130    94318    6.0000    +    .   

          CDS GSONMT00071408001<br>

          chr_1    Gaze    CDS    115360    115542    10.5000    +   

          .    CDS GSONMT00071408001<br>

          chr_1    Gaze    CDS    116586    116839    11.0000    +   

          .    CDS GSONMT00071408001<br>

          chr_1    Gaze    CDS    154563    154705    5.0000    +   

          .    CDS GSONMT00071408001<br>

          chr_1    Gaze    CDS    195996    196267    6.0000    +   

          .    CDS GSONMT00071408001</font><br>

      </p>

      <p>Here is what I founded in the gtf2vep code (version 81), maybe

        I passed through someones.</p>

      <ul>

        <li>particular feature_type are read and the other are excluded:</li>

      </ul>

      <p>

        <meta http-equiv="content-type" content="text/html;

          charset=utf-8">

      </p>

      <div class="line" style="box-sizing: border-box; color: rgb(0, 0,

        0); font-family: 'Lucida Grande', verdana, arial, helvetica,

        sans-serif; font-size: 12px; font-style: normal; font-variant:

        normal; font-weight: normal; letter-spacing: normal;

        line-height: 17.142858505249023px; orphans: auto; text-align:

        start; text-indent: 0px; text-transform: none; white-space:

        normal; widows: auto; word-spacing: 0px;

        -webkit-text-stroke-width: 0px; background-color: rgb(255, 255,

        255);"><big>included : gene transcript mrna rrna trna ncrna

          primary_transcript c_gene_segment d_gene_segment

          vd_gene_segment j_gene_segment v_gene_segment lincrna_gene

          mirna_gene rrna_gene snorna_gene  snrna_gene lincrna mirna

          snorna snrna nmd_transcript_variant pseudogene

          processed_pseudogene pseudogenic_transcript

          aberrant_processed_transcript nc_primary_transcript

          processed_transcript gene_segment exon cds<br>

        </big>

        <ul>

          <li><big>for transcript lines, particular attributes are

              necessery, and some are optionnal.</big></li>

          <ul>

            <li><big>necesary : gene_id , transcript_id,

                transcript_biotype (if feature_type is not a particular

                transcript type)</big></li>

            <li><big>optionnal : </big><big>transcript_biotype (if

                feature type is a particular transcript type such as

                mRNA), transcript_protein ...</big></li>

          </ul>

        </ul>

        <ul>

          <li><big>"protein_coding" transcript type (indicated as mRNA

              in feature type column or whith "protein_coding" as

              transcript_type attribute), must be reliated with exon/CDS

              line(s).</big></li>

        </ul>

        <ul>

          <li><big>CDS lines must overlap exon lines and must have a

              frame (0,1,or 2 and not "." in column 8), and (not shure)

              must have a transcript_id attribute (other attribute are

              optionnal)</big></li>

        </ul>

        <ul>

          <li><big>exon lines must have at least a transcript_id

              attribute.</big></li>

        </ul>

        <p><big>So finally, my trout GFF file looks now like this:</big><br>

        </p>

        <p><font face="Courier New, Courier, monospace"><big>chr_1   

              Gaze    Gene    94130    196267    36.8621    +    .   

              gene_id GSONMG00071408001<br>

              chr_1    Gaze    mRNA    94130    196267    36.8621   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    94130    94318    6.0000    +   

              .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    94130    94318    6.0000    +   

              0    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    115360    115542    10.5000   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    115360    115542    10.5000   

              +    0    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    116586    116839    11.0000   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    116586    116839    11.0000   

              +    0    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    154563    154705    5.0000   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    154563    154705    5.0000    +   

              0    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    exon    195996    196267    6.0000   

              +    .    transcript_id GSONMT00071408001<br>

              chr_1    Gaze    CDS    195996    196267    6.0000    +   

              0    transcript_id GSONMT00071408001</big></font><br>

        </p>

      </div>

      <p>As I said, I found those informations by reading gtf2vep.pl

        code version 81. Do I miss something (may be for gene lines, or

        gene_id in transcript line ...)?<br>

        So it would be very helpfull if you can publish those kind of

        informations, because I am not shure to take time to read future

        API versions code to find out. <br>

      </p>

          <u>2. format of sequence name in the fasta file</u><br>

      <br>

      As you can see, the sequence of the trout genome are currently

      named chr_1, chr_Un1, .... and chrUn (it will certainly change

      when included in EnsEMBL).<br>

      It seem's that having the presence of "chr" causes VEP to crash. I

      made some tests (on version 81 and 82) with or without "chr" in

      the gff, and/or fasta, and/or vcf files. <br>

      <ul>

        <li>GTF2VEP tests</li>

      </ul>

      <p>gtf2vep works fine if you give these combinations: <br>

            - GFF and FASTA with "chr" ==> database called trout_chr,<br>

            - GFF and FASTA without "chr" ==> database called trout,

        <br>

            - GFF with "chr" and FASTA without "chr" ==> database

        called trout_gchr<br>

      </p>

      <p>The constructed database stores fasta name sequences and change

        the gff sequence name (if it does not correspond to the fasta

        sequence names) by deleting « chr » at the begining. <br>

        But this is not working if fasta contains « chr » and gff not.<br>

      </p>

      <ul>

        <li>VEP tests<br>

        </li>

      </ul>

      For the 3 precedent databases, I tested VEP whith a VCF file

      containing or not "chr" in the sequence names.<br>

      <br>

      VEP looks for sequence names whithout "chr" even if « chr » is

      specified in the VCF. So the full pipeline works only if the fasta

      file does not contain « chr » in the fasta sequence name (gff name

      does not matter).<br>

      <br>

      Why sequence names are not simply stored as they are indicated,

      and softwares crash if GFF FASTA and VCF are not corresponding to

      each other? Maybe I missed something, but could you also indicate

      those kind of information somewhere in the documentation. (shame

      on me if you already have ...) <br>

      <br>

          <u>3. </u><u>how VEP deals with rsid in the VCF file</u><br>

      <br>

      Last thing I wanted to comment is the way rsid are stored in VEP

      output.<br>

      <br>

      If VCF file (for an EnsEMBL genomie) does not contain rsID,

      variant_effect_predictor may add it in the output file in the

      "Existing_variation" column.<br>

      exemple on the human genome:<br>

      <font face="Courier New, Courier, monospace"><br>

            #CHROM    POS    <b><font color="#3333ff">ID</font></b>   

        REF    ALT    QUAL    FILTER    INFO    FORMAT    Pickrell   

        Sample1<br>

            chr12    939302    <b><font color="#3333ff">.</font></b>   

        A    G    .    PASS   

        AC=1;ADP=587;AF=0.500;AN=2;HET=1;HOM=0;NC=0;WT=0;set=Varscan   

        GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PVAL:RBQ:RD:RDF:RDR:SDP    ./.   

        0/1:30:213:85:128:587:36,29%:99:3,2364E-75:27:370:149:221:651</font><br>

      <br>

      become after VEP (one line exemple among many others)<br>

      <font face="Courier New, Courier, monospace"><br>

            <font color="#3333ff"><b>#Uploaded_variation</b></font>   

        Location    Allele    Gene    Feature    Feature_type   

        Consequence    cDNA_position    CDS_position   

        Protein_position    Amino_acids    Codons    <b><font

            color="#3333ff">Existing_variation</font></b>    Extra<br>

            <font color="#3333ff"><b>12_939302_A/G</b></font>   

        12:939302    G    ENSG00000002016    ENST00000544742   

        Transcript    intron_variant    -    -    -    -    -    <b><font

            color="#3333ff">rs139381800</font></b>   

        IMPACT=MODIFIER;STRAND=-1</font><br>

      <br>

      but if the variant caller used we already annotated variant with

      existing rsID, rsID are managed differently:<br>

      <br>

      <font face="Courier New, Courier, monospace">    #CHROM    POS   

        ID    REF    ALT    QUAL    FILTER    INFO    FORMAT   

        Pickrell    Sample1<br>

            chr12    406292    <font color="#3333ff"><b>rs2229351</b></font>   


        G    A    1002.77    PASS   

        AC=2;ADP=163;AF=0.500;AN=4;BaseQRankSum=3.502;DB;DP=69;Dels=0.00;FS=4.853;HET=1;HOM=0;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=37.00;MQ0=0;MQRankSum=0.943;NC=0;QD=14.53;ReadPosRankSum=-1.219;WT=0;set=Intersection   


        GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PL:PVAL:RBQ:RD:RDF:RDR:SDP   

        0/1:.:33,36:.:.:66:.:99:1031,0,999   

        0/1:26:39:21:18:163:23,93%:99:.:1,3713E-13:30:124:66:58:169<br>

      </font>become (only one line result)<br>

      <br>

      <font face="Courier New, Courier, monospace">    <font

          color="#3333ff"><b>#Uploaded_variation</b></font>   

        Location    Allele    Gene    Feature    Feature_type   

        Consequence    cDNA_position    CDS_position   

        Protein_position    Amino_acids    Codons    <b><font

            color="#3333ff">Existing_variation</font></b>    Extra<br>

        <font color="#3333ff"><b>    rs2229351</b></font>   

        12:406292    A    ENSG00000120647    ENST00000535052   

        Transcript    intron_variant    -    -    -    -    -   <b><font

            color="#3333ff"> -</font></b>    IMPACT=MODIFIER;STRAND=1<br>

      </font><br>

      First case, uploaded variation id is construct with sequence name,

      position and alleles, and rsID is added in Existing_variation

      column. <br>

      Second case, uploaded variation id is the known rsID, and

      Existing_variation stays empty.<br>

      Would it be possible to stay in case 1 even if the VCF file

      includes the rsID ? <br>

      <br>

      In any cases thank you for those API which evolved since the last

      few versions. A little bit more documentation would be great (not

      a funny task, I know), some behaviour corrections that facilitate

      my life would be very very very much appreciated but I know that

      we can't satisfy every one.<br>

      <br>

      Again thank you for these tools.<br>

      <br>

      Best Regards<br>

      <pre class="moz-signature" cols="72">-- 

Maria Bernard

Bioinformatics


Sigenae team

GABI Unity


INRA de Jouy en Josas - batiment 320

Domaine de Vilvert

78352 Jouy en josas

FRANCE

</pre>

      <br>

    </div>

    <br>

  </body>

</html>