<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<div class="moz-forward-container">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
Dear developpers,<br>
<br>
I am a bioinformatic engineer working on various species, and some
of them are not in EnsEMBL database (such as the trout genome, for
now).<br>
Biologists are interested in the annotation of their variant, and
we given them the possibility to use Variant_effect_predictor API
via our Galaxy instance.<br>
<br>
For non EnsEMBL species I use gtf2vep to construct the "annotation
database" before VEP. Here are the command lines I use:<br>
<font face="Courier New, Courier, monospace">gtf2vep.pl -i
genome.gff -f genome.fa -s genomeName --dir database_directory<br>
variant_effect_predictor.pl --species genomeName -i genome.vcf
--format vcf -o genome.vep --offline --no_stats --dir_cache
database_directory --verbose</font><br>
<br>
I have 3 different questions/comments about the use of these two
programs together. I will take the trout and human genomes to
illustrate.<br>
<ol>
<li><u>GFF format specifications</u></li>
</ol>
<p>I clearly understand the GFF/GFT format asked. But I could not
find anywhere the minimum necessary informations and the way to
express them in each field of the annotation file. <br>
</p>
<p>Trout annotation file (that you could find here : <a
moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://www.genoscope.cns.fr/trout/data/"><a class="moz-txt-link-freetext" href="http://www.genoscope.cns.fr/trout/data/">http://www.genoscope.cns.fr/trout/data/</a></a>),
looks like this:</p>
<p><font face="Courier New, Courier, monospace">seq_name
source feature_type start stop score strand
frame attribute <br>
chr_1 Gaze Gene 94130 196267 36.8621 +
. Gene GSONMG00071408001<br>
chr_1 Gaze mRNA 94130 196267 36.8621 +
. mRNA GSONMT00071408001<br>
chr_1 Gaze CDS 94130 94318 6.0000 + .
CDS GSONMT00071408001<br>
chr_1 Gaze CDS 115360 115542 10.5000 +
. CDS GSONMT00071408001<br>
chr_1 Gaze CDS 116586 116839 11.0000 +
. CDS GSONMT00071408001<br>
chr_1 Gaze CDS 154563 154705 5.0000 +
. CDS GSONMT00071408001<br>
chr_1 Gaze CDS 195996 196267 6.0000 +
. CDS GSONMT00071408001</font><br>
</p>
<p>Here is what I founded in the gtf2vep code (version 81), maybe
I passed through someones.</p>
<ul>
<li>particular feature_type are read and the other are excluded:</li>
</ul>
<p>
<meta http-equiv="content-type" content="text/html;
charset=utf-8">
</p>
<div class="line" style="box-sizing: border-box; color: rgb(0, 0,
0); font-family: 'Lucida Grande', verdana, arial, helvetica,
sans-serif; font-size: 12px; font-style: normal; font-variant:
normal; font-weight: normal; letter-spacing: normal;
line-height: 17.142858505249023px; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; background-color: rgb(255, 255,
255);"><big>included : gene transcript mrna rrna trna ncrna
primary_transcript c_gene_segment d_gene_segment
vd_gene_segment j_gene_segment v_gene_segment lincrna_gene
mirna_gene rrna_gene snorna_gene snrna_gene lincrna mirna
snorna snrna nmd_transcript_variant pseudogene
processed_pseudogene pseudogenic_transcript
aberrant_processed_transcript nc_primary_transcript
processed_transcript gene_segment exon cds<br>
</big>
<ul>
<li><big>for transcript lines, particular attributes are
necessery, and some are optionnal.</big></li>
<ul>
<li><big>necesary : gene_id , transcript_id,
transcript_biotype (if feature_type is not a particular
transcript type)</big></li>
<li><big>optionnal : </big><big>transcript_biotype (if
feature type is a particular transcript type such as
mRNA), transcript_protein ...</big></li>
</ul>
</ul>
<ul>
<li><big>"protein_coding" transcript type (indicated as mRNA
in feature type column or whith "protein_coding" as
transcript_type attribute), must be reliated with exon/CDS
line(s).</big></li>
</ul>
<ul>
<li><big>CDS lines must overlap exon lines and must have a
frame (0,1,or 2 and not "." in column 8), and (not shure)
must have a transcript_id attribute (other attribute are
optionnal)</big></li>
</ul>
<ul>
<li><big>exon lines must have at least a transcript_id
attribute.</big></li>
</ul>
<p><big>So finally, my trout GFF file looks now like this:</big><br>
</p>
<p><font face="Courier New, Courier, monospace"><big>chr_1
Gaze Gene 94130 196267 36.8621 + .
gene_id GSONMG00071408001<br>
chr_1 Gaze mRNA 94130 196267 36.8621
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 94130 94318 6.0000 +
. transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 94130 94318 6.0000 +
0 transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 115360 115542 10.5000
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 115360 115542 10.5000
+ 0 transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 116586 116839 11.0000
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 116586 116839 11.0000
+ 0 transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 154563 154705 5.0000
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 154563 154705 5.0000 +
0 transcript_id GSONMT00071408001<br>
chr_1 Gaze exon 195996 196267 6.0000
+ . transcript_id GSONMT00071408001<br>
chr_1 Gaze CDS 195996 196267 6.0000 +
0 transcript_id GSONMT00071408001</big></font><br>
</p>
</div>
<p>As I said, I found those informations by reading gtf2vep.pl
code version 81. Do I miss something (may be for gene lines, or
gene_id in transcript line ...)?<br>
So it would be very helpfull if you can publish those kind of
informations, because I am not shure to take time to read future
API versions code to find out. <br>
</p>
<u>2. format of sequence name in the fasta file</u><br>
<br>
As you can see, the sequence of the trout genome are currently
named chr_1, chr_Un1, .... and chrUn (it will certainly change
when included in EnsEMBL).<br>
It seem's that having the presence of "chr" causes VEP to crash. I
made some tests (on version 81 and 82) with or without "chr" in
the gff, and/or fasta, and/or vcf files. <br>
<ul>
<li>GTF2VEP tests</li>
</ul>
<p>gtf2vep works fine if you give these combinations: <br>
- GFF and FASTA with "chr" ==> database called trout_chr,<br>
- GFF and FASTA without "chr" ==> database called trout,
<br>
- GFF with "chr" and FASTA without "chr" ==> database
called trout_gchr<br>
</p>
<p>The constructed database stores fasta name sequences and change
the gff sequence name (if it does not correspond to the fasta
sequence names) by deleting « chr » at the begining. <br>
But this is not working if fasta contains « chr » and gff not.<br>
</p>
<ul>
<li>VEP tests<br>
</li>
</ul>
For the 3 precedent databases, I tested VEP whith a VCF file
containing or not "chr" in the sequence names.<br>
<br>
VEP looks for sequence names whithout "chr" even if « chr » is
specified in the VCF. So the full pipeline works only if the fasta
file does not contain « chr » in the fasta sequence name (gff name
does not matter).<br>
<br>
Why sequence names are not simply stored as they are indicated,
and softwares crash if GFF FASTA and VCF are not corresponding to
each other? Maybe I missed something, but could you also indicate
those kind of information somewhere in the documentation. (shame
on me if you already have ...) <br>
<br>
<u>3. </u><u>how VEP deals with rsid in the VCF file</u><br>
<br>
Last thing I wanted to comment is the way rsid are stored in VEP
output.<br>
<br>
If VCF file (for an EnsEMBL genomie) does not contain rsID,
variant_effect_predictor may add it in the output file in the
"Existing_variation" column.<br>
exemple on the human genome:<br>
<font face="Courier New, Courier, monospace"><br>
#CHROM POS <b><font color="#3333ff">ID</font></b>
REF ALT QUAL FILTER INFO FORMAT Pickrell
Sample1<br>
chr12 939302 <b><font color="#3333ff">.</font></b>
A G . PASS
AC=1;ADP=587;AF=0.500;AN=2;HET=1;HOM=0;NC=0;WT=0;set=Varscan
GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PVAL:RBQ:RD:RDF:RDR:SDP ./.
0/1:30:213:85:128:587:36,29%:99:3,2364E-75:27:370:149:221:651</font><br>
<br>
become after VEP (one line exemple among many others)<br>
<font face="Courier New, Courier, monospace"><br>
<font color="#3333ff"><b>#Uploaded_variation</b></font>
Location Allele Gene Feature Feature_type
Consequence cDNA_position CDS_position
Protein_position Amino_acids Codons <b><font
color="#3333ff">Existing_variation</font></b> Extra<br>
<font color="#3333ff"><b>12_939302_A/G</b></font>
12:939302 G ENSG00000002016 ENST00000544742
Transcript intron_variant - - - - - <b><font
color="#3333ff">rs139381800</font></b>
IMPACT=MODIFIER;STRAND=-1</font><br>
<br>
but if the variant caller used we already annotated variant with
existing rsID, rsID are managed differently:<br>
<br>
<font face="Courier New, Courier, monospace"> #CHROM POS
ID REF ALT QUAL FILTER INFO FORMAT
Pickrell Sample1<br>
chr12 406292 <font color="#3333ff"><b>rs2229351</b></font>
G A 1002.77 PASS
AC=2;ADP=163;AF=0.500;AN=4;BaseQRankSum=3.502;DB;DP=69;Dels=0.00;FS=4.853;HET=1;HOM=0;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=37.00;MQ0=0;MQRankSum=0.943;NC=0;QD=14.53;ReadPosRankSum=-1.219;WT=0;set=Intersection
GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PL:PVAL:RBQ:RD:RDF:RDR:SDP
0/1:.:33,36:.:.:66:.:99:1031,0,999
0/1:26:39:21:18:163:23,93%:99:.:1,3713E-13:30:124:66:58:169<br>
</font>become (only one line result)<br>
<br>
<font face="Courier New, Courier, monospace"> <font
color="#3333ff"><b>#Uploaded_variation</b></font>
Location Allele Gene Feature Feature_type
Consequence cDNA_position CDS_position
Protein_position Amino_acids Codons <b><font
color="#3333ff">Existing_variation</font></b> Extra<br>
<font color="#3333ff"><b> rs2229351</b></font>
12:406292 A ENSG00000120647 ENST00000535052
Transcript intron_variant - - - - - <b><font
color="#3333ff"> -</font></b> IMPACT=MODIFIER;STRAND=1<br>
</font><br>
First case, uploaded variation id is construct with sequence name,
position and alleles, and rsID is added in Existing_variation
column. <br>
Second case, uploaded variation id is the known rsID, and
Existing_variation stays empty.<br>
Would it be possible to stay in case 1 even if the VCF file
includes the rsID ? <br>
<br>
In any cases thank you for those API which evolved since the last
few versions. A little bit more documentation would be great (not
a funny task, I know), some behaviour corrections that facilitate
my life would be very very very much appreciated but I know that
we can't satisfy every one.<br>
<br>
Again thank you for these tools.<br>
<br>
Best Regards<br>
<pre class="moz-signature" cols="72">--
Maria Bernard
Bioinformatics
Sigenae team
GABI Unity
INRA de Jouy en Josas - batiment 320
Domaine de Vilvert
78352 Jouy en josas
FRANCE
</pre>
<br>
</div>
<br>
</body>
</html>