<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Dear Derek,</p>
<p><br>
</p>
<p>Your GFF file is missing the "biotype" and the "parent"
parameters for the CDS lines.</p>
<p>e.g. using your input example:</p>
<p>NC_000962.3 Modlin et. al. 2018 CDS 1 1524 .
+ . ID=CDS1;<b>p</b><b>arent=gene1;biotype=protein_coding;</b>locus_tag=Rv0001;product=Chromosomal
replication initiator protein DnaA;note=FunctionalCategory:
information pathways</p>
<p><br>
</p>
<p>Furthermore, you need to add "exon" line(s) after the CDS line
(and using the "parent" attribute), e.g.:</p>
<p>NC_000962.3 Modlin et. al. 2018 CDS 1 1524 .
+ . ID=CDS1;<b>p</b><b>arent=gene1;biotype=protein_coding;</b>locus_tag=Rv0001;product=Chromosomal
replication initiator protein DnaA;note=FunctionalCategory:
information pathways</p>
<p>NC_000962.3 Modlin et. al. 2018 <b>exon</b> 1 1524
. + . ID=exon1;<b>p</b><b>arent=CDS1;</b>locus_tag=Rv0001;product=Chromosomal
replication initiator protein DnaA;note=FunctionalCategory:
information pathways</p>
<br>
<p>We will try to improve our documentation regarding the GFF files
in VEP.</p>
<p><br>
</p>
<p>Best regards,<br>
</p>
<pre class="moz-signature" cols="72">Laurent
Ensembl Variation
</pre>
<div class="moz-cite-prefix">On 12/02/2018 19:10, Derek
Conkle-Gutierrez wrote:<br>
</div>
<blockquote type="cite"
cite="mid:BY1PR16MB002127AD7D225FFA9A0F87A4A0F70@BY1PR16MB0021.namprd16.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
<div id="divtagdefaultwrapper" dir="ltr" style="font-size: 12pt;
color: rgb(0, 0, 0); font-family:
Calibri,Helvetica,sans-serif,"EmojiFont","Apple
Color Emoji","Segoe UI
Emoji",NotoColorEmoji,"Segoe UI
Symbol","Android Emoji",EmojiSymbols;">
<p style="margin-top:0; margin-bottom:0"><span
id="divtagdefaultwrapper" style="font-size:12pt"></span></p>
<div style="margin-top:0; margin-bottom:0">Hello,</div>
<div style="margin-top:0; margin-bottom:0"><br>
</div>
<div style="margin-top:0; margin-bottom:0">I work for Dr.
Faramarz Valafar at San Diego State University. Previously we
have used Ensembl's VEP program on our vcf files of
Mycobacterium tuberculosis sequences, using annotation from a
cache file downloaded from your website. However, recently we
have developed additional annotations (mostly from running
I-TASSER on ambiguously annotated genes) that we would like to
include. To that end I converted our custom annotation file to
a GFF3 format, and followed your website's instructions for
running VEP with that as the annotation source. This ran, but
unfortunately it identified every variant as intergenic, even
when they were within one of our annotated CDS features. I
assume this is due to a formatting error on my part with our
GFF file, though I've been following the specifications
described here <a previewremoved="true"
href="https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md"
target="_blank" rel="noopener noreferrer" id="LPlnk690491"
moz-do-not-send="true"><span id="LPlnk690491">https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md</span></a></div>
<div style="margin-top:0; margin-bottom:0"><br>
</div>
<div style="margin-top:0; margin-bottom:0">I'm using ensembl-vep
version 91.3<br>
Here's a bit of our gff:<br>
<div>NC_000962.3 Modlin et. al. 2018 gene 1 1524
. + .
ID=gene1;locus_tag=Rv0001;alias=dnaA;experiment=DESCRIPTION:Mutation
analysis, gene expression[PMID:
10375628];Dbxref=GeneID:885041<br>
NC_000962.3 Modlin et. al. 2018 CDS 1 1524
. + . ID=CDS1;locus_tag=Rv0001;product=Chromosomal
replication initiator protein DnaA;note=FunctionalCategory:
information pathways<br>
NC_000962.3 Modlin et. al. 2018 gene 2052 3260
. + .
ID=gene2;locus_tag=Rv0002;alias=dnaN;Dbxref=GeneID:887092<br>
NC_000962.3 Modlin et. al. 2018 CDS 2052 3260
. + . ID=CDS2;locus_tag=Rv0002;product=DNA
polymerase III (beta chain) DnaN (DNA
nucleotidyltransferase);note=FunctionalCategory: information
pathways</div>
<br>
Here's a bit of our test input vcf:<br>
<div>##fileformat=VCFv4.0<br>
##source=pbhooverV1.0.0a8<br>
##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting
reads"><br>
##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting
reads"><br>
##INFO=<ID=VF,Number=1,Type=Float,Description="Variant
frequency"><br>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read
Depth"><br>
##FILTER=<ID=LOW,Description="Position with too low of
depth"><br>
##FILTER=<ID=NO,Description="Does not meet criteria to
call variant"><br>
##FILTER=<ID=HETERO,Description="Enough support to call
reference and variant (mixed population)"><br>
#CHROM POS ID REF ALT QUAL FILTER INFO<br>
1 8 . A AGT 7.22 NO
RSR=4;VSR=1;VF=0.2;DP=5<br>
etc.<br>
<div>1 2050 . C CC 11.36 NO
RSR=15;VSR=2;VF=0.117647058824;DP=21<br>
1 2051 . A AA 13.36 NO
RSR=18;VSR=2;VF=0.1;DP=20<br>
1 2051 . AA A 15.75 NO
RSR=20;VSR=1;VF=0.047619047619;DP=21<br>
1 2052 . AT A 15.22 NO
RSR=19;VSR=1;VF=0.05;DP=21<br>
1 2053 . TG T 33.02 NO
RSR=15;VSR=2;VF=0.117647058824;DP=18<br>
1 2054 . GG G 45.75 NO
RSR=17;VSR=3;VF=0.15;DP=21<br>
1 2056 . A ACG 12.46 NO
RSR=17;VSR=1;VF=0.0555555555556;DP=21<br>
1 2057 . C CAC 13.28 NO
RSR=20;VSR=1;VF=0.047619047619;DP=21</div>
<br>
</div>
<br>
</div>
<div style="margin-top:0; margin-bottom:0">I used these commands
on our gff and vcf files:<br>
</div>
<div style="margin-top:0; margin-bottom:0">grep -v "#"
mannotation-with-computation-4vep.gff | sort -k1,1 -k4,4n
-k5,5n -t$'\t' | tabix/bgzip -c >
mannotation-with-computation-4vep.gff.gz</div>
<div style="margin-top:0; margin-bottom:0">tabix/tabix -p gff
mannotation-with-computation-4vep.gff.gz</div>
<div style="margin-top:0; margin-bottom:0">ensembl-vep/vep
--force_overwrite --synonyms synonyms-hyp.txt --format vcf
--vcf --species mycobacterium_tuberculosis --symbol
--variant_class --flag_pick --everything -i test1-0006.vcf
-gff mannotation-with-computation-4vep.gff.gz -fasta
H37Rv.fasta.gz -o test1-0006-annotated2.vcf</div>
<div style="margin-top:0; margin-bottom:0"><br>
</div>
<div style="margin-top:0; margin-bottom:0">And the output vcf
looks like this:<br>
<div>##fileformat=VCFv4.0<br>
##source=pbhooverV1.0.0a8<br>
##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting
reads"><br>
##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting
reads"><br>
##INFO=<ID=VF,Number=1,Type=Float,Description="Variant
frequency"><br>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read
Depth"><br>
##FILTER=<ID=LOW,Description="Position with too low of
depth"><br>
##FILTER=<ID=NO,Description="Does not meet criteria to
call variant"><br>
##FILTER=<ID=HETERO,Description="Enough support to call
reference and variant (mixed population)"><br>
##VEP="v91" time="2018-02-12 10:57:23"
ensembl-variation=91.c78d8b4 ensembl-funcgen=91.4681d69
ensembl=91.18ee742 ensembl-io=91.923d668<br>
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence
annotations from Ensembl VEP. Format:
Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|PICK|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SOURCE|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|mannotation-with-computation.gff.gz"><br>
##INFO=<ID=mannotation-with-computation.gff.gz,Number=.,Type=String,Description="mannotation-with-computation.gff.gz
(overlap)"><br>
#CHROM POS ID REF ALT QUAL FILTER INFO<br>
1 8 . A AGT 7.22 NO
RSR=4;VSR=1;VF=0.2;DP=5;CSQ=GT|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||</div>
etc.<br>
<div>1 2050 . C CC 11.36 NO
RSR=15;VSR=2;VF=0.117647058824;DP=21;CSQ=C|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||<br>
1 2051 . A AA 13.36 NO
RSR=18;VSR=2;VF=0.1;DP=20;CSQ=A|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||<br>
1 2051 . AA A 15.75 NO
RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||<br>
1 2052 . AT A 15.22 NO
RSR=19;VSR=1;VF=0.05;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||<br>
1 2053 . TG T 33.02 NO
RSR=15;VSR=2;VF=0.117647058824;DP=18;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||<br>
1 2054 . GG G 45.75 NO
RSR=17;VSR=3;VF=0.15;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||<br>
1 2056 . A ACG 12.46 NO
RSR=17;VSR=1;VF=0.0555555555556;DP=21;CSQ=CG|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||<br>
1 2057 . C CAC 13.28 NO
RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=AC|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||</div>
<br>
I've also tried adding '<span>Is_circular=true</span>' to the
attribute column of the first entry in the GFF, replacing
'locus_tag' with 'Name', and capitalizing 'alias', in case
those deviations from the format described in the GFF
documentation were the problem. I also added 'biotype'
attributes to the GFF, after seeing this discussion in the
forums:
<a previewremoved="true" id="LPlnk609331"
href="http://lists.ensembl.org/pipermail/dev/2018-January/012867.html"
class="OWAAutoLink" moz-do-not-send="true">
http://lists.ensembl.org/pipermail/dev/2018-January/012867.html</a>,
though I was unsure if that advice was meant for the GFF or
VCF, or whether it was applicable to whole genome reads vs
transcripts. None of this changed the resulting output.
<br>
<br>
</div>
<div style="margin-top:0; margin-bottom:0">Do you have an
example of a GFF annotation file that's worked with VEP, so I
can compare it with ours to see what I've done wrong? Or is
there a tool we can use to create our own cache files?<br>
<br>
Thank you for your assistance. <br>
</div>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Dev mailing list <a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>
Posting guidelines and subscribe/unsubscribe info: <a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev">http://lists.ensembl.org/mailman/listinfo/dev</a>
Ensembl Blog: <a class="moz-txt-link-freetext" href="http://www.ensembl.info/">http://www.ensembl.info/</a>
</pre>
</blockquote>
<br>
</body>
</html>