<div dir="ltr">Hello,<div class="gmail_extra"><br><div class="gmail_quote">On 29 May 2015 at 15:42, Svein Tore Koksrud Seljebotn <span dir="ltr"><<a href="mailto:s.t.seljebotn@medisin.uio.no" target="_blank">s.t.seljebotn@medisin.uio.no</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi again,<br>

<br>

thanks for your reply. That clears it up!<br>

<br>

Calling the subpopulations *_MAF is a bit misleading in that case?<br></blockquote><div><br></div><div>Yes, they should be called *_AF strictly, but it's kind of stuck that way.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Anyways, in either case, why not always include both REF and ALT(s), like G:0.002&C:0.998 and call the frequencies *_AF? Sometimes one or more will not be available in 1000g, but at least you provide all the data the user is likely to need.<br></blockquote><div><br></div><div>This is one solution, yes.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Another question, in some regions, especially intronic, I have quite a lot variants where GMAF is the REF allele. Does this sound plausible? Shouldn't the reference genome normally contain major alleles? The reference genome I use is from the  GATK bundle (v37).<br></blockquote><div><br></div><div>It's definitely plausible, yes. The reference genome has been corrected for GRCh38 at many loci using 1000 genomes frequencies as reference. I believe that due to various reasons the original reference genome ended up being not especially representative of the most common alleles at many loci.</div><div><br></div><div>If you are finding an abnormally large number, we'd be interested in taking a look at some of these cases to see if there's any systematic error anywhere.</div><div><br></div><div>Will</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Thanks again!<span class="HOEnZb"><font color="#888888"><br>

Svein Tore Koksrud Seljebotn</font></span><span class="im HOEnZb"><br>

<br>

>Hello,<br>

><br>

>As you've seen, this data can be somewhat confusing.<br>

><br>

>The GMAF field always reports the minor allele frequency, whereas the other<br>

>frequency fields report the frequencies of the ALT (or ALTs if there is<br>

>more than one).<br>

><br>

>Ideally the VEP would report the frequency of the ALT allele that you input<br>

>in your VCF, but this raises further problems if the ALT allele you report<br>

>does not match either the REF or ALT alleles from the 1000 Genomes VCF. It<br>

>is something we are hoping to improve in a future VEP release.<br>

><br>

>For your second question, it looks like frequencies have been mistakenly<br>

>assigned to the two reported co-located variants (rs3902057 and<br>

>RISN_CRB1:c.1410A>G),<br>

>so the frequencies appear twice. We'll look into a fix for this.<br>

><br>

>Regards<br>

><br>

>Will McLaren<br>

>Ensembl Variation<br>

><br>

>On 29 May 2015 at 14:29, Svein Tore Koksrud Seljebotn <<br></span><div class="HOEnZb"><div class="h5">

>s.t.seljebotn at <a href="http://medisin.uio.no" target="_blank">medisin.uio.no</a>> wrote:<br>

><br>

>> Hi,<br>

>><br>

>> I am trying to figure out some of the output I get from VEP (version 79)<br>

>> when annotating vcf files. See end of email for input and command. Please<br>

>> note, I am new to this field, so I might misunderstand a few concepts...<br>

>><br>

>> For the variant (1   197390368   rs3902057   A   G) I get the following<br>

>> output:<br>

>><br>

>> CSQ=G|upstream_gene_variant|MODIFIER|CRB1|ENSG00000134376|Transcript|ENST00000480086|processed_transcript||||||||||rs3902057&RISN_CRB1:c.1410A>G|1|1573|1|HGNC|2343||||||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||<br>

>> {rest of transcripts omitted...}<br>

>><br>

>> - This might be a silly question, but why is GMAF given for REF, while the<br>

>> subpopulations are given for ALT? In my case I'm interested in the<br>

>> frequency for the ALT, not the REF. I assume it's giving the minor allele<br>

>> frequency always? But why is there a difference in the allele given for<br>

>> GMAF vs e.g. AFR_MAF?<br>

>><br>

>> Looking at a later transcript for same variant, I see the following:<br>

>><br>

>><br>

>> G|synonymous_variant|LOW|CRB1|23418|Transcript|NM_001193640.1|protein_coding|4/10||NM_001193640.1:c.1074A>G|NM_001193640.1:c.1074A>G(p.=)|1283|1074|358|L|ctA/ctG|rs3902057&RISN_CRB1:c.1410A>G|1||1|||||NP_001180569.1|rseq_mrna_nonmatch&rseq_cds_mismatch&rseq_ens_match_cds||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||,G|5_prime_UTR_variant|MODIFIER|CRB1|ENSG00000134376|Transcript|ENST00000367397|protein_coding|2/6||ENST00000367397.1:c.-448A>G||411|||||rs3902057&RISN_CRB1:c.1410A>G|1||1|HGNC|2343|||ENSP00000356367|||||A:0.0803|G:0.7065&G:0.7065|G:0.9813&G:0.9813||G:1&G:1|G:0.999&G:0.999|G:1&G:1|G:0.7696&G:0.7696|G:0.9986&G:0.9986|||19339744||||<br>

>><br>

>> - Why is the frequency for the subpopulation alleles repeated twice with<br>

>> same value? Why not always give the frequency for all alleles?<br>

>><br>

>><br>

>> Best regards,<br>

>> Svein Tore Koksrud Seljebotn<br>

>><br>

>><br>

>><br>

>><br>

>> **** Example VCF: *****<br>

>><br>

>> ##fileformat=VCFv4.1<br>

>> ##INFO=<ID=class,Number=.,Type=String,Description="class"><br>

>> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"><br>

>> #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO FORMAT    H02<br>

>> 1   197390368   rs3902057   A   G   7128.77 .<br>

>> AC=2;AF=1.00;AN=2;DB;DP=193;Dels=0.00;FS=0.000;HaplotypeScore=4.6974;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;QD=29.21<br>

>> GT:AD:DP:GQ:PL  1/1:0,192:193:99:7157,518,0<br>

>><br>

>> ***** Command: *****<br>

>> vep --cache --dir_cache=/work/VEP/cache/<br>

>> --fasta=/work/human_g1k_v37_decoy.fasta --offline --sift=b --polyphen=b<br>

>> --ccds --hgvs --numbers --domains --regulatory --canonical --protein<br>

>> --biotype --gmaf --maf_1kg --maf_esp --pubmed --allow_non_variant --fork=4<br>

>> --vcf --allele_number --no_escape --failed=1 --no_stats --merged --symbol<br>

>> -i testfile.vcf -o testfile.annotated.vcf<br>

>><br>

>><br>

>><br>

<br>

_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

</div></div></blockquote></div><br></div></div>