<div dir="ltr">Hi Laura/Will,<div><br></div><div>Thanks for the response. While waiting for that to be fixed, I have two non-overlapping sets of variants one from phase 1 EUR and one from phase 3 EUR. I want to combine them into one VCF so I can do LD calculations using Plink. I can't find a way to combine them because the samples only overlap for 364 and there are 15 samples unique to phase 1 and 139 unique to phase 3. Any suggestion how I could do that? BCFtools is no good because it renames overlapping samples. Is there a tool to transpose VCF, cat all variants then convert back to VCF?</div><div><br></div><div>Thanks.</div><div><br></div><div>G.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 17 November 2014 11:39, Laura Clarke <span dir="ltr"><<a href="mailto:laura@ebi.ac.uk" target="_blank">laura@ebi.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello G<br>

<br>

The 1000 Genomes Project is currently undertaking a QC of P3.<br>

<br>

We aim to provide a list to the community categorizing the different<br>

reasons variants aren't part of the P3 sites list.  This will allow<br>

everyone to see if we no longer called a site, filtered it out or<br>

missed it from our final build. We hope to release the list as soon as<br>

possible.<br>

<br>

We will ensure that Ensembl gets that list and it is available for<br>

display as soon as is feasible.<br>

<br>

thanks<br>

<span class="HOEnZb"><font color="#888888"><br>

Laura<br>

</font></span><span class=""><br>

On 17 November 2014 10:50, Genomeo Dev <<a href="mailto:genomeodev@gmail.com">genomeodev@gmail.com</a>> wrote:<br>

> Hi Will,<br>

><br>

> As you might be aware phase 3 has some known issues at the moment which<br>

> include a considerable fraction of variants missing compared to phase 1 v3<br>

> with no apparent reason.<br>

><br>

> For example, for EUR super population, there are about 1.6 million variants<br>

> missing.<br>

><br>

>>>bcftools isec -p dir -n-1 -w1 1000GENOMES-phase_1_EUR.vcf.gz<br>

>>> EUR.all.phase3_shapeit2_mvncall_integrated_v5.20130502.vcf.gz<br>

><br>

>>>gawk '{print $5}' dir/sites.txt  | sort -V | uniq -c<br>

><br>

> 1,619,556 10<br>

> 68,792,800 01<br>

><br>

> See:<br>

> <a href="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/README_known_issues_20141030" target="_blank">ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/README_known_issues_20141030</a><br>

><br>

> See also a recent  thread on Ensembl developers email list.<br>

><br>

> I wonder whether Ensembl can do some merger between phase 1 and 3 while<br>

> waiting for a fix from 1000 genomes.<br>

><br>

> G.<br>

><br>

> On 17 November 2014 09:27, Will McLaren <<a href="mailto:wm2@ebi.ac.uk">wm2@ebi.ac.uk</a>> wrote:<br>

>><br>

</span><div><div class="h5">>> Hi Drew,<br>

>><br>

>> We hope to have 1000 genomes phase 3 data available in release 79 of<br>

>> Ensembl, which is currently due for release early next year.<br>

>><br>

>> Thanks for the valuable feedback.<br>

>><br>

>> Regards<br>

>><br>

>> Will<br>

>><br>

>> On 15 November 2014 02:58, Roberts, Drew (RY) <<a href="mailto:andrew_roberts@merck.com">andrew_roberts@merck.com</a>><br>

>> wrote:<br>

>>><br>

>>> Will-<br>

>>><br>

>>><br>

>>><br>

>>> This is very useful information, thank you much.  If nothing else, it<br>

>>> makes me feel better about having problems loading 1000GENOMES!<br>

>>><br>

>>><br>

>>><br>

>>> We will most likely follow your suggestions and try to implement the new<br>

>>> features you mention, but in immediate short term I think we'll move our<br>

>>> focus to some simpler tasks.  When we're ready to move forward I'll post to<br>

>>> this list again and take you up on your offer for configuration help.<br>

>>><br>

>>><br>

>>><br>

>>> For now, I will add 1 question and 1 comment:<br>

>>><br>

>>><br>

>>><br>

>>> Question:  Do you have any idea when 1000GENOMES phase 3 data will be<br>

>>> included in the standard EnsEMBL release?<br>

>>><br>

>>><br>

>>><br>

>>> Comment:  I would request that a small update be made to the<br>

>>> <a href="http://vcf_import.pl" target="_blank">vcf_import.pl</a> documentation online, at:<br>

>>><br>

>>> <a href="http://www.ensembl.org/info/genome/variation/import_vcf.html#tables" target="_blank">http://www.ensembl.org/info/genome/variation/import_vcf.html#tables</a><br>

>>> The current text implies that loading transcript_variation after the<br>

>>> import_vcf is a simple and supported pipeline, which it sounds like it isn't<br>

>>> quite yet.  If the wording was changed to warn people that it isn't so easy,<br>

>>> they will know up front that they need to do an "add_tables<br>

>>> transcript_variation" during their initial <a href="http://import_vcf.pl" target="_blank">import_vcf.pl</a> run if that data is<br>

>>> required.<br>

>>><br>

>>><br>

>>><br>

>>> Again, thanks for the enlightening response, and I expect that I will be<br>

>>> following up on your ideas at some point in the not-too-horribly-distant<br>

>>> future.<br>

>>><br>

>>><br>

>>><br>

>>> Cheers!<br>

>>><br>

>>> Drew<br>

>>><br>

>>><br>

>>><br>

>>> From: <a href="mailto:dev-bounces@ensembl.org">dev-bounces@ensembl.org</a> [mailto:<a href="mailto:dev-bounces@ensembl.org">dev-bounces@ensembl.org</a>] On Behalf<br>

>>> Of Will McLaren<br>

>>> Sent: Friday, November 14, 2014 12:52 AM<br>

>>> To: Ensembl developers list<br>

>>> Cc: Hughes, Jason<br>

>>> Subject: Re: [ensembl-dev] vcf_import of 1000GENOMES phase 3 data<br>

>>><br>

>>><br>

>>><br>

>>> Hi Drew,<br>

>>><br>

>>><br>

>>><br>

>>> The Ensembl Variation team created the variation DB for the 1000 genomes<br>

>>> browser site, so we do have relevant experience here.<br>

>>><br>

>>><br>

>>><br>

>>> We too found that the volume of data stretches a lot of resources in ways<br>

>>> they don't want to be stretched. To this end we have a couple of solutions<br>

>>> in the API that mitigate these issues to some extent.<br>

>>><br>

>>><br>

>>><br>

>>> 1) We don't load the allele or population_genotype table with 1000<br>

>>> genomes data; in fact we started doing this with phase 1. There exists code<br>

>>> in the API to generate allele and population_genotype objects on the fly<br>

>>> from just the data in the compressed_genotype_var table; in order to enable<br>

>>> this, you simply need to set the freqs_from_gts column to 1 in the<br>

>>> population table for the relevant populations.<br>

>>><br>

>>><br>

>>><br>

>>> Note that this data is then unavailable via direct SQL query as the<br>

>>> genotype data in compressed_genotype_var is readable only by the API.<br>

>>><br>

>>><br>

>>><br>

>>> 2) For the phase 3 data, we realised that populating<br>

>>> compressed_genotype_region and compressed_genotype_var would also become<br>

>>> impractical; tests indicated that each table would be around 1.5TB on disk!<br>

>>><br>

>>><br>

>>><br>

>>> To this end we developed some more API trickery to load genotype data on<br>

>>> the fly from VCF files. The genotype objects from this are then also fed<br>

>>> into the features mentioned above, such that all genotype and frequency data<br>

>>> that you see at e.g.<br>

>>> <a href="http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699" target="_blank">http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699</a><br>

>>> come from the same VCFs you are using to load your DB. The same code also<br>

>>> powers the LD views.<br>

>>><br>

>>><br>

>>><br>

>>> The API code for this is currently available on release/76 of the<br>

>>> ensembl-variation Git repo. If you'd like some help getting this set up, let<br>

>>> me know and I can give you more details on what to do; the code introduces a<br>

>>> couple of extra dependencies and requires some (very simple) configuration.<br>

>>> It has not yet made it into the master branch as it's still somewhat under<br>

>>> development, but it is stable enough for use for a number of applications.<br>

>>><br>

>>><br>

>>><br>

>>> I realise you said you were on release/75, but it should be possible for<br>

>>> you to simply patch your DB to release/76 using the<br>

>>> ensembl/misc-scripts/<a href="http://schema_patcher.pl" target="_blank">schema_patcher.pl</a> script. If not it may also be<br>

>>> possible to push the code onto the release/75 branch too.<br>

>>><br>

>>><br>

>>><br>

>>> Regarding transcript variation, the pipeline we use to populate the table<br>

>>> is not currently documented for users outside the project. It may be<br>

>>> possible to prepare some documentation suitable for external use. There is<br>

>>> also a way to have <a href="http://import_vcf.pl" target="_blank">import_vcf.pl</a> use the same code and optimised caches that<br>

>>> the Variant Effect Predictor uses, though this is alpha code at best.<br>

>>><br>

>>><br>

>>><br>

>>> I hope this helps, though I realise it doesn't necessarily address all of<br>

>>> the issues you've been having!<br>

>>><br>

>>><br>

>>><br>

>>> Regards<br>

>>><br>

>>><br>

>>><br>

>>> Will McLaren<br>

>>><br>

>>> Ensembl Variation<br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>> On 14 November 2014 01:47, Roberts, Drew (RY) <<a href="mailto:andrew_roberts@merck.com">andrew_roberts@merck.com</a>><br>

>>> wrote:<br>

>>><br>

>>> Hello all-<br>

>>><br>

>>> I am new to EnsEMBL software.  I am trying to use the <a href="http://vcf_import.pl" target="_blank">vcf_import.pl</a><br>

>>> script to load a set of VCF files produced by phase 3 of the 1000Genomes<br>

>>> project into a custom EnsEMBL variation database I created in-house.  I<br>

>>> confirmed that this process works for a smaller dataset, but am having<br>

>>> issues with 1000Genomes.  The sheer size of the dataset is proving<br>

>>> difficult.<br>

>>><br>

>>> We are using an in-house EnsEMBL database and API tools with version<br>

>>> 75_37...a little behind the latest, I know, but we wanted to stay with the<br>

>>> human genome build 37 for now as most of our in-house data uses it.<br>

>>><br>

>>> I ran a number of <a href="http://vcf_import.pl" target="_blank">vcf_import.pl</a> jobs in parallel -- one for each<br>

>>> chromosome -- with the following config file:<br>

>>><br>

>>> registry<br>

>>> /tmp60days/robertsa/vcf_file_import/<a href="http://vcf_import_ensembl_registry.pl" target="_blank">vcf_import_ensembl_registry.pl</a><br>

>>> input_file<br>

>>> /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/reduced.ALL.chr15.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz<br>

>>> source        1000Genomes<br>

>>> population    1000Genomes:phase_3:MERCK_PRELOAD<br>

>>> panel<br>

>>> /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/sample_population_panel_file.txt<br>

>>> species       homo_sapiens<br>

>>> tmpdir       /tmp60days/robertsa/vcf_file_import/vcf_load_tmpdir<br>

>>> add_tables   compressed_genotype_region<br>

>>><br>

>>> The notable part of this is that we are loading all standard tables plus<br>

>>> compressed_genotype_region.  We would also like to load<br>

>>> transcript_variation, but tests on small data sets showed that this slowed<br>

>>> down the import a lot, and the vcf_import web page claimed it was faster to<br>

>>> populate this table later "using the standard transcript_variation<br>

>>> pipeline", see:<br>

>>> <a href="http://www.ensembl.org/info/genome/variation/import_vcf.html#tables" target="_blank">http://www.ensembl.org/info/genome/variation/import_vcf.html#tables</a><br>

>>> However I have not been able to find any documentation for this "standard<br>

>>> pipeline", and I found an exchange on this mailing list where a user was<br>

>>> told not to try to use it.  So:<br>

>>><br>

>>> Question 1:  Is there still not a standard transcript_variation pipeline?<br>

>>> If it exists, can somebody point me to it?<br>

>>><br>

>>> This upload using the config above runs, but still quite slowly...a<br>

>>> little over 200 variants per minute.  It looked like it would take at least<br>

>>> a week and a half to finish, running 10 or 12 jobs in parallel.  The MySQL<br>

>>> database seemed to be holding up OK, while each individual perl script<br>

>>> consumed close to 100% of the CPU time for the processor it was running on.<br>

>>><br>

>>> About halfway through the process everything halted.  It turned out that<br>

>>> the auto-increment "allele_id" column in the "allele" table had run out of<br>

>>> integers!  It hit the maximum integer for the 'signed int' datatype which<br>

>>> EnsEMBL uses for this column.  I have been converting the table to use<br>

>>> "bigint" instead of "int" for this column.  However I wondered:<br>

>>><br>

>>> Question 2:  Has anybody else run into this problem?  In particular, have<br>

>>> the EnsEMBL folks encountered tried to load phase 3 1000GENOMES data yet?<br>

>>> It feels like I must be doing something wrong, but I've been using the<br>

>>> standard script.<br>

>>><br>

>>> In general I notice that the standard EnsEMBL variation database has far<br>

>>> fewer allele entries per variation than my custom vcf_import loaded database<br>

>>> does.  In other ways it seems that the sheer size of my custom database<br>

>>> (over a terabyte and only halfway through the load) is larger than the<br>

>>> standard database, which manages to include earlier 1000GENOMES data plus<br>

>>> plenty of other stuff despite its smaller size.  And so:<br>

>>><br>

>>> Question 3:  Am I totally going about this the wrong way?  The website I<br>

>>> linked above says the EnsEMBL team uses <a href="http://vcf_import.pl" target="_blank">vcf_import.pl</a> to load 1000GENOMES<br>

>>> data.  If that is true, can they tell me what table options they use?<br>

>>> Perhaps they are skipping tables I am keeping, or doing something else that<br>

>>> I should know about.  Any suggestions would be welcome.<br>

>>><br>

>>> Hope this makes sense -- I am new to all this, so I might have forgotten<br>

>>> to provide information you need.<br>

>>><br>

>>> In any event, thanks for any suggestions!<br>

>>> Drew<br>

>>> Notice:  This e-mail message, together with any attachments, contains<br>

>>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,<br>

>>> New Jersey, USA 08889), and/or its affiliates Direct contact information<br>

>>> for affiliates is available at<br>

>>> <a href="http://www.merck.com/contact/contacts.html" target="_blank">http://www.merck.com/contact/contacts.html</a>) that may be confidential,<br>

>>> proprietary copyrighted and/or legally privileged. It is intended solely<br>

>>> for the use of the individual or entity named on this message. If you are<br>

>>> not the intended recipient, and have received this message in error,<br>

>>> please notify us immediately by reply e-mail and then delete it from<br>

>>> your system.<br>

>>><br>

>>><br>

>>> _______________________________________________<br>

>>> Dev mailing list    <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>

>>> Posting guidelines and subscribe/unsubscribe info:<br>

>>> <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

>>> Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

>>><br>

>>><br>

>>><br>

>>> Notice:  This e-mail message, together with any attachments, contains<br>

>>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,<br>

>>> New Jersey, USA 08889), and/or its affiliates Direct contact information<br>

>>> for affiliates is available at<br>

>>> <a href="http://www.merck.com/contact/contacts.html" target="_blank">http://www.merck.com/contact/contacts.html</a>) that may be confidential,<br>

>>> proprietary copyrighted and/or legally privileged. It is intended solely<br>

>>> for the use of the individual or entity named on this message. If you are<br>

>>> not the intended recipient, and have received this message in error,<br>

>>> please notify us immediately by reply e-mail and then delete it from<br>

>>> your system.<br>

>>><br>

>>><br>

>>> _______________________________________________<br>

>>> Dev mailing list    <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>

>>> Posting guidelines and subscribe/unsubscribe info:<br>

>>> <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

>>> Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

>>><br>

>><br>

>><br>

>> _______________________________________________<br>

>> Dev mailing list    <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>

>> Posting guidelines and subscribe/unsubscribe info:<br>

>> <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

>> Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

>><br>

><br>

><br>

><br>

</div></div>> --<br>

<div class="HOEnZb"><div class="h5">> G.<br>

><br>

> _______________________________________________<br>

> Dev mailing list    <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>

> Posting guidelines and subscribe/unsubscribe info:<br>

> <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

> Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

><br>

<br>

_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr">G.</div></div>

</div>