<div dir="ltr">Hi Drew,<div><br></div><div>The Ensembl Variation team created the variation DB for the 1000 genomes browser site, so we do have relevant experience here.</div><div><br></div><div>We too found that the volume of data stretches a lot of resources in ways they don't want to be stretched. To this end we have a couple of solutions in the API that mitigate these issues to some extent.</div><div><br></div><div>1) We don't load the allele or population_genotype table with 1000 genomes data; in fact we started doing this with phase 1. There exists code in the API to generate allele and population_genotype objects on the fly from just the data in the compressed_genotype_var table; in order to enable this, you simply need to set the freqs_from_gts column to 1 in the population table for the relevant populations.</div><div><br></div><div>Note that this data is then unavailable via direct SQL query as the genotype data in compressed_genotype_var is readable only by the API.</div><div><br></div><div>2) For the phase 3 data, we realised that populating compressed_genotype_region and compressed_genotype_var would also become impractical; tests indicated that each table would be around 1.5TB on disk!</div><div><br></div><div>To this end we developed some more API trickery to load genotype data on the fly from VCF files. The genotype objects from this are then also fed into the features mentioned above, such that all genotype and frequency data that you see at e.g. <a href="http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699">http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699</a> come from the same VCFs you are using to load your DB. The same code also powers the LD views.</div><div><br></div><div>The API code for this is currently available on release/76 of the ensembl-variation Git repo. If you'd like some help getting this set up, let me know and I can give you more details on what to do; the code introduces a couple of extra dependencies and requires some (very simple) configuration. It has not yet made it into the master branch as it's still somewhat under development, but it is stable enough for use for a number of applications.</div><div><br></div><div>I realise you said you were on release/75, but it should be possible for you to simply patch your DB to release/76 using the ensembl/misc-scripts/<a href="http://schema_patcher.pl">schema_patcher.pl</a> script. If not it may also be possible to push the code onto the release/75 branch too.</div><div><br></div><div>Regarding transcript variation, the pipeline we use to populate the table is not currently documented for users outside the project. It may be possible to prepare some documentation suitable for external use. There is also a way to have <a href="http://import_vcf.pl">import_vcf.pl</a> use the same code and optimised caches that the Variant Effect Predictor uses, though this is alpha code at best.</div><div><br></div><div>I hope this helps, though I realise it doesn't necessarily address all of the issues you've been having!</div><div><br></div><div>Regards</div><div><br></div><div>Will McLaren</div><div>Ensembl Variation</div><div><br></div><div><div class="gmail_extra"><br><div class="gmail_quote">On 14 November 2014 01:47, Roberts, Drew (RY) <span dir="ltr"><<a href="mailto:andrew_roberts@merck.com" target="_blank">andrew_roberts@merck.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hello all-<br>

<br>

I am new to EnsEMBL software.  I am trying to use the <a href="http://vcf_import.pl" target="_blank">vcf_import.pl</a> script to load a set of VCF files produced by phase 3 of the 1000Genomes project into a custom EnsEMBL variation database I created in-house.  I confirmed that this process works for a smaller dataset, but am having issues with 1000Genomes.  The sheer size of the dataset is proving difficult.<br>

<br>

We are using an in-house EnsEMBL database and API tools with version 75_37...a little behind the latest, I know, but we wanted to stay with the human genome build 37 for now as most of our in-house data uses it.<br>

<br>

I ran a number of <a href="http://vcf_import.pl" target="_blank">vcf_import.pl</a> jobs in parallel -- one for each chromosome -- with the following config file:<br>

<br>

registry      /tmp60days/robertsa/vcf_file_import/<a href="http://vcf_import_ensembl_registry.pl" target="_blank">vcf_import_ensembl_registry.pl</a><br>

input_file    /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/reduced.ALL.chr15.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz<br>

source        1000Genomes<br>

population    1000Genomes:phase_3:MERCK_PRELOAD<br>

panel         /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/sample_population_panel_file.txt<br>

species       homo_sapiens<br>

tmpdir       /tmp60days/robertsa/vcf_file_import/vcf_load_tmpdir<br>

add_tables   compressed_genotype_region<br>

<br>

The notable part of this is that we are loading all standard tables plus compressed_genotype_region.  We would also like to load transcript_variation, but tests on small data sets showed that this slowed down the import a lot, and the vcf_import web page claimed it was faster to populate this table later "using the standard transcript_variation pipeline", see:<br>

<a href="http://www.ensembl.org/info/genome/variation/import_vcf.html#tables" target="_blank">http://www.ensembl.org/info/genome/variation/import_vcf.html#tables</a><br>

However I have not been able to find any documentation for this "standard pipeline", and I found an exchange on this mailing list where a user was told not to try to use it.  So:<br>

<br>

Question 1:  Is there still not a standard transcript_variation pipeline?  If it exists, can somebody point me to it?<br>

<br>

This upload using the config above runs, but still quite slowly...a little over 200 variants per minute.  It looked like it would take at least a week and a half to finish, running 10 or 12 jobs in parallel.  The MySQL database seemed to be holding up OK, while each individual perl script consumed close to 100% of the CPU time for the processor it was running on.<br>

<br>

About halfway through the process everything halted.  It turned out that the auto-increment "allele_id" column in the "allele" table had run out of integers!  It hit the maximum integer for the 'signed int' datatype which EnsEMBL uses for this column.  I have been converting the table to use "bigint" instead of "int" for this column.  However I wondered:<br>

<br>

Question 2:  Has anybody else run into this problem?  In particular, have the EnsEMBL folks encountered tried to load phase 3 1000GENOMES data yet?  It feels like I must be doing something wrong, but I've been using the standard script.<br>

<br>

In general I notice that the standard EnsEMBL variation database has far fewer allele entries per variation than my custom vcf_import loaded database does.  In other ways it seems that the sheer size of my custom database (over a terabyte and only halfway through the load) is larger than the standard database, which manages to include earlier 1000GENOMES data plus plenty of other stuff despite its smaller size.  And so:<br>

<br>

Question 3:  Am I totally going about this the wrong way?  The website I linked above says the EnsEMBL team uses <a href="http://vcf_import.pl" target="_blank">vcf_import.pl</a> to load 1000GENOMES data.  If that is true, can they tell me what table options they use?  Perhaps they are skipping tables I am keeping, or doing something else that I should know about.  Any suggestions would be welcome.<br>

<br>

Hope this makes sense -- I am new to all this, so I might have forgotten to provide information you need.<br>

<br>

In any event, thanks for any suggestions!<br>

Drew<br>

Notice:  This e-mail message, together with any attachments, contains<br>

information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,<br>

New Jersey, USA 08889), and/or its affiliates Direct contact information<br>

for affiliates is available at<br>

<a href="http://www.merck.com/contact/contacts.html" target="_blank">http://www.merck.com/contact/contacts.html</a>) that may be confidential,<br>

proprietary copyrighted and/or legally privileged. It is intended solely<br>

for the use of the individual or entity named on this message. If you are<br>

not the intended recipient, and have received this message in error,<br>

please notify us immediately by reply e-mail and then delete it from<br>

your system.<br>

<br>

<br>

_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

</blockquote></div><br></div></div></div>