<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
Pablo<br>
<br>
Thanks for your very helpful reply.<br>
<br>
How do you know all of this information about the database schema?
My plan was to install the database and work out the schema by
'playing' with the data but as yet I can't install the database<br>
<br>
How do you know what tables contain the data you need and the
relationships between them? Did you just get it from these docs<br>
<a class="moz-txt-link-freetext" href="http://www.ensembl.org/info/docs/api/core/core_schema.html">http://www.ensembl.org/info/docs/api/core/core_schema.html</a><br>
<br>
I'm on a much faster machine now and the core database is still
taking over 3 days and is at the ditag table. I think I'll stop the
process and comment out the indexes and start again!<br>
<br>
I'm happy not to download the whole database but i don't know which
tables i need. I have posted the script i was going to run against
the remote ensembl database (but it would take 2 weeks to run
remotely!!) in case you can offer some advice of which tables to
get. It's a very simple script even if it looks longish. It is
basically creating one big table with a row for each watson snp
consequence. Each consequence is a unique gene/exon/transcript
combination that the snp affects (e.g. if the snp in a gene, called
gene1, and is in exon1 of that gene and that exon is in 2 splice
variants transcript1 and transcript2 then there are 2 rows in the
table: gene1/exon1/transcript1 and gene1/exon1/transcript2) I used
this denormalised database design just as a shortcut while i was
trying to actually get the data. I'm going to model it differently
if/when i get it :)<br>
<br>
I really appreciate all the help you have already given me. I just
assumed I could download the whole database and then query it with
the api. Perhaps i need to get more familiar with the schema so i
can just get the subsets of data i need<br>
<br>
<small><small><font face="Courier New, Courier, monospace">#!/usr/bin/perl
-w<br>
<br>
#use lib 'C:\Perl\site\lib\ensembl-api\ensembl\modules';<br>
#use lib
'C:\Perl\site\lib\ensembl-api\ensembl-variation\modules';<br>
<br>
use lib '/home/andrea/ensembl-src/ensembl/modules';<br>
use lib '/home/andrea/ensembl-src/ensembl-variation/modules';<br>
<br>
use strict;<br>
use warnings;<br>
use Bio::EnsEMBL::Registry;<br>
use Bio::EnsEMBL::Utils::Slice qw(split_Slices);<br>
use DBI;<br>
use DBD::mysql;<br>
<br>
my $species = 'homo_sapiens';<br>
#my $reg = 'Bio::EnsEMBL::Registry';<br>
#my $registry_file= "C:\\Documents and
Settings\\Administrator\\Desktop\\PHD\\java\\Code\\SNP\\config.txt";<br>
#$reg->load_all($registry_file);<br>
my $reg = 'Bio::EnsEMBL::Registry';<br>
$reg->load_registry_from_db(-host =>
'ensembldb.ensembl.org',-user => 'anonymous');<br>
<br>
my $variation_set_adaptor= $reg->get_adaptor('human',
'variation', 'variationset');<br>
my $watson_set =
$variation_set_adaptor->fetch_by_name("ENSEMBL:Watson");<br>
my $variation_feature_adaptor=$reg->get_adaptor('human',
'variation', 'variationfeature');<br>
my $variation_adaptor=$reg->get_adaptor('human',
'variation', 'variation');<br>
my $slice_adaptor = $reg->get_adaptor($species, 'core',
'slice');<br>
my $gene_adaptor =$reg->get_adaptor($species,
'core','gene');<br>
my $transcript_adaptor =$reg->get_adaptor($species,
'core','Transcript');<br>
<br>
<br>
<br>
###################################<br>
#get the database schema<br>
###################################<br>
<br>
my @db_adaptors = @{ $reg->get_all_DBAdaptors(-group =>
'variation') };<br>
my $ens_schema;<br>
foreach my $db_adaptor (@db_adaptors) {<br>
my $db_connection = $db_adaptor->dbc();<br>
$ens_schema= $db_adaptor->_get_schema_build;<br>
}<br>
<br>
my $snp;<br>
<br>
<br>
my $i = 0;<br>
my $slices = $slice_adaptor->fetch_all('chromosome', undef,
0, 1);<br>
<br>
<br>
# Base pair overlap between returned slices<br>
my $overlap = 0;<br>
<br>
# Maximum size of returned slices<br>
my $max_size = 100_000;<br>
<br>
# Break chromosomal slices into smaller 100k component slices<br>
my @slices = @{split_Slices( $slices, $max_size, $overlap )
};<br>
<br>
foreach my $slice (@slices) {<br>
<br>
unless ($slice->seq_region_name() =~ /Un/) {<br>
my @vfs
=@{$watson_set->get_all_VariationFeatures_by_Slice($slice)};<br>
foreach my $vf (@vfs) {<br>
next if ($vf->var_class ne 'snp');<br>
$snp++;<br>
<br>
print "variation feature found: $snp\n";<br>
<br>
my ($chromosome, $locus, $gene_id, $biotype,
$description, $gene_name, $transcript_id, $exon_id,
$exon_start, $exon_end,$dbsnp_ref, $dbsnp_synonyms,
$reference, $allele, $genotype, $main_consequence,
$all_consequences, $peptide_alleles, $unique_mapping) =<br>
(undef, undef, undef, undef, undef, undef, undef,
undef, undef, undef, undef, undef, undef, undef, undef, undef,
undef, undef, undef);<br>
<br>
<br>
$chromosome=$slice->seq_region_name();<br>
<br>
#$a_allele, $t_allele, $g_allele, $c_allele,
$a_main_consequence, $a_all_consequences, $t_main_consequence,
$t_all_consequences, $c_main_consequence, $c_all_consequences,
$g_main_consequence, $g_all_consequences, $a_aa, $t_aa, $c_aa,
$g_aa,<br>
<br>
$locus = $vf->start();<br>
$allele=$vf->allele_string();<br>
$genotype = $vf->ambig_code(); #just a quick
hack to allow the workflow will convert it back again<br>
<br>
# my @alleles = split /\//,
$vf->allele_string();<br>
# foreach (@alleles) {<br>
# if (/A/i){$a_allele='Y';}<br>
# if (/T/i){$t_allele='Y';}<br>
# if (/C/i){$c_allele='Y';}<br>
# if (/G/i){$g_allele='Y';}<br>
# }<br>
<br>
$dbsnp_ref = $vf->variation_name();<br>
$dbsnp_synonyms = join ',',
@{$vf->variation->get_all_synonyms('dbSNP')};<br>
$reference = substr $allele, 0, 1;<br>
$main_consequence = $vf->display_consequence();<br>
$all_consequences = join ',',
@{$vf->get_consequence_type()};<br>
<br>
<br>
# fetch all genome hits for a particular variation<br>
my $variation =
$variation_adaptor->fetch_by_name($dbsnp_ref);<br>
<br>
my @vars =
@{$variation_feature_adaptor->fetch_all_by_Variation($variation)};<br>
my $count = @vars;<br>
$count > 1 ? $unique_mapping = 'FALSE':
$unique_mapping = 'TRUE';<br>
<br>
my @tvs = @{$vf->get_all_TranscriptVariations};<br>
if (@tvs) {<br>
<br>
print "vf has transcript variants\n";<br>
foreach my $tv (@tvs) {<br>
<br>
my $transcript = $tv->transcript();<br>
if ($transcript) {<br>
<br>
$transcript_id=
$transcript->stable_id();<br>
my $gene =
$gene_adaptor->fetch_by_transcript_stable_id($transcript_id);<br>
($gene_id, $biotype, $description,
$gene_name) = ($gene->stable_id,$gene->biotype,
$gene->description,$gene->external_name);<br>
$main_consequence =
$tv->display_consequence();<br>
$all_consequences = join ',',
@{$tv->consequence_type()};<br>
$peptide_alleles =
$tv->pep_allele_string();<br>
my @exons =
@{$tv->transcript->get_all_Exons() };<br>
<br>
foreach my $exon (@exons) {<br>
<br>
if ($vf->seq_region_start >=
$exon->seq_region_start and $vf->seq_region_start <=
$exon->seq_region_end) {<br>
<br>
($exon_start, $exon_end,
$exon_id) = ($exon->seq_region_start,
$exon->seq_region_end, $exon->stable_id) ;<br>
<br>
last;<br>
<br>
}#end if in the exon<br>
}#end for each exon<br>
<br>
<br>
<br>
}#end if transcript<br>
<br>
#insert row here, can still update the
record with nulls if transcript not available<br>
&insert_snp($chromosome, $locus,
$gene_id, $biotype, $description, $gene_name, $transcript_id,
$exon_id, $exon_start, $exon_end,$dbsnp_ref, $dbsnp_synonyms,
$reference, $allele, $genotype, $main_consequence,
$all_consequences, $peptide_alleles, $ens_schema,
$unique_mapping);<br>
}#end for each transcript variant<br>
<br>
} <br>
else {<br>
<br>
print "vf does not have transcript
variants\n";<br>
<br>
#insert row here for snp not in a gene<br>
<br>
&insert_snp($chromosome, $locus, $gene_id,
$biotype, $description, $gene_name, $transcript_id, $exon_id,
$exon_start, $exon_end,$dbsnp_ref, $dbsnp_synonyms,
$reference, $allele, $genotype, $main_consequence,
$all_consequences, $peptide_alleles, $ens_schema,
$unique_mapping);<br>
<br>
}#end if any transcript variants<br>
}#end for each variation feature<br>
}#end unless UnChr<br>
}#end for each chromosome<br>
<br>
sub insert_snp {<br>
<br>
#my ($chromosome, $locus, $gene_id, $biotype, $description,
$gene_name, $transcript_id, $exon_id, $exon_start,
$exon_end,$dbsnp_ref, $dbsnp_synonyms, $reference, $allele,
$main_consequence, $all_consequences, $peptide_alleles,
$ens_schema) = @_;<br>
<br>
my @params = @_;<br>
my $db =
DBI->connect("DBI:mysql:database=annotationDB;host=localhost;port=3306",
"root", "and1jon1",{'RaiseError' => 0}) or die "Cannot
connect: " . $DBI::errstr;<br>
my $sth = $db->prepare("insert into watson(chromosome,
locus, gene_id, biotype, description, gene_name,
transcript_id, exon_id, exon_start, exon_end, dbsnp_ref,
dbsnp_synonyms, reference, allele, genotype, main_consequence,
all_consequences, peptide_alleles, ens_schema, unique_mapping)
values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, ?)") or die
"Cannot prepare: " . $db->errstr();<br>
$sth->execute(@params) or die "Cannot execute: " .
$db->errstr();;<br>
<br>
}</font></small></small><br>
<br>
On 14/12/2010 13:26, Pablo Marin-Garcia wrote:
<blockquote
cite="mid:alpine.DEB.1.10.1012141140370.1073@hgs3b.internal.sanger.ac.uk"
type="cite">
<br>
Hello Andrea
<br>
<br>
there are several things than can help you (as a hint not as a
solution :-():
<br>
<br>
A) First, when you are uploading big data sets to mysql, you
should turn down the indexes or it would take ages. For a few
tables you can do it manually commenting the index creation in the
sql, and later use the 'create index' command. I don't know if in
mysql you can update everything skipping the indexes with
mysqladmin and then redo all the indexes automatically.
<br>
<br>
<br>
B) The approach that I follow when working with full genome snps
is to download once the tables that I need directly with mysql and
then I have ensembl-like adaptors to work with them:
<br>
<br>
mysql -B --port 5306 -u anonymous -h ensembldb.ensembl.org -e 'use
\ homo_sapiens_variation_60_37e; select variation_id as var_id, \
<br>
variation_name as snp, sr.name as chr, \
<br>
seq_region_start as start, seq_region_end as end,
seq_region_strand as \
<br>
strand, allele_string, map_weight, flags, validation_status,
consequence_type \
<br>
from variation_feature vf, seq_region sr where \
<br>
vf.seq_region_id=sr.seq_region_id' >
variations_build_37_ens-60.tab
<br>
<br>
note:
<br>
This makes 2Gb file variations_build_37_ens-60.tab (you can make
more specific
<br>
queries joining source-variation_synonym-variation_features in
order to reduce
<br>
the data). Also the memory report tell me that mysql was taking
4.5 Gb during
<br>
the download so you should make more specific queries in small
computers. You
<br>
can always download these three tables (plus seq_region for the
chr names)
<br>
from the ensemble db dumps and load to your mysql locally (It is
not necessary
<br>
to download the full variation database).
<br>
<br>
You can upload the query-downloaded data to a local mysql with:
<br>
<br>
============ loading script: load_ens_vf_b37.sql
<br>
<br>
-- load with:
<br>
-- mysql --host=variation_local --port=3636 --user=pmg -p <
load_ens_vf_b37.sql
<br>
--
<br>
<br>
use pmGWAS;
<br>
<br>
drop table if exists ens_vf_b37;
<br>
create table ens_vf_b37 (
<br>
var_id int(20),
<br>
snp varchar(20),
<br>
chr varchar(20),
<br>
start int(10) UNSIGNED NOT NULL,
<br>
end int(10) UNSIGNED NOT NULL,
<br>
strand enum('-1','1'),
<br>
allele_string varchar(10),
<br>
map_weight tinyint(2),
<br>
flags varchar(100),
<br>
validation_status varchar(100),
<br>
consequence_type varchar(100)
<br>
);
<br>
<br>
load data local infile
'/homes/pmg/ens_dump/variations_build_37_ens-60.tab'
<br>
into table ens_vf_b37
<br>
FIELDS TERMINATED BY '\t'
<br>
LINES TERMINATED BY '\n'
<br>
IGNORE 1 LINES
<br>
(
<br>
var_id
<br>
,snp
<br>
,chr
<br>
,start
<br>
,end
<br>
,strand
<br>
,allele_string
<br>
,map_weight
<br>
,flags
<br>
,validation_status
<br>
,consequence_type
<br>
)
<br>
<br>
-- after insertion do the index to prevent slowing down the
insertions
<br>
-- alter table ens_vf_b37 add index snp(snp);
<br>
-- alter table ens_vf_b37 add index pos(chr, start);
<br>
<br>
=============
<br>
<br>
<br>
If I need specific data sets (Watson, Venter, dbsnp, ) I make a
second table with the output of a join with 'source' table and
'variation_synonyms' to know which snps belong to a specific
source. That way I can make any genomic query faster locally
joining by variation_id my variation_feature with my
source-synonym table to extract the SNPS wanted
<br>
<br>
<br>
For your case in build_36 was strait forward because one of the
sources was "ENSEMBL:Watson", but in the new build 37 source they
have added a lot more sources and there is not a explicit watson
anymore, so I don't know where to define it (now you have only
ENSEMBL but in 36 was ENSEMBL:celera/watson/venter):
<br>
<br>
$ mysql_ensembl_variation_b37 'select name from source'
<br>
+----------------------------+
<br>
| name |
<br>
+----------------------------+
<br>
| dbSNP |
<br>
| Affy GeneChip 100K Array |
<br>
| Affy GeneChip 500K Array |
<br>
| Affy GenomeWideSNP_6.0 |
<br>
| NHGRI_GWAS_catalog |
<br>
| EGA |
<br>
| Illumina_CytoSNP12v1 |
<br>
| Illumina_Human660W-quad |
<br>
| Illumina_Human1M-duoV3 |
<br>
| Uniprot |
<br>
| COSMIC |
<br>
| ENSEMBL |
<br>
| DGVa:estd1 |
<br>
| DGVa:estd3 |
<br>
| Open Access GWAS Database |
<br>
| HGMD-PUBLIC |
<br>
| DGVa:nstd1 |
<br>
| DGVa:nstd2 |
<br>
| DGVa:nstd4 |
<br>
| DGVa:nstd9 |
<br>
| DGVa:nstd14 |
<br>
| DGVa:nstd16 |
<br>
| DGVa:nstd17 |
<br>
| DGVa:nstd20 |
<br>
| DGVa:nstd11 |
<br>
| DGVa:nstd22 |
<br>
| DGVa:nstd23 |
<br>
| DGVa:nstd27 |
<br>
| DGVa:nstd28 |
<br>
| DGVa:nstd29 |
<br>
| DGVa:nstd30 |
<br>
| DGVa:nstd31 |
<br>
| DGVa:nstd32 |
<br>
| DGVa:nstd34 |
<br>
| DGVa:nstd35 |
<br>
| Affy GenomeWideSNP_6.0 CNV |
<br>
| DGVa:nstd8 |
<br>
| DGVa:estd19 |
<br>
| DGVa:estd20 |
<br>
| DGVa:estd21 |
<br>
| DGVa:estd22 |
<br>
| DGVa:estd24 |
<br>
| DGVa:nstd36 |
<br>
| DGVa:nstd39 |
<br>
| DGVa:estd48 |
<br>
| DGVa:estd49 |
<br>
| DGVa:estd50 |
<br>
+----------------------------+
<br>
<br>
<br>
<br>
On Sun, 12 Dec 2010, Andrea Edwards wrote:
<br>
<br>
<blockquote type="cite">Hi
<br>
<br>
The dump files would be great but I am also retreiving lots of
other information about the snps with the snps and that might
not necessarily be in your dump file so i think i have to try
other options too.
<br>
<br>
This is what i have tried so far to get the watson snps and not
getting anywhere fast :)
<br>
<br>
1. Written perl script to download them from ensembl human
variation database. This works but will take over a month to get
all the snps at the rate at which it seems to be running and i
imagine you'll block my ip address if i leave it running :)
Plus I can't leave it a month anyway.
<br>
<br>
2. I've tried to install the human variation database locally
but that also seems to be having problems. It has been
installing the allele table now for 3 days i think. It is
running on a very slow machine but there are far bigger tables
than the allele table so i dread to think how long they will
take. I tried to get access to a better machine but i wasn't
give enough hard disk space but perhaps that will solve the
problem! How long should it take to install the human variation
database (roughly) on a 64 bit linux machine with 2 gig of ram
and intel xeon @ 2.27GHz? Will it take hours or days?
<br>
<br>
Is there anything else i can try. I do appreciate that the
dataset is vast and these things will be slow? Perhaps the
answer is simply a faster machine to install the local database
and I am looking into this.
<br>
<br>
I have already looked at getting the snps from dbsnp or directly
from source but i need to get information associated with the
snps so will have the same problems i think of retreiving the
associated data even if i got the 'raw snps' by other means
<br>
<br>
Many thanks
<br>
<br>
On 09/12/2010 16:53, Fiona Cunningham wrote:
<br>
<blockquote type="cite"> Dear Andrea,
<br>
<br>
We will look into producing the dump file of all SNPs in
Watson for
<br>
the next release which should make your life easier. Biomart
is really
<br>
best suited to specific queries and so we should provide dump
files
<br>
where large amounts of information across the entire genome is
<br>
required.
<br>
<br>
Fiona
<br>
<br>
------------------------------------------------------
<br>
Fiona Cunningham
<br>
Ensembl Variation Project Leader, EBI
<br>
<a class="moz-txt-link-abbreviated" href="http://www.ensembl.org">www.ensembl.org</a>
<br>
<a class="moz-txt-link-abbreviated" href="http://www.lrg-sequence.org">www.lrg-sequence.org</a>
<br>
t: 01223 494612 || e: <a class="moz-txt-link-abbreviated" href="mailto:fiona@ebi.ac.uk">fiona@ebi.ac.uk</a>
<br>
<br>
<br>
<br>
On 9 December 2010 13:46, Andrea
Edwards<a class="moz-txt-link-rfc2396E" href="mailto:edwardsa@cs.man.ac.uk"><edwardsa@cs.man.ac.uk></a> wrote:
<br>
<blockquote type="cite">Dear all
<br>
<br>
I've tried downloading watson snps from biomart by a) the
whole set and b)
<br>
chromosome by chromosome and i can't get the data. I have
tried requesting
<br>
the data by email (no email received) and direct download
(download starts
<br>
but at a rate of 1kb per second and times out after about 12
hours/10 mb
<br>
downloaded).
<br>
<br>
I have written a script to get the watson snps via the perl
api but that is
<br>
running and taking hours so I am scared I will get my ip
blocked! There are
<br>
3 million snps and it took an hour to get 3000 i think
<br>
<br>
I was thinking of getting the human databases directly but i
am awaiting a
<br>
new machine and totally out of disk space. Does anyone you
know how big the
<br>
human core and variation databases are when installed?
<br>
<br>
thanks a lot
<br>
<br>
_______________________________________________
<br>
Dev mailing list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>
<br>
<a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev">http://lists.ensembl.org/mailman/listinfo/dev</a>
<br>
<br>
</blockquote>
</blockquote>
<br>
<br>
_______________________________________________
<br>
Dev mailing list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>
<br>
<a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev">http://lists.ensembl.org/mailman/listinfo/dev</a>
<br>
<br>
</blockquote>
<br>
<br>
-----
<br>
<br>
Pablo Marin-Garcia
<br>
<br>
<br>
<br>
</blockquote>
<br>
</body>
</html>