<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

    <title></title>

  </head>

  <body text="#000000" bgcolor="#ffffff">

    Pablo<br>

    <br>

    Thanks for your very helpful reply.<br>

    <br>

    How do you know all of this information about the database schema?

    My plan was to install the database and work out the schema by

    'playing' with the data but as yet I can't install the database<br>

    <br>

    How do you know what tables contain the data you need and the

    relationships between them? Did you just get it from these docs<br>

    <a class="moz-txt-link-freetext" href="http://www.ensembl.org/info/docs/api/core/core_schema.html">http://www.ensembl.org/info/docs/api/core/core_schema.html</a><br>

    <br>

    I'm on a much faster machine now and the core database is still

    taking over 3 days and is at the ditag table. I think I'll stop the

    process and comment out the indexes and start again!<br>

    <br>

    I'm happy not to download the whole database but i don't know which

    tables i need. I have posted the script i was going to run against

    the remote ensembl database (but it would take 2 weeks to run

    remotely!!) in case you can offer some advice of which tables to

    get. It's a very simple script even if it looks longish. It is

    basically creating one big table with a row for each watson snp

    consequence. Each consequence is a unique gene/exon/transcript

    combination that the snp affects (e.g. if the snp in a gene, called

    gene1, and is in exon1 of that gene and that exon is in 2 splice

    variants transcript1 and transcript2 then there are 2 rows in the

    table: gene1/exon1/transcript1 and gene1/exon1/transcript2) I used

    this denormalised database design just as a shortcut while i was

    trying to actually get the data. I'm going to model it differently

    if/when i get it :)<br>

    <br>

    I really appreciate all the help you have already given me. I just

    assumed I could download the whole database and then query it with

    the api. Perhaps i need to get more familiar with the schema so i

    can just get the subsets of data i need<br>

    <br>

    <small><small><font face="Courier New, Courier, monospace">#!/usr/bin/perl

          -w<br>

          <br>

          #use lib 'C:\Perl\site\lib\ensembl-api\ensembl\modules';<br>

          #use lib

          'C:\Perl\site\lib\ensembl-api\ensembl-variation\modules';<br>

          <br>

          use lib '/home/andrea/ensembl-src/ensembl/modules';<br>

          use lib '/home/andrea/ensembl-src/ensembl-variation/modules';<br>

          <br>

          use strict;<br>

          use warnings;<br>

          use Bio::EnsEMBL::Registry;<br>

          use Bio::EnsEMBL::Utils::Slice qw(split_Slices);<br>

          use DBI;<br>

          use DBD::mysql;<br>

          <br>

          my $species = 'homo_sapiens';<br>

          #my $reg = 'Bio::EnsEMBL::Registry';<br>

          #my $registry_file= "C:\\Documents and

          Settings\\Administrator\\Desktop\\PHD\\java\\Code\\SNP\\config.txt";<br>

          #$reg->load_all($registry_file);<br>

          my $reg = 'Bio::EnsEMBL::Registry';<br>

          $reg->load_registry_from_db(-host =>

          'ensembldb.ensembl.org',-user => 'anonymous');<br>

          <br>

          my $variation_set_adaptor= $reg->get_adaptor('human',

          'variation', 'variationset');<br>

          my $watson_set =

          $variation_set_adaptor->fetch_by_name("ENSEMBL:Watson");<br>

          my $variation_feature_adaptor=$reg->get_adaptor('human',

          'variation', 'variationfeature');<br>

          my $variation_adaptor=$reg->get_adaptor('human',

          'variation', 'variation');<br>

          my $slice_adaptor = $reg->get_adaptor($species, 'core',

          'slice');<br>

          my $gene_adaptor =$reg->get_adaptor($species,

          'core','gene');<br>

          my $transcript_adaptor =$reg->get_adaptor($species,

          'core','Transcript');<br>

          <br>

          <br>

          <br>

          ###################################<br>

          #get the database schema<br>

          ###################################<br>

          <br>

          my @db_adaptors = @{ $reg->get_all_DBAdaptors(-group =>

          'variation') };<br>

          my $ens_schema;<br>

          foreach my $db_adaptor (@db_adaptors) {<br>

              my $db_connection = $db_adaptor->dbc();<br>

              $ens_schema= $db_adaptor->_get_schema_build;<br>

          }<br>

          <br>

          my $snp;<br>

          <br>

          <br>

          my $i = 0;<br>

          my $slices = $slice_adaptor->fetch_all('chromosome', undef,

          0, 1);<br>

          <br>

          <br>

          # Base pair overlap between returned slices<br>

          my $overlap = 0;<br>

          <br>

          # Maximum size of returned slices<br>

          my $max_size = 100_000;<br>

          <br>

          # Break chromosomal slices into smaller 100k component slices<br>

          my @slices = @{split_Slices( $slices, $max_size, $overlap ) 

          };<br>

          <br>

          foreach my $slice (@slices) {<br>

          <br>

              unless ($slice->seq_region_name() =~ /Un/) {<br>

                  my @vfs

          =@{$watson_set->get_all_VariationFeatures_by_Slice($slice)};<br>

                  foreach my $vf (@vfs) {<br>

                      next if ($vf->var_class ne 'snp');<br>

                      $snp++;<br>

          <br>

                      print "variation feature found: $snp\n";<br>

          <br>

                      my ($chromosome, $locus, $gene_id, $biotype,

          $description, $gene_name, $transcript_id, $exon_id,

          $exon_start, $exon_end,$dbsnp_ref, $dbsnp_synonyms,

          $reference, $allele, $genotype, $main_consequence,

          $all_consequences, $peptide_alleles, $unique_mapping) =<br>

                  (undef, undef, undef, undef, undef, undef, undef,

          undef, undef, undef, undef, undef, undef, undef, undef, undef,

          undef, undef, undef);<br>

          <br>

              <br>

                      $chromosome=$slice->seq_region_name();<br>

          <br>

                      #$a_allele, $t_allele, $g_allele, $c_allele,

          $a_main_consequence, $a_all_consequences, $t_main_consequence,

          $t_all_consequences, $c_main_consequence, $c_all_consequences,

          $g_main_consequence, $g_all_consequences, $a_aa, $t_aa, $c_aa,

          $g_aa,<br>

          <br>

                      $locus = $vf->start();<br>

                      $allele=$vf->allele_string();<br>

                      $genotype = $vf->ambig_code(); #just a quick

          hack to allow the workflow will convert it back again<br>

                      <br>

                      # my @alleles = split /\//,

          $vf->allele_string();<br>

                      # foreach (@alleles) {<br>

                      # if (/A/i){$a_allele='Y';}<br>

                      # if (/T/i){$t_allele='Y';}<br>

                      # if (/C/i){$c_allele='Y';}<br>

                      # if (/G/i){$g_allele='Y';}<br>

                      # }<br>

          <br>

                      $dbsnp_ref = $vf->variation_name();<br>

                      $dbsnp_synonyms = join ',',

          @{$vf->variation->get_all_synonyms('dbSNP')};<br>

                      $reference = substr $allele, 0, 1;<br>

                      $main_consequence = $vf->display_consequence();<br>

                      $all_consequences = join ',',

          @{$vf->get_consequence_type()};<br>

          <br>

                              <br>

                      # fetch all genome hits for a particular variation<br>

                      my $variation =

          $variation_adaptor->fetch_by_name($dbsnp_ref);<br>

          <br>

                      my @vars = 

          @{$variation_feature_adaptor->fetch_all_by_Variation($variation)};<br>

                      my $count = @vars;<br>

                      $count > 1 ? $unique_mapping = 'FALSE':

          $unique_mapping = 'TRUE';<br>

          <br>

                      my @tvs = @{$vf->get_all_TranscriptVariations};<br>

                      if (@tvs) {<br>

          <br>

                          print "vf has transcript variants\n";<br>

                          foreach my $tv (@tvs) {<br>

          <br>

                              my $transcript = $tv->transcript();<br>

                              if ($transcript) {<br>

          <br>

                                  $transcript_id=

          $transcript->stable_id();<br>

                                  my $gene =

          $gene_adaptor->fetch_by_transcript_stable_id($transcript_id);<br>

                                  ($gene_id, $biotype, $description,

          $gene_name) = ($gene->stable_id,$gene->biotype,

          $gene->description,$gene->external_name);<br>

                                  $main_consequence =

          $tv->display_consequence();<br>

                                  $all_consequences = join ',',

          @{$tv->consequence_type()};<br>

                                  $peptide_alleles =

          $tv->pep_allele_string();<br>

                                  my @exons =

          @{$tv->transcript->get_all_Exons() };<br>

          <br>

                                  foreach my $exon (@exons) {<br>

          <br>

                                      if ($vf->seq_region_start >=

          $exon->seq_region_start and $vf->seq_region_start <=

          $exon->seq_region_end) {<br>

          <br>

                                          ($exon_start, $exon_end,

          $exon_id) = ($exon->seq_region_start,

          $exon->seq_region_end, $exon->stable_id) ;<br>

          <br>

                                          last;<br>

          <br>

                                      }#end if in the exon<br>

                                  }#end for each exon<br>

          <br>

                                  <br>

          <br>

                              }#end if transcript<br>

          <br>

                              #insert row here, can still update the

          record with nulls if transcript not available<br>

                              &insert_snp($chromosome, $locus,

          $gene_id, $biotype, $description, $gene_name, $transcript_id,

          $exon_id, $exon_start, $exon_end,$dbsnp_ref, $dbsnp_synonyms,

          $reference, $allele, $genotype, $main_consequence,

          $all_consequences, $peptide_alleles, $ens_schema,

          $unique_mapping);<br>

                          }#end for each transcript variant<br>

          <br>

                      } <br>

                      else {<br>

          <br>

                          print "vf does not have transcript

          variants\n";<br>

          <br>

                          #insert row here for snp not in a gene<br>

          <br>

                          &insert_snp($chromosome, $locus, $gene_id,

          $biotype, $description, $gene_name, $transcript_id, $exon_id,

          $exon_start, $exon_end,$dbsnp_ref, $dbsnp_synonyms,

          $reference, $allele, $genotype, $main_consequence,

          $all_consequences, $peptide_alleles, $ens_schema,

          $unique_mapping);<br>

          <br>

                      }#end if any transcript variants<br>

                  }#end for each variation feature<br>

              }#end unless UnChr<br>

          }#end for each chromosome<br>

          <br>

          sub insert_snp {<br>

          <br>

          #my ($chromosome, $locus, $gene_id, $biotype, $description,

          $gene_name, $transcript_id, $exon_id, $exon_start,

          $exon_end,$dbsnp_ref, $dbsnp_synonyms, $reference, $allele,

          $main_consequence, $all_consequences, $peptide_alleles,

          $ens_schema) = @_;<br>

          <br>

              my @params = @_;<br>

              my $db =

          DBI->connect("DBI:mysql:database=annotationDB;host=localhost;port=3306",

          "root", "and1jon1",{'RaiseError' => 0}) or die "Cannot

          connect: " . $DBI::errstr;<br>

              my $sth = $db->prepare("insert into watson(chromosome,

          locus, gene_id, biotype, description, gene_name,

          transcript_id, exon_id, exon_start, exon_end, dbsnp_ref,

          dbsnp_synonyms, reference, allele, genotype, main_consequence,

          all_consequences, peptide_alleles, ens_schema, unique_mapping)

          values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, ?)") or die

          "Cannot prepare: " . $db->errstr();<br>

              $sth->execute(@params) or die "Cannot execute: " .

          $db->errstr();;<br>

          <br>

          }</font></small></small><br>

    <br>

    On 14/12/2010 13:26, Pablo Marin-Garcia wrote:

    <blockquote

cite="mid:alpine.DEB.1.10.1012141140370.1073@hgs3b.internal.sanger.ac.uk"

      type="cite">

      <br>

      Hello Andrea

      <br>

      <br>

      there are several things than can help you (as a hint not as a

      solution :-():

      <br>

      <br>

      A) First, when you are uploading big data sets to mysql, you

      should turn down the indexes or it would take ages. For a few

      tables you can do it manually commenting the index creation in the

      sql, and later use the 'create index' command. I don't know if in

      mysql you can update everything skipping the indexes with

      mysqladmin and then redo all the indexes automatically.

      <br>

      <br>

      <br>

      B) The approach that I follow when working with full genome snps

      is to download once the tables that I need directly with mysql and

      then I have ensembl-like adaptors to work with them:

      <br>

      <br>

      mysql -B --port 5306 -u anonymous -h ensembldb.ensembl.org -e 'use

      \ homo_sapiens_variation_60_37e; select variation_id as var_id, \

      <br>

      variation_name as snp, sr.name as chr, \

      <br>

      seq_region_start as start, seq_region_end as end,

      seq_region_strand as \

      <br>

      strand, allele_string, map_weight, flags, validation_status,

      consequence_type \

      <br>

      from  variation_feature vf, seq_region sr where \

      <br>

      vf.seq_region_id=sr.seq_region_id' > 

      variations_build_37_ens-60.tab

      <br>

      <br>

      note:

      <br>

        This makes 2Gb file variations_build_37_ens-60.tab (you can make

      more specific

      <br>

        queries joining source-variation_synonym-variation_features in

      order to reduce

      <br>

        the data). Also the memory report tell me that mysql was taking

      4.5 Gb during

      <br>

        the download so you should make more specific queries in small

      computers. You

      <br>

        can always download these three tables (plus seq_region for the

      chr names)

      <br>

        from the ensemble db dumps and load to your mysql locally (It is

      not necessary

      <br>

        to download the full variation database).

      <br>

      <br>

      You can upload the query-downloaded data to a local mysql with:

      <br>

      <br>

      ============ loading script: load_ens_vf_b37.sql

      <br>

      <br>

      -- load with:

      <br>

      --    mysql --host=variation_local  --port=3636 --user=pmg -p <

      load_ens_vf_b37.sql

      <br>

      --

      <br>

      <br>

      use pmGWAS;

      <br>

      <br>

      drop table if exists ens_vf_b37;

      <br>

      create table ens_vf_b37 (

      <br>

         var_id              int(20),

      <br>

         snp                 varchar(20),

      <br>

         chr                 varchar(20),

      <br>

         start               int(10) UNSIGNED NOT NULL,

      <br>

         end                 int(10) UNSIGNED NOT NULL,

      <br>

         strand              enum('-1','1'),

      <br>

         allele_string       varchar(10),

      <br>

         map_weight          tinyint(2),

      <br>

         flags               varchar(100),

      <br>

         validation_status   varchar(100),

      <br>

         consequence_type    varchar(100)

      <br>

      );

      <br>

      <br>

      load data local infile

      '/homes/pmg/ens_dump/variations_build_37_ens-60.tab'

      <br>

      into table ens_vf_b37

      <br>

      FIELDS TERMINATED BY '\t'

      <br>

      LINES TERMINATED BY '\n'

      <br>

      IGNORE 1 LINES

      <br>

      (

      <br>

      var_id

      <br>

      ,snp

      <br>

      ,chr

      <br>

      ,start

      <br>

      ,end

      <br>

      ,strand

      <br>

      ,allele_string

      <br>

      ,map_weight

      <br>

      ,flags

      <br>

      ,validation_status

      <br>

      ,consequence_type

      <br>

      )

      <br>

      <br>

      -- after insertion do the index to prevent slowing down the

      insertions

      <br>

      -- alter table ens_vf_b37 add index snp(snp);

      <br>

      -- alter table ens_vf_b37 add index pos(chr, start);

      <br>

      <br>

      =============

      <br>

      <br>

      <br>

      If I need specific data sets (Watson, Venter, dbsnp, ) I make a

      second table with the output of a join with 'source' table and

      'variation_synonyms' to know which snps belong to a specific

      source. That way I can make any genomic query faster locally

      joining by variation_id  my variation_feature with my

      source-synonym table to extract the SNPS wanted

      <br>

      <br>

      <br>

      For your case in build_36 was strait forward because one of the

      sources was "ENSEMBL:Watson", but in the new build 37 source they

      have added a lot more sources and there is not a explicit watson

      anymore, so I don't know where to define it (now you have only

      ENSEMBL but in 36 was ENSEMBL:celera/watson/venter):

      <br>

      <br>

      $  mysql_ensembl_variation_b37 'select name  from source'

      <br>

      +----------------------------+

      <br>

      | name                       |

      <br>

      +----------------------------+

      <br>

      | dbSNP                      |

      <br>

      | Affy GeneChip 100K Array   |

      <br>

      | Affy GeneChip 500K Array   |

      <br>

      | Affy GenomeWideSNP_6.0     |

      <br>

      | NHGRI_GWAS_catalog         |

      <br>

      | EGA                        |

      <br>

      | Illumina_CytoSNP12v1       |

      <br>

      | Illumina_Human660W-quad    |

      <br>

      | Illumina_Human1M-duoV3     |

      <br>

      | Uniprot                    |

      <br>

      | COSMIC                     |

      <br>

      | ENSEMBL                    |

      <br>

      | DGVa:estd1                 |

      <br>

      | DGVa:estd3                 |

      <br>

      | Open Access GWAS Database  |

      <br>

      | HGMD-PUBLIC                |

      <br>

      | DGVa:nstd1                 |

      <br>

      | DGVa:nstd2                 |

      <br>

      | DGVa:nstd4                 |

      <br>

      | DGVa:nstd9                 |

      <br>

      | DGVa:nstd14                |

      <br>

      | DGVa:nstd16                |

      <br>

      | DGVa:nstd17                |

      <br>

      | DGVa:nstd20                |

      <br>

      | DGVa:nstd11                |

      <br>

      | DGVa:nstd22                |

      <br>

      | DGVa:nstd23                |

      <br>

      | DGVa:nstd27                |

      <br>

      | DGVa:nstd28                |

      <br>

      | DGVa:nstd29                |

      <br>

      | DGVa:nstd30                |

      <br>

      | DGVa:nstd31                |

      <br>

      | DGVa:nstd32                |

      <br>

      | DGVa:nstd34                |

      <br>

      | DGVa:nstd35                |

      <br>

      | Affy GenomeWideSNP_6.0 CNV |

      <br>

      | DGVa:nstd8                 |

      <br>

      | DGVa:estd19                |

      <br>

      | DGVa:estd20                |

      <br>

      | DGVa:estd21                |

      <br>

      | DGVa:estd22                |

      <br>

      | DGVa:estd24                |

      <br>

      | DGVa:nstd36                |

      <br>

      | DGVa:nstd39                |

      <br>

      | DGVa:estd48                |

      <br>

      | DGVa:estd49                |

      <br>

      | DGVa:estd50                |

      <br>

      +----------------------------+

      <br>

      <br>

      <br>

      <br>

       On Sun, 12 Dec 2010, Andrea Edwards wrote:

      <br>

      <br>

      <blockquote type="cite">Hi

        <br>

        <br>

        The dump files would be great but I am also retreiving lots of

        other information about the snps with the snps and that might

        not necessarily be in your dump file so i think i have to try

        other options too.

        <br>

        <br>

        This is what i have tried so far to get the watson snps and not

        getting anywhere fast :)

        <br>

        <br>

        1. Written perl script to download them from ensembl human

        variation database. This works but will take over a month to get

        all the snps at the rate at which it seems to be running and i

        imagine you'll block my  ip address if i leave it running :)

        Plus I can't leave it a month anyway.

        <br>

        <br>

        2. I've tried to install the human variation database locally

        but that also seems to be having problems. It has been

        installing the allele table now for 3 days i think. It is

        running on a very slow machine but there are far bigger tables

        than the allele table so i dread to think how long they will

        take. I tried to get access to a better machine but i wasn't

        give enough hard disk space but perhaps that will solve the

        problem! How long should it take to install the human variation

        database (roughly) on a 64 bit linux machine with 2 gig of ram

        and intel xeon @ 2.27GHz? Will it take hours or days?

        <br>

        <br>

        Is there anything else i can try. I do appreciate that the

        dataset is vast and these things will be slow? Perhaps the

        answer is simply a faster machine to install the local database

        and I am looking into this.

        <br>

        <br>

        I have already looked at getting the snps from dbsnp or directly

        from source but i need to get information associated with the

        snps so will have the same problems i think of retreiving the

        associated data even if i got the 'raw snps' by other means

        <br>

        <br>

        Many thanks

        <br>

        <br>

        On 09/12/2010 16:53, Fiona Cunningham wrote:

        <br>

        <blockquote type="cite">  Dear Andrea,

          <br>

          <br>

          We will look into producing the dump file of all SNPs in

          Watson for

          <br>

          the next release which should make your life easier. Biomart

          is really

          <br>

          best suited to specific queries and so we should provide dump

          files

          <br>

          where large amounts of information across the entire genome is

          <br>

          required.

          <br>

          <br>

          Fiona

          <br>

          <br>

          ------------------------------------------------------

          <br>

          Fiona Cunningham

          <br>

          Ensembl Variation Project Leader, EBI

          <br>

          <a class="moz-txt-link-abbreviated" href="http://www.ensembl.org">www.ensembl.org</a>

          <br>

          <a class="moz-txt-link-abbreviated" href="http://www.lrg-sequence.org">www.lrg-sequence.org</a>

          <br>

          t: 01223 494612 || e: <a class="moz-txt-link-abbreviated" href="mailto:fiona@ebi.ac.uk">fiona@ebi.ac.uk</a>

          <br>

          <br>

          <br>

          <br>

          On 9 December 2010 13:46, Andrea

          Edwards<a class="moz-txt-link-rfc2396E" href="mailto:edwardsa@cs.man.ac.uk"><edwardsa@cs.man.ac.uk></a>  wrote:

          <br>

          <blockquote type="cite">Dear all

            <br>

            <br>

            I've tried downloading watson snps from biomart by a) the

            whole set and b)

            <br>

            chromosome by chromosome and i can't get the data. I have

            tried requesting

            <br>

            the data by email (no email received) and direct download

            (download starts

            <br>

            but at a rate of 1kb per second and times out after about 12

            hours/10 mb

            <br>

            downloaded).

            <br>

            <br>

            I have written a script to get the watson snps via the perl

            api but that is

            <br>

            running and taking hours so I am scared I will get my ip

            blocked! There are

            <br>

            3 million snps and it took an hour to get 3000 i think

            <br>

            <br>

            I was thinking of getting the human databases directly but i

            am awaiting a

            <br>

            new machine and totally out of disk space. Does anyone you

            know how big the

            <br>

            human core and variation databases are when installed?

            <br>

            <br>

            thanks a lot

            <br>

            <br>

            _______________________________________________

            <br>

            Dev mailing list

            <br>

            <a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>

            <br>

            <a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev">http://lists.ensembl.org/mailman/listinfo/dev</a>

            <br>

            <br>

          </blockquote>

        </blockquote>

        <br>

        <br>

        _______________________________________________

        <br>

        Dev mailing list

        <br>

        <a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>

        <br>

        <a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev">http://lists.ensembl.org/mailman/listinfo/dev</a>

        <br>

        <br>

      </blockquote>

      <br>

      <br>

      -----

      <br>

      <br>

        Pablo Marin-Garcia

      <br>

      <br>

      <br>

      <br>

    </blockquote>

    <br>

  </body>

</html>