<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hello,<br><br>I would like to bulk download the microarray probe annotations described in BMC Genomics 2010, "Consistent annotation of gene expression arrays".  I initially explored using the API interface to access the data, but found that this method would likely be too slow to be reasonably useful. Below is the main section of my test Perl script to pull only the human arrays, and to only retrieve the probes, not yet worrying about what they mapped to.<div><br></div><div>API version 72<br><div><br></div><div>my $array_adaptor = $registry->get_adaptor('homo_sapiens','funcgen','array');</div><div><div>my $array_adaptor = $registry->get_adaptor('homo_sapiens','funcgen','array');</div><div><div>my $probe_adaptor = $registry->get_adaptor('homo_sapiens','funcgen','probe');</div></div><div><div><div>foreach my $array ( @array ){</div><div><span class="Apple-tab-span" style="white-space: pre; ">   </span>print "\nArray:\t".$array->name ."\t" . "Vendor:\t".$array->vendor ."\t" . "ProbeCount: ".$array->probe_count(). "\t" . "time: ". time . "\n";</div><div><br></div><div><span class="Apple-tab-span" style="white-space: pre; ">  </span>print "Getting all probes...\n"; </div><div><span class="Apple-tab-span" style="white-space: pre; ">  </span>my @probes = @{$probe_adaptor->fetch_all_by_Array($array)};</div><div><span class="Apple-tab-span" style="white-space: pre; ">    </span>print "Gotten!\t". time . "\n";</div><div><span class="Apple-tab-span" style="white-space: pre; ">       </span>print scalar(@probes) . "\n";</div></div></div><div>}</div><div><br></div><div>When I run the script, it successfully retrieves the probes for the first 2 arrays (HumanWG_6_V2 and HumanWG_6_V3), each with ~48,000 probes in approximately 12.5 minutes.  However, the next array HuEx-1_0-st-v2 contains 5,431,924 probes, and I became frustrated and canceled it after several hours.  </div></div><div><br></div><div>Output:</div><div><div>Array:<span class="Apple-tab-span" style="white-space: pre; "> </span>HumanWG_6_V2<span class="Apple-tab-span" style="white-space: pre; ">     </span>Vendor:<span class="Apple-tab-span" style="white-space: pre; ">  </span>ILLUMINA<span class="Apple-tab-span" style="white-space: pre; "> </span>ProbeCount: 48701<span class="Apple-tab-span" style="white-space: pre; ">        </span>time: 1373385951</div><div>Getting all probes...</div><div>Gotten!<span class="Apple-tab-span" style="white-space: pre; ">       </span>1373386705</div><div>48701</div><div><br></div><div>Array:<span class="Apple-tab-span" style="white-space: pre; ">     </span>HumanWG_6_V3<span class="Apple-tab-span" style="white-space: pre; ">     </span>Vendor:<span class="Apple-tab-span" style="white-space: pre; ">  </span>ILLUMINA<span class="Apple-tab-span" style="white-space: pre; "> </span>ProbeCount: 48802<span class="Apple-tab-span" style="white-space: pre; ">        </span>time: 1373386708</div><div>Getting all probes...</div><div>Gotten!<span class="Apple-tab-span" style="white-space: pre; ">       </span>1373387464</div><div>48802</div><div><br></div><div>Array:<span class="Apple-tab-span" style="white-space: pre; ">     </span>HuEx-1_0-st-v2<span class="Apple-tab-span" style="white-space: pre; ">   </span>Vendor:<span class="Apple-tab-span" style="white-space: pre; ">  </span>AFFY<span class="Apple-tab-span" style="white-space: pre; ">     </span>ProbeCount: 5431924<span class="Apple-tab-span" style="white-space: pre; ">      </span>time: 1373387502</div><div>Getting all probes...</div><div>^C</div></div><div><br></div><div>Based on the above, I think that the best way to accomplish a bulk download would be to do a direct MySQL call against the database, probably a private instance of the database spun up on an Amazon machine image.  I am confidant in spinning up an Amazon image, based on the instructions here: <a href="http://useast.ensembl.org/info/data/amazon_aws.html">http://useast.ensembl.org/info/data/amazon_aws.html</a></div><div><br></div><div>However, I need some help parsing the where the probe annotations are stored within the MySQL database.  I have identified the array, array_chip, probe, probe_set tables, and joined on the probe_feature, analysis and seq_region tables.  However, I'm not clear on exactly how the next jump to genes should be made.  Additionally, within the original paper, there was logic on mapping Affy probesets to genes based on the individual probes. How would this be encoded within the database?  My MySQL code so far is below.  </div><div><br></div><div><div>USE homo_sapiens_funcgen_72_37;</div><div><br></div><div>SELECT </div><div><a href="http://array.name/">array.name</a>,</div><div>#array.*, array_chip.*,</div><div>IF ( !ISNULL(probe_set.name), probe_set.name, <a href="http://probe.name/">probe.name</a> ) AS probe_or_probeSet_name,</div><div>probe_feature.*</div><div><br></div><div>FROM array</div><div>JOIN array_chip ON array_chip.array_id = array.array_id</div><div>JOIN probe ON probe.array_chip_id = array_chip.array_chip_id</div><div>LEFT JOIN probe_set ON probe.probe_set_id = probe_set.probe_set_id</div><div>JOIN probe_feature ON probe_feature.probe_id = probe.probe_id</div><div>JOIN analysis ON probe_feature.analysis_id = analysis.analysis_id</div><div>JOIN analysis_description ON analysis_description.analysis_id = analysis.analysis_id</div><div>JOIN seq_region ON seq_region.seq_region_id = probe_feature.seq_region_id</div><div><br></div><div>WHERE analysis.module = "ProbeAlign"</div><div>AND array.array_id=28  LIMIT 30;</div></div><div><br></div><div>Alternatively, I saw reference that the probe mapping information is available through biomart and that the biomart SQL tables might contain the data in a de-normalized format.  Would I be better off accessing <a href="http://martdb.ensembl.org/">martdb.ensembl.org</a>, and if so, which tables should I be looking in?</div><div><br></div><div>Thank you,</div><div>Alex Holman</div></div><div><br></div></body></html>