<div class="gmail_quote">On Wed, Mar 9, 2011 at 9:06 AM, Will McLaren <span dir="ltr"><<a href="mailto:wm2@ebi.ac.uk">wm2@ebi.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div id=":30">In many cases this will be true; however, when we build a new Ensembl<br>

Variation database from a new dbSNP release, there will be basically<br>

no rows in common between the new and previous DBs. </div></blockquote><div><br></div><div>I wasn't sufficiently clear. The proposal is to hash rows in release 61, let insert/update/delete changes happen during the course of our use, then, when 62 is released, repeat the hash process *in the same database*. The point is to identify changes that were made to the database instance since installation. Propagating those changes would depend on how much the schema changed, but would generally be done semi-automatically.</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div id=":30">

Have you factored in the size of our tables? In human, for example,<br>

our variation database is 76gb in total, with many tables containing<br>

hundreds of millions of rows.</div></blockquote><div> </div></div>I just ran a test that was even faster than I expected. Here's a sample:<div><br><div><div><font class="Apple-style-span" face="'courier new', monospace">mysql> create table reece.variation_sha1 as select variation_id,sha1(concat(coalesce(source_id,'NULL'),coalesce(name,'NULL'),coalesce(validation_status,'NULL'),coalesce(ancestral_allele,'NULL'),coalesce(flipped,'NULL'),coalesce(class_so_id,'NULL'))) as sha1 from variation ;</font></div>

<div><font class="Apple-style-span" face="'courier new', monospace">Query OK, 30443264 rows affected (2 min 9.36 sec)</font></div><div><font class="Apple-style-span" face="'courier new', monospace">Records: 30443264  Duplicates: 0  Warnings: 0</font></div>

<div><br></div><div>So, that's ~2 minutes to checksum 30M rows. That completely allays my concern about timing. If you remain concerned, I'd love to know what I'm not seeing. (BTW, this is on an m1.large instance with an EBS mysql data directory.)</div>

<div><br></div></div></div><div><br></div><div>Thanks,</div><div>Reece</div><div><br></div>