<div class="gmail_quote">Hi Glenn-</div><div class="gmail_quote"><br></div><div class="gmail_quote">Thanks for your replies. </div><div class="gmail_quote"><br></div><div class="gmail_quote">On Wed, Mar 9, 2011 at 1:28 AM, Glenn Proctor <span dir="ltr"><<a href="mailto:glenn@ebi.ac.uk">glenn@ebi.ac.uk</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">I'm a little unsure about this, some of our tables have a *lot* of</div>

rows, and the overhead of computing and comparing hashes will be<br>

non-trivial, plus storing the data in hashed form would, presumably,<br>

make the database's intrinsic optimisations less useful, and maybe<br>

even prevent indexes from working properly (or at all), which would<br>

cause things to grind to a halt very quickly.</blockquote></div><br><div>Hashes would be computed only twice: once just after installation, and once just before migration. Furthermore, they'd be stored elsewhere. Think tripwire (the old security tool) for Ensembl.</div>

<div><br></div><div>Let's use the variation table as an example. Approach 1 would create a new schema and table, say hs6137fsha1.variation, to store <variation_id,sha1> immediately after installation. Time would pass and we'd make changes. At upgrade time, and only at upgrade time, we'd identify:</div>

<div><ul><li>new rows (keys in homo_sapiens_variation_61_37f.variation and not in hs6137fsha1.variation)</li><li>deleted rows (in hs6137fsha1.variation, not in homo_sapiens_variation_61_37f.variation)</li><li>changed rows (keys in both, sha1 differs)</li>

</ul><div>The only computational burden would be after installation and during migration.</div></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8">The most common change is likely to be insert; in fact, I have no existing use case for update or delete, but merely point out that this approach would allow identification of such rows.</div>

<div><br></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><div>The context for all of this is that we will need to store novel variation and associated data. The Ensembl structure should work well and allow us to use existing tools. The only rub is how to transfer in-house data between releases. Perhaps this  context will trigger new ideas. </div>

</div><div><br></div><div>Thanks for your time.</div><div><br></div><div>-Reece</div><div><br></div><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta http-equiv="content-type" content="text/html; charset=utf-8">