Hi,<div><br></div><div>A variation set is not by design associated with a population; it is just a generic way to group together variations. Variants from one set may be observed in more than one population; by their very nature variants can of course occur in the same position in multiple individuals and populations.</div>
<div><br></div><div>There is no fail-safe way to join up sets and populations as they were not designed that way; it so happens that some sets denote groups of variants observed in a population in a particular study (1000 genomes, for example), but the way the data are constructed is different for the sample table and the variation_set tables.</div>
<div><br></div><div>Will</div><div><br><div class="gmail_quote">On 4 January 2011 15:35, Andrea Edwards <span dir="ltr"><<a href="mailto:edwardsa@cs.man.ac.uk">edwardsa@cs.man.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div text="#000000" bgcolor="#ffffff">
Hello<br>
<br>
Thanks for your reply. However I still don't quite see how you know<i>
programmatically</i> what population a variation set is associated
with:<br>
<br>
<br>
-if the variations in the variation set belong to only one
population then you can <i>assume</i> that the variation set
relates to that population<br>
<br>
but what about when when the variation belongs to multiple
populations? there are rows for each population the variation
belongs to in the allele table<br>
-the name of the variation set (e.g Ensembl Watson) is not the same
as the name of the population (ENSEMBL:ENSEMBL_Watson) so you can't
do a join on the variation_set_name to the sample_name to filter the
appropriate records by population from the allele table<br>
<br>
Also Is it possible that a variation set could pertain to multiple
populations?<br>
<br>
==============================================<br>
Example of difficulty finding population for a variation set<br>
===============================================<br>
the variation sets for 1000 genomes are:<br>
<br>
mysql> select variation_set_id , name from variation_set;<br>
+-------------------------------------<br>
| id, name<br>
+-------------------------------------<br>
| 1, 1000 genomes<br>
| 8, 1000 genomes - Low coverage<br>
| 3, 1000 genomes - Trios - CEU<br>
| 4, 1000 genomes - Trios - YRI<br>
<br>
Looking at the variation_set_structure table, the last 3 variation
sets are subsets of the first "1000 genomes" <br>
<br>
I don't know which populations these 4 variation sets pertain to.
There are 3 possibilities in the population table<br>
<br>
mysql> select id, name from sample where name like "%1000"<br>
+-----------------------------+<br>
| id, name |<br>
+-----------------------------+<br>
| 11273, 1000GENOMES:pilot.1.CEU |<br>
| 11274 1000GENOMES:pilot.1.CHB+JPT |<br>
| 11275 1000GENOMES:pilot.1.YRI |<br>
.....<br>
56 rows in set (0.06 sec)<br>
<br>
all population names contain the digit 1 suggesting they belong to
the first low coverage pilot which suggests variation set 8
corresponds to populations 11273, 11274 and 11275 <br>
<br>
but i know this obviously isn't right as what populations do the
individuals in variation sets 3 and 4 belong to, and what
populations do the individuals in variation set 1 (but not 8,3 and
4) belong to?<br>
<br>
thanks a lot<div><div></div><div class="h5"><br>
<br>
<br>
<br>
On 22/12/2010 09:44, Will McLaren wrote:
<blockquote type="cite">Hi Andrea,
<div><br>
</div>
<div>Apologies, but the schema document has not been updated to
include the variation set tables - your understanding of them is
correct. Sets are a generic and catch-all way of grouping
variations - it allows us to group, for example, all variants
from the HapMap project, or all variants with phenotypic
associations, or in this case all variants called in a
particular individual.</div>
<div><br>
</div>
<div>Alleles are linked to populations, and there is a population
representing Watson (the population is named
"ENSEMBL:ENSEMBL_Watson" and is of size one, and has an
individual named "Watson"). Thus if a variation belongs to the
Watson set, it should have a pair of alleles linked to the
Watson population.</div>
<div><br>
</div>
<div>Cheers</div>
<div><br>
</div>
<div>Will<br>
<br>
<div class="gmail_quote">On 21 December 2010 21:06, Andrea
Edwards <span dir="ltr"><<a href="mailto:edwardsa@cs.man.ac.uk" target="_blank">edwardsa@cs.man.ac.uk</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">Hi<br>
<br>
I have been reading about the variation database schema here<br>
<br>
<a href="http://www.ensembl.org/info/docs/api/variation/variation_schema.html" target="_blank">http://www.ensembl.org/info/docs/api/variation/variation_schema.html</a><br>
<br>
but there is no information in this document about the
database tables that, based on their name, look like they
deal with variation sets namely<br>
<br>
*variation_set<br>
*variation_set_structure<br>
*variation_set_variation<br>
<br>
These tables aren't on the pdf schema diagram either.<br>
<br>
I was hoping i could get an explanation of these tables.<br>
<br>
It looks as though variation_set is simply a variation set
with a name and description.<br>
<br>
It looks then as if variation_set_variation is a simple link
table to resolve the many to many relationship between a
variation and a variation set. But if that is the case I
don't know how you model the alleles in a variation set such
as the watson set.<br>
<br>
For example a particular variation might be triallelic
overall (e.g. in every individual looked at) but variations
in the the watson variation can only be diploid at most. The
table that normally describes the alleles of a variation and
their frequencies is allele. The allele table links to a
sample id so you which alleles occur for a variation in a
population and you know the frequency of a particular allele
in that population. The allele table doesn't seem to have
any link to a variation set.<br>
<br>
It looks like there should be a link somewhere between a
variation set and a population/sample so that the allele
table can still represent the alleles/frequencies of a
variation set<br>
<br>
Or i could be guessing this all wrong. Either way, i would
really benefit from some data about the schema that models
variation sets. And I think I need ensembl's definition of
a variation set (the POD simply says This is a class
representing a set of variations that are grouped by e.g.
study, method, quality measure etc.)<br>
<br>
Kind regards<br>
<br>
_______________________________________________<br>
Dev mailing list<br>
<a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>
<a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<br>
</div></div></div>
</blockquote></div><br></div>