<div dir="ltr">Dear Jose, <div><br></div><div>Thank you for your reply. Your clear explanation has totally resolved my concerns.</div><div><br></div><div>Best regards,</div><div>Hiep</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jul 15, 2023 at 2:18 AM Jose Gonzalez <<a href="mailto:jmgonzalez@ebi.ac.uk">jmgonzalez@ebi.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi Hiep,</p>
<p>The version change from 6 to 1 was not intentional. As you would
expect, versions should always increase and older versions must
not be reused, so this looks like an anomalous behaviour of our
stable id mapping. <br>
</p>
<p>The annotation of this gene changed in an unusual way between
releases 107 and 108. To understand this, two aspects of our
internal workflow must be considered:</p>
<p>1) The human gene annotation is being manually edited constantly
in our internal database, from which snapshots or freezes are
taken at regular intervals for the Ensembl releases.</p>
<p>2) Our internal database generates stable ids that must be
honoured by the stable id mapping that assigns identifiers and
versions in the Ensembl release. Versions, however, are assigned
by the stable id mapping process by comparison with the previous
release. It is done this way because the human gene annotation can
undergo multiple changes between releases (hence multiple version
increments in our internal database) but only single increments
are expected between Ensembl releases.<br>
</p>
<p>The gene ENSG0000025076 was a lncRNA gene until release 107.
Then, our manual annotation added a pseudogene transcript, so the
gene became a transcribed unprocessed pseudogene that also
included the lncRNA transcript. Before the freeze for release 108,
the lncRNA was separated from the pseudogene (as explained in my
previous reply) and, to follow the standard procedure, the
pseudogene kept the original gene id (ENSG0000025076), perhaps
against common sense. When the stable id mapping process for
release 108 compared this annotation with 107, it found that
ENSG0000025076 was now a different gene with no common transcripts
in 107, so it determined that this was a new gene and gave it
version 1. However, it was forced to keep the gene id that came
from our internal database. <br>
</p>
<p>This looks like a possible bug in our code that wasn't able to
handle an unexpected situation. We will need to investigate this
further. <br>
</p>
<p>Thank you for bringing this to our attention.</p>
<p>Jose<br>
</p>
<p><br>
</p>
<div>On 13/07/2023 03:42, Hiep Dang wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hi Jose,
<div><br>
</div>
<div>Thank you for your response, it helps me better understand
Emsembl's annotation pipeline. I look forward to the release
of Ensembl 110.</div>
<div><br>
</div>
<div>There is another unrelated question: </div>
<div>- As per my understanding, the stable ID version is
expected to always increase, and the older version will be
retired. However, I came across a case involving
ENSG00000250765. In release 108, ENSG00000250765.1 was mapped
to ENSG00000250765.6 from release 85. Is this intentional or
there is something wrong with it?</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Hiep</div>
<div><br>
</div>
<div><br>
</div>
<img src="cid:189626aabf9cb971f161" alt="image.png" width="562" height="254"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jun 29, 2023 at
10:27 PM Jose Gonzalez <<a href="mailto:jmgonzalez@ebi.ac.uk" target="_blank">jmgonzalez@ebi.ac.uk</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi Hiep,</p>
Thank you for bringing this issue to our attention. Please
find below my replies to your questions.
<div>On 22/06/2023 03:33, Hiep Dang wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Dear Ensembl Team,
<div><br>
</div>
<div>I am doing a project that needs to convert stable
IDs to gene symbols. When I referenced the HGNC
database, one symbol corresponds to only one stable
ID. However, in the Ensembl database, one symbol can
correspond to many stable IDs. I worry that using the
HGNC reference will drop out some gene information. To
clarify this problem, I investigate why some stable
IDs share their gene names. For the human genes, I
found that they will belong to 3 cases:</div>
<div><br>
</div>
<div>1. Stable IDs from non-primary assemblies: </div>
<div>- These stable IDs will not be in the released GTF
file (which contains chromosomes 1-22, X, Y, and MT).
I can only retrieve these IDs from BioMart. This
confuses me because, for a regular use case such as
transcriptomic alignment and quantification, the input
file is only the GTF file. So when should I consider
using these IDs from other assemblies?</div>
</div>
</blockquote>
Please note that a GTF file with the gene annotation of the
alternate regions (patches and haplotypes) is also part of
the Ensembl release files:<br>
<a href="http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz" target="_blank">http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz</a>
<p>The use of primary assembly sequences is normally
sufficient for transcriptomic alignment and
quantification. The inclusion of alternate regions could
lead to multi-mapping issues, since they are very similar
to the corresponding sequences in the primary assembly.
However, some users may be interested in the annotation on
alternate regions. For instance, some gene annotations are
known to be inaccurate because of underlying errors in the
primary assembly, and the corrected annotations can be
found in fix patches that have the corrected genomic
sequences. Or genetics researchers may be interested in
the variation of gene annotations in different haplotypes.</p>
<blockquote type="cite">
<div dir="ltr">
<div><br>
</div>
<div>- After dropping the stable IDs from non-primary
assemblies, there are still about 1700 IDs that share
the external gene name. Considering only the genes
with their sources from HGNC or NCBI, they will fall
into the following 2 cases.</div>
<div><br>
</div>
<div><br>
</div>
<div>2. Stable IDs with similar chromosomal positions:</div>
<div>- For example: ENSG00000291019 (chr5: 178764861 -
178818435) and ENSG00000250420 (chr5: 178767204 -
178797611). They are both assigned to AACSP1 with a
source from HGNC. However, the HGNC database only
references ENSG00000250420. </div>
<div><br>
</div>
<div>- Why do these two stable IDs exist at the same
time? It seems like they are essentially one gene. In
the future version, will one of them be retired?</div>
<div><br>
</div>
</div>
</blockquote>
<p>Most of these cases have their origin in a recent change
in the way that we annotate transcribed pseudogenes. The
pseudogene model, containing the homology with a coding
gene, has been dissociated from the transcriptional
evidence, which is now grouped in one or more lncRNA
genes. An undesired side effect of this change is that
both genes still share the same gene name. The pseudogene
keeps the same stable ID, so it gets its name directly
from HGNC, whereas the lncRNA gene gets the same name via
NCBI based on its genomic overlap with the transcribed
pseudogene annotated by RefSeq. <br>
</p>
<p>Please note that this change only involves
"transcribed_unprocessed_pseudogene" genes in release 109,
but it will be extended to the remaining transcribed
pseudogene biotypes ("transcribed_processed_pseudogene"
and "transcribed_unitary_pseudogene") in release 110.</p>
<p>In future releases, transcribed pseudogenes and lncRNAs
will still be separate genes with their own IDs. On the
other hand, HGNC are reluctant to assign the same gene
symbol to more than one Ensembl stable ID. To fix these
gene name duplicate issues, we will simply remove the
current gene names from the lncRNA genes that overlap
transcribed pseudogenes without giving them a new name.
However, due to the Ensembl release cycle timing, this
will not happen before release 112.<br>
</p>
<p> Within the remaining set of duplicates, a few genes such
as SPATA13 and SCARNA4 could certainly be merged to remove
the duplication and we will look into fixing the
annotation. <br>
</p>
<p><br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div>3. Stable IDs with different chromosomal positions:</div>
<div>- For example: ENSG00000240356 (chr2: 113610502 -
113627090 - HGNC referenced) and ENSG00000291064
(chr22: 50756948 - 50801309 - NCBI referenced<span>:</span><a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=118433" rel="external" target="_blank">118433</a>). They are both
assigned to RPL23AP7. The former is currently in the
HGNC database. When I go to the NCBI website, the
current position is chr2: <span style="color:rgb(0,0,0);font-family:arial,helvetica,clean,sans-serif;font-size:13px">113611239-113627138,
which is more similar to </span>ENSG00000240356. </div>
<div><br>
</div>
<div>- Will these NCBI-referenced genes be fixed in
future releases?</div>
</div>
</blockquote>
These cases seem to have been caused by the inaccurate name
assignment to one of the genes because of their high
sequence similarity to other genes in the same family. For
instance, gene ENSG00000291064 should have been called
RPL23AP82 instead of RPL23AP7. The pipeline seems to have
taken into account the sequence identity with the NCBI genes
of this family but not the genomic overlap with the NCBI
gene RPL23AP82. We still need more time to investigate why
this happened and will try to fix it in future releases.<br>
<p>Just a heads-up that there will be another source of
duplicate gene names since release 110 as chromosome Y PAR
genes are now annotated separately, eg. they will have
their own stable IDs but keep the same gene names as their
chromosome X counterparts.</p>
<blockquote type="cite">
<div dir="ltr">
<div><br>
</div>
<div>I have attached the duplicated stable IDs for case
2 and case 3 that I retrieved from BioMart release
109.</div>
<div>Thank you and I look forward to your response.</div>
<div><br>
</div>
<div>Best, </div>
<font color="#888888">
<div>Hiep</div>
</font></div>
<br>
</blockquote>
<p>Please let me know if you have any further questions
about this.</p>
Thanks,<br>
Jose
<p><br>
</p>
<blockquote type="cite">
<fieldset></fieldset>
<pre>_______________________________________________
Dev mailing list <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a>
Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" target="_blank">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a>
Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a>
</pre>
</blockquote>
<pre cols="72">--
Dr Jose M. Gonzalez
GENCODE Bioinformatician (Genome Interpretation Team)
European Bioinformatics Institute (EMBL-EBI)
Wellcome Genome Campus
Hinxton, CB10 1SD, UK</pre>
</div>
_______________________________________________<br>
Dev mailing list <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>
Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" rel="noreferrer" target="_blank">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a><br>
Ensembl Blog: <a href="http://www.ensembl.info/" rel="noreferrer" target="_blank">http://www.ensembl.info/</a><br>
</blockquote>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
Dev mailing list <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a>
Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" target="_blank">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a>
Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a>
</pre>
</blockquote>
<pre cols="72"></pre>
</div>
_______________________________________________<br>
Dev mailing list <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>
Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" rel="noreferrer" target="_blank">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a><br>
Ensembl Blog: <a href="http://www.ensembl.info/" rel="noreferrer" target="_blank">http://www.ensembl.info/</a><br>
</blockquote></div>