<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hi Hiep,</p>
<p>Thank you for bringing this issue to our attention. Please find
below my replies to your questions.<br>
</p>
<div class="moz-cite-prefix">On 22/06/2023 03:33, Hiep Dang wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Dear Ensembl Team,
<div><br>
</div>
<div>I am doing a project that needs to convert stable IDs to
gene symbols. When I referenced the HGNC database, one symbol
corresponds to only one stable ID. However, in the Ensembl
database, one symbol can correspond to many stable IDs. I
worry that using the HGNC reference will drop out some gene
information. To clarify this problem, I investigate why some
stable IDs share their gene names. For the human genes, I
found that they will belong to 3 cases:</div>
<div><br>
</div>
<div>1. Stable IDs from non-primary assemblies: </div>
<div>- These stable IDs will not be in the released GTF file
(which contains chromosomes 1-22, X, Y, and MT). I can only
retrieve these IDs from BioMart. This confuses me because, for
a regular use case such as transcriptomic alignment and
quantification, the input file is only the GTF file. So when
should I consider using these IDs from other assemblies?</div>
</div>
</blockquote>
Please note that a GTF file with the gene annotation of the
alternate regions (patches and haplotypes) is also part of the
Ensembl release files:<br>
<a class="moz-txt-link-freetext"
href="http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz">http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz</a>
<p>The use of primary assembly sequences is normally sufficient for
transcriptomic alignment and quantification. The inclusion of
alternate regions could lead to multi-mapping issues, since they
are very similar to the corresponding sequences in the primary
assembly. However, some users may be interested in the annotation
on alternate regions. For instance, some gene annotations are
known to be inaccurate because of underlying errors in the primary
assembly, and the corrected annotations can be found in fix
patches that have the corrected genomic sequences. Or genetics
researchers may be interested in the variation of gene annotations
in different haplotypes.</p>
<blockquote type="cite"
cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">
<div dir="ltr">
<div><br>
</div>
<div>- After dropping the stable IDs from non-primary
assemblies, there are still about 1700 IDs that share the
external gene name. Considering only the genes with their
sources from HGNC or NCBI, they will fall into the following 2
cases.</div>
<div><br>
</div>
<div><br>
</div>
<div>2. Stable IDs with similar chromosomal positions:</div>
<div>- For example: ENSG00000291019 (chr5: 178764861 -
178818435) and ENSG00000250420 (chr5: 178767204 - 178797611).
They are both assigned to AACSP1 with a source from HGNC.
However, the HGNC database only references ENSG00000250420. </div>
<div><br>
</div>
<div>- Why do these two stable IDs exist at the same time? It
seems like they are essentially one gene. In the future
version, will one of them be retired?</div>
<div><br>
</div>
</div>
</blockquote>
<p>Most of these cases have their origin in a recent change in the
way that we annotate transcribed pseudogenes. The pseudogene
model, containing the homology with a coding gene, has been
dissociated from the transcriptional evidence, which is now
grouped in one or more lncRNA genes. An undesired side effect of
this change is that both genes still share the same gene name. The
pseudogene keeps the same stable ID, so it gets its name directly
from HGNC, whereas the lncRNA gene gets the same name via NCBI
based on its genomic overlap with the transcribed pseudogene
annotated by RefSeq. <br>
</p>
<p>Please note that this change only involves
"transcribed_unprocessed_pseudogene" genes in release 109, but it
will be extended to the remaining transcribed pseudogene biotypes
("transcribed_processed_pseudogene" and
"transcribed_unitary_pseudogene") in release 110.</p>
<p>In future releases, transcribed pseudogenes and lncRNAs will
still be separate genes with their own IDs. On the other hand,
HGNC are reluctant to assign the same gene symbol to more than one
Ensembl stable ID. To fix these gene name duplicate issues, we
will simply remove the current gene names from the lncRNA genes
that overlap transcribed pseudogenes without giving them a new
name. However, due to the Ensembl release cycle timing, this will
not happen before release 112.<br>
</p>
Within the remaining set of duplicates, a few genes such as SPATA13
and SCARNA4 could certainly be merged to remove the duplication and
we will look into fixing the annotation.
<blockquote type="cite"
cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">
<div dir="ltr">
<div> </div>
<div><br>
</div>
<div>3. Stable IDs with different chromosomal positions:</div>
<div>- For example: ENSG00000240356 (chr2: 113610502 - 113627090
- HGNC referenced) and ENSG00000291064 (chr22: 50756948 -
50801309 - NCBI referenced<span
style="color:rgb(102,102,102);font-family:"Luxi
Sans",Helvetica,Arial,Geneva,sans-serif;font-size:12.8px">:</span><a
href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=118433"
rel="external" target="_blank"
style="color:rgb(204,54,0);background-position:100%
50%;background-repeat:no-repeat;padding-right:12px;font-family:"Luxi
Sans",Helvetica,Arial,Geneva,sans-serif;font-size:12.8px"
moz-do-not-send="true">118433</a>). They are both assigned
to RPL23AP7. The former is currently in the HGNC database.
When I go to the NCBI website, the current position is chr2: <span
style="color:rgb(0,0,0);font-family:arial,helvetica,clean,sans-serif;font-size:13px">113611239-113627138,
which is more similar to </span>ENSG00000240356. </div>
<div><br>
</div>
<div>- Will these NCBI-referenced genes be fixed in future
releases?</div>
</div>
</blockquote>
<p>These cases seem to have been caused by the inaccurate name
assignment to one of the genes because of their high sequence
similarity to other genes in the same family. For instance, gene
ENSG00000291064 should have been called RPL23AP82 instead of
RPL23AP7. The pipeline seems to have taken into account the
sequence identity with the NCBI genes of this family but not the
genomic overlap with the NCBI gene RPL23AP82. We still need more
time to investigate why this happened and will try to fix it in
future releases.<br>
</p>
<p>Just a heads-up that there will be another source of duplicate
gene names since release 110 as chromosome Y PAR genes are now
annotated separately, eg. they will have their own stable IDs but
keep the same gene names as their chromosome X counterparts.</p>
<blockquote type="cite"
cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">
<div dir="ltr">
<div><br>
</div>
<div>I have attached the duplicated stable IDs for case 2 and
case 3 that I retrieved from BioMart release 109.</div>
<div>Thank you and I look forward to your response.</div>
<div><br>
</div>
<div>Best, </div>
<font color="#888888">
<div>Hiep</div>
</font></div>
<br>
</blockquote>
<p>Please let me know if you have any further questions about this.</p>
<p>Thanks,<br>
Jose<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">
<fieldset class="moz-mime-attachment-header"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
Dev mailing list <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>
Posting guidelines and subscribe/unsubscribe info: <a class="moz-txt-link-freetext" href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a>
Ensembl Blog: <a class="moz-txt-link-freetext" href="http://www.ensembl.info/">http://www.ensembl.info/</a>
</pre>
</blockquote>
<pre class="moz-signature" cols="72">--
Dr Jose M. Gonzalez
GENCODE Bioinformatician (Genome Interpretation Team)
European Bioinformatics Institute (EMBL-EBI)
Wellcome Genome Campus
Hinxton, CB10 1SD, UK</pre>
</body>
</html>