<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hi Hiep,</p>

    Thank you for bringing this issue to our attention. Please find

    below my replies to your questions.

    <p></p>

    <div class="moz-cite-prefix">On 22/06/2023 03:33, Hiep Dang wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">Dear Ensembl Team,

        <div><br>

        </div>

        <div>I am doing a project that needs to convert stable IDs to

          gene symbols. When I referenced the HGNC database, one symbol

          corresponds to only one stable ID. However, in the Ensembl

          database, one symbol can correspond to many stable IDs. I

          worry that using the HGNC reference will drop out some gene

          information. To clarify this problem, I investigate why some

          stable IDs share their gene names. For the human genes, I

          found that they will belong to 3 cases:</div>

        <div><br>

        </div>

        <div>1. Stable IDs from non-primary assemblies: </div>

        <div>- These stable IDs will not be in the released GTF file

          (which contains chromosomes 1-22, X, Y, and MT). I can only

          retrieve these IDs from BioMart. This confuses me because, for

          a regular use case such as transcriptomic alignment and

          quantification, the input file is only the GTF file. So when

          should I consider using these IDs from other assemblies?</div>

      </div>

    </blockquote>

    Please note that a GTF file with the gene annotation of the

    alternate regions (patches and haplotypes) is also part of the

    Ensembl release files:<br>

    <a class="moz-txt-link-freetext"

href="http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz">http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz</a>

    <p>The use of primary assembly sequences is normally sufficient for

      transcriptomic alignment and quantification. The inclusion of

      alternate regions could lead to multi-mapping issues, since they

      are very similar to the corresponding sequences in the primary

      assembly. However, some users may be interested in the annotation

      on alternate regions. For instance, some gene annotations are

      known to be inaccurate because of underlying errors in the primary

      assembly, and the corrected annotations can be found in fix

      patches that have the corrected genomic sequences. Or genetics

      researchers may be interested in the variation of gene annotations

      in different haplotypes.</p>

    <blockquote type="cite"

cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">

      <div dir="ltr">

        <div><br>

        </div>

        <div>- After dropping the stable IDs from non-primary

          assemblies, there are still about 1700 IDs that share the

          external gene name. Considering only the genes with their

          sources from HGNC or NCBI, they will fall into the following 2

          cases.</div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div>2. Stable IDs with similar chromosomal positions:</div>

        <div>- For example: ENSG00000291019 (chr5: 178764861 -

          178818435) and ENSG00000250420 (chr5: 178767204 - 178797611).

          They are both assigned to AACSP1 with a source from HGNC.

          However, the HGNC database only references ENSG00000250420. </div>

        <div><br>

        </div>

        <div>- Why do these two stable IDs exist at the same time? It

          seems like they are essentially one gene. In the future

          version, will one of them be retired?</div>

        <div><br>

        </div>

      </div>

    </blockquote>

    <p>Most of these cases have their origin in a recent change in the

      way that we annotate transcribed pseudogenes. The pseudogene

      model, containing the homology with a coding gene, has been

      dissociated from the transcriptional evidence, which is now

      grouped in one or more lncRNA genes. An undesired side effect of

      this change is that both genes still share the same gene name. The

      pseudogene keeps the same stable ID, so it gets its name directly

      from HGNC, whereas the lncRNA gene gets the same name via NCBI

      based on its genomic overlap with the transcribed pseudogene

      annotated by RefSeq. <br>

    </p>

    <p>Please note that this change only involves

      "transcribed_unprocessed_pseudogene" genes in release 109, but it

      will be extended to the remaining transcribed pseudogene biotypes

      ("transcribed_processed_pseudogene" and

      "transcribed_unitary_pseudogene") in release 110.</p>

    <p>In future releases, transcribed pseudogenes and lncRNAs will

      still be separate genes with their own IDs. On the other hand,

      HGNC are reluctant to assign the same gene symbol to more than one

      Ensembl stable ID. To fix these gene name duplicate issues, we

      will simply remove the current gene names from the lncRNA genes

      that overlap transcribed pseudogenes without giving them a new

      name. However, due to the Ensembl release cycle timing, this will

      not happen before release 112.<br>

    </p>

    <p> Within the remaining set of duplicates, a few genes such as

      SPATA13 and SCARNA4 could certainly be merged to remove the

      duplication and we will look into fixing the annotation. <br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">

      <div dir="ltr">

        <div>3. Stable IDs with different chromosomal positions:</div>

        <div>- For example: ENSG00000240356 (chr2: 113610502 - 113627090

          - HGNC referenced) and ENSG00000291064 (chr22: 50756948 -

          50801309 - NCBI referenced<span

            style="color:rgb(102,102,102);font-family:"Luxi

            Sans",Helvetica,Arial,Geneva,sans-serif;font-size:12.8px">:</span><a

href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=118433"

            rel="external" target="_blank"

            style="color:rgb(204,54,0);background-position:100%

50%;background-repeat:no-repeat;padding-right:12px;font-family:"Luxi

Sans",Helvetica,Arial,Geneva,sans-serif;font-size:12.8px"

            moz-do-not-send="true">118433</a>). They are both assigned

          to RPL23AP7. The former is currently in the HGNC database.

          When I go to the NCBI website, the current position is chr2: <span

style="color:rgb(0,0,0);font-family:arial,helvetica,clean,sans-serif;font-size:13px">113611239-113627138,

            which is more similar to </span>ENSG00000240356. </div>

        <div><br>

        </div>

        <div>- Will these NCBI-referenced genes be fixed in future

          releases?</div>

      </div>

    </blockquote>

    These cases seem to have been caused by the inaccurate name

    assignment to one of the genes because of their high sequence

    similarity to other genes in the same family. For instance, gene

    ENSG00000291064 should have been called RPL23AP82 instead of

    RPL23AP7. The pipeline seems to have taken into account the sequence

    identity with the NCBI genes of this family but not the genomic

    overlap with the NCBI gene RPL23AP82. We still need more time to

    investigate why this happened and will try to fix it in future

    releases.<br>

    <p>Just a heads-up that there will be another source of duplicate

      gene names since release 110 as chromosome Y PAR genes are now

      annotated separately, eg. they will have their own stable IDs but

      keep the same gene names as their chromosome X counterparts.</p>

    <blockquote type="cite"

cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">

      <div dir="ltr">

        <div><br>

        </div>

        <div>I have attached the duplicated stable IDs for case 2 and

          case 3 that I retrieved from BioMart release 109.</div>

        <div>Thank you and I look forward to your response.</div>

        <div><br>

        </div>

        <div>Best, </div>

        <font color="#888888">

          <div>Hiep</div>

        </font></div>

      <br>

    </blockquote>

    <p>Please let me know if you have any further questions about this.</p>

    Thanks,<br>

    Jose

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:CAOBwYajLJd2Hbau+wCU8wuCdRkoZ7mLDGx1HmyL_v1QgC0DUjw@mail.gmail.com">

      <fieldset class="moz-mime-attachment-header"></fieldset>

      <pre class="moz-quote-pre" wrap="">_______________________________________________

Dev mailing list    <a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>

Posting guidelines and subscribe/unsubscribe info: <a class="moz-txt-link-freetext" href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a>

Ensembl Blog: <a class="moz-txt-link-freetext" href="http://www.ensembl.info/">http://www.ensembl.info/</a>

</pre>

    </blockquote>

    <pre class="moz-signature" cols="72">-- 

Dr Jose M. Gonzalez

GENCODE Bioinformatician (Genome Interpretation Team)

European Bioinformatics Institute (EMBL-EBI)

Wellcome Genome Campus

Hinxton, CB10 1SD, UK</pre>

  </body>

</html>