<div dir="ltr">Dear Jose, <div><br></div><div>Thank you for your reply. Your clear explanation has totally resolved my concerns.</div><div><br></div><div>Best regards,</div><div>Hiep</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jul 15, 2023 at 2:18 AM Jose Gonzalez <<a href="mailto:jmgonzalez@ebi.ac.uk">jmgonzalez@ebi.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    <p>Hi Hiep,</p>

    <p>The version change from 6 to 1 was not intentional. As you would

      expect, versions should always increase and older versions must

      not be reused, so this looks like an anomalous behaviour of our

      stable id mapping. <br>

    </p>

    <p>The annotation of this gene changed in an unusual way between

      releases 107 and 108. To understand this, two aspects of our

      internal workflow must be considered:</p>

    <p>1) The human gene annotation is being manually edited constantly

      in our internal database, from which snapshots or freezes are

      taken at regular intervals for the Ensembl releases.</p>

    <p>2) Our internal database generates stable ids that must be

      honoured by the stable id mapping that assigns identifiers and

      versions in the Ensembl release. Versions, however, are assigned

      by the stable id mapping process by comparison with the previous

      release. It is done this way because the human gene annotation can

      undergo multiple changes between releases (hence multiple version

      increments in our internal database) but only single increments

      are expected between Ensembl releases.<br>

    </p>

    <p>The gene ENSG0000025076 was a lncRNA gene until release 107.

      Then, our manual annotation added a pseudogene transcript, so the

      gene became a transcribed unprocessed pseudogene that also

      included the lncRNA transcript. Before the freeze for release 108,

      the lncRNA was separated from the pseudogene (as explained in my

      previous reply) and, to follow the standard procedure, the

      pseudogene kept the original gene id (ENSG0000025076), perhaps

      against common sense. When the stable id mapping process for

      release 108 compared this annotation with 107, it found that

      ENSG0000025076 was now a different gene with no common transcripts

      in 107, so it determined that this was a new gene and gave it

      version 1. However, it was forced to keep the gene id that came

      from our internal database. <br>

    </p>

    <p>This looks like a possible bug in our code that wasn't able to

      handle an unexpected situation. We will need to investigate this

      further. <br>

    </p>

    <p>Thank you for bringing this to our attention.</p>

    <p>Jose<br>

    </p>

    <p><br>

    </p>

    <div>On 13/07/2023 03:42, Hiep Dang wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">Hi Jose,

        <div><br>

        </div>

        <div>Thank you for your response, it helps me better understand

          Emsembl's annotation pipeline. I look forward to the release

          of Ensembl 110.</div>

        <div><br>

        </div>

        <div>There is another unrelated question: </div>

        <div>- As per my understanding, the stable ID version is

          expected to always increase, and the older version will be

          retired. However, I came across a case involving

          ENSG00000250765. In release 108, ENSG00000250765.1 was mapped

          to ENSG00000250765.6 from release 85. Is this intentional or

          there is something wrong with it?</div>

        <div><br>

        </div>

        <div>Thanks,</div>

        <div>Hiep</div>

        <div><br>

        </div>

        <div><br>

        </div>

        <img src="cid:189626aabf9cb971f161" alt="image.png" width="562" height="254"><br>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Thu, Jun 29, 2023 at

          10:27 PM Jose Gonzalez <<a href="mailto:jmgonzalez@ebi.ac.uk" target="_blank">jmgonzalez@ebi.ac.uk</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div>

            <p>Hi Hiep,</p>

            Thank you for bringing this issue to our attention. Please

            find below my replies to your questions.

            <div>On 22/06/2023 03:33, Hiep Dang wrote:<br>

            </div>

            <blockquote type="cite">

              <div dir="ltr">Dear Ensembl Team,

                <div><br>

                </div>

                <div>I am doing a project that needs to convert stable

                  IDs to gene symbols. When I referenced the HGNC

                  database, one symbol corresponds to only one stable

                  ID. However, in the Ensembl database, one symbol can

                  correspond to many stable IDs. I worry that using the

                  HGNC reference will drop out some gene information. To

                  clarify this problem, I investigate why some stable

                  IDs share their gene names. For the human genes, I

                  found that they will belong to 3 cases:</div>

                <div><br>

                </div>

                <div>1. Stable IDs from non-primary assemblies: </div>

                <div>- These stable IDs will not be in the released GTF

                  file (which contains chromosomes 1-22, X, Y, and MT).

                  I can only retrieve these IDs from BioMart. This

                  confuses me because, for a regular use case such as

                  transcriptomic alignment and quantification, the input

                  file is only the GTF file. So when should I consider

                  using these IDs from other assemblies?</div>

              </div>

            </blockquote>

            Please note that a GTF file with the gene annotation of the

            alternate regions (patches and haplotypes) is also part of

            the Ensembl release files:<br>

            <a href="http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz" target="_blank">http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz</a>

            <p>The use of primary assembly sequences is normally

              sufficient for transcriptomic alignment and

              quantification. The inclusion of alternate regions could

              lead to multi-mapping issues, since they are very similar

              to the corresponding sequences in the primary assembly.

              However, some users may be interested in the annotation on

              alternate regions. For instance, some gene annotations are

              known to be inaccurate because of underlying errors in the

              primary assembly, and the corrected annotations can be

              found in fix patches that have the corrected genomic

              sequences. Or genetics researchers may be interested in

              the variation of gene annotations in different haplotypes.</p>

            <blockquote type="cite">

              <div dir="ltr">

                <div><br>

                </div>

                <div>- After dropping the stable IDs from non-primary

                  assemblies, there are still about 1700 IDs that share

                  the external gene name. Considering only the genes

                  with their sources from HGNC or NCBI, they will fall

                  into the following 2 cases.</div>

                <div><br>

                </div>

                <div><br>

                </div>

                <div>2. Stable IDs with similar chromosomal positions:</div>

                <div>- For example: ENSG00000291019 (chr5: 178764861 -

                  178818435) and ENSG00000250420 (chr5: 178767204 -

                  178797611). They are both assigned to AACSP1 with a

                  source from HGNC. However, the HGNC database only

                  references ENSG00000250420. </div>

                <div><br>

                </div>

                <div>- Why do these two stable IDs exist at the same

                  time? It seems like they are essentially one gene. In

                  the future version, will one of them be retired?</div>

                <div><br>

                </div>

              </div>

            </blockquote>

            <p>Most of these cases have their origin in a recent change

              in the way that we annotate transcribed pseudogenes. The

              pseudogene model, containing the homology with a coding

              gene, has been dissociated from the transcriptional

              evidence, which is now grouped in one or more lncRNA

              genes. An undesired side effect of this change is that

              both genes still share the same gene name. The pseudogene

              keeps the same stable ID, so it gets its name directly

              from HGNC, whereas the lncRNA gene gets the same name via

              NCBI based on its genomic overlap with the transcribed

              pseudogene annotated by RefSeq. <br>

            </p>

            <p>Please note that this change only involves

              "transcribed_unprocessed_pseudogene" genes in release 109,

              but it will be extended to the remaining transcribed

              pseudogene biotypes ("transcribed_processed_pseudogene"

              and "transcribed_unitary_pseudogene") in release 110.</p>

            <p>In future releases, transcribed pseudogenes and lncRNAs

              will still be separate genes with their own IDs. On the

              other hand, HGNC are reluctant to assign the same gene

              symbol to more than one Ensembl stable ID. To fix these

              gene name duplicate issues, we will simply remove the

              current gene names from the lncRNA genes that overlap

              transcribed pseudogenes without giving them a new name.

              However, due to the Ensembl release cycle timing, this

              will not happen before release 112.<br>

            </p>

            <p> Within the remaining set of duplicates, a few genes such

              as SPATA13 and SCARNA4 could certainly be merged to remove

              the duplication and we will look into fixing the

              annotation. <br>

            </p>

            <p><br>

            </p>

            <blockquote type="cite">

              <div dir="ltr">

                <div>3. Stable IDs with different chromosomal positions:</div>

                <div>- For example: ENSG00000240356 (chr2: 113610502 -

                  113627090 - HGNC referenced) and ENSG00000291064

                  (chr22: 50756948 - 50801309 - NCBI referenced<span>:</span><a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=118433" rel="external" target="_blank">118433</a>). They are both

                  assigned to RPL23AP7. The former is currently in the

                  HGNC database. When I go to the NCBI website, the

                  current position is chr2: <span style="color:rgb(0,0,0);font-family:arial,helvetica,clean,sans-serif;font-size:13px">113611239-113627138,

                    which is more similar to </span>ENSG00000240356. </div>

                <div><br>

                </div>

                <div>- Will these NCBI-referenced genes be fixed in

                  future releases?</div>

              </div>

            </blockquote>

            These cases seem to have been caused by the inaccurate name

            assignment to one of the genes because of their high

            sequence similarity to other genes in the same family. For

            instance, gene ENSG00000291064 should have been called

            RPL23AP82 instead of RPL23AP7. The pipeline seems to have

            taken into account the sequence identity with the NCBI genes

            of this family but not the genomic overlap with the NCBI

            gene RPL23AP82. We still need more time to investigate why

            this happened and will try to fix it in future releases.<br>

            <p>Just a heads-up that there will be another source of

              duplicate gene names since release 110 as chromosome Y PAR

              genes are now annotated separately, eg. they will have

              their own stable IDs but keep the same gene names as their

              chromosome X counterparts.</p>

            <blockquote type="cite">

              <div dir="ltr">

                <div><br>

                </div>

                <div>I have attached the duplicated stable IDs for case

                  2 and case 3 that I retrieved from BioMart release

                  109.</div>

                <div>Thank you and I look forward to your response.</div>

                <div><br>

                </div>

                <div>Best, </div>

                <font color="#888888">

                  <div>Hiep</div>

                </font></div>

              <br>

            </blockquote>

            <p>Please let me know if you have any further questions

              about this.</p>

            Thanks,<br>

            Jose

            <p><br>

            </p>

            <blockquote type="cite">

              <fieldset></fieldset>

              <pre>_______________________________________________

Dev mailing list    <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a>

Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" target="_blank">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a>

</pre>

            </blockquote>

            <pre cols="72">-- 

Dr Jose M. Gonzalez

GENCODE Bioinformatician (Genome Interpretation Team)

European Bioinformatics Institute (EMBL-EBI)

Wellcome Genome Campus

Hinxton, CB10 1SD, UK</pre>

          </div>

          _______________________________________________<br>

          Dev mailing list    <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>

          Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" rel="noreferrer" target="_blank">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a><br>

          Ensembl Blog: <a href="http://www.ensembl.info/" rel="noreferrer" target="_blank">http://www.ensembl.info/</a><br>

        </blockquote>

      </div>

      <br>

      <fieldset></fieldset>

      <pre>_______________________________________________

Dev mailing list    <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a>

Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" target="_blank">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a>

</pre>

    </blockquote>

    <pre cols="72"></pre>

  </div>

_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" rel="noreferrer" target="_blank">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" rel="noreferrer" target="_blank">http://www.ensembl.info/</a><br>

</blockquote></div>