<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hi Julien</p>
<p>Checked and can confirm that those were missing from the GTF file
in the current release as well. <br>
</p>
<p>One thing I noticed that is common in all the missing 9
transcripts is that they are all "trans-spliced" transcripts.</p>
<p><a class="moz-txt-link-freetext" href="http://www.ensembl.org/Drosophila_melanogaster/Transcript/Summary?db=core;g=FBgn0002781;r=3R:21375060-21377399;t=FBtr0307759">http://www.ensembl.org/Drosophila_melanogaster/Transcript/Summary?db=core;g=FBgn0002781;r=3R:21375060-21377399;t=FBtr0307759</a></p>
<p>Scroll down to the end of the page.<br>
</p>
<p>“Trans-spliced This is a trans-spliced transcript”. (A single
RNA transcript derived from multiple precursor mRNAs)</p>
<p>One possible explanation is that as it is difficult to represent
trans-spliced transcripts (single transcript multiple parents) in
standard GTF file format, they might have been skipped.</p>
<p>However, have added a jira ticket to look in to them in detail.<br>
</p>
Thanks<br>
Prem<br>
<br>
<div class="moz-cite-prefix">On 04/10/2018 13:20, Julien Wollbrett
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:5e6eab4878204b879bbf193f14552cf3@unil.ch">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<div class="moz-cite-prefix">Hi Premanand,<br>
<br>
By "where the annotation of these transcripts comes from" my
question was why these transcripts are not present in the gtf
and present in the cdna.all.fa<br>
I was wondering that all transcripts of the fasta file should be
present in the gtf file.<br>
<br>
Julien<br>
<br>
<br>
Le 04.10.18 à 13:19, Premanand Achuthan a écrit :<br>
</div>
<blockquote type="cite"
cite="mid:20181004111845.85B0922E197_BB5F715B@hx-mx1.ebi.ac.uk">
<p>Thanks Julien,</p>
<p>To get more info about the biotypes use our rest endpoints
and play with the parameters for filtering.<br>
</p>
<p><a class="moz-txt-link-freetext"
href="http://rest.ensembl.org/info/biotypes/groups/?content-type=application/json"
moz-do-not-send="true">http://rest.ensembl.org/info/biotypes/groups/?content-type=application/json</a>
(list of available biotype groups)</p>
<p><a class="moz-txt-link-freetext"
href="http://rest.ensembl.org/info/biotypes/groups/coding?content-type=application/json"
moz-do-not-send="true">http://rest.ensembl.org/info/biotypes/groups/coding?content-type=application/json</a>
(list within the coding group)</p>
<p><a class="moz-txt-link-freetext"
href="http://rest.ensembl.org/info/biotypes/groups/coding/gene?content-type=application/json"
moz-do-not-send="true">http://rest.ensembl.org/info/biotypes/groups/coding/gene?content-type=application/json</a>
(list within the coding group for gene object type)</p>
<p>To get more info about the transcripts, use our lookup
endpoint.</p>
<p>eg:<br>
</p>
<p><a class="moz-txt-link-freetext"
href="http://rest.ensembl.org/lookup/id/"
moz-do-not-send="true">http://rest.ensembl.org/lookup/id/</a><b>FBtr0307760</b>?content-type=application/json</p>
<p><span style="color: rgb(0, 0, 0); font-family: monospace;
font-size: medium; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal;
font-weight: 400; letter-spacing: normal; orphans: 2;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: 2; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration-style:
initial; text-decoration-color: initial; display: inline
!important; float: none;">{</span></p>
<ul class="obj collapsible" style="-webkit-font-smoothing:
antialiased; list-style-type: none; padding: 0px; margin: 0px
0px 0px 2em; color: rgb(0, 0, 0); font-family: monospace;
font-size: medium; font-style: normal; font-variant-ligatures:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; orphans: 2; text-align: start;
text-indent: 0px; text-transform: none; white-space: normal;
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration-style: initial; text-decoration-color:
initial;">
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">Parent</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"FBgn0002781"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">display_name</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"mod(mdg4)-RAE"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">db_type</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"core"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">id</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"FBtr0307760"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">is_canonical</span>:<span> </span><span class="type-number" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: blue;">0</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">assembly_name</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"BDGP6"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">end</span>:<span> </span><span class="type-number" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: blue;">21377399</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">object_type</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"Transcript"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">species</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"drosophila_melanogaster"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">biotype</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"protein_coding"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">strand</span>:<span> </span><span class="type-number" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: blue;">-1</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">seq_region_name</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"3R"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">start</span>:<span> </span><span class="type-number" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: blue;">21375060</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">source</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"FlyBase"</span>,</div>
</li>
<li style="-webkit-font-smoothing: antialiased; position:
relative;">
<div class="hoverable" style="-webkit-font-smoothing:
antialiased; transition: background-color 0.2s ease-out
0s; display: inline-block; padding: 1px 2px;
border-radius: 2px;">
<span class="property" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; font-weight: bold;">logic_name</span>:<span> </span><span class="type-string" style="-webkit-font-smoothing: antialiased; white-space: pre-wrap; color: green;">"flybase"</span></div>
</li>
</ul>
<span style="color: rgb(0, 0, 0); font-family: monospace;
font-size: medium; font-style: normal; font-variant-ligatures:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; orphans: 2; text-align: start;
text-indent: 0px; text-transform: none; white-space: normal;
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration-style: initial; text-decoration-color:
initial; display: inline !important; float: none;">}</span>
<p>You can see that the source is from 'Flybase'.</p>
<p>Please note that our rest service is running currently on
Ensembl release 94. (<a class="moz-txt-link-freetext"
href="http://rest.ensembl.org/info/software?content-type=application/json"
moz-do-not-send="true">http://rest.ensembl.org/info/software?content-type=application/json</a>)<br>
</p>
<p>Hope it helps.</p>
Thanks<br>
Prem<br>
<br>
<div class="moz-cite-prefix">On 04/10/2018 11:54, Julien
Wollbrett wrote:<br>
</div>
<blockquote type="cite"
cite="mid:5ec2a3c426414ab19f6d8dcc9b1d5d47@unil.ch">
<div class="moz-cite-prefix">Hello,<br>
<br>
Thank you Premanand for this extremely usefull answer.<br>
<br>
If I well understand, one fasta file is created using the <b>_patch_hapl_scaff.gtf.gz
</b>file for mouse and human or the <b>gtf.gz</b> file for
other species. This file is then splitted using gene
biotypes to create cdna, cds, ncrna, and pep fasta files.<br>
Do you know where I can find a list of all biotypes used to
split to each fasta file ?<br>
<br>
In my previous email I described that I found 9 transcripts
(FBtr0307759, FBtr0084079, FBtr0307760, FBtr0084081,
FBtr0084085, FBtr0084083, FBtr0084084, FBtr0084080,
FBtr0084082) that are present in the
<b>cdna.all.fa.gz</b> file but not in the <b>gtf.gz</b>
file for D. melanogaster (release 93). All these transcripts
are from the same gene (FBgn0002781).<br>
Do you know where the annotation of these transcripts comes
from ?<br>
<br>
Thanks,<br>
<br>
Julien<br>
<br>
Le 03.10.18 à 16:39, Premanand Achuthan a écrit :<br>
</div>
<blockquote type="cite"
cite="mid:20181003143900.9C35F230047_BB4D484B@hx-mx1.ebi.ac.uk">
<p>Hi Julien</p>
Apologies for the delay. This might possibly answer your
previous and current questions.<br>
<p>The split in the gtf file is based on the chromosomal
regions. Also please note that the “<b>Homo_sapiens.GRCh38.84.gtf</b>”
file includes all the top level chromosomal regions
(1..22, X,Y, MT) and also the scaffolds, as you can see
below.<br>
</p>
cut -f1 Homo_sapiens.GRCh38.84.gtf | sort | uniq<br>
<p>1<br>
10<br>
11<br>
12<br>
13<br>
14<br>
15<br>
16<br>
17<br>
18<br>
19<br>
2<br>
20<br>
21<br>
22<br>
3<br>
4<br>
5<br>
6<br>
7<br>
8<br>
9<br>
GL000008.2<br>
GL000009.2<br>
GL000194.1<br>
GL000195.1<br>
GL000205.2<br>
GL000213.1<br>
GL000216.2<br>
GL000218.1<br>
GL000219.1<br>
GL000220.1<br>
GL000224.1<br>
GL000225.1<br>
KI270442.1<br>
KI270706.1<br>
KI270707.1<br>
KI270708.1<br>
KI270711.1<br>
KI270713.1<br>
KI270714.1<br>
KI270721.1<br>
KI270722.1<br>
KI270723.1<br>
KI270724.1<br>
KI270726.1<br>
KI270727.1<br>
KI270728.1<br>
KI270731.1<br>
KI270733.1<br>
KI270734.1<br>
KI270741.1<br>
KI270743.1<br>
KI270744.1<br>
KI270750.1<br>
KI270752.1<br>
MT<br>
X<br>
Y</p>
<p>Also, please note that the '<b>Homo_sapiens.GRCh38.84.chr.gtf.gz</b>'
contains the features only from the primary assemblies<br>
</p>
<p>The '<b>Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz</b>'
also contains the features from haplotypes and patches
(for human and mouse only).
<br>
</p>
<p>More info about haplotype and patches:<br>
<a class="moz-txt-link-freetext"
href="https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html"
moz-do-not-send="true">https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html</a></p>
<p>- The split in the fasta file is based on the <b>biotype</b>.
So we have cdna, cds, ncrna, pep etc., (<a
class="moz-txt-link-freetext"
href="http://ftp.ensemblorg.ebi.ac.uk/pub/release-84/fasta/homo_sapiens/"
moz-do-not-send="true">http://ftp.ensemblorg.ebi.ac.uk/pub/release-84/fasta/homo_sapiens/</a>)<br>
If you are interested only in the primary assembly, then
he should be using the chr.gtf file and ignore any
accessions in cdna fasta not in gtf as the cdna fasta
includes haplotypes.<br>
</p>
<p>Hope it helps. <br>
</p>
Thanks<br>
Prem<br>
<br>
<p><br>
</p>
<br>
<div class="moz-cite-prefix">On 03/10/2018 15:07, Julien
Wollbrett wrote:<br>
</div>
<blockquote type="cite"
cite="mid:7dfe3bbd3326412396f32010538eb771@unil.ch">
<p>Hello,</p>
<p>As I do not receive answers of my previous email I will
try to describe a bit more my goals and questions.</p>
<p>I generated my own transcriptome fasta file using gtf
from ensembl, genome fasta file from ensembl (same
release) and a tool like gtf_to_fasta (TopHat) or
gffread (Cufflinks). Once I did that I compared the
generated file with the cdna transcriptome fasta file
available at ensembl ftp. Unfortunately I found some
differences between my transcriptome fasta file and the
one provided by ensembl. That is why I tried to
determine the origin of these differences.<br>
All my tests have been run on different species (human,
D. melanogaster, ...) and different releases (84 and 93)<br>
<br>
I used the approach described below to define
differences :<br>
- take all transcript_ids from transcriptome fasta file
of ensembl<br>
- take all transcript_ids from gtf file of ensembl<br>
- detect number of transcript in common in both files
and transcript specific to each file<br>
- map transcript_id to the gtf annotation and detect
gene_biotype associated to transcripts of each file.<br>
<br>
<b>results for human release 84:</b></p>
<p>gtf file : <a class="moz-txt-link-freetext"
href="http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz"
moz-do-not-send="true">
http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz</a><br>
fasta file : <a class="moz-txt-link-freetext"
href="http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz"
moz-do-not-send="true">
http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz</a><br>
</p>
<p>transcripts common in both files : 161150<br>
transcripts present only in gtf : 38034<br>
transcripts present only in fasta file : 15091<br>
number of different gene biotypes for transcripts
present in gtf: 44<br>
number of different gene biotypes for transcripts in
fasta file : 23<br>
list of biotypes present only in gtf and their count :<br>
<br>
gene_biotype freq<br>
3prime_overlapping_ncrna 32<br>
antisense 10183<br>
bidirectional_promoter_lncrna 5<br>
lincRNA 12648<br>
macro_lncRNA 1<br>
miRNA 4198<br>
misc_RNA 2306<br>
Mt_rRNA 2<br>
Mt_tRNA 22<br>
non_coding 3<br>
processed_transcript 2760<br>
ribozyme 8<br>
rRNA 549<br>
scaRNA 49<br>
sense_intronic 978<br>
sense_overlapping 334<br>
snoRNA 961<br>
snRNA 1905<br>
sRNA 20<br>
TEC 1069<br>
vaultRNA 1<br>
<br>
All the 38034 transcripts present only in gtf have a
gene_biotype not present anymore in ensembl
transcriptome.
<br>
</p>
<p><b>results for D. melanogaster release 93:</b></p>
<p>gtf file : <a class="moz-txt-link-freetext"
href="http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz"
moz-do-not-send="true">
http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz</a><br>
fasta file : <a class="moz-txt-link-freetext"
href="http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz"
moz-do-not-send="true">
http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz</a><br>
</p>
<p>transcripts in both files : 30819<br>
transcripts present only in gtf : 3948<br>
transcripts present only in fasta : 9<br>
number of different gene biotypes for transcripts
present in gtf: 8<br>
number of different gene biotypes for transcripts
present in fasta file : 2<br>
list of biotypes present only in gtf and their count :</p>
<p>gene_biotype freq<br>
ncRNA 2941<br>
pre_miRNA 259<br>
rRNA 115<br>
snoRNA 289<br>
snRNA 32<br>
tRNA 312<br>
</p>
<p>All the 3948 transcripts present only in gtf have a
gene_biotype not present anymore in ensembl
transcriptome.
</p>
<p><br>
</p>
<p>Could someone please explain to me :</p>
<p> 1. Why all the transcripts with these gene biotypes
are removed during the creation of the transcriptome ?<br>
2. Where do the transcripts present in the
transcriptome fasta file but not in the gtf file (15091
in human, 9 in D. melanogaster) come from ?<br>
3. How does the cdna transcriptome fasta file is
generated ?<br>
4. Should I generate my own transcriptome fasta file
or take the ensembl cdna fasta file ?</p>
<p>Sorry for such a long email.... and thank you for your
answers.</p>
<p>Best Regards,</p>
<p><br>
</p>
<p>Julien Wollbrett<br>
</p>
<p><br>
</p>
<p><br>
</p>
<p><br>
</p>
<p><br>
</p>
<!--'"--><br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Dev mailing list <a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org" moz-do-not-send="true">Dev@ensembl.org</a>
Posting guidelines and subscribe/unsubscribe info: <a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev" moz-do-not-send="true">http://lists.ensembl.org/mailman/listinfo/dev</a>
Ensembl Blog: <a class="moz-txt-link-freetext" href="http://www.ensembl.info/" moz-do-not-send="true">http://www.ensembl.info/</a>
</pre>
</blockquote>
<br>
</blockquote>
<p><br>
</p>
</blockquote>
<br>
</blockquote>
<p><br>
</p>
</blockquote>
<br>
</body>
</html>