<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hi Julien</p>
Apologies for the delay. This might possibly answer your previous
and current questions.<br>
<p>The split in the gtf file is based on the chromosomal regions.
Also please note that the “<b>Homo_sapiens.GRCh38.84.gtf</b>” file
includes all the top level chromosomal regions (1..22, X,Y, MT)
and also the scaffolds, as you can see below.<br>
</p>
cut -f1 Homo_sapiens.GRCh38.84.gtf | sort | uniq<br>
<p>
1<br>
10<br>
11<br>
12<br>
13<br>
14<br>
15<br>
16<br>
17<br>
18<br>
19<br>
2<br>
20<br>
21<br>
22<br>
3<br>
4<br>
5<br>
6<br>
7<br>
8<br>
9<br>
GL000008.2<br>
GL000009.2<br>
GL000194.1<br>
GL000195.1<br>
GL000205.2<br>
GL000213.1<br>
GL000216.2<br>
GL000218.1<br>
GL000219.1<br>
GL000220.1<br>
GL000224.1<br>
GL000225.1<br>
KI270442.1<br>
KI270706.1<br>
KI270707.1<br>
KI270708.1<br>
KI270711.1<br>
KI270713.1<br>
KI270714.1<br>
KI270721.1<br>
KI270722.1<br>
KI270723.1<br>
KI270724.1<br>
KI270726.1<br>
KI270727.1<br>
KI270728.1<br>
KI270731.1<br>
KI270733.1<br>
KI270734.1<br>
KI270741.1<br>
KI270743.1<br>
KI270744.1<br>
KI270750.1<br>
KI270752.1<br>
MT<br>
X<br>
Y</p>
<p>Also, please note that the '<b>Homo_sapiens.GRCh38.84.chr.gtf.gz</b>'
contains the features only from the primary assemblies<br>
</p>
<p>The '<b>Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz</b>'
also contains the features from haplotypes and patches (for human
and mouse only). <br>
</p>
<p>More info about haplotype and patches:<br>
<a class="moz-txt-link-freetext" href="https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html">https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html</a></p>
<p>
</p>
<p>- The split in the fasta file is based on the <b>biotype</b>. So
we have cdna, cds, ncrna, pep etc.,
(<a class="moz-txt-link-freetext" href="http://ftp.ensemblorg.ebi.ac.uk/pub/release-84/fasta/homo_sapiens/">http://ftp.ensemblorg.ebi.ac.uk/pub/release-84/fasta/homo_sapiens/</a>)<br>
If you are interested only in the primary assembly, then he should
be using the chr.gtf file and ignore any accessions in cdna fasta
not in gtf as the cdna fasta includes haplotypes.<br>
</p>
<p>Hope it helps. <br>
</p>
Thanks<br>
Prem<br>
<br>
<p><br>
</p>
<br>
<div class="moz-cite-prefix">On 03/10/2018 15:07, Julien Wollbrett
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:7dfe3bbd3326412396f32010538eb771@unil.ch">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<p>Hello,</p>
<p>As I do not receive answers of my previous email I will try to
describe a bit more my goals and questions.</p>
<p>I generated my own transcriptome fasta file using gtf from
ensembl, genome fasta file from ensembl (same release) and a
tool like gtf_to_fasta (TopHat) or gffread (Cufflinks). Once I
did that I compared the generated file with the cdna
transcriptome fasta file available at ensembl ftp. Unfortunately
I found some differences between my transcriptome fasta file and
the one provided by ensembl. That is why I tried to determine
the origin of these differences.<br>
All my tests have been run on different species (human, D.
melanogaster, ...) and different releases (84 and 93)<br>
<br>
I used the approach described below to define differences :<br>
- take all transcript_ids from transcriptome fasta file of
ensembl<br>
- take all transcript_ids from gtf file of ensembl<br>
- detect number of transcript in common in both files and
transcript specific to each file<br>
- map transcript_id to the gtf annotation and detect
gene_biotype associated to transcripts of each file.<br>
<br>
<b>results for human release 84:</b></p>
<p>gtf file : <a class="moz-txt-link-freetext"
href="http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz"
moz-do-not-send="true">
http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz</a><br>
fasta file : <a class="moz-txt-link-freetext"
href="http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz"
moz-do-not-send="true">
http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz</a><br>
</p>
<p>transcripts common in both files : 161150<br>
transcripts present only in gtf : 38034<br>
transcripts present only in fasta file : 15091<br>
number of different gene biotypes for transcripts present in
gtf: 44<br>
number of different gene biotypes for transcripts in fasta file
: 23<br>
list of biotypes present only in gtf and their count :<br>
<br>
gene_biotype freq<br>
3prime_overlapping_ncrna 32<br>
antisense 10183<br>
bidirectional_promoter_lncrna 5<br>
lincRNA 12648<br>
macro_lncRNA 1<br>
miRNA 4198<br>
misc_RNA 2306<br>
Mt_rRNA 2<br>
Mt_tRNA 22<br>
non_coding 3<br>
processed_transcript 2760<br>
ribozyme 8<br>
rRNA 549<br>
scaRNA 49<br>
sense_intronic 978<br>
sense_overlapping 334<br>
snoRNA 961<br>
snRNA 1905<br>
sRNA 20<br>
TEC 1069<br>
vaultRNA 1<br>
<br>
All the 38034 transcripts present only in gtf have a
gene_biotype not present anymore in ensembl transcriptome.
<br>
</p>
<p><b>results for D. melanogaster release 93:</b></p>
<p>gtf file : <a class="moz-txt-link-freetext"
href="http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz"
moz-do-not-send="true">
http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz</a><br>
fasta file : <a class="moz-txt-link-freetext"
href="http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz"
moz-do-not-send="true">
http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz</a><br>
</p>
<p>transcripts in both files : 30819<br>
transcripts present only in gtf : 3948<br>
transcripts present only in fasta : 9<br>
number of different gene biotypes for transcripts present in
gtf: 8<br>
number of different gene biotypes for transcripts present in
fasta file : 2<br>
list of biotypes present only in gtf and their count :</p>
<p>gene_biotype freq<br>
ncRNA 2941<br>
pre_miRNA 259<br>
rRNA 115<br>
snoRNA 289<br>
snRNA 32<br>
tRNA 312<br>
</p>
<p>All the 3948 transcripts present only in gtf have a
gene_biotype not present anymore in ensembl transcriptome.
</p>
<p><br>
</p>
<p>Could someone please explain to me :</p>
<p> 1. Why all the transcripts with these gene biotypes are
removed during the creation of the transcriptome ?<br>
2. Where do the transcripts present in the transcriptome
fasta file but not in the gtf file (15091 in human, 9 in D.
melanogaster) come from ?<br>
3. How does the cdna transcriptome fasta file is generated ?<br>
4. Should I generate my own transcriptome fasta file or take
the ensembl cdna fasta file ?</p>
<p>Sorry for such a long email.... and thank you for your answers.</p>
<p>Best Regards,</p>
<p><br>
</p>
<p>Julien Wollbrett<br>
</p>
<p><br>
</p>
<p><br>
</p>
<p><br>
</p>
<p><br>
</p>
<!--'"--><br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Dev mailing list <a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>
Posting guidelines and subscribe/unsubscribe info: <a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev">http://lists.ensembl.org/mailman/listinfo/dev</a>
Ensembl Blog: <a class="moz-txt-link-freetext" href="http://www.ensembl.info/">http://www.ensembl.info/</a>
</pre>
</blockquote>
<br>
</body>
</html>