<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div><div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On 10 Apr 2018, at 17:23, David Mathog <<a href="mailto:mathog@caltech.edu" class="">mathog@caltech.edu</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">On 10-Apr-2018 03:10, Thibaut Hourlier wrote:<br class=""><blockquote type="cite" class="">Yes all the code used by Ensembl is free to use and can be found on<br class=""><a href="http://github.com/Ensembl" class="">github.com/Ensembl</a> <<a href="http://github.com/Ensembl" class="">http://github.com/Ensembl</a>>. Unfortunately we do<br class="">not have a proper documentation on how to install the pipelines and<br class="">how to use them but we are working on it.<br class=""></blockquote><br class="">OK<br class=""><br class=""><blockquote type="cite" class="">If by locally you mean on your laptop, it might take some time,<br class="">probably more than a month but it is hard to predict. Our pipeline is<br class="">made to be run on a cluster with hundreds of job running in parallel.<br class=""></blockquote><br class="">This would be on a ~40 thread large Dell server.<br class=""><br class=""><blockquote type="cite" class="">All our pipelines are made to use MySQL databases which are created<br class="">when the pipeline needs them. You need to have a database with the<br class="">Ensembl schema containing your dna.<br class=""></blockquote><br class="">Why?  The input dna consists of a fasta header (with completely arbitrary information, might as well just be the numbers 1->N) and the sequence.  That's it.  Other than the read mappings, what other information would there be in a pre-annotated genome?<br class=""></div></div></blockquote><div><br class=""></div>Yes technically you only need your sequences. We store all the data we produce in databases. The first step of our annotation pipeline is to store the sequences in the database that will be later used for the website, the public MySQL instance and anyone in Ensembl who needs the sequences for a species. It makes more sense for us to have this database ready at the beginning rather than the end of our production cycle.</div><div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class=""><blockquote type="cite" class="">If the assembly is available at<br class="">NCBI the pipeline will do the right thing. Otherwise you will need to<br class="">manually load your assembly into the database.<br class=""></blockquote><br class="">Nope, not there.<br class=""><br class=""><blockquote type="cite" class="">We are using linuxbrew to install all the software we need:<br class=""><a href="https://github.com/Ensembl/homebrew-ensembl" class="">https://github.com/Ensembl/homebrew-ensembl</a><br class=""><https://github.com/Ensembl/homebrew-ensembl><br class="">https://github.com/Ensembl/homebrew-cask<br class=""><https://github.com/Ensembl/homebrew-cask><br class="">https://github.com/Ensembl/homebrew-external<br class=""><https://github.com/Ensembl/homebrew-external><br class="">https://github.com/Ensembl/homebrew-moonshine<br class=""><https://github.com/Ensembl/homebrew-moonshine> (you will need to get<br class="">the license and archive for software like genscan)<br class=""></blockquote><br class="">This genscan? <a href="http://genes.mit.edu/license.html" class="">http://genes.mit.edu/license.html</a><br class=""></div></div></blockquote><div><br class=""></div>Yes this Genscan. I understand that in your case you will not want to run it so asking for the license will be useless. As it is a dependency of the pipeline I wanted to make you aware that some of the software might need a license.</div><div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class="">Is there a list somewhere in the github repository of the dependencies?  Does one of the scripts check for these and report when it starts up?<br class=""></div></div></blockquote><div><br class=""></div>Linuxbrew is a package manager which doesn’t require admin rights. So by installing the software with the commands below you should have all the software and dependencies required</div><div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class=""><blockquote type="cite" class="">brew tap ensembl/ensembl<br class="">brew tap ensembl/cask<br class="">brew tap ensembl/external<br class="">brew tap ensembl/moonshine<br class="">brew install genebuild-annotation<br class="">brew install rnaseq-pipeline<br class="">Once all the softwares are installed, you will need these repositories<br class="">to run the pipeline:<br class=""><a href="https://github.com/Ensembl/ensembl" class="">https://github.com/Ensembl/ensembl</a> <<a href="https://github.com/Ensembl/ensembl" class="">https://github.com/Ensembl/ensembl</a>><br class=""><a href="https://github.com/Ensembl/ensembl-analysis" class="">https://github.com/Ensembl/ensembl-analysis</a><br class=""><https://github.com/Ensembl/ensembl-analysis> dev/hive_master (branch)<br class="">https://github.com/Ensembl/ensembl-hive<br class=""><https://github.com/Ensembl/ensembl-hive><br class="">https://github.com/Ensembl/ensembl-compara<br class=""><https://github.com/Ensembl/ensembl-compara><br class="">https://github.com/Ensembl/ensembl-io <https://github.com/Ensembl/ensembl-io><br class="">https://github.com/Ensembl/ensembl-killlist<br class=""><https://github.com/Ensembl/ensembl-killlist><br class="">https://github.com/Ensembl/ensembl-production<br class=""><https://github.com/Ensembl/ensembl-production><br class="">https://github.com/bioperl/bioperl-live<br class=""><https://github.com/bioperl/bioperl-live> release-1-6-924 (tag)<br class="">ensembl-hive is our job manager which we use with LSF, SGE is<br class="">supported and some others job scheduler too. If you want to run jobs<br class="">locally a bit more tuning might be required.<br class="">The configuration of the pipeline will need some tweaking but we will<br class="">be happy to help.<br class=""></blockquote><br class="">Before going through all of that, is there a way I could manually run a few tests through just the mapping and gene prediction phases?   As noted in an earlier post the biggest problem seems to be when protein and mRNAs are mapped onto the DNA, and the DNA typically has some rough spots.  The NCBI's code notes and works around those rough spots, Maker by and large does not.  It would be good to put through a few test sets of known genomic DNA, corresponding mRNA and protein to see if the results are "NCBI like" or "Maker like".<br class=""><br class="">Basically this would just be:<br class=""><br class="">0.  mask (by whichever method is preferred, repeats are known)<br class="">1.  map corresponding mRNA to genome<br class="">2.  map corresponding protein to genome<br class="">3.  run gene prediction on raw genome + mapping<br class=""><br class="">Ideally the predicted gene's mRNA/protein will match the input fairly closely.<br class=""></div></div></blockquote><div><br class=""></div>We do not use gene prediction to generate the annotation. We only base our annotation on cDNA/transcriptomic data and proteic data by aligning them on the genome. What we do is:</div><div>• Mask the genome using RepeatMasker and repbase. In some case we would use repeatmodeler to create a repeat library</div><div>• Align species specific data with exonerate/genewise</div><div>• Align protein from other species with genBlast</div><div>• Select the best gene model at each location using Perl code in ensembl-analysis</div><div><br class=""></div><div>Thanks</div><div>Thibaut</div><div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class="">If you could tell me the names of the programs used at each of these steps it would be a big help in finding the corresponding commands in all of the code you cited.<br class=""><br class="">Thanks,<br class=""><br class="">David Mathog<br class=""><a href="mailto:mathog@caltech.edu" class="">mathog@caltech.edu</a><br class="">Manager, Sequence Analysis Facility, Biology Division, Caltech<br class=""><br class=""><blockquote type="cite" class="">Thanks<br class="">Thibaut<br class=""><blockquote type="cite" class="">On 9 Apr 2018, at 17:46, David Mathog <mathog@caltech.edu> wrote:<br class="">On 06-Apr-2018 14:12, David Mathog wrote:<br class=""><blockquote type="cite" class="">Greetings all,<br class="">Is the software used for this<br class="">  http://uswest.ensembl.org/info/genome/genebuild/automatic_coding.html<br class="">publicly available?  That is, can it be downloaded and run locally?<br class=""></blockquote>Found these:<br class="">https://github.com/Ensembl/ensembl-analysis<br class=""> Modules to interface with tools used in Ensembl Gene Annotation<br class=""> Process and scripts to run pipelines<br class="">https://github.com/Ensembl/ensembl<br class=""> The Ensembl Core Perl API and SQL schema<br class="">https://github.com/Ensembl/ensembl-annotation<br class=""> The Ensembl gene annotation pipeline (a work in progress)<br class="">and dozens of others.  Have not located any documentation about how to install and run the pipeline though.  Anybody know where that might be, or who to ask???<br class="">I only need the parts to work from data in (genome, proteins, RNA) to gff output.  Anything having to do with checking data into or out of EMBL databases is not required.<br class="">Thanks,<br class="">David Mathog<br class="">mathog@caltech.edu<br class="">Manager, Sequence Analysis Facility, Biology Division, Caltech<br class="">_______________________________________________<br class="">Dev mailing list    Dev@ensembl.org<br class="">Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev<br class="">Ensembl Blog: http://www.ensembl.info/<br class=""></blockquote></blockquote><br class=""></div></div></blockquote></div><br class=""></div></div></body></html>