<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Dec 23, 2014 at 4:28 AM, Will McLaren <span dir="ltr"><<a href="mailto:wm2@ebi.ac.uk" target="_blank">wm2@ebi.ac.uk</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">Thanks again Rocky, your work on this is really appreciated, and great to see such an improvement for such a minor change!</div></blockquote><div><br></div><div>And thank you for the kind words.  </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><br></div><div>If there's any other code you'd like to share, or any changes to ours, please feel free to send us more details or put in a pull request on GitHub.</div></div></blockquote><div><br></div><div>Ok. There is one other avenue I think worthwhile to pursue and mentioned before: a Map/Reduce variation of this code.  On a Human Genome that might give a speedup of anywhere from 10-20 times with little thought as to proceed.</div><div><br></div><div>Of course the actual unparallelized processor time would be the same -- or a little more since there is map/reduce overhead. (That's why I don't think it possible to obtain the theoretical maximum of 24 times speed improvement.)</div><div><br></div><div>I have in mind someone who might be able to provide processor time on an existing scalable cloud platform for such task. I don't otherwise have such resources available at my disposal.</div><div><br></div><div>But to convince him to donate the service free to me to try this, it would be helpful to demonstrate among the community there is a desire or need for it. </div><div><br></div><div>The flame graph gives other areas that afford  the most opportunity for speed improvement, without changing the overall algorithm: </div><div><br></div><div><font face="monospace"><span style="color:rgb(128,128,128);font-size:medium;white-space:pre">16.00s Bio::EnsEMBL::Variation::BaseVariationFeatureOverlapAllele::</span><a href="Bio-BaseVariationFeatureOverlapAllele-pm-169-line.html#199" style="color:rgb(109,0,230);font-size:medium;white-space:pre">get_all_OverlapConsequences</a>  </font></div><div><font face="monospace"><span style="font-size:medium;color:rgb(128,128,128);white-space:pre">08.85s Bio::EnsEMBL::Feature::</span><a href="Bio-EnsEMBL-Feature-pm-92-line.html#665" style="font-size:medium;color:rgb(109,0,230);white-space:pre">transfer</a></font></div><div><font face="monospace"><span style="color:rgb(128,128,128);font-size:medium;white-space:pre">08.72s Bio::EnsEMBL::Variation::BaseVariationFeatureOverlapAllele::</span><a href="Bio-BaseVariationFeatureOverlapAllele-pm-169-line.html#154" style="color:rgb(109,0,230);font-size:medium;white-space:pre">base_variation_feature</a><br></font></div><div><font face="monospace"><span style="font-size:medium;color:rgb(128,128,128);white-space:pre">05.56s Bio::EnsEMBL::Variation::BaseVariationFeatureOverlapAllele::</span><a href="Bio-BaseVariationFeatureOverlapAllele-pm-169-line.html#168" style="font-size:medium;color:rgb(109,0,230);white-space:pre">feature</a><br></font></div><div><span style="font-family:monospace">0</span><span style="font-size:medium;color:rgb(128,128,128);font-family:monospace;white-space:pre">5.00s Bio::EnsEMBL::Variation::BaseTranscriptVariation::</span><a href="Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#532" style="font-size:medium;color:rgb(109,0,230);font-family:monospace;white-space:pre">_intron_effects</a></div><div><br></div><div>The above is for a (smallish) run that took 195 seconds overall; the time given in the beginning of the line is time spent exclusively in the named routines. That is, it excludes time spent by in routines that subroutine called.  Even if you were to reduce these times to zero, clearly you're not going to see even a 10-time improvement as you probably would with map/reduce.</div><div><br></div><div>And again to be clear: even though improvements are planned to reduce time by the routines that get_all_OverlapConsequences calls, that doesn't necessarily mean the exclusive time listed above will necessarily be greatly effected. I personally do think it will have some impact. But there are other lines above, like the second line above, transfer, that are in not even in that phase of the code, they deal only with formatting the results once you have the basic data.  And it is disturbing that this phase, vfa_to_line should take more than 44% of the overall time.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><br></div><div>Thanks</div><div><br></div><div>Will</div></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On 23 December 2014 at 03:26, Rocky Bernstein <span dir="ltr"><<a href="mailto:rocky.bernstein@gmail.com" target="_blank">rocky.bernstein@gmail.com</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div class="h5"><div dir="ltr">Just a follow-up to my earlier post. <div><br></div><div>I ran a Variant Effect Prediction  run on a VCF file of 5000 entries (which is what fits in one buffer read)  with one small change. With that, I was able to significantly significantly reduce the time bottleneck in the Fasta code. The time spent here went from 7.76 seconds to 2.32 seconds.</div><div><br></div><div>Compare the top line of: <br><div class="gmail_extra"><a href="http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html" target="_blank">http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html</a> </div><div class="gmail_extra">with:</div><div class="gmail_extra"><a href="http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html" target="_blank">http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html</a></div><div class="gmail_extra"><br></div><div class="gmail_extra">You get a 50% reduction just by the fact that one transformation is needed to remove both \n and \r rather than two transformations. But even beyond this, the C code for one run is still faster than the corresponding Perl s///. </div><div class="gmail_extra"><br></div><div class="gmail_extra">The specific change that I made can be found at <a href="https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff" target="_blank">https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff</a></div><div class="gmail_extra">You'll also see benchmarks for other variations of that code. </div><div class="gmail_extra"><br></div><div class="gmail_extra">But.... in order to see the effect in a run you need to have Perl module Inline::C installed. Otherwise you get a lesser improvement outlined in my original posting.  Again this speeds things up by compiling once Perl regular expressions used to match \n and \r. </div><div class="gmail_extra"><br></div><div class="gmail_extra">In the spirit of open scientific review, I am curious to learn of others experience the same kind of improvement I saw. </div><div class="gmail_extra"><br></div><div class="gmail_extra">I have a pull request for this change to the bioperl-live repository. See <a href="https://github.com/bioperl/bioperl-live/issues/95" target="_blank">https://github.com/bioperl/bioperl-live/issues/95</a> . However I note that the Bio::DB code used by  Variant Effect Predictor is a different (back-level) from the code in that git repository. The diff file in the gist cited above is for the Fasta.pm code that is in Ensembl ; of course, the pull request uses the current Bio::DB code. </div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Lastly <a href="http://dustyfeet.com:8001" target="_blank">http://dustyfeet.com:8001</a> has the profile results other kinds of runs which I hope will clarify my other remarks about where things are slow. </div><div class="gmail_extra"><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Dec 18, 2014 at 12:48 AM, Rocky Bernstein <span dir="ltr"><<a href="mailto:rocky.bernstein@gmail.com" target="_blank">rocky.bernstein@gmail.com</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">Running the Variant Effect Predictor on a Human Genome VCF file (130780 lines)  with a local Fasta cache (--offline) takes about 50 minutes on a quad-core Ubuntu box. <div><br></div><div>I could give more details, but I don't think they are that important.<div><br></div><div>In looking at how to speed this up, it looks like VEP goes through the VCF file,  is sorted by chromosome, and processes each</div><div>Chromosome independently. The first obvious way to speed this up would be to do some sort of 24-way map/reduce.</div><div>There is of course the --fork option on the <a href="http://variant_effect_predictor.pl" target="_blank">variant_effect_predictor.pl</a> program which is roughly the same idea, but it parallelizes only across the cores of a single computer rather than make use of multiple ones. </div><div><br></div><div>To pinpoint the slowness better, I used Devel::NYTProf. For those of you who haven't used it recently, it now has flame graphs and it makes it very easy to see what's going on.</div><div><br></div><div>The first thing that came out was a slowness in code to remove carriage returns and line feeds. This is in Bio::DB::Fasta ::subseq: <br></div><div><br></div><div><div>     $data =~ s/\n//g;</div><div>     $data =~ s/\r//g;</div></div><div><br></div><div>Compiling the regexp, e.g: </div><div><br></div><div>     my $nl = qr/\n/;</div><div>     my $cr = qr/\r/;</div><div><br></div><div>     sub subseq {</div><div>         ....</div><div>        $data =~ s/$nl//g;</div><div>        $data =~ s/$cr//g;</div><div>     }</div><div><br></div><div>Speeds up the subseq method by about 15%. I can elaborate more or describe the other methods I tried and how they fared, if there's interest. But since this portion is really part of BioPerl and not Bio::EnsEMBL, I'll try to work up a git pull request ont that repository.</div><div><br></div><div>So now I come to the meat of what I have to say. I should have put this at the top -- I hope some of you are still with me. </div><div><br></div><div>The NYTProf graphs seem to say that there is a *lot* of overhead in object lookup and type testing. I think some of this is already known as there already are calls to "weaken" and "new_fast" object creators. And there is this comment in  Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:</div><div><br></div><div><br></div><div><div>    # this method is a major bottle neck in the effect calculation code so</div><div>    # we cache results and use local variables instead of method calls where</div><div>    # possible to speed things up - caveat bug-fixer!</div></div><div><br></div><div>In the few cases guided by NYTProf, I've been able to make reasonable speed ups at the expense of eliminating the tests</div><div>and object overhead. <br></div><div><br></div><div>For example, in EnsEMBL::Variation::BaseTranscriptVariation changing: </div><div><br></div><div><br></div><div><div> sub transcript {</div><div>     my ($self, $transcript) = @_;</div><div>     assert_ref($transcript, 'Bio::EnsEMBL::Transcript') if $transcript;</div><div>     return $self->SUPER::feature($transcript, 'Transcript');</div><div>}<br></div></div><div><br></div><div>to: </div><div><br></div><div><div>     sub transcript {</div><div>         my ($self, $transcript) = @_;</div><div>        return $self->{feature};<br></div><div><br></div></div><div>Gives a noticeable speed up. But you may ask: if that happens, then we lose type safety and there is a potential for bugs? </div><div>And here's my take on how to address these valid concerns. First, I think there could be two sets of the Perl modules, such as for</div><div>EnsEMBL::Variation::BaseTranscriptVariation - those with all of the checks and those that are fast.  A configuration parameter might specify which version to use. In development or by default, one might use the ones that check types. </div><div><br></div><div>Second and perhaps more import, there are the tests! If more need to be added, then let's add them. And one can always add a test to make sure the results of the two versions gives the same result.<br></div><div><br></div><div>One last avenue of optimization that I'd like to explore is using say Inline::C or basically coding in C hot spots. In particular, consider</div><div>Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap which looks like this: <br></div><div><br></div><div>         my ( $f1_start, $f1_end, $f2_start, $f2_end ) = @_;</div><div>         return ( ($f1_end >= $f2_start) and ($f1_start <= $f2_end) );</div><div><br></div><div>I haven't tried it on this hot spot, but this is something that might benefit from getting coded in C. Again the trade off for speed here is a dependency on compiling C. In my view anyone installing this locally or installing CPAN modules probably already does, but it does add complexity.</div><div><br></div><div>Typically, this is handled in Perl by providing both versions, perhaps as separate modules.</div><div><br></div><div>Thought or comments? </div><div><br></div><div>Thanks,</div><div>   rocky </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><div><div><div><br></div></div></div></div></div></div>

</blockquote></div></div></div></div>

<br></div></div>_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

<br></blockquote></div><br></div>

<br>_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

<br></blockquote></div></div></div>