<div dir="ltr">Thanks again Rocky, your work on this is really appreciated, and great to see such an improvement for such a minor change!<div><br></div><div>If there's any other code you'd like to share, or any changes to ours, please feel free to send us more details or put in a pull request on GitHub.</div><div><br></div><div>Thanks</div><div><br></div><div>Will</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 23 December 2014 at 03:26, Rocky Bernstein <span dir="ltr"><<a href="mailto:rocky.bernstein@gmail.com" target="_blank">rocky.bernstein@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Just a follow-up to my earlier post. <div><br></div><div>I ran a Variant Effect Prediction  run on a VCF file of 5000 entries (which is what fits in one buffer read)  with one small change. With that, I was able to significantly significantly reduce the time bottleneck in the Fasta code. The time spent here went from 7.76 seconds to 2.32 seconds.</div><div><br></div><div>Compare the top line of: <br><div class="gmail_extra"><a href="http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html" target="_blank">http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html</a> </div><div class="gmail_extra">with:</div><div class="gmail_extra"><a href="http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html" target="_blank">http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html</a></div><div class="gmail_extra"><br></div><div class="gmail_extra">You get a 50% reduction just by the fact that one transformation is needed to remove both \n and \r rather than two transformations. But even beyond this, the C code for one run is still faster than the corresponding Perl s///. </div><div class="gmail_extra"><br></div><div class="gmail_extra">The specific change that I made can be found at <a href="https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff" target="_blank">https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff</a></div><div class="gmail_extra">You'll also see benchmarks for other variations of that code. </div><div class="gmail_extra"><br></div><div class="gmail_extra">But.... in order to see the effect in a run you need to have Perl module Inline::C installed. Otherwise you get a lesser improvement outlined in my original posting.  Again this speeds things up by compiling once Perl regular expressions used to match \n and \r. </div><div class="gmail_extra"><br></div><div class="gmail_extra">In the spirit of open scientific review, I am curious to learn of others experience the same kind of improvement I saw. </div><div class="gmail_extra"><br></div><div class="gmail_extra">I have a pull request for this change to the bioperl-live repository. See <a href="https://github.com/bioperl/bioperl-live/issues/95" target="_blank">https://github.com/bioperl/bioperl-live/issues/95</a> . However I note that the Bio::DB code used by  Variant Effect Predictor is a different (back-level) from the code in that git repository. The diff file in the gist cited above is for the Fasta.pm code that is in Ensembl ; of course, the pull request uses the current Bio::DB code. </div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Lastly <a href="http://dustyfeet.com:8001" target="_blank">http://dustyfeet.com:8001</a> has the profile results other kinds of runs which I hope will clarify my other remarks about where things are slow. </div><div class="gmail_extra"><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Dec 18, 2014 at 12:48 AM, Rocky Bernstein <span dir="ltr"><<a href="mailto:rocky.bernstein@gmail.com" target="_blank">rocky.bernstein@gmail.com</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">Running the Variant Effect Predictor on a Human Genome VCF file (130780 lines)  with a local Fasta cache (--offline) takes about 50 minutes on a quad-core Ubuntu box. <div><br></div><div>I could give more details, but I don't think they are that important.<div><br></div><div>In looking at how to speed this up, it looks like VEP goes through the VCF file,  is sorted by chromosome, and processes each</div><div>Chromosome independently. The first obvious way to speed this up would be to do some sort of 24-way map/reduce.</div><div>There is of course the --fork option on the <a href="http://variant_effect_predictor.pl" target="_blank">variant_effect_predictor.pl</a> program which is roughly the same idea, but it parallelizes only across the cores of a single computer rather than make use of multiple ones. </div><div><br></div><div>To pinpoint the slowness better, I used Devel::NYTProf. For those of you who haven't used it recently, it now has flame graphs and it makes it very easy to see what's going on.</div><div><br></div><div>The first thing that came out was a slowness in code to remove carriage returns and line feeds. This is in Bio::DB::Fasta ::subseq: <br></div><div><br></div><div><div>     $data =~ s/\n//g;</div><div>     $data =~ s/\r//g;</div></div><div><br></div><div>Compiling the regexp, e.g: </div><div><br></div><div>     my $nl = qr/\n/;</div><div>     my $cr = qr/\r/;</div><div><br></div><div>     sub subseq {</div><div>         ....</div><div>        $data =~ s/$nl//g;</div><div>        $data =~ s/$cr//g;</div><div>     }</div><div><br></div><div>Speeds up the subseq method by about 15%. I can elaborate more or describe the other methods I tried and how they fared, if there's interest. But since this portion is really part of BioPerl and not Bio::EnsEMBL, I'll try to work up a git pull request ont that repository.</div><div><br></div><div>So now I come to the meat of what I have to say. I should have put this at the top -- I hope some of you are still with me. </div><div><br></div><div>The NYTProf graphs seem to say that there is a *lot* of overhead in object lookup and type testing. I think some of this is already known as there already are calls to "weaken" and "new_fast" object creators. And there is this comment in  Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:</div><div><br></div><div><br></div><div><div>    # this method is a major bottle neck in the effect calculation code so</div><div>    # we cache results and use local variables instead of method calls where</div><div>    # possible to speed things up - caveat bug-fixer!</div></div><div><br></div><div>In the few cases guided by NYTProf, I've been able to make reasonable speed ups at the expense of eliminating the tests</div><div>and object overhead. <br></div><div><br></div><div>For example, in EnsEMBL::Variation::BaseTranscriptVariation changing: </div><div><br></div><div><br></div><div><div> sub transcript {</div><div>     my ($self, $transcript) = @_;</div><div>     assert_ref($transcript, 'Bio::EnsEMBL::Transcript') if $transcript;</div><div>     return $self->SUPER::feature($transcript, 'Transcript');</div><div>}<br></div></div><div><br></div><div>to: </div><div><br></div><div><div>     sub transcript {</div><div>         my ($self, $transcript) = @_;</div><div>        return $self->{feature};<br></div><div><br></div></div><div>Gives a noticeable speed up. But you may ask: if that happens, then we lose type safety and there is a potential for bugs? </div><div>And here's my take on how to address these valid concerns. First, I think there could be two sets of the Perl modules, such as for</div><div>EnsEMBL::Variation::BaseTranscriptVariation - those with all of the checks and those that are fast.  A configuration parameter might specify which version to use. In development or by default, one might use the ones that check types. </div><div><br></div><div>Second and perhaps more import, there are the tests! If more need to be added, then let's add them. And one can always add a test to make sure the results of the two versions gives the same result.<br></div><div><br></div><div>One last avenue of optimization that I'd like to explore is using say Inline::C or basically coding in C hot spots. In particular, consider</div><div>Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap which looks like this: <br></div><div><br></div><div>         my ( $f1_start, $f1_end, $f2_start, $f2_end ) = @_;</div><div>         return ( ($f1_end >= $f2_start) and ($f1_start <= $f2_end) );</div><div><br></div><div>I haven't tried it on this hot spot, but this is something that might benefit from getting coded in C. Again the trade off for speed here is a dependency on compiling C. In my view anyone installing this locally or installing CPAN modules probably already does, but it does add complexity.</div><div><br></div><div>Typically, this is handled in Perl by providing both versions, perhaps as separate modules.</div><div><br></div><div>Thought or comments? </div><div><br></div><div>Thanks,</div><div>   rocky </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><div><div><div><br></div></div></div></div></div></div>

</blockquote></div></div></div></div>

<br>_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

<br></blockquote></div><br></div>