<div dir="ltr">Ok, good to know you all are looking at ways to optimize VEP run time.<div><br></div><div>To sum up, 50 minutes of elapsed time would probably go to about 3-5 with map/reduce (but with 20 or so times the number of processors). It probably wouldn't take that much thought.  This is an area I could explore if there is interest.<div><br></div><div>Any comments regarding recoding hot spots in C? </div><div><br></div><div>Other comments in line...</div><div><br></div><div><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Dec 18, 2014 at 11:34 AM, Will McLaren <span dir="ltr"><<a href="mailto:wm2@ebi.ac.uk" target="_blank">wm2@ebi.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">Hi Rocky,<div><br></div><div>Thanks very much for the detailed feedback.</div><div><br></div><div>We are currently looking into ways to further optimise VEP run time, and we are already considering a number of the solutions you propose, along with some others. We'll definitely take a look at the things we haven't thought of too.</div><div><br></div><div>Further comments inline below.</div><div><br></div><div>Regards</div><div><br></div><div>Will McLaren</div><div>Ensembl Variation</div><div><br></div><div class="gmail_extra"><br><div class="gmail_quote"><span>On 18 December 2014 at 15:51, Rocky Bernstein <span dir="ltr"><<a href="mailto:rocky.bernstein@gmail.com" target="_blank">rocky.bernstein@gmail.com</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><span style="font-size:13px">Running the Variant Effect Predictor on a Human Genome VCF file (130780 lines)  with a local Fasta cache (--offline) takes about 50 minutes on a quad-core Ubuntu box. </span><div style="font-size:13px"><br></div><div style="font-size:13px">I could give more details, but I don't think they are that important.<div><br></div><div>In looking at how to speed this up, it looks like VEP goes through the VCF file,  is sorted by chromosome, and processes each</div><div>Chromosome independently. A simple and obvious way to speed this up would be to do some sort of 24-way map/reduce.</div><div>There is of course the --fork option on the <a href="http://variant_effect_predictor.pl/" target="_blank">variant_effect_predictor.pl</a> program which is roughly the same idea, but it parallelizes only across the cores of a single computer rather than make use of multiple ones. </div><div><br></div><div>To pinpoint the slowness better, I used Devel::NYTProf. For those of you who haven't used it recently, it now has flame graphs and it makes it very easy to see what's going on.</div></div></div></blockquote><div><br></div></span><div>We regularly run Devel::NYTProf on the code, very useful it is too!</div><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div><br></div><div>The first thing that came out was a slowness in code to remove carriage returns and line feeds. This is in Bio::DB::Fasta ::subseq: <br></div><div><br></div><div><div>     $data =~ s/\n//g;</div><div>     $data =~ s/\r//g;</div></div><div><br></div><div>Compiling the regexp, e.g: </div><div><br></div><div>     my $nl = qr/\n/;</div><div>     my $cr = qr/\r/;</div><div><br></div><div>     sub subseq {</div><div>         ....</div><div>        $data =~ s/$nl//g;</div><div>        $data =~ s/$cr//g;</div><div>     }</div><div><br></div><div>Speeds up the subseq method by about 15%. I can elaborate more or describe the other methods I tried and how they fared, if there's interest. But since this portion is really part of BioPerl and not Bio::EnsEMBL, I'll try to work up a git pull request on that repository.</div></div></div></blockquote><div><br></div></span><div>Thanks, that would be useful. This regex came up in our NYTProf runs too.</div><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div><br></div><div>So now I come to the meat of what I have to say. I should have put this at the top -- I hope some of you are still with me. </div><div><br></div><div>The NYTProf graphs seem to say that there is a *lot* of overhead in object lookup and type testing. I think some of this is already known as there already are calls to "weaken" and "new_fast" object creators. And there is this comment in  Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:</div><div><br></div><div><br></div><div><div>    # this method is a major bottle neck in the effect calculation code so</div><div>    # we cache results and use local variables instead of method calls where</div><div>    # possible to speed things up - caveat bug-fixer!</div></div></div></div></blockquote><div><br></div></span><div>One major avenue we will be investigating is subs such as this.</div><div><br></div><div>For each TranscriptVariationAllele object created (this represents the overlap of a variant allele and a transcript), the VEP evaluates a number of predicate statements. All predicates are evaluated for all objects, without any filtering with prior knowledge (for example if we know a variant falls entirely in the coding sequence there's no point checking if it's intronic).</div><div><br></div><div>This should be fairly low-hanging fruit to pick, so this will probably be one of the first optimisations to reach production code.</div></div></div></div></blockquote><div><br></div><div>Ok. This too is good to know.  I can think of several ways this can be done, so I will be eager to see how this works. </div><div><br></div><div>If I have this right (and I may not) the predicate tests are somewhere under Bio::EnsEMBL::Variation-BaseVariationFeature::display_consequence and that constitutes less than half of the time.</div><div><br></div><div>An equal portion of time is in part that formats the results for output, Bio::EnsEMBL::Variation::Utils::VEP::vfa_to_line.  It is in this portion where "transcript" mentioned below lies.</div><div><br></div><div><br></div><div>In the graphs I have, vfa_to_line is in the right-hand portion while display_consequences is in the left-hand portion. If you or others want to see my graphs, I can try to make them available online somewhere. <br></div><div><br></div><div>Since you've looked at NYTProf output, you should see this too, right?  Again I have this right, this is big area to look for improvements. Reformatting objects doesn't feel like it should be that time consuming. </div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div><br></div><div>In the few cases guided by NYTProf that I have looked at, I've been able to make reasonable speed ups at the expense of eliminating the tests</div><div>and object overhead. <br></div><div><br></div><div>For example, in EnsEMBL::Variation::BaseTranscriptVariation changing: </div><div><br></div><div><br></div><div><div> sub transcript {</div><div>     my ($self, $transcript) = @_;</div><div>     assert_ref($transcript, 'Bio::EnsEMBL::Transcript') if $transcript;</div><div>     return $self->SUPER::feature($transcript, 'Transcript');</div><div>}<br></div></div><div><br></div><div>to: </div><div><br></div><div><div>     sub transcript {</div><div>         my ($self, $transcript) = @_;</div><div>        return $self->{feature};<br></div><div><br></div></div><div>Gives a noticeable speed up. But you may ask: if that happens, then we lose type safety and there is a potential for bugs? </div><div>Here ist how to address these valid concerns.</div></div></div></blockquote><div><br></div></span><div>Type safety is of course important, but there are places (such as this one) where the setter is only ever called internally. Testing should find cases where this is done incorrectly, but it may be possible to remove the assert_ref() call here completely.</div></div></div></div></blockquote><div><br></div><div>I just want to be clear about something here. It is not just removing the assert_ref that helps. It is </div><div>also removing the call to $self->SUPER::feature which also includes another assert_ref call before it</div><div>returns $self->{feature}. </div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div><br></div><div>First, I think there could be two sets of the Perl modules, such as for</div><div>EnsEMBL::Variation::BaseTranscriptVariation. One set with all of the checks and another without that are fast.  A configuration parameter might specify which version to use. In development or by default, one might use the ones that check types. </div></div></div></blockquote><div><br></div></span><div>I think having two sets of modules would be a bit clumsy and could lead to inconsistencies between the two, even if more tests were added.</div></div></div></div></blockquote><div><br></div><div>The model I was thinking of is like Scalar::Util or at least the way it was which had a pure Perl version and an XS version. In Ruby I there used to be (and there still may be) there were some database drivers that had both C-compiled versions as well as pure Ruby ones.</div><div><br></div><div>As far as tests go, if one specified an optimization build, then that would be used for the tests. Travis or some other Continuous Integration would run with both sets of options.</div><div><br></div><div>But, yes, I can see that it is more complex. If it is of interest I might be able to work up a little example in a branch of a fork to show how it might work. I think the key though is to limit in just those few places it is needed, drop into C occasionally.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div><br></div><div>Second and perhaps more import, there are the tests! If more need to be added, then let's add them. And one can always add a test to make sure the results of the two versions gives the same result.<br></div><div><br></div><div>One last avenue of optimization that I'd like to explore is using say Inline::C or basically coding in C hot spots. In particular, consider</div><div>Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap which looks like this: <br></div><div><br></div><div>         my ( $f1_start, $f1_end, $f2_start, $f2_end ) = @_;</div><div>         return ( ($f1_end >= $f2_start) and ($f1_start <= $f2_end) );</div></div></div></blockquote><div><br></div></span><div>Again, something we've thought of, but I'd be surprised if you saw that much speedup from changing some simple int comparisons.</div><div><br></div><div>I believe there may be better ways to optimise this overlap code, but it's not something we've got to yet.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span><div dir="ltr"><div style="font-size:13px"><div><br></div><div>I haven't tried it on this hot spot, but this is something that might benefit from getting coded in C. Again the trade off for speed here is a dependency on compiling C. In my view anyone installing this locally or installing CPAN modules probably already does, but it does add complexity.</div><div><br></div><div>Typically, this is handled in Perl by providing both versions, perhaps as separate modules.</div><div><br></div><div>Thoughts or comments? </div><div><br></div><div>Thanks,</div><div>   rocky </div></div></div>

<br></span>_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

<br></blockquote></div></div></div>

<br>_______________________________________________<br>

Dev mailing list    <a href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>

Posting guidelines and subscribe/unsubscribe info: <a href="http://lists.ensembl.org/mailman/listinfo/dev" target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

<br></blockquote></div><br></div></div></div></div>