<div dir="ltr"><span style="font-size:13px">Running the Variant Effect Predictor on a Human Genome VCF file (130780 lines)  with a local Fasta cache (--offline) takes about 50 minutes on a quad-core Ubuntu box. </span><div style="font-size:13px"><br></div><div style="font-size:13px">I could give more details, but I don't think they are that important.<div><br></div><div>In looking at how to speed this up, it looks like VEP goes through the VCF file,  is sorted by chromosome, and processes each</div><div>Chromosome independently. A simple and obvious way to speed this up would be to do some sort of 24-way map/reduce.</div><div>There is of course the --fork option on the <a href="http://variant_effect_predictor.pl/" target="_blank">variant_effect_predictor.pl</a> program which is roughly the same idea, but it parallelizes only across the cores of a single computer rather than make use of multiple ones. </div><div><br></div><div>To pinpoint the slowness better, I used Devel::NYTProf. For those of you who haven't used it recently, it now has flame graphs and it makes it very easy to see what's going on.</div><div><br></div><div>The first thing that came out was a slowness in code to remove carriage returns and line feeds. This is in Bio::DB::Fasta ::subseq: <br></div><div><br></div><div><div>     $data =~ s/\n//g;</div><div>     $data =~ s/\r//g;</div></div><div><br></div><div>Compiling the regexp, e.g: </div><div><br></div><div>     my $nl = qr/\n/;</div><div>     my $cr = qr/\r/;</div><div><br></div><div>     sub subseq {</div><div>         ....</div><div>        $data =~ s/$nl//g;</div><div>        $data =~ s/$cr//g;</div><div>     }</div><div><br></div><div>Speeds up the subseq method by about 15%. I can elaborate more or describe the other methods I tried and how they fared, if there's interest. But since this portion is really part of BioPerl and not Bio::EnsEMBL, I'll try to work up a git pull request on that repository.</div><div><br></div><div>So now I come to the meat of what I have to say. I should have put this at the top -- I hope some of you are still with me. </div><div><br></div><div>The NYTProf graphs seem to say that there is a *lot* of overhead in object lookup and type testing. I think some of this is already known as there already are calls to "weaken" and "new_fast" object creators. And there is this comment in  Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:</div><div><br></div><div><br></div><div><div>    # this method is a major bottle neck in the effect calculation code so</div><div>    # we cache results and use local variables instead of method calls where</div><div>    # possible to speed things up - caveat bug-fixer!</div></div><div><br></div><div>In the few cases guided by NYTProf that I have looked at, I've been able to make reasonable speed ups at the expense of eliminating the tests</div><div>and object overhead. <br></div><div><br></div><div>For example, in EnsEMBL::Variation::BaseTranscriptVariation changing: </div><div><br></div><div><br></div><div><div> sub transcript {</div><div>     my ($self, $transcript) = @_;</div><div>     assert_ref($transcript, 'Bio::EnsEMBL::Transcript') if $transcript;</div><div>     return $self->SUPER::feature($transcript, 'Transcript');</div><div>}<br></div></div><div><br></div><div>to: </div><div><br></div><div><div>     sub transcript {</div><div>         my ($self, $transcript) = @_;</div><div>        return $self->{feature};<br></div><div><br></div></div><div>Gives a noticeable speed up. But you may ask: if that happens, then we lose type safety and there is a potential for bugs? </div><div>Here ist how to address these valid concerns. </div><div><br></div><div>First, I think there could be two sets of the Perl modules, such as for</div><div>EnsEMBL::Variation::BaseTranscriptVariation. One set with all of the checks and another without that are fast.  A configuration parameter might specify which version to use. In development or by default, one might use the ones that check types. </div><div><br></div><div>Second and perhaps more import, there are the tests! If more need to be added, then let's add them. And one can always add a test to make sure the results of the two versions gives the same result.<br></div><div><br></div><div>One last avenue of optimization that I'd like to explore is using say Inline::C or basically coding in C hot spots. In particular, consider</div><div>Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap which looks like this: <br></div><div><br></div><div>         my ( $f1_start, $f1_end, $f2_start, $f2_end ) = @_;</div><div>         return ( ($f1_end >= $f2_start) and ($f1_start <= $f2_end) );</div><div><br></div><div>I haven't tried it on this hot spot, but this is something that might benefit from getting coded in C. Again the trade off for speed here is a dependency on compiling C. In my view anyone installing this locally or installing CPAN modules probably already does, but it does add complexity.</div><div><br></div><div>Typically, this is handled in Perl by providing both versions, perhaps as separate modules.</div><div><br></div><div>Thoughts or comments? </div><div><br></div><div>Thanks,</div><div>   rocky </div></div></div>