<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
Hi Rocky,<br>
<br>
Thanks for this - it's good to see you are noticing significant
improvements when switching these functions. As Will said, we have
started discussing possible VEP optimisation strategies recently, so
are very keen to hear your ideas. That said, half of our team -
including Will who is the lead on VEP - are away at the moment, so
we won't be planning any big changes until next month.<br>
<br>
All the best,<br>
<br>
Sarah<br>
<br>
<div class="moz-cite-prefix">On 07/01/2015 21:29, Rocky Bernstein
wrote:<br>
</div>
<blockquote
cite="mid:CANCp2gYpqRaJAFrwhTyYDLsboj9H7ruL6Qfd_7U6Lar6RiEnzw@mail.gmail.com"
type="cite">
<div dir="ltr">Some corrections. Coding overlap in C is about a
100% improvement. (And the code looks about the same in C if you
remove '$'s and change "and" to "&&").
<div><br>
</div>
<div>If I have this right, (and I may not so please double check
with your tests), this reduced the time taken
in Bio::EnsEMBL::Variation::Utils::VariationEffect from
executing 8574029 statements in 19.6s to executing 5631868
statements in 12.5s. The specific time in _intron_effects went
from 7.87s to 5.11s, while the time in overlap went from 1.45s
to .638ms. I don't understand how the speedup in overlap
appears to cause a bigger speedup overall.<br>
</div>
<div><br>
</div>
<div>Details of the benchmark changes are at <a
moz-do-not-send="true"
href="https://gist.github.com/rocky/6083dedc752c197875ca">https://gist.github.com/rocky/6083dedc752c197875ca</a></div>
<div>while the overall run is <a moz-do-not-send="true"
href="http://dustyfeet.com:8001/VEP-prof-5000-Inline-Overlap-C/">http://dustyfeet.com:8001/VEP-prof-5000-Inline-Overlap-C/</a></div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Fri, Jan 2, 2015 at 11:39 PM, Rocky
Bernstein <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:rocky.bernstein@gmail.com" target="_blank">rocky.bernstein@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>A big-picture question before some small nuts and
bolts.</div>
<div><br>
</div>
<div>SnpEff <a moz-do-not-send="true"
href="http://snpeff.sourceforge.net/" target="_blank">http://snpeff.sourceforge.net/</a>
is about an order of magnitude faster than VEP. Yes, I
realize they work at different levels, but isn't the
level of difficulty and size data sizes that they work
on roughly equivalent? I've heard people express the
feeling that because the problems "feel" about the same
in size and complexity they VEP should be running at
about competitive speed. Or with in a factor or so.</div>
<div><br>
</div>
<div>I honestly don't know, and I'd like to understand
this better. So I'd appreciate thoughts and comments on
this. </div>
<div><br>
</div>
<div><br>
</div>
<div>Okay. now to nuts and bolts. Occasionally I'll look
at VEP performance data mentioned before. And this has
led me to look
at Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap
. See: <a moz-do-not-send="true"
href="http://dustyfeet.com:8001/VEP-prof-chrom1/Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#531"
target="_blank">http://dustyfeet.com:8001/VEP-prof-chrom1/Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#531</a></div>
<div><br>
</div>
<div>The overlap code is basically returning the "and" of
two integer comparisons. I tried coding overalp in C and
got a 6% speedup - not that great. But now consider </div>
<div>this code in
Bio::EnsEMBL::Variation::BaseTranscriptVariation::_intron_effects
that calls overlap: </div>
<div><br>
</div>
<div> if ( overlap($vf_start, $vf_end,
$intron_start-3, $intron_start-1) or</div>
<div> overlap($vf_start, $vf_end,
$intron_start+2, $intron_start+7) or</div>
<div> overlap($vf_start, $vf_end,
$intron_end-7, $intron_end-2 ) or</div>
<div> overlap($vf_start, $vf_end,
$intron_end+1, $intron_end+3 ) or</div>
<div> ($insertion && (</div>
<div> $vf_start == $intron_start ||</div>
<div> $vf_end == $intron_end ||</div>
<div> $vf_start == $intron_start+2 ||</div>
<div> $vf_end == $intron_end-2</div>
<div> ) )) {</div>
<div><br>
</div>
<div>If you inline the overlap code here, you'd get
basically </div>
<div>* 4 comparisons of $vf_start with $intron_start</div>
<div>* 3 comparisons of $vf_end with $intron_end </div>
<div>* 2 comparisons of $vf_start with $intron_end</div>
<div><br>
</div>
<div>And since $intron_end is not less than $intron_start
, it is possible that if $vf_start is greater than
$intron_end, it will also have to be greater than
$intron_start as well, eliminating possibly 4
comparisons.</div>
<div><br>
</div>
<div>So the logic could be rewritten here. Having good
tests of that replaced logic is in my opinion crucial,
as is keeping the old code above around in case one
wants to change or experiment with things.</div>
<div><br>
</div>
<div>But what I might consider doing is coding it all in C
and combine the 4 overlap calls into one. My guess is
that C will also benefit from keeping the values
referred to above in registers. </div>
<div><br>
</div>
<div>Again all of this is messy so I invite thoughts on
this before undertaking something this messy. </div>
<div><br>
</div>
<div>In closing though I'll mention that the Inline C code
has been merged into bioperl_live. <a
moz-do-not-send="true"
href="https://github.com/bioperl/bioperl-live/commit/01ec10dda23b6c5ed7592808cff4ae0d34278611"
target="_blank">https://github.com/bioperl/bioperl-live/commit/01ec10dda23b6c5ed7592808cff4ae0d34278611</a><br>
</div>
<div><br>
</div>
<div>The way that works is that there is a recommended
dependency on C::Inline. If C::Inline is around, a Perl
subroutine gets overwritten with a routine of the same
name implemented in C. I imagine that if this goes
forward, it would do likewise. </div>
<div><br>
</div>
<div>Okay, enough babble. Time to hear from you all...</div>
<div><br>
</div>
<div>But </div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote"><span class="">On Tue, Dec 23,
2014 at 4:28 AM, Will McLaren <span dir="ltr"><<a
moz-do-not-send="true" href="mailto:wm2@ebi.ac.uk"
target="_blank">wm2@ebi.ac.uk</a>></span>
wrote:<br>
</span>
<div>
<div class="h5">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Thanks again Rocky, your work on
this is really appreciated, and great to see
such an improvement for such a minor change!
<div><br>
</div>
<div>If there's any other code you'd like to
share, or any changes to ours, please feel
free to send us more details or put in a pull
request on GitHub.</div>
<div><br>
</div>
<div>Thanks</div>
<div><br>
</div>
<div>Will</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">
<div>
<div>On 23 December 2014 at 03:26, Rocky
Bernstein <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:rocky.bernstein@gmail.com"
target="_blank">rocky.bernstein@gmail.com</a>></span>
wrote:<br>
</div>
</div>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div>
<div>
<div dir="ltr">Just a follow-up to my
earlier post.
<div><br>
</div>
<div>I ran a Variant Effect Prediction
run on a VCF file of 5000 entries
(which is what fits in one buffer
read) with one small change. With
that, I was able to significantly
significantly reduce the time
bottleneck in the Fasta code. The
time spent here went from 7.76
seconds to 2.32 seconds.</div>
<div><br>
</div>
<div>Compare the top line of: <br>
<div class="gmail_extra"><a
moz-do-not-send="true"
href="http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html"
target="_blank">http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html</a> </div>
<div class="gmail_extra">with:</div>
<div class="gmail_extra"><a
moz-do-not-send="true"
href="http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html"
target="_blank">http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html</a></div>
<div class="gmail_extra"><br>
</div>
<div class="gmail_extra">You get a
50% reduction just by the fact
that one transformation is needed
to remove both \n and \r rather
than two transformations. But even
beyond this, the C code for one
run is still faster than the
corresponding Perl s///. </div>
<div class="gmail_extra"><br>
</div>
<div class="gmail_extra">The
specific change that I made can be
found at <a moz-do-not-send="true"
href="https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff"
target="_blank">https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff</a></div>
<div class="gmail_extra">You'll also
see benchmarks for other
variations of that code. </div>
<div class="gmail_extra"><br>
</div>
<div class="gmail_extra">But.... in
order to see the effect in a run
you need to have Perl module
Inline::C installed. Otherwise you
get a lesser improvement outlined
in my original posting. Again
this speeds things up by compiling
once Perl regular expressions used
to match \n and \r. </div>
<div class="gmail_extra"><br>
</div>
<div class="gmail_extra">In the
spirit of open scientific review,
I am curious to learn of others
experience the same kind of
improvement I saw. </div>
<div class="gmail_extra"><br>
</div>
<div class="gmail_extra">I have a
pull request for this change to
the bioperl-live repository. See <a
moz-do-not-send="true"
href="https://github.com/bioperl/bioperl-live/issues/95"
target="_blank">https://github.com/bioperl/bioperl-live/issues/95</a>
. However I note that the Bio::DB
code used by Variant Effect
Predictor is a different
(back-level) from the code in that
git repository. The diff file in
the gist cited above is for the
Fasta.pm code that is in Ensembl ;
of course, the pull request uses
the current Bio::DB code. </div>
<div class="gmail_extra"><br>
</div>
<div class="gmail_extra"><br>
</div>
<div class="gmail_extra">Lastly <a
moz-do-not-send="true"
href="http://dustyfeet.com:8001"
target="_blank">http://dustyfeet.com:8001</a>
has the profile results other
kinds of runs which I hope will
clarify my other remarks about
where things are slow. </div>
<div class="gmail_extra"><br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu,
Dec 18, 2014 at 12:48 AM, Rocky
Bernstein <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:rocky.bernstein@gmail.com"
target="_blank">rocky.bernstein@gmail.com</a>></span>
wrote:
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div dir="ltr">Running the
Variant Effect Predictor on
a Human Genome VCF file
(130780 lines) with a local
Fasta cache (--offline)
takes about 50 minutes on a
quad-core Ubuntu box.
<div><br>
</div>
<div>I could give more
details, but I don't think
they are that important.
<div><br>
</div>
<div>In looking at how to
speed this up, it looks
like VEP goes through
the VCF file, is sorted
by chromosome, and
processes each</div>
<div>Chromosome
independently. The first
obvious way to speed
this up would be to do
some sort of 24-way
map/reduce.</div>
<div>There is of course
the --fork option on the
<a
moz-do-not-send="true"
href="http://variant_effect_predictor.pl" target="_blank">variant_effect_predictor.pl</a>
program which is roughly
the same idea, but it
parallelizes only across
the cores of a single
computer rather than
make use of multiple
ones. </div>
<div><br>
</div>
<div>To pinpoint the
slowness better, I used
Devel::NYTProf. For
those of you who haven't
used it recently, it now
has flame graphs and it
makes it very easy to
see what's going on.</div>
<div><br>
</div>
<div>The first thing that
came out was a slowness
in code to remove
carriage returns and
line feeds. This is in
Bio::DB::Fasta
::subseq: <br>
</div>
<div><br>
</div>
<div>
<div> $data =~
s/\n//g;</div>
<div> $data =~
s/\r//g;</div>
</div>
<div><br>
</div>
<div>Compiling the regexp,
e.g: </div>
<div><br>
</div>
<div> my $nl = qr/\n/;</div>
<div> my $cr = qr/\r/;</div>
<div><br>
</div>
<div> sub subseq {</div>
<div> ....</div>
<div> $data =~
s/$nl//g;</div>
<div> $data =~
s/$cr//g;</div>
<div> }</div>
<div><br>
</div>
<div>Speeds up the subseq
method by about 15%. I
can elaborate more or
describe the other
methods I tried and how
they fared, if there's
interest. But since this
portion is really part
of BioPerl and not
Bio::EnsEMBL, I'll try
to work up a git pull
request ont that
repository.</div>
<div><br>
</div>
<div>So now I come to the
meat of what I have to
say. I should have put
this at the top -- I
hope some of you are
still with me. </div>
<div><br>
</div>
<div>The NYTProf graphs
seem to say that there
is a *lot* of overhead
in object lookup and
type testing. I think
some of this is already
known as there already
are calls to "weaken"
and "new_fast" object
creators. And there is
this comment in
Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:</div>
<div><br>
</div>
<div><br>
</div>
<div>
<div> # this method
is a major bottle neck
in the effect
calculation code so</div>
<div> # we cache
results and use local
variables instead of
method calls where</div>
<div> # possible to
speed things up -
caveat bug-fixer!</div>
</div>
<div><br>
</div>
<div>In the few cases
guided by NYTProf, I've
been able to make
reasonable speed ups at
the expense of
eliminating the tests</div>
<div>and object overhead. <br>
</div>
<div><br>
</div>
<div>For example, in
EnsEMBL::Variation::BaseTranscriptVariation
changing: </div>
<div><br>
</div>
<div><br>
</div>
<div>
<div> sub transcript {</div>
<div> my ($self,
$transcript) = @_;</div>
<div>
assert_ref($transcript,
'Bio::EnsEMBL::Transcript')
if $transcript;</div>
<div> return
$self->SUPER::feature($transcript,
'Transcript');</div>
<div>}<br>
</div>
</div>
<div><br>
</div>
<div>to: </div>
<div><br>
</div>
<div>
<div> sub transcript
{</div>
<div> my ($self,
$transcript) = @_;</div>
<div> return
$self->{feature};<br>
</div>
<div><br>
</div>
</div>
<div>Gives a noticeable
speed up. But you may
ask: if that happens,
then we lose type safety
and there is a potential
for bugs? </div>
<div>And here's my take on
how to address these
valid concerns. First, I
think there could be two
sets of the Perl
modules, such as for</div>
<div>EnsEMBL::Variation::BaseTranscriptVariation
- those with all of the
checks and those that
are fast. A
configuration parameter
might specify which
version to use. In
development or by
default, one might use
the ones that check
types. </div>
<div><br>
</div>
<div>Second and perhaps
more import, there are
the tests! If more need
to be added, then let's
add them. And one can
always add a test to
make sure the results of
the two versions gives
the same result.<br>
</div>
<div><br>
</div>
<div>One last avenue of
optimization that I'd
like to explore is using
say Inline::C or
basically coding in C
hot spots. In
particular, consider</div>
<div>Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap
which looks like this: <br>
</div>
<div><br>
</div>
<div> my (
$f1_start, $f1_end,
$f2_start, $f2_end ) =
@_;</div>
<div> return (
($f1_end >=
$f2_start) and
($f1_start <=
$f2_end) );</div>
<div><br>
</div>
<div>I haven't tried it on
this hot spot, but this
is something that might
benefit from getting
coded in C. Again the
trade off for speed here
is a dependency on
compiling C. In my view
anyone installing this
locally or installing
CPAN modules probably
already does, but it
does add complexity.</div>
<div><br>
</div>
<div>Typically, this is
handled in Perl by
providing both versions,
perhaps as separate
modules.</div>
<div><br>
</div>
<div>Thought or comments? </div>
<div><br>
</div>
<div>Thanks,</div>
<div> rocky </div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>
<div>
<div>
<div><br>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
_______________________________________________<br>
Dev mailing list <a
moz-do-not-send="true"
href="mailto:Dev@ensembl.org"
target="_blank">Dev@ensembl.org</a><br>
Posting guidelines and subscribe/unsubscribe
info: <a moz-do-not-send="true"
href="http://lists.ensembl.org/mailman/listinfo/dev"
target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>
Ensembl Blog: <a moz-do-not-send="true"
href="http://www.ensembl.info/"
target="_blank">http://www.ensembl.info/</a><br>
<br>
</blockquote>
</div>
<br>
</div>
<br>
_______________________________________________<br>
Dev mailing list <a moz-do-not-send="true"
href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>
Posting guidelines and subscribe/unsubscribe info:
<a moz-do-not-send="true"
href="http://lists.ensembl.org/mailman/listinfo/dev"
target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>
Ensembl Blog: <a moz-do-not-send="true"
href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>
<br>
</blockquote>
</div>
</div>
</div>
<br>
</div>
</blockquote>
</div>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Dev mailing list <a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>
Posting guidelines and subscribe/unsubscribe info: <a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev">http://lists.ensembl.org/mailman/listinfo/dev</a>
Ensembl Blog: <a class="moz-txt-link-freetext" href="http://www.ensembl.info/">http://www.ensembl.info/</a>
</pre>
</blockquote>
<br>
</body>
</html>