<html>
  <head>
    <meta content="text/html; charset=windows-1252"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
    Hi Rocky,<br>
    <br>
    Thanks for this - it's good to see you are noticing significant
    improvements when switching these functions. As Will said, we have
    started discussing possible VEP optimisation strategies recently, so
    are very keen to hear your ideas. That said, half of our team -
    including Will who is the lead on VEP - are away at the moment, so
    we won't be planning any big changes until next month.<br>
    <br>
    All the best,<br>
    <br>
    Sarah<br>
    <br>
    <div class="moz-cite-prefix">On 07/01/2015 21:29, Rocky Bernstein
      wrote:<br>
    </div>
    <blockquote
cite="mid:CANCp2gYpqRaJAFrwhTyYDLsboj9H7ruL6Qfd_7U6Lar6RiEnzw@mail.gmail.com"
      type="cite">
      <div dir="ltr">Some corrections. Coding overlap in C is about a
        100% improvement. (And the code looks about the same in C if you
        remove '$'s and change "and" to "&&"). 
        <div><br>
        </div>
        <div>If I have this right, (and I may not so please double check
          with your tests), this reduced the time taken
          in Bio::EnsEMBL::Variation::Utils::VariationEffect from
           executing 8574029 statements in 19.6s to executing 5631868
          statements in 12.5s. The specific time in _intron_effects went
          from 7.87s to 5.11s, while the time in overlap went from 1.45s
          to .638ms. I don't understand how the speedup in overlap
          appears to cause a bigger speedup overall.<br>
        </div>
        <div><br>
        </div>
        <div>Details of the benchmark changes are at <a
            moz-do-not-send="true"
            href="https://gist.github.com/rocky/6083dedc752c197875ca">https://gist.github.com/rocky/6083dedc752c197875ca</a></div>
        <div>while the overall run is <a moz-do-not-send="true"
            href="http://dustyfeet.com:8001/VEP-prof-5000-Inline-Overlap-C/">http://dustyfeet.com:8001/VEP-prof-5000-Inline-Overlap-C/</a></div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Fri, Jan 2, 2015 at 11:39 PM, Rocky
          Bernstein <span dir="ltr"><<a moz-do-not-send="true"
              href="mailto:rocky.bernstein@gmail.com" target="_blank">rocky.bernstein@gmail.com</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr">
              <div>A big-picture question before some small nuts and
                bolts.</div>
              <div><br>
              </div>
              <div>SnpEff  <a moz-do-not-send="true"
                  href="http://snpeff.sourceforge.net/" target="_blank">http://snpeff.sourceforge.net/</a>
                is about an order of magnitude faster than VEP. Yes, I
                realize they work at different levels, but isn't the
                level of difficulty and size data sizes that they work
                on roughly equivalent? I've heard people express the
                feeling that because the problems "feel" about the same
                in size and complexity they VEP should be running at
                about competitive speed. Or with in a factor or so.</div>
              <div><br>
              </div>
              <div>I honestly don't know, and I'd like to understand
                this better. So I'd appreciate thoughts and comments on
                this. </div>
              <div><br>
              </div>
              <div><br>
              </div>
              <div>Okay. now to nuts and bolts. Occasionally I'll look
                at VEP performance data mentioned before. And this has
                led me to look
                at Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap
                . See: <a moz-do-not-send="true"
href="http://dustyfeet.com:8001/VEP-prof-chrom1/Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#531"
                  target="_blank">http://dustyfeet.com:8001/VEP-prof-chrom1/Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#531</a></div>
              <div><br>
              </div>
              <div>The overlap code is basically returning the "and" of
                two integer comparisons. I tried coding overalp in C and
                got a 6% speedup - not that great. But now consider </div>
              <div>this code in
                Bio::EnsEMBL::Variation::BaseTranscriptVariation::_intron_effects
                that calls overlap: </div>
              <div><br>
              </div>
              <div>        if ( overlap($vf_start, $vf_end,
                $intron_start-3, $intron_start-1) or</div>
              <div>                 overlap($vf_start, $vf_end,
                $intron_start+2, $intron_start+7) or</div>
              <div>                 overlap($vf_start, $vf_end,
                $intron_end-7,   $intron_end-2  ) or</div>
              <div>                 overlap($vf_start, $vf_end,
                $intron_end+1,   $intron_end+3  ) or</div>
              <div>                 ($insertion && (</div>
              <div>                     $vf_start == $intron_start ||</div>
              <div>                     $vf_end == $intron_end ||</div>
              <div>                     $vf_start == $intron_start+2 ||</div>
              <div>                     $vf_end == $intron_end-2</div>
              <div>                    ) )) {</div>
              <div><br>
              </div>
              <div>If you inline the overlap code here, you'd get
                basically </div>
              <div>* 4 comparisons of $vf_start with $intron_start</div>
              <div>* 3 comparisons of $vf_end with $intron_end </div>
              <div>* 2 comparisons of $vf_start with $intron_end</div>
              <div><br>
              </div>
              <div>And since $intron_end is not less than $intron_start
                , it is possible that if $vf_start is greater than
                $intron_end, it will also have to be greater than
                $intron_start as well, eliminating possibly 4
                comparisons.</div>
              <div><br>
              </div>
              <div>So the logic could be rewritten here. Having good
                tests of that replaced logic is in my opinion crucial,
                as is keeping the old code above around in case one
                wants to change or experiment with things.</div>
              <div><br>
              </div>
              <div>But what I might consider doing is coding it all in C
                and combine the 4 overlap calls into one. My guess is
                that C will also benefit from keeping the values
                referred to above in registers. </div>
              <div><br>
              </div>
              <div>Again all of this is messy so I invite thoughts on
                this before undertaking something this messy. </div>
              <div><br>
              </div>
              <div>In closing though I'll mention that the Inline C code
                has been merged into bioperl_live. <a
                  moz-do-not-send="true"
href="https://github.com/bioperl/bioperl-live/commit/01ec10dda23b6c5ed7592808cff4ae0d34278611"
                  target="_blank">https://github.com/bioperl/bioperl-live/commit/01ec10dda23b6c5ed7592808cff4ae0d34278611</a><br>
              </div>
              <div><br>
              </div>
              <div>The way that works is that there is a recommended
                dependency on C::Inline. If C::Inline is around, a Perl
                subroutine gets overwritten with a routine of the same
                name implemented in C. I imagine that if this goes
                forward, it would do likewise. </div>
              <div><br>
              </div>
              <div>Okay, enough babble. Time to hear from you all...</div>
              <div><br>
              </div>
              <div>But </div>
              <div><br>
              </div>
            </div>
            <div class="gmail_extra"><br>
              <div class="gmail_quote"><span class="">On Tue, Dec 23,
                  2014 at 4:28 AM, Will McLaren <span dir="ltr"><<a
                      moz-do-not-send="true" href="mailto:wm2@ebi.ac.uk"
                      target="_blank">wm2@ebi.ac.uk</a>></span>
                  wrote:<br>
                </span>
                <div>
                  <div class="h5">
                    <blockquote class="gmail_quote" style="margin:0 0 0
                      .8ex;border-left:1px #ccc solid;padding-left:1ex">
                      <div dir="ltr">Thanks again Rocky, your work on
                        this is really appreciated, and great to see
                        such an improvement for such a minor change!
                        <div><br>
                        </div>
                        <div>If there's any other code you'd like to
                          share, or any changes to ours, please feel
                          free to send us more details or put in a pull
                          request on GitHub.</div>
                        <div><br>
                        </div>
                        <div>Thanks</div>
                        <div><br>
                        </div>
                        <div>Will</div>
                      </div>
                      <div class="gmail_extra"><br>
                        <div class="gmail_quote">
                          <div>
                            <div>On 23 December 2014 at 03:26, Rocky
                              Bernstein <span dir="ltr"><<a
                                  moz-do-not-send="true"
                                  href="mailto:rocky.bernstein@gmail.com"
                                  target="_blank">rocky.bernstein@gmail.com</a>></span>
                              wrote:<br>
                            </div>
                          </div>
                          <blockquote class="gmail_quote"
                            style="margin:0 0 0 .8ex;border-left:1px
                            #ccc solid;padding-left:1ex">
                            <div>
                              <div>
                                <div dir="ltr">Just a follow-up to my
                                  earlier post. 
                                  <div><br>
                                  </div>
                                  <div>I ran a Variant Effect Prediction
                                     run on a VCF file of 5000 entries
                                    (which is what fits in one buffer
                                    read)  with one small change. With
                                    that, I was able to significantly
                                    significantly reduce the time
                                    bottleneck in the Fasta code. The
                                    time spent here went from 7.76
                                    seconds to 2.32 seconds.</div>
                                  <div><br>
                                  </div>
                                  <div>Compare the top line of: <br>
                                    <div class="gmail_extra"><a
                                        moz-do-not-send="true"
href="http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html"
                                        target="_blank">http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html</a> </div>
                                    <div class="gmail_extra">with:</div>
                                    <div class="gmail_extra"><a
                                        moz-do-not-send="true"
href="http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html"
                                        target="_blank">http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html</a></div>
                                    <div class="gmail_extra"><br>
                                    </div>
                                    <div class="gmail_extra">You get a
                                      50% reduction just by the fact
                                      that one transformation is needed
                                      to remove both \n and \r rather
                                      than two transformations. But even
                                      beyond this, the C code for one
                                      run is still faster than the
                                      corresponding Perl s///. </div>
                                    <div class="gmail_extra"><br>
                                    </div>
                                    <div class="gmail_extra">The
                                      specific change that I made can be
                                      found at <a moz-do-not-send="true"
href="https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff"
                                        target="_blank">https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff</a></div>
                                    <div class="gmail_extra">You'll also
                                      see benchmarks for other
                                      variations of that code. </div>
                                    <div class="gmail_extra"><br>
                                    </div>
                                    <div class="gmail_extra">But.... in
                                      order to see the effect in a run
                                      you need to have Perl module
                                      Inline::C installed. Otherwise you
                                      get a lesser improvement outlined
                                      in my original posting.  Again
                                      this speeds things up by compiling
                                      once Perl regular expressions used
                                      to match \n and \r. </div>
                                    <div class="gmail_extra"><br>
                                    </div>
                                    <div class="gmail_extra">In the
                                      spirit of open scientific review,
                                      I am curious to learn of others
                                      experience the same kind of
                                      improvement I saw. </div>
                                    <div class="gmail_extra"><br>
                                    </div>
                                    <div class="gmail_extra">I have a
                                      pull request for this change to
                                      the bioperl-live repository. See <a
                                        moz-do-not-send="true"
                                        href="https://github.com/bioperl/bioperl-live/issues/95"
                                        target="_blank">https://github.com/bioperl/bioperl-live/issues/95</a>
                                      . However I note that the Bio::DB
                                      code used by  Variant Effect
                                      Predictor is a different
                                      (back-level) from the code in that
                                      git repository. The diff file in
                                      the gist cited above is for the
                                      Fasta.pm code that is in Ensembl ;
                                      of course, the pull request uses
                                      the current Bio::DB code. </div>
                                    <div class="gmail_extra"><br>
                                    </div>
                                    <div class="gmail_extra"><br>
                                    </div>
                                    <div class="gmail_extra">Lastly <a
                                        moz-do-not-send="true"
                                        href="http://dustyfeet.com:8001"
                                        target="_blank">http://dustyfeet.com:8001</a>
                                      has the profile results other
                                      kinds of runs which I hope will
                                      clarify my other remarks about
                                      where things are slow. </div>
                                    <div class="gmail_extra"><br>
                                    </div>
                                    <div class="gmail_extra"><br>
                                      <div class="gmail_quote">On Thu,
                                        Dec 18, 2014 at 12:48 AM, Rocky
                                        Bernstein <span dir="ltr"><<a
                                            moz-do-not-send="true"
                                            href="mailto:rocky.bernstein@gmail.com"
                                            target="_blank">rocky.bernstein@gmail.com</a>></span>
                                        wrote:
                                        <blockquote class="gmail_quote"
                                          style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
                                          <div dir="ltr">Running the
                                            Variant Effect Predictor on
                                            a Human Genome VCF file
                                            (130780 lines)  with a local
                                            Fasta cache (--offline)
                                            takes about 50 minutes on a
                                            quad-core Ubuntu box. 
                                            <div><br>
                                            </div>
                                            <div>I could give more
                                              details, but I don't think
                                              they are that important.
                                              <div><br>
                                              </div>
                                              <div>In looking at how to
                                                speed this up, it looks
                                                like VEP goes through
                                                the VCF file,  is sorted
                                                by chromosome, and
                                                processes each</div>
                                              <div>Chromosome
                                                independently. The first
                                                obvious way to speed
                                                this up would be to do
                                                some sort of 24-way
                                                map/reduce.</div>
                                              <div>There is of course
                                                the --fork option on the
                                                <a
                                                  moz-do-not-send="true"
href="http://variant_effect_predictor.pl" target="_blank">variant_effect_predictor.pl</a>
                                                program which is roughly
                                                the same idea, but it
                                                parallelizes only across
                                                the cores of a single
                                                computer rather than
                                                make use of multiple
                                                ones. </div>
                                              <div><br>
                                              </div>
                                              <div>To pinpoint the
                                                slowness better, I used
                                                Devel::NYTProf. For
                                                those of you who haven't
                                                used it recently, it now
                                                has flame graphs and it
                                                makes it very easy to
                                                see what's going on.</div>
                                              <div><br>
                                              </div>
                                              <div>The first thing that
                                                came out was a slowness
                                                in code to remove
                                                carriage returns and
                                                line feeds. This is in
                                                Bio::DB::Fasta
                                                ::subseq: <br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>
                                                <div>     $data =~
                                                  s/\n//g;</div>
                                                <div>     $data =~
                                                  s/\r//g;</div>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>Compiling the regexp,
                                                e.g: </div>
                                              <div><br>
                                              </div>
                                              <div>     my $nl = qr/\n/;</div>
                                              <div>     my $cr = qr/\r/;</div>
                                              <div><br>
                                              </div>
                                              <div>     sub subseq {</div>
                                              <div>         ....</div>
                                              <div>        $data =~
                                                s/$nl//g;</div>
                                              <div>        $data =~
                                                s/$cr//g;</div>
                                              <div>     }</div>
                                              <div><br>
                                              </div>
                                              <div>Speeds up the subseq
                                                method by about 15%. I
                                                can elaborate more or
                                                describe the other
                                                methods I tried and how
                                                they fared, if there's
                                                interest. But since this
                                                portion is really part
                                                of BioPerl and not
                                                Bio::EnsEMBL, I'll try
                                                to work up a git pull
                                                request ont that
                                                repository.</div>
                                              <div><br>
                                              </div>
                                              <div>So now I come to the
                                                meat of what I have to
                                                say. I should have put
                                                this at the top -- I
                                                hope some of you are
                                                still with me. </div>
                                              <div><br>
                                              </div>
                                              <div>The NYTProf graphs
                                                seem to say that there
                                                is a *lot* of overhead
                                                in object lookup and
                                                type testing. I think
                                                some of this is already
                                                known as there already
                                                are calls to "weaken"
                                                and "new_fast" object
                                                creators. And there is
                                                this comment in
                                                 Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:</div>
                                              <div><br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>
                                                <div>    # this method
                                                  is a major bottle neck
                                                  in the effect
                                                  calculation code so</div>
                                                <div>    # we cache
                                                  results and use local
                                                  variables instead of
                                                  method calls where</div>
                                                <div>    # possible to
                                                  speed things up -
                                                  caveat bug-fixer!</div>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>In the few cases
                                                guided by NYTProf, I've
                                                been able to make
                                                reasonable speed ups at
                                                the expense of
                                                eliminating the tests</div>
                                              <div>and object overhead. <br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>For example, in
                                                EnsEMBL::Variation::BaseTranscriptVariation
                                                changing: </div>
                                              <div><br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>
                                                <div> sub transcript {</div>
                                                <div>     my ($self,
                                                  $transcript) = @_;</div>
                                                <div>   
                                                   assert_ref($transcript,
                                                  'Bio::EnsEMBL::Transcript')
                                                  if $transcript;</div>
                                                <div>     return
                                                  $self->SUPER::feature($transcript,
                                                  'Transcript');</div>
                                                <div>}<br>
                                                </div>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>to: </div>
                                              <div><br>
                                              </div>
                                              <div>
                                                <div>     sub transcript
                                                  {</div>
                                                <div>         my ($self,
                                                  $transcript) = @_;</div>
                                                <div>        return
                                                  $self->{feature};<br>
                                                </div>
                                                <div><br>
                                                </div>
                                              </div>
                                              <div>Gives a noticeable
                                                speed up. But you may
                                                ask: if that happens,
                                                then we lose type safety
                                                and there is a potential
                                                for bugs? </div>
                                              <div>And here's my take on
                                                how to address these
                                                valid concerns. First, I
                                                think there could be two
                                                sets of the Perl
                                                modules, such as for</div>
                                              <div>EnsEMBL::Variation::BaseTranscriptVariation
                                                - those with all of the
                                                checks and those that
                                                are fast.  A
                                                configuration parameter
                                                might specify which
                                                version to use. In
                                                development or by
                                                default, one might use
                                                the ones that check
                                                types. </div>
                                              <div><br>
                                              </div>
                                              <div>Second and perhaps
                                                more import, there are
                                                the tests! If more need
                                                to be added, then let's
                                                add them. And one can
                                                always add a test to
                                                make sure the results of
                                                the two versions gives
                                                the same result.<br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>One last avenue of
                                                optimization that I'd
                                                like to explore is using
                                                say Inline::C or
                                                basically coding in C
                                                hot spots. In
                                                particular, consider</div>
                                              <div>Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap
                                                which looks like this: <br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>         my (
                                                $f1_start, $f1_end,
                                                $f2_start, $f2_end ) =
                                                @_;</div>
                                              <div>         return (
                                                ($f1_end >=
                                                $f2_start) and
                                                ($f1_start <=
                                                $f2_end) );</div>
                                              <div><br>
                                              </div>
                                              <div>I haven't tried it on
                                                this hot spot, but this
                                                is something that might
                                                benefit from getting
                                                coded in C. Again the
                                                trade off for speed here
                                                is a dependency on
                                                compiling C. In my view
                                                anyone installing this
                                                locally or installing
                                                CPAN modules probably
                                                already does, but it
                                                does add complexity.</div>
                                              <div><br>
                                              </div>
                                              <div>Typically, this is
                                                handled in Perl by
                                                providing both versions,
                                                perhaps as separate
                                                modules.</div>
                                              <div><br>
                                              </div>
                                              <div>Thought or comments? </div>
                                              <div><br>
                                              </div>
                                              <div>Thanks,</div>
                                              <div>   rocky </div>
                                              <div><br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>
                                                <div>
                                                  <div>
                                                    <div><br>
                                                    </div>
                                                  </div>
                                                </div>
                                              </div>
                                            </div>
                                          </div>
                                        </blockquote>
                                      </div>
                                    </div>
                                  </div>
                                </div>
                                <br>
                              </div>
                            </div>
_______________________________________________<br>
                            Dev mailing list    <a
                              moz-do-not-send="true"
                              href="mailto:Dev@ensembl.org"
                              target="_blank">Dev@ensembl.org</a><br>
                            Posting guidelines and subscribe/unsubscribe
                            info: <a moz-do-not-send="true"
                              href="http://lists.ensembl.org/mailman/listinfo/dev"
                              target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>
                            Ensembl Blog: <a moz-do-not-send="true"
                              href="http://www.ensembl.info/"
                              target="_blank">http://www.ensembl.info/</a><br>
                            <br>
                          </blockquote>
                        </div>
                        <br>
                      </div>
                      <br>
                      _______________________________________________<br>
                      Dev mailing list    <a moz-do-not-send="true"
                        href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>
                      Posting guidelines and subscribe/unsubscribe info:
                      <a moz-do-not-send="true"
                        href="http://lists.ensembl.org/mailman/listinfo/dev"
                        target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>
                      Ensembl Blog: <a moz-do-not-send="true"
                        href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>
                      <br>
                    </blockquote>
                  </div>
                </div>
              </div>
              <br>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
Dev mailing list    <a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>
Posting guidelines and subscribe/unsubscribe info: <a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev">http://lists.ensembl.org/mailman/listinfo/dev</a>
Ensembl Blog: <a class="moz-txt-link-freetext" href="http://www.ensembl.info/">http://www.ensembl.info/</a>
</pre>
    </blockquote>
    <br>
  </body>
</html>