<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    Hi Rocky,<br>

    <br>

    Thanks for this - it's good to see you are noticing significant

    improvements when switching these functions. As Will said, we have

    started discussing possible VEP optimisation strategies recently, so

    are very keen to hear your ideas. That said, half of our team -

    including Will who is the lead on VEP - are away at the moment, so

    we won't be planning any big changes until next month.<br>

    <br>

    All the best,<br>

    <br>

    Sarah<br>

    <br>

    <div class="moz-cite-prefix">On 07/01/2015 21:29, Rocky Bernstein

      wrote:<br>

    </div>

    <blockquote

cite="mid:CANCp2gYpqRaJAFrwhTyYDLsboj9H7ruL6Qfd_7U6Lar6RiEnzw@mail.gmail.com"

      type="cite">

      <div dir="ltr">Some corrections. Coding overlap in C is about a

        100% improvement. (And the code looks about the same in C if you

        remove '$'s and change "and" to "&&"). 

        <div><br>

        </div>

        <div>If I have this right, (and I may not so please double check

          with your tests), this reduced the time taken

          in Bio::EnsEMBL::Variation::Utils::VariationEffect from

           executing 8574029 statements in 19.6s to executing 5631868

          statements in 12.5s. The specific time in _intron_effects went

          from 7.87s to 5.11s, while the time in overlap went from 1.45s

          to .638ms. I don't understand how the speedup in overlap

          appears to cause a bigger speedup overall.<br>

        </div>

        <div><br>

        </div>

        <div>Details of the benchmark changes are at <a

            moz-do-not-send="true"

            href="https://gist.github.com/rocky/6083dedc752c197875ca">https://gist.github.com/rocky/6083dedc752c197875ca</a></div>

        <div>while the overall run is <a moz-do-not-send="true"

            href="http://dustyfeet.com:8001/VEP-prof-5000-Inline-Overlap-C/">http://dustyfeet.com:8001/VEP-prof-5000-Inline-Overlap-C/</a></div>

      </div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote">On Fri, Jan 2, 2015 at 11:39 PM, Rocky

          Bernstein <span dir="ltr"><<a moz-do-not-send="true"

              href="mailto:rocky.bernstein@gmail.com" target="_blank">rocky.bernstein@gmail.com</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div dir="ltr">

              <div>A big-picture question before some small nuts and

                bolts.</div>

              <div><br>

              </div>

              <div>SnpEff  <a moz-do-not-send="true"

                  href="http://snpeff.sourceforge.net/" target="_blank">http://snpeff.sourceforge.net/</a>

                is about an order of magnitude faster than VEP. Yes, I

                realize they work at different levels, but isn't the

                level of difficulty and size data sizes that they work

                on roughly equivalent? I've heard people express the

                feeling that because the problems "feel" about the same

                in size and complexity they VEP should be running at

                about competitive speed. Or with in a factor or so.</div>

              <div><br>

              </div>

              <div>I honestly don't know, and I'd like to understand

                this better. So I'd appreciate thoughts and comments on

                this. </div>

              <div><br>

              </div>

              <div><br>

              </div>

              <div>Okay. now to nuts and bolts. Occasionally I'll look

                at VEP performance data mentioned before. And this has

                led me to look

                at Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap

                . See: <a moz-do-not-send="true"

href="http://dustyfeet.com:8001/VEP-prof-chrom1/Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#531"

                  target="_blank">http://dustyfeet.com:8001/VEP-prof-chrom1/Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#531</a></div>

              <div><br>

              </div>

              <div>The overlap code is basically returning the "and" of

                two integer comparisons. I tried coding overalp in C and

                got a 6% speedup - not that great. But now consider </div>

              <div>this code in

                Bio::EnsEMBL::Variation::BaseTranscriptVariation::_intron_effects

                that calls overlap: </div>

              <div><br>

              </div>

              <div>        if ( overlap($vf_start, $vf_end,

                $intron_start-3, $intron_start-1) or</div>

              <div>                 overlap($vf_start, $vf_end,

                $intron_start+2, $intron_start+7) or</div>

              <div>                 overlap($vf_start, $vf_end,

                $intron_end-7,   $intron_end-2  ) or</div>

              <div>                 overlap($vf_start, $vf_end,

                $intron_end+1,   $intron_end+3  ) or</div>

              <div>                 ($insertion && (</div>

              <div>                     $vf_start == $intron_start ||</div>

              <div>                     $vf_end == $intron_end ||</div>

              <div>                     $vf_start == $intron_start+2 ||</div>

              <div>                     $vf_end == $intron_end-2</div>

              <div>                    ) )) {</div>

              <div><br>

              </div>

              <div>If you inline the overlap code here, you'd get

                basically </div>

              <div>* 4 comparisons of $vf_start with $intron_start</div>

              <div>* 3 comparisons of $vf_end with $intron_end </div>

              <div>* 2 comparisons of $vf_start with $intron_end</div>

              <div><br>

              </div>

              <div>And since $intron_end is not less than $intron_start

                , it is possible that if $vf_start is greater than

                $intron_end, it will also have to be greater than

                $intron_start as well, eliminating possibly 4

                comparisons.</div>

              <div><br>

              </div>

              <div>So the logic could be rewritten here. Having good

                tests of that replaced logic is in my opinion crucial,

                as is keeping the old code above around in case one

                wants to change or experiment with things.</div>

              <div><br>

              </div>

              <div>But what I might consider doing is coding it all in C

                and combine the 4 overlap calls into one. My guess is

                that C will also benefit from keeping the values

                referred to above in registers. </div>

              <div><br>

              </div>

              <div>Again all of this is messy so I invite thoughts on

                this before undertaking something this messy. </div>

              <div><br>

              </div>

              <div>In closing though I'll mention that the Inline C code

                has been merged into bioperl_live. <a

                  moz-do-not-send="true"

href="https://github.com/bioperl/bioperl-live/commit/01ec10dda23b6c5ed7592808cff4ae0d34278611"

                  target="_blank">https://github.com/bioperl/bioperl-live/commit/01ec10dda23b6c5ed7592808cff4ae0d34278611</a><br>

              </div>

              <div><br>

              </div>

              <div>The way that works is that there is a recommended

                dependency on C::Inline. If C::Inline is around, a Perl

                subroutine gets overwritten with a routine of the same

                name implemented in C. I imagine that if this goes

                forward, it would do likewise. </div>

              <div><br>

              </div>

              <div>Okay, enough babble. Time to hear from you all...</div>

              <div><br>

              </div>

              <div>But </div>

              <div><br>

              </div>

            </div>

            <div class="gmail_extra"><br>

              <div class="gmail_quote"><span class="">On Tue, Dec 23,

                  2014 at 4:28 AM, Will McLaren <span dir="ltr"><<a

                      moz-do-not-send="true" href="mailto:wm2@ebi.ac.uk"

                      target="_blank">wm2@ebi.ac.uk</a>></span>

                  wrote:<br>

                </span>

                <div>

                  <div class="h5">

                    <blockquote class="gmail_quote" style="margin:0 0 0

                      .8ex;border-left:1px #ccc solid;padding-left:1ex">

                      <div dir="ltr">Thanks again Rocky, your work on

                        this is really appreciated, and great to see

                        such an improvement for such a minor change!

                        <div><br>

                        </div>

                        <div>If there's any other code you'd like to

                          share, or any changes to ours, please feel

                          free to send us more details or put in a pull

                          request on GitHub.</div>

                        <div><br>

                        </div>

                        <div>Thanks</div>

                        <div><br>

                        </div>

                        <div>Will</div>

                      </div>

                      <div class="gmail_extra"><br>

                        <div class="gmail_quote">

                          <div>

                            <div>On 23 December 2014 at 03:26, Rocky

                              Bernstein <span dir="ltr"><<a

                                  moz-do-not-send="true"

                                  href="mailto:rocky.bernstein@gmail.com"

                                  target="_blank">rocky.bernstein@gmail.com</a>></span>

                              wrote:<br>

                            </div>

                          </div>

                          <blockquote class="gmail_quote"

                            style="margin:0 0 0 .8ex;border-left:1px

                            #ccc solid;padding-left:1ex">

                            <div>

                              <div>

                                <div dir="ltr">Just a follow-up to my

                                  earlier post. 

                                  <div><br>

                                  </div>

                                  <div>I ran a Variant Effect Prediction

                                     run on a VCF file of 5000 entries

                                    (which is what fits in one buffer

                                    read)  with one small change. With

                                    that, I was able to significantly

                                    significantly reduce the time

                                    bottleneck in the Fasta code. The

                                    time spent here went from 7.76

                                    seconds to 2.32 seconds.</div>

                                  <div><br>

                                  </div>

                                  <div>Compare the top line of: <br>

                                    <div class="gmail_extra"><a

                                        moz-do-not-send="true"

href="http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html"

                                        target="_blank">http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html</a> </div>

                                    <div class="gmail_extra">with:</div>

                                    <div class="gmail_extra"><a

                                        moz-do-not-send="true"

href="http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html"

                                        target="_blank">http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html</a></div>

                                    <div class="gmail_extra"><br>

                                    </div>

                                    <div class="gmail_extra">You get a

                                      50% reduction just by the fact

                                      that one transformation is needed

                                      to remove both \n and \r rather

                                      than two transformations. But even

                                      beyond this, the C code for one

                                      run is still faster than the

                                      corresponding Perl s///. </div>

                                    <div class="gmail_extra"><br>

                                    </div>

                                    <div class="gmail_extra">The

                                      specific change that I made can be

                                      found at <a moz-do-not-send="true"

href="https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff"

                                        target="_blank">https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff</a></div>

                                    <div class="gmail_extra">You'll also

                                      see benchmarks for other

                                      variations of that code. </div>

                                    <div class="gmail_extra"><br>

                                    </div>

                                    <div class="gmail_extra">But.... in

                                      order to see the effect in a run

                                      you need to have Perl module

                                      Inline::C installed. Otherwise you

                                      get a lesser improvement outlined

                                      in my original posting.  Again

                                      this speeds things up by compiling

                                      once Perl regular expressions used

                                      to match \n and \r. </div>

                                    <div class="gmail_extra"><br>

                                    </div>

                                    <div class="gmail_extra">In the

                                      spirit of open scientific review,

                                      I am curious to learn of others

                                      experience the same kind of

                                      improvement I saw. </div>

                                    <div class="gmail_extra"><br>

                                    </div>

                                    <div class="gmail_extra">I have a

                                      pull request for this change to

                                      the bioperl-live repository. See <a

                                        moz-do-not-send="true"

                                        href="https://github.com/bioperl/bioperl-live/issues/95"

                                        target="_blank">https://github.com/bioperl/bioperl-live/issues/95</a>

                                      . However I note that the Bio::DB

                                      code used by  Variant Effect

                                      Predictor is a different

                                      (back-level) from the code in that

                                      git repository. The diff file in

                                      the gist cited above is for the

                                      Fasta.pm code that is in Ensembl ;

                                      of course, the pull request uses

                                      the current Bio::DB code. </div>

                                    <div class="gmail_extra"><br>

                                    </div>

                                    <div class="gmail_extra"><br>

                                    </div>

                                    <div class="gmail_extra">Lastly <a

                                        moz-do-not-send="true"

                                        href="http://dustyfeet.com:8001"

                                        target="_blank">http://dustyfeet.com:8001</a>

                                      has the profile results other

                                      kinds of runs which I hope will

                                      clarify my other remarks about

                                      where things are slow. </div>

                                    <div class="gmail_extra"><br>

                                    </div>

                                    <div class="gmail_extra"><br>

                                      <div class="gmail_quote">On Thu,

                                        Dec 18, 2014 at 12:48 AM, Rocky

                                        Bernstein <span dir="ltr"><<a

                                            moz-do-not-send="true"

                                            href="mailto:rocky.bernstein@gmail.com"

                                            target="_blank">rocky.bernstein@gmail.com</a>></span>

                                        wrote:

                                        <blockquote class="gmail_quote"

                                          style="margin:0px 0px 0px

0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

                                          <div dir="ltr">Running the

                                            Variant Effect Predictor on

                                            a Human Genome VCF file

                                            (130780 lines)  with a local

                                            Fasta cache (--offline)

                                            takes about 50 minutes on a

                                            quad-core Ubuntu box. 

                                            <div><br>

                                            </div>

                                            <div>I could give more

                                              details, but I don't think

                                              they are that important.

                                              <div><br>

                                              </div>

                                              <div>In looking at how to

                                                speed this up, it looks

                                                like VEP goes through

                                                the VCF file,  is sorted

                                                by chromosome, and

                                                processes each</div>

                                              <div>Chromosome

                                                independently. The first

                                                obvious way to speed

                                                this up would be to do

                                                some sort of 24-way

                                                map/reduce.</div>

                                              <div>There is of course

                                                the --fork option on the

                                                <a

                                                  moz-do-not-send="true"

href="http://variant_effect_predictor.pl" target="_blank">variant_effect_predictor.pl</a>

                                                program which is roughly

                                                the same idea, but it

                                                parallelizes only across

                                                the cores of a single

                                                computer rather than

                                                make use of multiple

                                                ones. </div>

                                              <div><br>

                                              </div>

                                              <div>To pinpoint the

                                                slowness better, I used

                                                Devel::NYTProf. For

                                                those of you who haven't

                                                used it recently, it now

                                                has flame graphs and it

                                                makes it very easy to

                                                see what's going on.</div>

                                              <div><br>

                                              </div>

                                              <div>The first thing that

                                                came out was a slowness

                                                in code to remove

                                                carriage returns and

                                                line feeds. This is in

                                                Bio::DB::Fasta

                                                ::subseq: <br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>

                                                <div>     $data =~

                                                  s/\n//g;</div>

                                                <div>     $data =~

                                                  s/\r//g;</div>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>Compiling the regexp,

                                                e.g: </div>

                                              <div><br>

                                              </div>

                                              <div>     my $nl = qr/\n/;</div>

                                              <div>     my $cr = qr/\r/;</div>

                                              <div><br>

                                              </div>

                                              <div>     sub subseq {</div>

                                              <div>         ....</div>

                                              <div>        $data =~

                                                s/$nl//g;</div>

                                              <div>        $data =~

                                                s/$cr//g;</div>

                                              <div>     }</div>

                                              <div><br>

                                              </div>

                                              <div>Speeds up the subseq

                                                method by about 15%. I

                                                can elaborate more or

                                                describe the other

                                                methods I tried and how

                                                they fared, if there's

                                                interest. But since this

                                                portion is really part

                                                of BioPerl and not

                                                Bio::EnsEMBL, I'll try

                                                to work up a git pull

                                                request ont that

                                                repository.</div>

                                              <div><br>

                                              </div>

                                              <div>So now I come to the

                                                meat of what I have to

                                                say. I should have put

                                                this at the top -- I

                                                hope some of you are

                                                still with me. </div>

                                              <div><br>

                                              </div>

                                              <div>The NYTProf graphs

                                                seem to say that there

                                                is a *lot* of overhead

                                                in object lookup and

                                                type testing. I think

                                                some of this is already

                                                known as there already

                                                are calls to "weaken"

                                                and "new_fast" object

                                                creators. And there is

                                                this comment in

                                                 Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:</div>

                                              <div><br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>

                                                <div>    # this method

                                                  is a major bottle neck

                                                  in the effect

                                                  calculation code so</div>

                                                <div>    # we cache

                                                  results and use local

                                                  variables instead of

                                                  method calls where</div>

                                                <div>    # possible to

                                                  speed things up -

                                                  caveat bug-fixer!</div>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>In the few cases

                                                guided by NYTProf, I've

                                                been able to make

                                                reasonable speed ups at

                                                the expense of

                                                eliminating the tests</div>

                                              <div>and object overhead. <br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>For example, in

                                                EnsEMBL::Variation::BaseTranscriptVariation

                                                changing: </div>

                                              <div><br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>

                                                <div> sub transcript {</div>

                                                <div>     my ($self,

                                                  $transcript) = @_;</div>

                                                <div>   

                                                   assert_ref($transcript,

                                                  'Bio::EnsEMBL::Transcript')

                                                  if $transcript;</div>

                                                <div>     return

                                                  $self->SUPER::feature($transcript,

                                                  'Transcript');</div>

                                                <div>}<br>

                                                </div>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>to: </div>

                                              <div><br>

                                              </div>

                                              <div>

                                                <div>     sub transcript

                                                  {</div>

                                                <div>         my ($self,

                                                  $transcript) = @_;</div>

                                                <div>        return

                                                  $self->{feature};<br>

                                                </div>

                                                <div><br>

                                                </div>

                                              </div>

                                              <div>Gives a noticeable

                                                speed up. But you may

                                                ask: if that happens,

                                                then we lose type safety

                                                and there is a potential

                                                for bugs? </div>

                                              <div>And here's my take on

                                                how to address these

                                                valid concerns. First, I

                                                think there could be two

                                                sets of the Perl

                                                modules, such as for</div>

                                              <div>EnsEMBL::Variation::BaseTranscriptVariation

                                                - those with all of the

                                                checks and those that

                                                are fast.  A

                                                configuration parameter

                                                might specify which

                                                version to use. In

                                                development or by

                                                default, one might use

                                                the ones that check

                                                types. </div>

                                              <div><br>

                                              </div>

                                              <div>Second and perhaps

                                                more import, there are

                                                the tests! If more need

                                                to be added, then let's

                                                add them. And one can

                                                always add a test to

                                                make sure the results of

                                                the two versions gives

                                                the same result.<br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>One last avenue of

                                                optimization that I'd

                                                like to explore is using

                                                say Inline::C or

                                                basically coding in C

                                                hot spots. In

                                                particular, consider</div>

                                              <div>Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap

                                                which looks like this: <br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>         my (

                                                $f1_start, $f1_end,

                                                $f2_start, $f2_end ) =

                                                @_;</div>

                                              <div>         return (

                                                ($f1_end >=

                                                $f2_start) and

                                                ($f1_start <=

                                                $f2_end) );</div>

                                              <div><br>

                                              </div>

                                              <div>I haven't tried it on

                                                this hot spot, but this

                                                is something that might

                                                benefit from getting

                                                coded in C. Again the

                                                trade off for speed here

                                                is a dependency on

                                                compiling C. In my view

                                                anyone installing this

                                                locally or installing

                                                CPAN modules probably

                                                already does, but it

                                                does add complexity.</div>

                                              <div><br>

                                              </div>

                                              <div>Typically, this is

                                                handled in Perl by

                                                providing both versions,

                                                perhaps as separate

                                                modules.</div>

                                              <div><br>

                                              </div>

                                              <div>Thought or comments? </div>

                                              <div><br>

                                              </div>

                                              <div>Thanks,</div>

                                              <div>   rocky </div>

                                              <div><br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div><br>

                                              </div>

                                              <div>

                                                <div>

                                                  <div>

                                                    <div><br>

                                                    </div>

                                                  </div>

                                                </div>

                                              </div>

                                            </div>

                                          </div>

                                        </blockquote>

                                      </div>

                                    </div>

                                  </div>

                                </div>

                                <br>

                              </div>

                            </div>

_______________________________________________<br>

                            Dev mailing list    <a

                              moz-do-not-send="true"

                              href="mailto:Dev@ensembl.org"

                              target="_blank">Dev@ensembl.org</a><br>

                            Posting guidelines and subscribe/unsubscribe

                            info: <a moz-do-not-send="true"

                              href="http://lists.ensembl.org/mailman/listinfo/dev"

                              target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

                            Ensembl Blog: <a moz-do-not-send="true"

                              href="http://www.ensembl.info/"

                              target="_blank">http://www.ensembl.info/</a><br>

                            <br>

                          </blockquote>

                        </div>

                        <br>

                      </div>

                      <br>

                      _______________________________________________<br>

                      Dev mailing list    <a moz-do-not-send="true"

                        href="mailto:Dev@ensembl.org" target="_blank">Dev@ensembl.org</a><br>

                      Posting guidelines and subscribe/unsubscribe info:

                      <a moz-do-not-send="true"

                        href="http://lists.ensembl.org/mailman/listinfo/dev"

                        target="_blank">http://lists.ensembl.org/mailman/listinfo/dev</a><br>

                      Ensembl Blog: <a moz-do-not-send="true"

                        href="http://www.ensembl.info/" target="_blank">http://www.ensembl.info/</a><br>

                      <br>

                    </blockquote>

                  </div>

                </div>

              </div>

              <br>

            </div>

          </blockquote>

        </div>

        <br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Dev mailing list    <a class="moz-txt-link-abbreviated" href="mailto:Dev@ensembl.org">Dev@ensembl.org</a>

Posting guidelines and subscribe/unsubscribe info: <a class="moz-txt-link-freetext" href="http://lists.ensembl.org/mailman/listinfo/dev">http://lists.ensembl.org/mailman/listinfo/dev</a>

Ensembl Blog: <a class="moz-txt-link-freetext" href="http://www.ensembl.info/">http://www.ensembl.info/</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>