develooper Front page | perl.perl5.changes | Postings from March 2023

[Perl/perl5] dd09cd: regcomp.c - Resolve issues clearing buffers inCUR...

From:
Yves Orton via perl5-changes
Date:
March 29, 2023 14:03
Subject:
[Perl/perl5] dd09cd: regcomp.c - Resolve issues clearing buffers inCUR...
Message ID:
Perl/perl5/push/refs/heads/yves/curlyx_curlym/f5f4af-dd09cd@github.com
  Branch: refs/heads/yves/curlyx_curlym
  Home:   https://github.com/Perl/perl5
  Commit: dd09cdb57d10f904b31baa5499a2048198a1b58b
      https://github.com/Perl/perl5/commit/dd09cdb57d10f904b31baa5499a2048198a1b58b
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-29 (Wed, 29 Mar 2023)

  Changed paths:
    M pod/perldelta.pod
    M pp_ctl.c
    M regexec.c
    M regexp.h
    M t/re/pat.t
    M t/re/pat_rt_report.t
    M t/re/re_tests

  Log Message:
  -----------
  regcomp.c - Resolve issues clearing buffers in CURLYX (MAJOR-CHANGE)

CURLYX doesn't reset capture buffers properly. It is possible
for multiple buffers to be defined at once with values from
different iterations of the loop, which doesn't make sense really.

An example is this:

  "foobarfoo"=~/((foo)|(bar))+/

after this matches $1 should equal $2 and $3 should be undefined,
or $1 should equal $3 and $2 should be undefined. Prior to this
patch this would not be the case.

The solution that this patches uses is to introduce a form of
"layered transactional storage" for paren data. The existing
pair of start/end data for capture data is extended with a
start_new/end_new pair. When the vast majority of our code wants
to check if a given capture buffer is defined they first check
"start_new/end_new", if either is -1 then they fall back to
whatever is in start/end.

When a capture buffer is CLOSEd the data is written into the
start_new/end_new pair instead of the start/end pair. When a CURLYX
loop is executing and has matched something (at least one "A" in
/A*B/ -- thus actually in WHILEM) it "commits" the start_new/end_new
data by writing it into start/end. When we begin a new iteration of
the loop we clear the start_new/end_new pairs that are contained by
the loop, by setting them to -1. If the loop fails then we roll back
as we used to. If the loop succeeds we continue. When we hit an END
block we commit everything.

Consider the example above. We start off with everything set to -1.

 $1 = (-1,-1):(-1,-1)
 $2 = (-1,-1):(-1,-1)
 $3 = (-1,-1):(-1,-1)

In the first iteration we have matched "foo" and end up with this:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):( 0, 3)
 $3 = (-1,-1):(-1,-1)

We commit the results of $2 and $3, and then clear the new data in
the beginning of the next loop:

 $1 = (-1,-1):( 0, 3)
 $2 = ( 0, 3):(-1,-1)
 $3 = (-1,-1):(-1,-1)

We then match "bar":

 $1 = (-1,-1):( 0, 3)
 $2 = ( 0, 3):(-1,-1)
 $3 = (-1,-1):( 3, 7)

and then commit the result and clear the new data:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):(-1,-1)
 $3 = ( 3, 7):(-1,-1)

and then we match "foo" again:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):( 7,10)
 $3 = ( 3, 7):(-1,-1)

And we then commit. We do a regcppush here as normal.

 $1 = (-1,-1):( 0, 3)
 $2 = ( 7,10):( 7,10)
 $3 = (-1,-1):(-1,-1)

We then clear it again, but since we don't match when we regcppop
we store the buffers back to the above layout. When we finally
hit the END buffer we also do a commit as well on all buffers, including
the 0th (for the full match).

Fixes GH Issue #18865, and adds tests for it and other things.





nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About