develooper Front page | perl.perl5.changes | Postings from March 2023

[Perl/perl5] 2b5d6c: regexp.h - fixup mistake in comment

From:
Yves Orton via perl5-changes
Date:
March 13, 2023 08:32
Subject:
[Perl/perl5] 2b5d6c: regexp.h - fixup mistake in comment
Message ID:
Perl/perl5/push/refs/heads/yves/redo_curlyx_curlym/8cd2da-98a9b3@github.com
  Branch: refs/heads/yves/redo_curlyx_curlym
  Home:   https://github.com/Perl/perl5
  Commit: 2b5d6c69e10cf5b63e7027ae280a50f7bef0326c
      https://github.com/Perl/perl5/commit/2b5d6c69e10cf5b63e7027ae280a50f7bef0326c
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexp.h

  Log Message:
  -----------
  regexp.h - fixup mistake in comment

rex->maxlen holds the maximum length the pattern can match, not the
minimum. The copy was obviously copied from the rex->minlen case,
so fix it to be correct.


  Commit: fc3bf600b683356ff57de4c786cb9314d9fc661c
      https://github.com/Perl/perl5/commit/fc3bf600b683356ff57de4c786cb9314d9fc661c
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M t/re/regexp.t

  Log Message:
  -----------
  t/re/regexp.t - in skip_amp tests (via _noamp.t) do not TODO tests with ampersand

Tests with $& always pass in regexp_noamp.t (wrapper around regexp.t),
so when they are TODO tests it looks like a TODO pass when in fact it is
just an artifact of how we handle ampersand tests in this file. For these
cases we simply do not mark them as TODO anymore


  Commit: 34f87433cdd494f279661b2d0752e9623871684c
      https://github.com/Perl/perl5/commit/34f87433cdd494f279661b2d0752e9623871684c
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M pod/perldebguts.pod
    M pp_ctl.c
    M regcomp.c
    M regcomp.h
    M regcomp.sym
    M regcomp_debug.c
    M regexec.c
    M regexp.h
    M regnodes.h
    M t/re/pat.t
    M t/re/re_tests

  Log Message:
  -----------
  regcomp.c - track parens related to CURLYX and CURLYM

This was originally a patch which made somewhat drastic changes to how
we represent capture buffers, which Dave M and I and are still
discussing offline and which has a larger impact than is acceptable to
address at the current time. As such I have reverted the controversial
parts of this patch for now, while keeping most of it intact even if in
some cases the changes are unused except for debugging purposes.

This patch still contains valuable changes, for instance teaching CURLYX
and CURLYM about how many parens there are before the curly[1] (which
will be useful in follow up patches even if stricly speaking they are
not directly used yet), tests and other cleanups. Also this patch is
sufficiently large that reverting it out would have a large effect on
the patches that were made on top of it.

Thus keeping most of this patch while eliminating the controversial
parts of it for now seemed the best approach, especially as some of the
changes it introduces and the follow up patches based on it are very
useful in cleaning up the structures we use to represent regops.

[1] Curly is the regexp internals term for quantifiers, named after
x{min,max} "curly brace" quantifiers.


  Commit: 4c12e5457bee887ffea9ef5c8371d45a04687ede
      https://github.com/Perl/perl5/commit/4c12e5457bee887ffea9ef5c8371d45a04687ede
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M pod/perldebguts.pod
    M regcomp.c
    M regcomp.h
    M regcomp.sym
    M regcomp_debug.c
    M regcomp_trie.c
    M regexec.c
    M regexp.h
    M regnodes.h
    M t/re/pat.t
    M t/re/re_tests

  Log Message:
  -----------
  regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers

In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at
the same time. When a branch fails it should reset any capture buffers
that might be touched by its branch.

We change BRANCH and BRANCHJ to store the number of parens before the
branch, and the number of parens after the branch was completed. When
a BRANCH operation fails, we clear the buffers it contains before we
continue on.

It is a bit more complex than it should be because we have BRANCHJ
and BRANCH. (One of these days we should merge them together.)

This is also made somewhat more complex because TRIE nodes are actually
branches, and may need to track capture buffers also, at two levels.
The overall TRIE op, and for jump tries especially where we emulate
the behavior of branches. So we have to do the same clearing logic if
a trie branch fails as well.


  Commit: d93b43ca7f9bd583049f5c43219898d3d2adc2e2
      https://github.com/Perl/perl5/commit/d93b43ca7f9bd583049f5c43219898d3d2adc2e2
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexec.c
    M regexp.h
    M t/re/re_tests

  Log Message:
  -----------
  regexec.c - incredibly inefficient solution to backref problem

Backrefs to unclosed parens inside of a quantified group were not being
properly handled, which revealed we are not unrolling the paren state properly
on failure and backtracking.

Much of the code assumes that when we execute a "conditional" operation (where
more than one thing could match) that we need not concern ourself with the
paren state unless the conditional operation itself represents a paren, and
that generally opcodes only needed to concern themselves with parens to their
right. When you exclude backrefs from the equation this is broadly reasonable
(i think), as on failure we typically dont care about the state of the paren
buffers. They either get reset as we find a new different accepting pathway,
or their state is irrelevant if the overal match is rejected (eg it fails).

However backreferences are different. Consider the following pattern
from the tests

    "xa=xaaa" =~ /^(xa|=?\1a){2}\z/

in the first iteration through this the first branch matches, and in fact
because the \1 is in the second branch it can't match on the first iteration
at all. After this $1 = "xa". We then perform the second iteration. "xa" does
not match "=xaaa" so we fall to the second branch. The '=?' matches, but sets
up a backtracking action to not match if the rest of the pattern does not
match. \1 matches 'xa', and then the 'a' matches, leaving an unmatched 'a' in
the string, we exit the quantifier loop with $1 = "=xaa" and match \z against
the remaining "a" in the pattern, and fail.

Here is where things go wrong in the old code, we unwind to the outer loop,
but we do not unwind the paren state. We then unwind further into the 2nds
iteration of the loop, to the '=?' where we then try to match the tail with
the quantifier matching the empty string. We then match the old $1 (which was
not unwound) as "=xaa", and then the "a" matches, and we are the end of the
string and we have incorrectly accpeted this string as matching the pattern.

What should have happend was when the \1 was resolved the second time it
should have returned the same string as it did when the =? matched '=', which
then would have resulted in the tail matching again, and etc, eventually
unwinding the entire pattern when the second iteration failed entirely.

This patch is very crude. It simple pushes the state of the parens and creates
and unwind point for every case where we do a transition to a B or _next
operation, and we make the corresponding _next_fail do the appropriate
unwinding. The objective was to achieve correctness and then work towards
making it more efficient. We almost certainly overstore items on the stack.

In a future patch we can perhaps keep track of the unclosed parens before the
relevant operators and make sure that they are properly pushed and unwound at
the correct times.


  Commit: a5a3529d1c79eee62c4952fb2b0c4701f7bca880
      https://github.com/Perl/perl5/commit/a5a3529d1c79eee62c4952fb2b0c4701f7bca880
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regcomp.c
    M regcomp.h
    M regcomp.sym
    M regcomp_internal.h
    M regexec.c
    M regexp.h
    M regnodes.h

  Log Message:
  -----------
  regexec.c - make REF into a backtracking state

This way we can do the required paren restoration only when it is in use. When
we match a REF type node which is potentially a reference to an unclosed paren
we push the match context information, currently for "everything", but in a
future patch we can teach it to be more efficient by adding a new parameter to
the REF regop to track which parens it should save.

This converts the backtracking changes from the previous commit, so that it is
run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which
is by default 0. We don't make the new fields in the struct conditional as the
stack frames are large and our changes don't make any real difference and it
keeps things simpler to not have conditional members, especially since some of
the structures have to line up with each other.

If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means
something is sensitive to us not necessarily restoring the parens properly on
failure. We make some assumptions that the paren state after a failing state
will be corrected by a future successful state, or that the state of the
parens is irrelevant as we will fail anyway. This can be made not true by
EVAL, backrefs, and potentially some other scenarios. Thus I have left this
inefficient logic in place but guarded by the flag.


  Commit: 74edc41564c2c82e217f4d57128ef97c8ced73c8
      https://github.com/Perl/perl5/commit/74edc41564c2c82e217f4d57128ef97c8ced73c8
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M embed.fnc
    M embed.h
    M pod/perldebguts.pod
    M proto.h
    M regcomp.c
    M regcomp.h
    M regcomp.sym
    M regcomp_debug.c
    M regcomp_study.c
    M regcomp_trie.c
    M regexec.c
    M reginline.h
    M regnodes.h

  Log Message:
  -----------
  regex engine - simplify regnode structures and make them consistent

This eliminates the regnode_2L data structure, and merges it with the older
regnode_2 data structure. At the same time it makes each "arg" property of the
various regnode types that have one be consistently structured as an anonymous
union like this:

    union {
        U32 arg1u;
        I32 arg2i;
        struct {
            U16 arg1a;
            U16 arg1b;
        };
    };

We then expose four macros for accessing each slot: ARG1u() ARG1i() and
ARG1a() and ARG1b(). Code then explicitly designates which they want. The old
logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1,
which was confusing to say the least. The regnode_2L structure had a U32 arg1,
and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new
set of macros we use the regnode_2 for both, and use the appropriate macros to
show whether we want to signed or unsigned values.

This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit
args". However as each slot can also store two U16s, a regnode_3 can hold up
to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes
use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count
and ARG3a() and ARG3b() for parens before and inside the quantifier.

It also changes the functions reganode() to reg1node() and changes reg2Lanode()
to reg2node(). The 2L thing was just confusing.


  Commit: 4e80d460710a249ca53e0ddd2713ecc0a119f4e0
      https://github.com/Perl/perl5/commit/4e80d460710a249ca53e0ddd2713ecc0a119f4e0
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M pod/perldebguts.pod
    M regcomp.c
    M regcomp.sym
    M regcomp_debug.c
    M regexec.c
    M regnodes.h
    M t/re/pat_advanced.t

  Log Message:
  -----------
  regcomp.c - extend REF to hold the paren it needs to regcppush

this way we can avoid pushing every buffer, we only need to push
the nestroot of the ref.


  Commit: 0eed5c876385adcd6c41c2d2a1d17dd9cc19261a
      https://github.com/Perl/perl5/commit/0eed5c876385adcd6c41c2d2a1d17dd9cc19261a
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M dump.c
    M regcomp.c
    M regexec.c
    M regexp.h

  Log Message:
  -----------
  regcomp.c - Use RXp_OFFSp() to access offset data

This insulates access to the regexp match offset data so we can
fix the define later and move the offset structure into a new struct.

The RXp_OFFSp() was introduced in a recent commit to deliberately
break anything using RXp_OFFS() directly. It is hard to type
deliberately, nothing but the internals should use it. Everything
else should use one of the wrappers around it.


  Commit: 3373de0e13504dec9ee24629ac21d19558997caf
      https://github.com/Perl/perl5/commit/3373de0e13504dec9ee24629ac21d19558997caf
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexp.h

  Log Message:
  -----------
  regexp.h - standardize macros, and parenthesize parameters

Obviously this isn't required as we build fine. But doing this
future proofs us to other changes.


  Commit: 2fa9ffd77d1f5452ecf0ab535757608c6341841f
      https://github.com/Perl/perl5/commit/2fa9ffd77d1f5452ecf0ab535757608c6341841f
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M dump.c
    M regcomp_debug.c
    M regexec.c

  Log Message:
  -----------
  regexec.c - use RXp_LASTPAREN(rex) to access rex->lastparen

This field will be moving to a new struct. Converting this to a macro
will make that move easier.


  Commit: 15722433a292bae60b8ca4a85d5ef9bff530db2d
      https://github.com/Perl/perl5/commit/15722433a292bae60b8ca4a85d5ef9bff530db2d
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexp.h

  Log Message:
  -----------
  regexp.h - add missing defines

We were missing various RXp_XXXX() and RX_XXXX() macros. This adds
them so we can use them in places where we are unreasonable intimate
with the regexp struct internals.


  Commit: 2aa0c8c5fc1ff1d7a551211b9ed1d4ceb80a9acd
      https://github.com/Perl/perl5/commit/2aa0c8c5fc1ff1d7a551211b9ed1d4ceb80a9acd
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M dump.c

  Log Message:
  -----------
  dump.c - use RXp_ macros to access regexp struct members

We will move some of these members out of the regexp structure
into a new sub structucture. This isolates those changes to the
macro definitions


  Commit: 9288307e699737d0c13d8f0f097ac8732145a4d1
      https://github.com/Perl/perl5/commit/9288307e699737d0c13d8f0f097ac8732145a4d1
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c - use RXp_LASTCLOSEPAREN(r) to access r->lastcloseparen

We will move this struct member into a new struct in a future patch,
and using the macros means we can reduce the number of places that
needs to be explcitly aware of the new structure.


  Commit: ecdd6ee82904bf8fccc18ff97f091300a4edd6e5
      https://github.com/Perl/perl5/commit/ecdd6ee82904bf8fccc18ff97f091300a4edd6e5
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c - use macro to access rex->subbeg

We will move this member to a new struct in the near future,
converting all uses to a macro isolates that change.


  Commit: 5c7c1ba6e90fe18dd0ea1805ea89242236c9dd01
      https://github.com/Perl/perl5/commit/5c7c1ba6e90fe18dd0ea1805ea89242236c9dd01
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c - use RXp_SUBLEN(ret) for ret->sublen

This member of the regexp structure will be moved to a new
structure in the near future. Converting to use the macro
will make this change easier to manage.


  Commit: 59d7ca9e4de3414704e42725215aa11738f54bde
      https://github.com/Perl/perl5/commit/59d7ca9e4de3414704e42725215aa11738f54bde
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c - use RXp_SUBOFFSET(rx) instead of rx->suboffset

We will migrate this struct member to a new struct in the near future
this change will make that patch more minimal and hide the gory details.


  Commit: f296ae459fa7f25fc1ac484b725a1508a12cbe68
      https://github.com/Perl/perl5/commit/f296ae459fa7f25fc1ac484b725a1508a12cbe68
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c - use RXp_SUBCOFFSET instead of rx->subcoffset

This member of the regexp struct will soon be migrated to a new
independent structure. This change ensure that when we do the migration
the changes are restricted to the least code possible.


  Commit: 5b5c29e8d7ebb2ca539374cb9d5ae0a65b2f5b61
      https://github.com/Perl/perl5/commit/5b5c29e8d7ebb2ca539374cb9d5ae0a65b2f5b61
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c - use RXp_SAVED_COPY(rex) instead of rex->saved_copy

We will migrate this member to a new structure in the near future,
wrapping with a macro makes that migration simpler and less invasive.


  Commit: 98a9b349f010061047a377bf1282a776cd30fa09
      https://github.com/Perl/perl5/commit/98a9b349f010061047a377bf1282a776cd30fa09
  Author: Yves Orton <demerphq@gmail.com>
  Date:   2023-03-13 (Mon, 13 Mar 2023)

  Changed paths:
    M regcomp.c

  Log Message:
  -----------
  regcomp.c - use macro wrappers to minimize impact of struct split

We will move various members of the regexp structure to a new
structure which just contains information about the match. Wrapping
the members in the standard macros means that change can be made
less invasive. We already did all of this in regexec.c


Compare: https://github.com/Perl/perl5/compare/8cd2da2fe41a...98a9b349f010



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About