Re: Proposal for \v and \V, the small- and large- cut regex operators.

Front page | perl.perl5.porters | Postings from August 2000

Re: Proposal for \v and \V, the small- and large- cut regex operators.

From:

Jeffrey Friedl

Date:

August 6, 2000 23:47

Subject:

Re: Proposal for \v and \V, the small- and large- cut regex operators.

Message ID:

200008070647.XAA18312@ventrue.yahoo.com


Rick Delaney <rick.delaney@home.com> wrote:
|> Jeffrey Friedl wrote:
|> > 
|> > For that example, yes, it is. But there are situations where \v could
|> > not be described with (?>...).
|> 
|> Such as?  And can you explain them using the better/worse language of
|> "Combining pieces together", or can they only be understood if you use
|> terms like "backtracking"?

Well, the better/worse language merely describes which of the possible
results will be chosen, not what the possible results are in the first
place. A \v wedge could very well remove from consideration some results
which would otherwise be possible. So no, I don't believe that I could
describe it in that respect.

But that's okay, since I don't believe that \1, \2, etc. could be described
in that respect, either, since exactly what they even mean (what substrings
they'll match) can change over the course of the regular expression
application.

The whole better/worse thing is just another way of describing the order in
which branches are taken. Even if you don't use the word "branches",
"backtracking", or "stack", the user still has to mentally employ those
concepts when trying to actually put the description to work to understand
how regex /this(t[ha]+t|other)*\b/ matches a string.

I think it's good to offer these other descriptions, since differing people
learn in differing ways, but the better/worse description covers only one
facet of a larger mechanism, and so it alone isn't sufficient to explain
things.

Frankly, I don't really care for the better/worse description as it's
currently written, but I do like the concept as one tool in the arsenal in
the war for understanding. So, along those lines, I'll offer a better/worse
type of description of \v:

\v  The "for better but not for worse" cut operator.

    As you apply the better/worse rules of a regex to a particular string,
    you'll be keeping tabs on the various non-better choices you've still
    got to fall back on. When you come to a \v, it has the effect of wiping
    that slate clean, removing from consideration the fallbacks you'd been
    accumulating up to that point.

You'll see that this brings into the description elements not currently
found in perlre, but that's because it's bringing into the description
additional facets of the greater mechanism. In an earlier note, Ilya
claimed (jokingly?) to have no idea what I was talking about when I
mentioned some of these, such as "application" and "when you come to...",
but these very same elements would be required to explain backreferences,
or a lot of things related to (?{...}) [since random variables can be
accessed], and $^R, and (??{...}) too. These things are what make perl
"regular" expressions non-regular, and is what makes them so very much more
powerful than, say, egrep's.

That perl goes about its matches as it does may have originally been a side
effect of the implementation. Perl versions 0 and 1 used the regex routines
from Jim Gosling's Emacs, which were very simple (even moreso than ed's). 
Wanting more power, Larry considered writing his own DFA engine a'la egrep,
and one can only wonder how Perl regular expressions would be now if back
then AT&T had allowed Al Aho to release his egrep source. But Henry Spencer
did release his regex package, and it was far superior to what was in Perl
at the time. Larry availed himself of it, and Perl2 had regular expressions
that are the root of all we have now.

Anyway, that perl goes about its matches as it does may have originally
been a side effect of the implementation, but now it's a specific feature
that's well documented and well used.

Its power is also a dual-edged sword, as the non-regular stuff is also what
makes them more difficult to understand than egrep's regular expressions,
and makes efficiency a non-trivial concern. Standing alone, the
better/worse description doesn't even touch on these, but its not touching
on them doesn't remove them from what the user needs to know to use them. 
That's why I say it's only one facet of a larger mechanism.

But Rick, I still haven't answered your first question as to what
situations \v couldn't be replaced by (?>...). I wish I had a great killer
example to give you, but I don't have one. Since I don't have \v at my
disposal, I'm not used to thinking in those terms.

I believe \v would mostly be used to raise the efficiency of failures, but
it could also be useful deep in a regex to say "oh, I've found that special
case I was watching out for, and now all other bets are off -- go only with
this flow, and nothing else".

Consider wanting a regular expression that will match the last number on a
line, but do so only if there's not an XXX somewhere following it. One way
would be:

    /^.*(\b\d+|XXX\v)/

Of course, there are other ways to write the same thing, which is a failing
of my simple example, but hopefully it's illustrative enough to give a
glimpse of its power.

Now, consider wanting to match the first number on a line, but only if it's
not after XXX. That would be:

    /\d+|XXX\V/

Again, a simple example, but I think a powerful one. Yes, you could
currently do something like /(\d+|XXX)/ and then check $1, but not if the
regex is in a configuration file, or is a subexpression of a larger regex. 
(Well, I guess with (?()...|...) you could do that last one :-) You could
also do something like /^(?![^d]+XXX)[^d]+(\d+)/, but again, the ease in
which this is rewritten is a result of the example's simplicity. Like I
said, I can imagine situations where \v or \V would be buried deep in an
expression as a kind of overall abort, but lack of experience has limited
my ability to come up with a compelling example of its use that way.

I hope this note has been more interesting than boring.
	Jeffrey
------------------------------------------------------------------------------
Jeffrey Friedl <jfriedl@yahoo-inc.com> Yahoo! Finance http://finance.yahoo.com