[perl #114282] Regex optimizer fails with mixture of EXACT, and EXACTFU

Thread Next

From:

Karl Williamson via RT

Date:

July 28, 2012 09:47

Subject:

[perl #114282] Regex optimizer fails with mixture of EXACT, and EXACTFU

Message ID:

rt-3.6.HEAD-11172-1343494043-1295.114282-14-0@perl.org

I have done some more research into this, as follows. A simpler example that fails is this one: blead -Mre=Debug,All -E 'say "\N{LATIN SMALL LIGATURE FF}_" =~ /(?i:ff)_/' That yields: Compiling REx "(?i:ff)_" Final program: 1: EXACTFU <ff> (3) 3: EXACT <_> (5) 5: END (0) anchored "_" at 2 (checking anchored) stclass EXACTFU <ff> minlen 3 Matching REx "(?i:ff)_" against "%x{fb00}_" UTF-8 string... Did not find anchored substr "_"... Match failed The regex optimizer, split between regcomp.c and regexec.c, does not take into account the possibility of multi-character folds in these situations. I was wrong to say that the original compiled correctly. The problem is that the pattern can match either 2 or 3 characters, and the regcomp optimizer doesn't tell this to the regexec optimiser. This part: anchored "_" at 2 (checking anchored) stclass EXACTFU <ff> minlen 3 is wrong because the "2" could either be 1 or 2 characters, so should be range. In the example in the original report, the "2..3" should have been "1..3". If it is changed by hand in the debugger to 1..3, the match properly succeeds. The obvious solution would be to change the regcomp.c portion of the optimizer to detect multi-character folds in EXACTFish nodes, and to increase the range passed to regexec accordingly. However, I don't see a real efficient way to do this. The best method I've come up with so far along those lines is to go through every EXACTFish node looking for multi-char folded sequences in regcomp.c's join_exact(). A hash would be constructed at initialization whose keys are the UTF-8 of all the 49 characters (currently) which begin multi-character folds. Every character in the node would be looked up in that hash. The hash values would form a chain to find all the multi-char fold sequences that begin with the key. -- Karl Williamson

Thread Next

[perl #114282] Regex optimizer fails with mixture of EXACT, and EXACTFU by Karl Williamson via RT
[perl #114282] Regex optimizer fails with mixture of EXACT, and EXACTFU by karl williamson

nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About