fix ambiguous submatches

This commit is contained in:
Kyryl Melekhin
2021-08-10 10:33:04 +00:00
parent e44e6a9517
commit 6a3be3927e
3 changed files with 57 additions and 12 deletions

43
README
View File

@@ -47,16 +47,39 @@ no distinction, it makes the implementation potentially even uglier.
NOTES NOTES
===== =====
Currently submatch tracking allocates a fix constant, however it is theoretically The problem described in this paper has been fixed. Ambiguous matching is correct.
possible to compute the worst case memory usage at compile time and get rid of HISTORY:
the constant. If this note remains, that means I did not find an efficient way https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf
to calculate the worst case. Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching,
which is based on the observation that reversing the longest-match rule
TODO simplifies the handling of iteration subexpressions: instead of maximizing
==== submatch from the first to the last iteration, one needs to maximize the
iterations in reverse order. This means that the disambiguation is always
* Support for matching flags like case-insensitive based on the most recent iteration, removing the need to remember all previous
* maybe add lookaround, ahead, behind iterations (except for the backwards-first, i.e. the last one, which contains
submatch result). The algorithm tracks two pairs of offsets per each submatch
group: the active pair (used for disambiguation) and the result pair. It gives
incorrect results under two conditions: (1) ambiguous matches have equal
offsets on some iteration, and (2) disambiguation happens too late, when
the active offsets have already been updated and the difference between
ambiguous matches is erased. We found that such situations may occur for two
reasons. First, the ε-closure algorithm may compare ambiguous paths after
their join point, when both paths have a common suffix with tagged
transitions. This is the case with the Cox prototype implementation; for
example, it gives incorrect results for (aa|a)* and string aaaaa. Most of such
failures can be repaired by exploring states in topological order, but a
topological order does not exist in the presence of ε-loops. The second reason
is bounded repetition: ambiguous paths may not have an intermediate join point
at all. For example, in the case of (aaaa|aaa|a){3,4} and string aaaaaaaaaa we
have matches (aaaa)(aaaa)(a)(a) and (aaaa)(aaa)(aaa) with a different number
of iterations. Assuming that the bounded repetition is unrolled by chaining
three sub-automata for (aaaa|aaa|a) and an optional fourth one, by the time
ambiguous paths meet both have active offsets (0,4). Despite the flaw, Cox
algorithm is interesting: if somehow the delayed comparison problem was fixed,
it would work. The algorithm requires O(mt) memory and O(nm^2t) time
(assuming a worst-case optimal closure algorithm), where n is the
length of input, m it the size of RE and t is the number of submatch groups
and subexpressions that contain them.
Author and License Author and License
================== ==================

5
pike.c
View File

@@ -576,10 +576,11 @@ int re_pikevm(rcode *prog, const char *s, const char **subp, int nsubp)
npc += *(npc+1) * 2 + 2; npc += *(npc+1) * 2 + 2;
goto addthread; goto addthread;
case MATCH: case MATCH:
if (matched) if (matched) {
decref(matched) decref(matched)
subidx = 0;
}
matched = nsub; matched = nsub;
subidx = 0;
goto break_for; goto break_for;
} }
decref(nsub) decref(nsub)

21
test.sh
View File

@@ -96,6 +96,13 @@ qwerty.*$
([a-zA-Z0-9_][^1]*[a-zA-Z0-9_])|(\\\\\$([^\$]+)\\\\\$) ([a-zA-Z0-9_][^1]*[a-zA-Z0-9_])|(\\\\\$([^\$]+)\\\\\$)
(h[^1]*b)|(\\\\\$([^\$]+)\\\\\$) (h[^1]*b)|(\\\\\$([^\$]+)\\\\\$)
(h[^1]*b)|(\\\\\$([^\$]+)\\\\\$) (h[^1]*b)|(\\\\\$([^\$]+)\\\\\$)
(a|aa)*
(a|aa)*
(a|aa)*
(a|aa)*
(a|aa)*
(a|aa)*
(aaaa|aaa|a){3,4}
" "
input="\ input="\
abcdef abcdef
@@ -193,6 +200,13 @@ $\"}, /* email */
$\"}, /* email */$ $\"}, /* email */$
$ hbbbb $ hbbbb
$ hsdhs $ $ hsdhs $
a
aa
aaa
aaaa
aaaaa
aaaaaa
aaaaaaaaaa
" "
expect="\ expect="\
(0,3) (0,3)
@@ -290,6 +304,13 @@ expect="\
(0,18)(?,?)(0,18)(1,17) (0,18)(?,?)(0,18)(1,17)
(3,8)(3,8)(?,?)(?,?) (3,8)(3,8)(?,?)(?,?)
(0,9)(?,?)(0,9)(1,8) (0,9)(?,?)(0,9)(1,8)
(0,1)(0,1)
(0,2)(1,2)
(0,3)(2,3)
(0,4)(3,4)
(0,5)(4,5)
(0,6)(5,6)
(0,10)(9,10)
(0,0) (0,0)
" "