fix ambiguous submatches
This commit is contained in:
43
README
43
README
@@ -47,16 +47,39 @@ no distinction, it makes the implementation potentially even uglier.
|
|||||||
|
|
||||||
NOTES
|
NOTES
|
||||||
=====
|
=====
|
||||||
Currently submatch tracking allocates a fix constant, however it is theoretically
|
The problem described in this paper has been fixed. Ambiguous matching is correct.
|
||||||
possible to compute the worst case memory usage at compile time and get rid of
|
HISTORY:
|
||||||
the constant. If this note remains, that means I did not find an efficient way
|
https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf
|
||||||
to calculate the worst case.
|
Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching,
|
||||||
|
which is based on the observation that reversing the longest-match rule
|
||||||
TODO
|
simplifies the handling of iteration subexpressions: instead of maximizing
|
||||||
====
|
submatch from the first to the last iteration, one needs to maximize the
|
||||||
|
iterations in reverse order. This means that the disambiguation is always
|
||||||
* Support for matching flags like case-insensitive
|
based on the most recent iteration, removing the need to remember all previous
|
||||||
* maybe add lookaround, ahead, behind
|
iterations (except for the backwards-first, i.e. the last one, which contains
|
||||||
|
submatch result). The algorithm tracks two pairs of offsets per each submatch
|
||||||
|
group: the active pair (used for disambiguation) and the result pair. It gives
|
||||||
|
incorrect results under two conditions: (1) ambiguous matches have equal
|
||||||
|
offsets on some iteration, and (2) disambiguation happens too late, when
|
||||||
|
the active offsets have already been updated and the difference between
|
||||||
|
ambiguous matches is erased. We found that such situations may occur for two
|
||||||
|
reasons. First, the ε-closure algorithm may compare ambiguous paths after
|
||||||
|
their join point, when both paths have a common suffix with tagged
|
||||||
|
transitions. This is the case with the Cox prototype implementation; for
|
||||||
|
example, it gives incorrect results for (aa|a)* and string aaaaa. Most of such
|
||||||
|
failures can be repaired by exploring states in topological order, but a
|
||||||
|
topological order does not exist in the presence of ε-loops. The second reason
|
||||||
|
is bounded repetition: ambiguous paths may not have an intermediate join point
|
||||||
|
at all. For example, in the case of (aaaa|aaa|a){3,4} and string aaaaaaaaaa we
|
||||||
|
have matches (aaaa)(aaaa)(a)(a) and (aaaa)(aaa)(aaa) with a different number
|
||||||
|
of iterations. Assuming that the bounded repetition is unrolled by chaining
|
||||||
|
three sub-automata for (aaaa|aaa|a) and an optional fourth one, by the time
|
||||||
|
ambiguous paths meet both have active offsets (0,4). Despite the flaw, Cox
|
||||||
|
algorithm is interesting: if somehow the delayed comparison problem was fixed,
|
||||||
|
it would work. The algorithm requires O(mt) memory and O(nm^2t) time
|
||||||
|
(assuming a worst-case optimal closure algorithm), where n is the
|
||||||
|
length of input, m it the size of RE and t is the number of submatch groups
|
||||||
|
and subexpressions that contain them.
|
||||||
|
|
||||||
Author and License
|
Author and License
|
||||||
==================
|
==================
|
||||||
|
|||||||
5
pike.c
5
pike.c
@@ -576,10 +576,11 @@ int re_pikevm(rcode *prog, const char *s, const char **subp, int nsubp)
|
|||||||
npc += *(npc+1) * 2 + 2;
|
npc += *(npc+1) * 2 + 2;
|
||||||
goto addthread;
|
goto addthread;
|
||||||
case MATCH:
|
case MATCH:
|
||||||
if (matched)
|
if (matched) {
|
||||||
decref(matched)
|
decref(matched)
|
||||||
|
subidx = 0;
|
||||||
|
}
|
||||||
matched = nsub;
|
matched = nsub;
|
||||||
subidx = 0;
|
|
||||||
goto break_for;
|
goto break_for;
|
||||||
}
|
}
|
||||||
decref(nsub)
|
decref(nsub)
|
||||||
|
|||||||
21
test.sh
21
test.sh
@@ -96,6 +96,13 @@ qwerty.*$
|
|||||||
([a-zA-Z0-9_][^1]*[a-zA-Z0-9_])|(\\\\\$([^\$]+)\\\\\$)
|
([a-zA-Z0-9_][^1]*[a-zA-Z0-9_])|(\\\\\$([^\$]+)\\\\\$)
|
||||||
(h[^1]*b)|(\\\\\$([^\$]+)\\\\\$)
|
(h[^1]*b)|(\\\\\$([^\$]+)\\\\\$)
|
||||||
(h[^1]*b)|(\\\\\$([^\$]+)\\\\\$)
|
(h[^1]*b)|(\\\\\$([^\$]+)\\\\\$)
|
||||||
|
(a|aa)*
|
||||||
|
(a|aa)*
|
||||||
|
(a|aa)*
|
||||||
|
(a|aa)*
|
||||||
|
(a|aa)*
|
||||||
|
(a|aa)*
|
||||||
|
(aaaa|aaa|a){3,4}
|
||||||
"
|
"
|
||||||
input="\
|
input="\
|
||||||
abcdef
|
abcdef
|
||||||
@@ -193,6 +200,13 @@ $\"}, /* email */
|
|||||||
$\"}, /* email */$
|
$\"}, /* email */$
|
||||||
$ hbbbb
|
$ hbbbb
|
||||||
$ hsdhs $
|
$ hsdhs $
|
||||||
|
a
|
||||||
|
aa
|
||||||
|
aaa
|
||||||
|
aaaa
|
||||||
|
aaaaa
|
||||||
|
aaaaaa
|
||||||
|
aaaaaaaaaa
|
||||||
"
|
"
|
||||||
expect="\
|
expect="\
|
||||||
(0,3)
|
(0,3)
|
||||||
@@ -290,6 +304,13 @@ expect="\
|
|||||||
(0,18)(?,?)(0,18)(1,17)
|
(0,18)(?,?)(0,18)(1,17)
|
||||||
(3,8)(3,8)(?,?)(?,?)
|
(3,8)(3,8)(?,?)(?,?)
|
||||||
(0,9)(?,?)(0,9)(1,8)
|
(0,9)(?,?)(0,9)(1,8)
|
||||||
|
(0,1)(0,1)
|
||||||
|
(0,2)(1,2)
|
||||||
|
(0,3)(2,3)
|
||||||
|
(0,4)(3,4)
|
||||||
|
(0,5)(4,5)
|
||||||
|
(0,6)(5,6)
|
||||||
|
(0,10)(9,10)
|
||||||
(0,0)
|
(0,0)
|
||||||
"
|
"
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user