From 6a3be3927e6f93846f7188dfa6704934f1cc3554 Mon Sep 17 00:00:00 2001 From: Kyryl Melekhin Date: Tue, 10 Aug 2021 10:33:04 +0000 Subject: [PATCH] fix ambiguous submatches --- README | 43 +++++++++++++++++++++++++++++++++---------- pike.c | 5 +++-- test.sh | 21 +++++++++++++++++++++ 3 files changed, 57 insertions(+), 12 deletions(-) diff --git a/README b/README index 5bb4e51..1d46f83 100644 --- a/README +++ b/README @@ -47,16 +47,39 @@ no distinction, it makes the implementation potentially even uglier. NOTES ===== -Currently submatch tracking allocates a fix constant, however it is theoretically -possible to compute the worst case memory usage at compile time and get rid of -the constant. If this note remains, that means I did not find an efficient way -to calculate the worst case. - -TODO -==== - -* Support for matching flags like case-insensitive -* maybe add lookaround, ahead, behind +The problem described in this paper has been fixed. Ambiguous matching is correct. +HISTORY: +https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf +Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching, +which is based on the observation that reversing the longest-match rule +simplifies the handling of iteration subexpressions: instead of maximizing +submatch from the first to the last iteration, one needs to maximize the +iterations in reverse order. This means that the disambiguation is always +based on the most recent iteration, removing the need to remember all previous +iterations (except for the backwards-first, i.e. the last one, which contains +submatch result). The algorithm tracks two pairs of offsets per each submatch +group: the active pair (used for disambiguation) and the result pair. It gives +incorrect results under two conditions: (1) ambiguous matches have equal +offsets on some iteration, and (2) disambiguation happens too late, when +the active offsets have already been updated and the difference between +ambiguous matches is erased. We found that such situations may occur for two +reasons. First, the ε-closure algorithm may compare ambiguous paths after +their join point, when both paths have a common suffix with tagged +transitions. This is the case with the Cox prototype implementation; for +example, it gives incorrect results for (aa|a)* and string aaaaa. Most of such +failures can be repaired by exploring states in topological order, but a +topological order does not exist in the presence of ε-loops. The second reason +is bounded repetition: ambiguous paths may not have an intermediate join point +at all. For example, in the case of (aaaa|aaa|a){3,4} and string aaaaaaaaaa we +have matches (aaaa)(aaaa)(a)(a) and (aaaa)(aaa)(aaa) with a different number +of iterations. Assuming that the bounded repetition is unrolled by chaining +three sub-automata for (aaaa|aaa|a) and an optional fourth one, by the time +ambiguous paths meet both have active offsets (0,4). Despite the flaw, Cox +algorithm is interesting: if somehow the delayed comparison problem was fixed, +it would work. The algorithm requires O(mt) memory and O(nm^2t) time +(assuming a worst-case optimal closure algorithm), where n is the +length of input, m it the size of RE and t is the number of submatch groups +and subexpressions that contain them. Author and License ================== diff --git a/pike.c b/pike.c index e50265d..ded28a0 100644 --- a/pike.c +++ b/pike.c @@ -576,10 +576,11 @@ int re_pikevm(rcode *prog, const char *s, const char **subp, int nsubp) npc += *(npc+1) * 2 + 2; goto addthread; case MATCH: - if (matched) + if (matched) { decref(matched) + subidx = 0; + } matched = nsub; - subidx = 0; goto break_for; } decref(nsub) diff --git a/test.sh b/test.sh index afc0667..0e6eadb 100755 --- a/test.sh +++ b/test.sh @@ -96,6 +96,13 @@ qwerty.*$ ([a-zA-Z0-9_][^1]*[a-zA-Z0-9_])|(\\\\\$([^\$]+)\\\\\$) (h[^1]*b)|(\\\\\$([^\$]+)\\\\\$) (h[^1]*b)|(\\\\\$([^\$]+)\\\\\$) +(a|aa)* +(a|aa)* +(a|aa)* +(a|aa)* +(a|aa)* +(a|aa)* +(aaaa|aaa|a){3,4} " input="\ abcdef @@ -193,6 +200,13 @@ $\"}, /* email */ $\"}, /* email */$ $ hbbbb $ hsdhs $ +a +aa +aaa +aaaa +aaaaa +aaaaaa +aaaaaaaaaa " expect="\ (0,3) @@ -290,6 +304,13 @@ expect="\ (0,18)(?,?)(0,18)(1,17) (3,8)(3,8)(?,?)(?,?) (0,9)(?,?)(0,9)(1,8) +(0,1)(0,1) +(0,2)(1,2) +(0,3)(2,3) +(0,4)(3,4) +(0,5)(4,5) +(0,6)(5,6) +(0,10)(9,10) (0,0) "