readme: explain what's going on with ambiguity

This commit is contained in:
Kyryl Melekhin
2021-10-06 12:44:45 +00:00
parent 2fe1deca0b
commit 3bb28cd1f8

13
README
View File

@@ -53,7 +53,7 @@ NOTES
The problem described in this paper has been fixed. Ambiguous matching is correct.
HISTORY:
https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf
Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching,
"Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching,
which is based on the observation that reversing the longest-match rule
simplifies the handling of iteration subexpressions: instead of maximizing
submatch from the first to the last iteration, one needs to maximize the
@@ -82,7 +82,16 @@ algorithm is interesting: if somehow the delayed comparison problem was fixed,
it would work. The algorithm requires O(mt) memory and O(nm^2t) time
(assuming a worst-case optimal closure algorithm), where n is the
length of input, m it the size of RE and t is the number of submatch groups
and subexpressions that contain them.
and subexpressions that contain them."
This worst case scenario can only happen on ambiguous input, that is why nsubs
size is set to half a MB just in case, this can match 5000000
ambiguous consumers (char, class, any) assuming t is 1. In practice there
is almost never a situation where someone wants to search using regex this
large. Use of alloca() instead of VLA, could remove this limit, I just wish
it was standardized. If you ever wondered about a situation where alloca
is a must, this is the algorithm.
Research has shown that it is possible to disambiguate NFA in polynomial time
but it brings serious performance issues on non ambiguous inputs.
Author and License
==================