What is pikevm? ============== re1 (http://code.google.com/p/re1/) is "toy regular expression implementation" by Russel Cox, featuring simplicity and minimal code size unheard of in other implementations. re2 (http://code.google.com/p/re2/) is "an efficient, principled regular expression library" by the same author. It is robust, full-featured, and ... bloated, comparing to re1. This is implementation of pikevm based on re1.5 which adds features required for minimalistic real-world use, while sticking to the minimal code size and memory use. https://github.com/pfalcon/re1.5 Why? ==== Pikevm guarantees that any input regex will scale O(n) with the size of the string, thus making it the fastest regex implementation. There is no backtracking that usually expodes to O(n^k) time and space where k is some constant. Features ======== * UnLike re1.5, here is only pikevm, one file easy to use. * Unlike re1.5, regexes is compiled to type sized code rather than bytecode, eliviating the problem of byte overflow in splits/jmps on large regexes. Currently the type used is int, and every atom in compiled code is aligned to that. * Matcher does not take size of string as param, it checks for '\0' instead, so that the user does not need to waste time taking strlen() * Highly optimized source code, probably 2x faster than re1.5 * Support for quoted chars in regex. Escapes in brackets. * Support for ^, $ assertions in regex. * Support for repetition operator {n} and {n,m}. * Support for Unicode (UTF-8). * Unlike other engines, the output is byte level offset. (Which is more useful) * Support for non capture group ?: * Support for wordend & wordbeg assertions - Some limitations for word assertions are meta chars like spaces being used in for expression itself, for example "\< abc" should match " abc" exactly at that space word boundary but it won't. It's possible to fix this, but it would require rsplit before word assert, and some dirty logic to check that the character or class is a space we want to match not assert at. But the code for it was too dirty and I scrapped it. Syntax for word assertions are like posix C library, not the pcre "\b" which can be used both in front or back of the word, because there is no distinction, it makes the implementation potentially even uglier. * Assert flags like REG_ICASE,REG_NOTEOL,REG_NOTBOL and lookahead inside negated bracket are implemented here (also shows use case in real world project): https://github.com/kyx0r/nextvi/blob/master/regex.c NOTES ===== The problem described in this paper has been fixed. Ambiguous matching is correct. HISTORY: https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf "Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching, which is based on the observation that reversing the longest-match rule simplifies the handling of iteration subexpressions: instead of maximizing submatch from the first to the last iteration, one needs to maximize the iterations in reverse order. This means that the disambiguation is always based on the most recent iteration, removing the need to remember all previous iterations (except for the backwards-first, i.e. the last one, which contains submatch result). The algorithm tracks two pairs of offsets per each submatch group: the active pair (used for disambiguation) and the result pair. It gives incorrect results under two conditions: (1) ambiguous matches have equal offsets on some iteration, and (2) disambiguation happens too late, when the active offsets have already been updated and the difference between ambiguous matches is erased. We found that such situations may occur for two reasons. First, the ε-closure algorithm may compare ambiguous paths after their join point, when both paths have a common suffix with tagged transitions. This is the case with the Cox prototype implementation; for example, it gives incorrect results for (aa|a)* and string aaaaa. Most of such failures can be repaired by exploring states in topological order, but a topological order does not exist in the presence of ε-loops. The second reason is bounded repetition: ambiguous paths may not have an intermediate join point at all. For example, in the case of (aaaa|aaa|a){3,4} and string aaaaaaaaaa we have matches (aaaa)(aaaa)(a)(a) and (aaaa)(aaa)(aaa) with a different number of iterations. Assuming that the bounded repetition is unrolled by chaining three sub-automata for (aaaa|aaa|a) and an optional fourth one, by the time ambiguous paths meet both have active offsets (0,4). Despite the flaw, Cox algorithm is interesting: if somehow the delayed comparison problem was fixed, it would work. The algorithm requires O(mt) memory and O(nm^2t) time (assuming a worst-case optimal closure algorithm), where n is the length of input, m it the size of RE and t is the number of submatch groups and subexpressions that contain them." This worst case scenario can only happen on ambiguous input, that is why nsubs size is set to half a MB just in case, this can match 5000000 ambiguous consumers (char, class, any) assuming t is 1. In practice there is almost never a situation where someone wants to search using regex this large. Use of alloca() instead of VLA, could remove this limit, I just wish it was standardized. If you ever wondered about a situation where alloca is a must, this is the algorithm. Research has shown that it is possible to disambiguate NFA in polynomial time but it brings serious performance issues on non ambiguous inputs. Author and License ================== licensed under BSD license, just as the original re1.