147 lines
8.2 KiB
Plaintext
147 lines
8.2 KiB
Plaintext
What is pikevm?
|
|
==============
|
|
|
|
re1 (http://code.google.com/p/re1/) is "toy regular expression implementation"
|
|
by Russel Cox, featuring simplicity and minimal code size unheard of in other
|
|
implementations. re2 (http://code.google.com/p/re2/) is "an efficient,
|
|
principled regular expression library" by the same author. It is robust,
|
|
full-featured, and ... bloated, comparing to re1.
|
|
|
|
This is implementation of pikevm based on re1.5 which adds features required for
|
|
minimalistic real-world use, while sticking to the minimal code size and
|
|
memory use.
|
|
https://github.com/pfalcon/re1.5
|
|
|
|
Why?
|
|
====
|
|
Pikevm guarantees that any input regex will scale O(n) with the size of the
|
|
string, thus making it the fastest regex implementation. There is no backtracking
|
|
that usually expodes to O(n^k) time and space where k is some constant.
|
|
|
|
Features
|
|
========
|
|
|
|
* UnLike re1.5, here is only pikevm, one file easy to use.
|
|
* Unlike re1.5, regexes is compiled to type sized code rather than bytecode,
|
|
eliviating the problem of byte overflow in splits/jmps on large regexes.
|
|
Currently the type used is int, and every atom in compiled code is aligned
|
|
to that.
|
|
* Matcher does not take size of string as param, it checks for '\0' instead,
|
|
so that the user does not need to waste time taking strlen()
|
|
* Highly optimized source code, probably 2x faster than re1.5
|
|
* Support for quoted chars in regex. Escapes in brackets.
|
|
* Support for ^, $ assertions in regex.
|
|
* Support for repetition operator {n} and {n,m} and {n,}.
|
|
- Note: cases with 0 are not handled, avoid them, they can easily be replaced.
|
|
* Support for Unicode (UTF-8).
|
|
* Unlike other engines, the output is byte level offset. (Which is more useful)
|
|
* Support for non capture group ?:
|
|
* Support for wordend & wordbeg assertions
|
|
- Some limitations for word assertions are meta chars like spaces being used
|
|
in for expression itself, for example "\< abc" should match " abc" exactly at
|
|
that space word boundary but it won't. It's possible to fix this, but it would
|
|
require rsplit before word assert, and some dirty logic to check that the character
|
|
or class is a space we want to match not assert at. But the code for it was too
|
|
dirty and I scrapped it. Syntax for word assertions are like posix C library, not
|
|
the pcre "\b" which can be used both in front or back of the word, because there is
|
|
no distinction, it makes the implementation potentially even uglier.
|
|
* Assert flags like REG_ICASE,REG_NOTEOL,REG_NOTBOL and lookahead inside
|
|
negated bracket are implemented here (also shows use case in real world project):
|
|
https://github.com/kyx0r/nextvi/blob/master/regex.c
|
|
|
|
NOTES
|
|
=====
|
|
The problem described in this paper has been fixed. Ambiguous matching is correct.
|
|
HISTORY:
|
|
https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf
|
|
"Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching,
|
|
which is based on the observation that reversing the longest-match rule
|
|
simplifies the handling of iteration subexpressions: instead of maximizing
|
|
submatch from the first to the last iteration, one needs to maximize the
|
|
iterations in reverse order. This means that the disambiguation is always
|
|
based on the most recent iteration, removing the need to remember all previous
|
|
iterations (except for the backwards-first, i.e. the last one, which contains
|
|
submatch result). The algorithm tracks two pairs of offsets per each submatch
|
|
group: the active pair (used for disambiguation) and the result pair. It gives
|
|
incorrect results under two conditions: (1) ambiguous matches have equal
|
|
offsets on some iteration, and (2) disambiguation happens too late, when
|
|
the active offsets have already been updated and the difference between
|
|
ambiguous matches is erased. We found that such situations may occur for two
|
|
reasons. First, the ε-closure algorithm may compare ambiguous paths after
|
|
their join point, when both paths have a common suffix with tagged
|
|
transitions. This is the case with the Cox prototype implementation; for
|
|
example, it gives incorrect results for (aa|a)* and string aaaaa. Most of such
|
|
failures can be repaired by exploring states in topological order, but a
|
|
topological order does not exist in the presence of ε-loops. The second reason
|
|
is bounded repetition: ambiguous paths may not have an intermediate join point
|
|
at all. For example, in the case of (aaaa|aaa|a){3,4} and string aaaaaaaaaa we
|
|
have matches (aaaa)(aaaa)(a)(a) and (aaaa)(aaa)(aaa) with a different number
|
|
of iterations. Assuming that the bounded repetition is unrolled by chaining
|
|
three sub-automata for (aaaa|aaa|a) and an optional fourth one, by the time
|
|
ambiguous paths meet both have active offsets (0,4). Despite the flaw, Cox
|
|
algorithm is interesting: if somehow the delayed comparison problem was fixed,
|
|
it would work. The algorithm requires O(mt) memory and O(nm^2t) time
|
|
(assuming a worst-case optimal closure algorithm), where n is the
|
|
length of input, m it the size of RE and t is the number of submatch groups
|
|
and subexpressions that contain them."
|
|
This worst case scenario can only happen on ambiguous input, that is why nsubs
|
|
size is set to half a MB just in case, this can match 5000000
|
|
ambiguous consumers (char, class, any) assuming t is 1. In practice there
|
|
is almost never a situation where someone wants to search using regex this
|
|
large. Use of alloca() instead of VLA, could remove this limit, I just wish
|
|
it was standardized. If you ever wondered about a situation where alloca
|
|
is a must, this is the algorithm.
|
|
Research has shown that it is possible to disambiguate NFA in polynomial time
|
|
but it brings serious performance issues on non ambiguous inputs.
|
|
|
|
This pikevm features an improved submatch extraction
|
|
algorithm based on Russ Cox's original design.
|
|
I - Kyryl Melekhin have found a way to optimize the tracking
|
|
properly of 1st number in the submatch pair. Based on simple
|
|
observation of how the NFA is constructed I noticed that
|
|
there is no way for addthread1() to ever reach inner SAVE
|
|
instructions in the regex, so that leaves tracking 2nd pairs by
|
|
addthread1() irrelevant to the final results (except the need to
|
|
initialize the sub after allocation). This improved the overall
|
|
performance by 25% which is massive considering that at the
|
|
time there was nothing else left to can be done to make it faster.
|
|
|
|
What are on##list macros?
|
|
Redundant state inside nlist can happen in couple of
|
|
ways, and has to do with the (closure) a* (star) operations and
|
|
also +. Due to the automata machine design split happens
|
|
to be above the next consumed instruction and if that
|
|
state gets added onto the list we may segfault or give
|
|
wrong submatch result. Rsplit does not have this problem
|
|
because it is generated below the consumer instruction, but
|
|
it can still add redundant states. Overall this is extremely
|
|
difficult to understand or explain, but this is just something
|
|
we have to check for. We checked for this using extra int inside
|
|
the split instructions, so this left some global state inside the
|
|
machine insts. Most of the time we just added to the next
|
|
gen number and kept incrementing it forever. This leaves a small
|
|
chance of overflowing the int and getting a run on a false state
|
|
left from previous use of the regex. Though if overflow never
|
|
happens there is no chance of getting a false state. Overflows
|
|
like this pose a high security threat, if the hacker knows
|
|
how many cycles he needs to overflow the gen variable and get
|
|
inconsistent result. It is possible to reset the marks if we
|
|
near the overflow, but as you may guess that does not come
|
|
for free.
|
|
|
|
Currently I removed all dynamic global state from the instructions
|
|
fixing any overlow issue at the cost of slight overhead of needing
|
|
to look though the nlist states, to prevent their readdition. This
|
|
solution is still fast because it affects only nlist + split run on
|
|
so most other uses of regex don't suffer big performace penalty.
|
|
This does not solve the ambiguity problem with multible
|
|
continuous states though. Finding a fast solution for continuous
|
|
ambiguity is the last thing preventing me to call this regex engine
|
|
PERFECT and limitation free. While yet, this is to be invented it
|
|
takes a big deal of genius and creativity to make new algorithms
|
|
or find improvements in what we already know.
|
|
|
|
Author and License
|
|
==================
|
|
licensed under BSD license, just as the original re1.
|