Add Readme
This commit is contained in:
206
README
206
README
@@ -1,161 +1,75 @@
|
|||||||
What is pikevm?
|
# lexvm: A Lexical Analysis Virtual Machine
|
||||||
==============
|
lexvm is a specialized virtual machine for lexical analysis (tokenization), derived
|
||||||
|
from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm
|
||||||
|
is optimized specifically for scanner/lexer workloads with deterministic,
|
||||||
|
linear-time matching semantics and a streamlined instruction set.
|
||||||
|
|
||||||
re1 (http://code.google.com/p/re1/) is "toy regular expression implementation"
|
## Negative Factors
|
||||||
by Russel Cox, featuring simplicity and minimal code size unheard of in other
|
Traditional regular expression engines struggle with lexical constructs that must
|
||||||
implementations. re2 (http://code.google.com/p/re2/) is "an efficient,
|
exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible
|
||||||
principled regular expression library" by the same author. It is robust,
|
but offer no native way to express "match anything except if it contains X".
|
||||||
full-featured, and ... bloated, comparing to re1.
|
Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to
|
||||||
|
address this but:
|
||||||
|
|
||||||
This is implementation of pikevm based on re1.5 which adds features required for
|
- Break linear-time guarantees.
|
||||||
minimalistic real-world use, while sticking to the minimal code size and
|
- Are not regular operators.
|
||||||
memory use.
|
- Introduce fragile rule ordering in lexers
|
||||||
https://github.com/pfalcon/re1.5
|
|
||||||
|
|
||||||
Why?
|
## Apostrophe ´'´
|
||||||
====
|
The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
|
||||||
Pikevm guarantees that any input regex will scale O(n) with the size of the
|
by *, forming `E'*` does it activate the negative factor operator: Match the
|
||||||
string, thus making it the fastest regex implementation. There is no backtracking
|
longest token starting at the current position that does not
|
||||||
that usually expodes to O(n^k) time and space where k is some constant.
|
contain any substring matching `E`.
|
||||||
|
|
||||||
Features
|
## Examples
|
||||||
========
|
### Escaped Strings
|
||||||
|
```
|
||||||
|
"(("|\\)'|\\.)*" (164)
|
||||||
|
[1 == 1] ""
|
||||||
|
[0 == 0] """
|
||||||
|
[0 == 0] "\"
|
||||||
|
[1 == 1] "\\"
|
||||||
|
[1 == 1] "lsk\"lsdk"
|
||||||
|
|
||||||
* UnLike re1.5, here is only pikevm, one file easy to use.
|
"(\\.|("|\\)')*" (164)
|
||||||
* Unlike re1.5, regexes is compiled to type sized code rather than bytecode,
|
[1 == 1] ""
|
||||||
eliviating the problem of byte overflow in splits/jmps on large regexes.
|
[0 == 0] """
|
||||||
Currently the type used is int, and every atom in compiled code is aligned
|
[0 == 0] "\"
|
||||||
to that.
|
[1 == 1] "\\"
|
||||||
* Matcher does not take size of string as param, it checks for '\0' instead,
|
[1 == 1] "lsk\"lsdk"
|
||||||
so that the user does not need to waste time taking strlen()
|
```
|
||||||
* Highly optimized source code, probably 2x faster than re1.5
|
### C-Style Comments
|
||||||
* Support for quoted chars in regex. Escapes in brackets.
|
```
|
||||||
* Support for ^, $ assertions in regex.
|
\\\*(\*\\)'*\*\\ (120)
|
||||||
* Support for repetition operator {n} and {n,m} and {n,}.
|
[1 == 1] \*lskd*\
|
||||||
* Support for Unicode (UTF-8).
|
[1 == 1] \****\
|
||||||
* Unlike other engines, the output is byte level offset. (Which is more useful)
|
[1 == 1] \*\\*\
|
||||||
* Support for non capture group ?:
|
[0 == 0] \*ls*\ lsdk *\
|
||||||
* Support for wordend & wordbeg assertions
|
```
|
||||||
- Some limitations for word assertions are meta chars like spaces being used
|
|
||||||
in for expression itself, for example "\< abc" should match " abc" exactly at
|
|
||||||
that space word boundary but it won't. It's possible to fix this, but it would
|
|
||||||
require rsplit before word assert, and some dirty logic to check that the character
|
|
||||||
or class is a space we want to match not assert at. But the code for it was too
|
|
||||||
dirty and I scrapped it. Syntax for word assertions are like posix C library, not
|
|
||||||
the pcre "\b" which can be used both in front or back of the word, because there is
|
|
||||||
no distinction, it makes the implementation potentially even uglier.
|
|
||||||
* Assert flags like REG_ICASE,REG_NOTEOL,REG_NOTBOL and lookaround assertions
|
|
||||||
are implemented in nextvi branch or Nextvi's regex.c
|
|
||||||
https://github.com/kyx0r/nextvi/blob/master/regex.c
|
|
||||||
|
|
||||||
NOTES
|
## Removed Features
|
||||||
=====
|
### Lazy Quantifiers
|
||||||
The problem described in this paper has been fixed. Ambiguous matching is correct.
|
Superseded by the negative factor operator `E'*`, which provides stronger
|
||||||
HISTORY:
|
exclusion semantics
|
||||||
https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf
|
|
||||||
"Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching,
|
|
||||||
which is based on the observation that reversing the longest-match rule
|
|
||||||
simplifies the handling of iteration subexpressions: instead of maximizing
|
|
||||||
submatch from the first to the last iteration, one needs to maximize the
|
|
||||||
iterations in reverse order. This means that the disambiguation is always
|
|
||||||
based on the most recent iteration, removing the need to remember all previous
|
|
||||||
iterations (except for the backwards-first, i.e. the last one, which contains
|
|
||||||
submatch result). The algorithm tracks two pairs of offsets per each submatch
|
|
||||||
group: the active pair (used for disambiguation) and the result pair. It gives
|
|
||||||
incorrect results under two conditions: (1) ambiguous matches have equal
|
|
||||||
offsets on some iteration, and (2) disambiguation happens too late, when
|
|
||||||
the active offsets have already been updated and the difference between
|
|
||||||
ambiguous matches is erased. We found that such situations may occur for two
|
|
||||||
reasons. First, the ε-closure algorithm may compare ambiguous paths after
|
|
||||||
their join point, when both paths have a common suffix with tagged
|
|
||||||
transitions. This is the case with the Cox prototype implementation; for
|
|
||||||
example, it gives incorrect results for (aa|a)* and string aaaaa. Most of such
|
|
||||||
failures can be repaired by exploring states in topological order, but a
|
|
||||||
topological order does not exist in the presence of ε-loops. The second reason
|
|
||||||
is bounded repetition: ambiguous paths may not have an intermediate join point
|
|
||||||
at all. For example, in the case of (aaaa|aaa|a){3,4} and string aaaaaaaaaa we
|
|
||||||
have matches (aaaa)(aaaa)(a)(a) and (aaaa)(aaa)(aaa) with a different number
|
|
||||||
of iterations. Assuming that the bounded repetition is unrolled by chaining
|
|
||||||
three sub-automata for (aaaa|aaa|a) and an optional fourth one, by the time
|
|
||||||
ambiguous paths meet both have active offsets (0,4). Despite the flaw, Cox
|
|
||||||
algorithm is interesting: if somehow the delayed comparison problem was fixed,
|
|
||||||
it would work. The algorithm requires O(mt) memory and O(nm^2t) time
|
|
||||||
(assuming a worst-case optimal closure algorithm), where n is the
|
|
||||||
length of input, m it the size of RE and t is the number of submatch groups
|
|
||||||
and subexpressions that contain them."
|
|
||||||
|
|
||||||
Research has shown that it is possible to disambiguate NFA in polynomial time
|
### Capture Groups
|
||||||
but it brings serious performance issues on non ambiguous inputs. See the
|
Lexers only need token boundaries—not submatch extraction. Removing capture
|
||||||
branch "disambiguate_paths" on this repo shows what is being done to solve it
|
infrastructure simplifies the VM and eliminates bookkeeping overhead.
|
||||||
and the potential performance costs. In short it requires tracking the parent
|
|
||||||
of every state added on nlist from clist. If the state from nlist matches
|
|
||||||
the consumer, the alternative clist state related to that nlist state gets
|
|
||||||
discarded and the nsub ref can be decremented (freed). The reason why this
|
|
||||||
problem does not exist for non ambiguous regexes is because the alternative
|
|
||||||
clist state will never match due to the next state having a different consumer
|
|
||||||
. There is no need for any extra handling it gets freed normally. I decided
|
|
||||||
to not apply this solution here because I think most use cases for regex are
|
|
||||||
not ambiguious like say regex: "a{10000}". If you try matching 10000 'a'
|
|
||||||
characters in a row like that you will have a problem where the stack usage
|
|
||||||
will jump up to 10000*(subsize) but it will never exceed the size of regex
|
|
||||||
though, but the number of NFA states will also increase by the same amount,
|
|
||||||
so at the charater 9999 you will find 9999 redundant nlist states, that will
|
|
||||||
degrade performance linearly, however it will be very slow compared to
|
|
||||||
uplimited regex like a+. The cost of this solution is somewhere around 2%
|
|
||||||
general performance decrease (broadly), but a magnitude of complexity
|
|
||||||
decrease for ambiguous cases, for example matching 64 characters went down
|
|
||||||
from 30 to 9 microseconds. Another solution to this problem can be to
|
|
||||||
determine the ambiguous paths at compile time and flag the inner states as
|
|
||||||
ambiguous ahead of time, still this can't avoid having a loop though the alt
|
|
||||||
states as their positioning in clist can't be precomputed due to the dynamic
|
|
||||||
changes.
|
|
||||||
(Comment about O(mt) memory complexity)
|
|
||||||
This worst case scenario can only happen on ambiguous input. Ambiguous
|
|
||||||
consumers (char, class, any) assuming t is 1. In practice there is almost
|
|
||||||
never a situation where someone wants to search using regex this large. Most
|
|
||||||
of the time memory usage is very low and the space complexity for non
|
|
||||||
ambigious regex is O(nt) where n is the number of currently considering
|
|
||||||
alternate paths in the regex and t is the number of submatch groups.
|
|
||||||
|
|
||||||
This pikevm implementation features an improved submatch extraction algorithm
|
### Explicit anchors
|
||||||
based on Russ Cox's original design. I - Kyryl Melekhin have found a way to
|
All patterns implicitly start with BOL—a natural fit for lexer rules that always
|
||||||
optimize the tracking properly of 1st number in the submatch pair. Based on
|
match from the current input position.
|
||||||
simple observation of how the NFA is constructed I noticed that there is no
|
|
||||||
way for addthread1() to ever reach inner SAVE instructions in the regex, so
|
|
||||||
that leaves tracking 2nd pairs by addthread1() irrelevant to the final
|
|
||||||
results (except the need to initialize the sub after allocation). This
|
|
||||||
improved the overall performance by 25% which is massive considering that at
|
|
||||||
the time there was nothing else left to can be done to make it faster.
|
|
||||||
|
|
||||||
What are on##list macros?
|
### Word boundries:
|
||||||
Redundant state inside nlist can happen in couple of ways, and has to do with
|
Lexical analysis relies on explicit character classes and negative factors for
|
||||||
the (closure) a* (star) operations and also +. Due to the automata machine
|
token separation.
|
||||||
design split happens to be above the next consumed instruction and if that
|
|
||||||
state gets added onto the list we may segfault or give wrong submatch result.
|
|
||||||
Rsplit does not have this problem because it is generated below the consumer
|
|
||||||
instruction, but it can still add redundant states. Overall this is extremely
|
|
||||||
difficult to understand or explain, but this is just something we have to
|
|
||||||
check for. We checked for this using extra int inside the split instructions,
|
|
||||||
so this left some global state inside the machine insts. Most of the time we
|
|
||||||
just added to the next gen number and kept incrementing it forever. This
|
|
||||||
leaves a small chance of overflowing the int and getting a run on a false
|
|
||||||
state left from previous use of the regex. Though if overflow never happens
|
|
||||||
there is no chance of getting a false state. Overflows like this pose a high
|
|
||||||
security threat, if the hacker knows how many cycles he needs to overflow the
|
|
||||||
gen variable and get inconsistent result. It is possible to reset the marks
|
|
||||||
if we near the overflow, but as you may guess that does not come for free.
|
|
||||||
|
|
||||||
Currently I removed all dynamic global state from the instructions fixing any
|
### Syntax check for epsilon loops
|
||||||
overlow issue utilizing a sparse set datastructure trick which abuses the
|
All inputs either compile to a valid NFA or fail with a semantic error.
|
||||||
uninitialized varibles. This allows the redundant states to be excluded in
|
|
||||||
O(1) operation. That said, don't run valgrind on pikevm as it will go crazy, or
|
|
||||||
find a way to surpress errors from pikevm.
|
|
||||||
|
|
||||||
Further reading
|
## Further reading
|
||||||
===============
|
|
||||||
https://research.swtch.com/sparse
|
https://research.swtch.com/sparse
|
||||||
https://swtch.com/~rsc/regexp/regexp1.html
|
https://swtch.com/~rsc/regexp/regexp1.html
|
||||||
|
|
||||||
Author and License
|
## Author and License
|
||||||
==================
|
|
||||||
licensed under BSD license, just as the original re1.
|
licensed under BSD license, just as the original re1.
|
||||||
|
|||||||
Reference in New Issue
Block a user