From 9e12aaff2684734eb0492f329d24614d9242bc2e Mon Sep 17 00:00:00 2001 From: PedroEdiaz Date: Mon, 9 Feb 2026 07:13:17 -0600 Subject: [PATCH] Add Readme --- README | 206 +++++++++++++++++---------------------------------------- 1 file changed, 60 insertions(+), 146 deletions(-) diff --git a/README b/README index 7c4f89b..8289090 100644 --- a/README +++ b/README @@ -1,161 +1,75 @@ -What is pikevm? -============== +# lexvm: A Lexical Analysis Virtual Machine +lexvm is a specialized virtual machine for lexical analysis (tokenization), derived +from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm +is optimized specifically for scanner/lexer workloads with deterministic, +linear-time matching semantics and a streamlined instruction set. -re1 (http://code.google.com/p/re1/) is "toy regular expression implementation" -by Russel Cox, featuring simplicity and minimal code size unheard of in other -implementations. re2 (http://code.google.com/p/re2/) is "an efficient, -principled regular expression library" by the same author. It is robust, -full-featured, and ... bloated, comparing to re1. +## Negative Factors +Traditional regular expression engines struggle with lexical constructs that must +exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible +but offer no native way to express "match anything except if it contains X". +Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to +address this but: -This is implementation of pikevm based on re1.5 which adds features required for -minimalistic real-world use, while sticking to the minimal code size and -memory use. -https://github.com/pfalcon/re1.5 +- Break linear-time guarantees. +- Are not regular operators. +- Introduce fragile rule ordering in lexers -Why? -==== -Pikevm guarantees that any input regex will scale O(n) with the size of the -string, thus making it the fastest regex implementation. There is no backtracking -that usually expodes to O(n^k) time and space where k is some constant. +## Apostrophe ´'´ +The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed +by *, forming `E'*` does it activate the negative factor operator: Match the +longest token starting at the current position that does not +contain any substring matching `E`. -Features -======== +## Examples +### Escaped Strings +``` +"(("|\\)'|\\.)*" (164) +[1 == 1] "" +[0 == 0] """ +[0 == 0] "\" +[1 == 1] "\\" +[1 == 1] "lsk\"lsdk" -* UnLike re1.5, here is only pikevm, one file easy to use. -* Unlike re1.5, regexes is compiled to type sized code rather than bytecode, -eliviating the problem of byte overflow in splits/jmps on large regexes. -Currently the type used is int, and every atom in compiled code is aligned -to that. -* Matcher does not take size of string as param, it checks for '\0' instead, -so that the user does not need to waste time taking strlen() -* Highly optimized source code, probably 2x faster than re1.5 -* Support for quoted chars in regex. Escapes in brackets. -* Support for ^, $ assertions in regex. -* Support for repetition operator {n} and {n,m} and {n,}. -* Support for Unicode (UTF-8). -* Unlike other engines, the output is byte level offset. (Which is more useful) -* Support for non capture group ?: -* Support for wordend & wordbeg assertions -- Some limitations for word assertions are meta chars like spaces being used -in for expression itself, for example "\< abc" should match " abc" exactly at -that space word boundary but it won't. It's possible to fix this, but it would -require rsplit before word assert, and some dirty logic to check that the character -or class is a space we want to match not assert at. But the code for it was too -dirty and I scrapped it. Syntax for word assertions are like posix C library, not -the pcre "\b" which can be used both in front or back of the word, because there is -no distinction, it makes the implementation potentially even uglier. -* Assert flags like REG_ICASE,REG_NOTEOL,REG_NOTBOL and lookaround assertions -are implemented in nextvi branch or Nextvi's regex.c -https://github.com/kyx0r/nextvi/blob/master/regex.c +"(\\.|("|\\)')*" (164) +[1 == 1] "" +[0 == 0] """ +[0 == 0] "\" +[1 == 1] "\\" +[1 == 1] "lsk\"lsdk" +``` +### C-Style Comments +``` +\\\*(\*\\)'*\*\\ (120) +[1 == 1] \*lskd*\ +[1 == 1] \****\ +[1 == 1] \*\\*\ +[0 == 0] \*ls*\ lsdk *\ +``` -NOTES -===== -The problem described in this paper has been fixed. Ambiguous matching is correct. -HISTORY: -https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf -"Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching, -which is based on the observation that reversing the longest-match rule -simplifies the handling of iteration subexpressions: instead of maximizing -submatch from the first to the last iteration, one needs to maximize the -iterations in reverse order. This means that the disambiguation is always -based on the most recent iteration, removing the need to remember all previous -iterations (except for the backwards-first, i.e. the last one, which contains -submatch result). The algorithm tracks two pairs of offsets per each submatch -group: the active pair (used for disambiguation) and the result pair. It gives -incorrect results under two conditions: (1) ambiguous matches have equal -offsets on some iteration, and (2) disambiguation happens too late, when -the active offsets have already been updated and the difference between -ambiguous matches is erased. We found that such situations may occur for two -reasons. First, the ε-closure algorithm may compare ambiguous paths after -their join point, when both paths have a common suffix with tagged -transitions. This is the case with the Cox prototype implementation; for -example, it gives incorrect results for (aa|a)* and string aaaaa. Most of such -failures can be repaired by exploring states in topological order, but a -topological order does not exist in the presence of ε-loops. The second reason -is bounded repetition: ambiguous paths may not have an intermediate join point -at all. For example, in the case of (aaaa|aaa|a){3,4} and string aaaaaaaaaa we -have matches (aaaa)(aaaa)(a)(a) and (aaaa)(aaa)(aaa) with a different number -of iterations. Assuming that the bounded repetition is unrolled by chaining -three sub-automata for (aaaa|aaa|a) and an optional fourth one, by the time -ambiguous paths meet both have active offsets (0,4). Despite the flaw, Cox -algorithm is interesting: if somehow the delayed comparison problem was fixed, -it would work. The algorithm requires O(mt) memory and O(nm^2t) time -(assuming a worst-case optimal closure algorithm), where n is the -length of input, m it the size of RE and t is the number of submatch groups -and subexpressions that contain them." +## Removed Features +### Lazy Quantifiers +Superseded by the negative factor operator `E'*`, which provides stronger +exclusion semantics -Research has shown that it is possible to disambiguate NFA in polynomial time -but it brings serious performance issues on non ambiguous inputs. See the -branch "disambiguate_paths" on this repo shows what is being done to solve it -and the potential performance costs. In short it requires tracking the parent -of every state added on nlist from clist. If the state from nlist matches -the consumer, the alternative clist state related to that nlist state gets -discarded and the nsub ref can be decremented (freed). The reason why this -problem does not exist for non ambiguous regexes is because the alternative -clist state will never match due to the next state having a different consumer -. There is no need for any extra handling it gets freed normally. I decided -to not apply this solution here because I think most use cases for regex are -not ambiguious like say regex: "a{10000}". If you try matching 10000 'a' -characters in a row like that you will have a problem where the stack usage -will jump up to 10000*(subsize) but it will never exceed the size of regex -though, but the number of NFA states will also increase by the same amount, -so at the charater 9999 you will find 9999 redundant nlist states, that will -degrade performance linearly, however it will be very slow compared to -uplimited regex like a+. The cost of this solution is somewhere around 2% -general performance decrease (broadly), but a magnitude of complexity -decrease for ambiguous cases, for example matching 64 characters went down -from 30 to 9 microseconds. Another solution to this problem can be to -determine the ambiguous paths at compile time and flag the inner states as -ambiguous ahead of time, still this can't avoid having a loop though the alt -states as their positioning in clist can't be precomputed due to the dynamic -changes. -(Comment about O(mt) memory complexity) -This worst case scenario can only happen on ambiguous input. Ambiguous -consumers (char, class, any) assuming t is 1. In practice there is almost -never a situation where someone wants to search using regex this large. Most -of the time memory usage is very low and the space complexity for non -ambigious regex is O(nt) where n is the number of currently considering -alternate paths in the regex and t is the number of submatch groups. +### Capture Groups +Lexers only need token boundaries—not submatch extraction. Removing capture +infrastructure simplifies the VM and eliminates bookkeeping overhead. -This pikevm implementation features an improved submatch extraction algorithm -based on Russ Cox's original design. I - Kyryl Melekhin have found a way to -optimize the tracking properly of 1st number in the submatch pair. Based on -simple observation of how the NFA is constructed I noticed that there is no -way for addthread1() to ever reach inner SAVE instructions in the regex, so -that leaves tracking 2nd pairs by addthread1() irrelevant to the final -results (except the need to initialize the sub after allocation). This -improved the overall performance by 25% which is massive considering that at -the time there was nothing else left to can be done to make it faster. +### Explicit anchors +All patterns implicitly start with BOL—a natural fit for lexer rules that always +match from the current input position. -What are on##list macros? -Redundant state inside nlist can happen in couple of ways, and has to do with -the (closure) a* (star) operations and also +. Due to the automata machine -design split happens to be above the next consumed instruction and if that -state gets added onto the list we may segfault or give wrong submatch result. -Rsplit does not have this problem because it is generated below the consumer -instruction, but it can still add redundant states. Overall this is extremely -difficult to understand or explain, but this is just something we have to -check for. We checked for this using extra int inside the split instructions, -so this left some global state inside the machine insts. Most of the time we -just added to the next gen number and kept incrementing it forever. This -leaves a small chance of overflowing the int and getting a run on a false -state left from previous use of the regex. Though if overflow never happens -there is no chance of getting a false state. Overflows like this pose a high -security threat, if the hacker knows how many cycles he needs to overflow the -gen variable and get inconsistent result. It is possible to reset the marks -if we near the overflow, but as you may guess that does not come for free. +### Word boundries: +Lexical analysis relies on explicit character classes and negative factors for +token separation. -Currently I removed all dynamic global state from the instructions fixing any -overlow issue utilizing a sparse set datastructure trick which abuses the -uninitialized varibles. This allows the redundant states to be excluded in -O(1) operation. That said, don't run valgrind on pikevm as it will go crazy, or -find a way to surpress errors from pikevm. +### Syntax check for epsilon loops +All inputs either compile to a valid NFA or fail with a semantic error. -Further reading -=============== +## Further reading https://research.swtch.com/sparse https://swtch.com/~rsc/regexp/regexp1.html -Author and License -================== +## Author and License licensed under BSD license, just as the original re1.