Add Readme

2026-02-09 07:13:17 -06:00
parent e578d80f05
commit 9e12aaff26
1 changed files with 60 additions and 146 deletions
--- a/206
+++ b/206
@@ -1,161 +1,75 @@
-What is pikevm?
+# lexvm: A Lexical Analysis Virtual Machine
-==============
+lexvm is a specialized virtual machine for lexical analysis (tokenization), derived
 from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm
 is optimized specifically for scanner/lexer workloads with deterministic, 
 linear-time matching semantics and a streamlined instruction set.
-re1 (http://code.google.com/p/re1/) is "toy regular expression implementation"
+## Negative Factors
-by Russel Cox, featuring simplicity and minimal code size unheard of in other
+Traditional regular expression engines struggle with lexical constructs that must
-implementations. re2 (http://code.google.com/p/re2/) is "an efficient,
+exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible
-principled regular expression library" by the same author. It is robust,
+but offer no native way to express "match anything except if it contains X". 
-full-featured, and ... bloated, comparing to re1.
+Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to 
 address this but:
-This is implementation of pikevm based on re1.5 which adds features required for
+- Break linear-time guarantees.
-minimalistic real-world use, while sticking to the minimal code size and
+- Are not regular operators.
-memory use.
+- Introduce fragile rule ordering in lexers
 https://github.com/pfalcon/re1.5
-Why?
+## Apostrophe ´'´
-====
+The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
-Pikevm guarantees that any input regex will scale O(n) with the size of the
+by *, forming `E'*` does it activate the negative factor operator: Match the 
-string, thus making it the fastest regex implementation. There is no backtracking
+longest token starting at the current position that does not 
-that usually expodes to O(n^k) time and space where k is some constant.
+contain any substring matching `E`.
-Features
+## Examples
-========
+### Escaped Strings
 ```
 "(("|\\)'|\\.)*" (164)
 [1 == 1] ""
 [0 == 0] """
 [0 == 0] "\"
 [1 == 1] "\\"
 [1 == 1] "lsk\"lsdk"
-* UnLike re1.5, here is only pikevm, one file easy to use.
+"(\\.|("|\\)')*" (164)
-* Unlike re1.5, regexes is compiled to type sized code rather than bytecode,
+[1 == 1] ""
-eliviating the problem of byte overflow in splits/jmps on large regexes.
+[0 == 0] """
-Currently the type used is int, and every atom in compiled code is aligned
+[0 == 0] "\"
-to that.
+[1 == 1] "\\"
-* Matcher does not take size of string as param, it checks for '\0' instead,
+[1 == 1] "lsk\"lsdk"
-so that the user does not need to waste time taking strlen()
+```
-* Highly optimized source code, probably 2x faster than re1.5
+### C-Style Comments
-* Support for quoted chars in regex. Escapes in brackets.
+```
-* Support for ^, $ assertions in regex.
+\\\*(\*\\)'*\*\\ (120)
-* Support for repetition operator {n} and {n,m} and {n,}.
+[1 == 1] \*lskd*\
-* Support for Unicode (UTF-8).
+[1 == 1] \****\
-* Unlike other engines, the output is byte level offset. (Which is more useful)
+[1 == 1] \*\\*\
-* Support for non capture group ?:
+[0 == 0] \*ls*\ lsdk *\
-* Support for wordend & wordbeg assertions
+```
 - Some limitations for word assertions are meta chars like spaces being used
 in for expression itself, for example "\< abc" should match " abc" exactly at
 that space word boundary but it won't. It's possible to fix this, but it would
 require rsplit before word assert, and some dirty logic to check that the character
 or class is a space we want to match not assert at. But the code for it was too
 dirty and I scrapped it. Syntax for word assertions are like posix C library, not
 the pcre "\b" which can be used both in front or back of the word, because there is
 no distinction, it makes the implementation potentially even uglier.
 * Assert flags like REG_ICASE,REG_NOTEOL,REG_NOTBOL and lookaround assertions
 are implemented in nextvi branch or Nextvi's regex.c
 https://github.com/kyx0r/nextvi/blob/master/regex.c
-NOTES
+## Removed Features
-=====
+### Lazy Quantifiers
-The problem described in this paper has been fixed. Ambiguous matching is correct.
+Superseded by the negative factor operator `E'*`, which provides stronger 
-HISTORY:
+exclusion semantics 
 https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf
 "Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching,
 which is based on the observation that reversing the longest-match rule
 simplifies the handling of iteration subexpressions: instead of maximizing
 submatch from the first to the last iteration, one needs to maximize the
 iterations in reverse order. This means that the disambiguation is always
 based on the most recent iteration, removing the need to remember all previous
 iterations (except for the backwards-first, i.e.  the last one, which contains
 submatch result). The algorithm tracks two pairs of offsets per each submatch
 group: the active pair (used for disambiguation) and the result pair. It gives
 incorrect results under two conditions: (1) ambiguous matches have equal
 offsets on some iteration, and (2) disambiguation happens too late, when
 the active offsets have already been updated and the difference between
 ambiguous matches is erased. We found that such situations may occur for two
 reasons. First, the ε-closure algorithm may compare ambiguous paths after
 their join point, when both paths have a common suffix with tagged
 transitions. This is the case with the Cox prototype implementation; for
 example, it gives incorrect results for (aa|a)* and string aaaaa. Most of such
 failures can be repaired by exploring states in topological order, but a
 topological order does not exist in the presence of ε-loops. The second reason
 is bounded repetition: ambiguous paths may not have an intermediate join point
 at all. For example, in the case of (aaaa|aaa|a){3,4} and string aaaaaaaaaa we
 have matches (aaaa)(aaaa)(a)(a) and (aaaa)(aaa)(aaa) with a different number
 of iterations. Assuming that the bounded repetition is unrolled by chaining
 three sub-automata for (aaaa|aaa|a) and an optional fourth one, by the time
 ambiguous paths meet both have active offsets (0,4). Despite the flaw, Cox
 algorithm is interesting: if somehow the delayed comparison problem was fixed,
 it would work.  The algorithm requires O(mt) memory and O(nm^2t) time
 (assuming a worst-case optimal closure algorithm), where n is the
 length of input, m it the size of RE and t is the number of submatch groups
 and subexpressions that contain them."
-Research has shown that it is possible to disambiguate NFA in polynomial time
+### Capture Groups
-but it brings serious performance issues on non ambiguous inputs.  See the
+Lexers only need token boundaries—not submatch extraction. Removing capture 
-branch "disambiguate_paths" on this repo shows what is being done to solve it
+infrastructure simplifies the VM and eliminates bookkeeping overhead.
 and the potential performance costs. In short it requires tracking the parent
 of every state added on nlist from clist.  If the state from nlist matches
 the consumer, the alternative clist state related to that nlist state gets
 discarded and the nsub ref can be decremented (freed). The reason why this
 problem does not exist for non ambiguous regexes is because the alternative
 clist state will never match due to the next state having a different consumer
 . There is no need for any extra handling it gets freed normally.  I decided
 to not apply this solution here because I think most use cases for regex are
 not ambiguious like say regex: "a{10000}". If you try matching 10000 'a'
 characters in a row like that you will have a problem where the stack usage
 will jump up to 10000*(subsize) but it will never exceed the size of regex
 though, but the number of NFA states will also increase by the same amount,
 so at the charater 9999 you will find 9999 redundant nlist states, that will
 degrade performance linearly, however it will be very slow compared to
 uplimited regex like a+. The cost of this solution is somewhere around 2%
 general performance decrease (broadly), but a magnitude of complexity
 decrease for ambiguous cases, for example matching 64 characters went down
 from 30 to 9 microseconds.  Another solution to this problem can be to
 determine the ambiguous paths at compile time and flag the inner states as
 ambiguous ahead of time, still this can't avoid having a loop though the alt
 states as their positioning in clist can't be precomputed due to the dynamic
 changes.
 (Comment about O(mt) memory complexity)
 This worst case scenario can only happen on ambiguous input. Ambiguous
 consumers (char, class, any) assuming t is 1. In practice there is almost
 never a situation where someone wants to search using regex this large. Most
 of the time memory usage is very low and the space complexity for non
 ambigious regex is O(nt) where n is the number of currently considering
 alternate paths in the regex and t is the number of submatch groups.
-This pikevm implementation features an improved submatch extraction algorithm
+### Explicit anchors
-based on Russ Cox's original design.  I - Kyryl Melekhin have found a way to
+All patterns implicitly start with BOL—a natural fit for lexer rules that always
-optimize the tracking properly of 1st number in the submatch pair. Based on
+match from the current input position.
 simple observation of how the NFA is constructed I noticed that there is no
 way for addthread1() to ever reach inner SAVE instructions in the regex, so
 that leaves tracking 2nd pairs by addthread1() irrelevant to the final
 results (except the need to initialize the sub after allocation). This
 improved the overall performance by 25% which is massive considering that at
 the time there was nothing else left to can be done to make it faster.
-What are on##list macros?
+### Word boundries:
-Redundant state inside nlist can happen in couple of ways, and has to do with 
+Lexical analysis relies on explicit character classes and negative factors for 
-the (closure) a* (star) operations and also +. Due to the automata machine 
+token separation.
 design split happens to be above the next consumed instruction and if that 
 state gets added onto the list we may segfault or give wrong submatch result. 
 Rsplit does not have this problem because it is generated below the consumer 
 instruction, but it can still add redundant states. Overall this is extremely 
 difficult to understand or explain, but this is just something we have to 
 check for. We checked for this using extra int inside the split instructions, 
 so this left some global state inside the machine insts. Most of the time we 
 just added to the next gen number and kept incrementing it forever. This 
 leaves a small chance of overflowing the int and getting a run on a false 
 state left from previous use of the regex. Though if overflow never happens 
 there is no chance of getting a false state. Overflows like this pose a high 
 security threat, if the hacker knows how many cycles he needs to overflow the 
 gen variable and get inconsistent result. It is possible to reset the marks 
 if we near the overflow, but as you may guess that does not come for free.
-Currently I removed all dynamic global state from the instructions fixing any 
+### Syntax check for epsilon loops
-overlow issue utilizing a sparse set datastructure trick which abuses the 
+All inputs either compile to a valid NFA or fail with a semantic error.
 uninitialized varibles. This allows the redundant states to be excluded in
 O(1) operation. That said, don't run valgrind on pikevm as it will go crazy, or 
 find a way to surpress errors from pikevm.
-Further reading
+## Further reading
 ===============
 https://research.swtch.com/sparse
 https://swtch.com/~rsc/regexp/regexp1.html
-Author and License
+## Author and License
 ==================
 licensed under BSD license, just as the original re1.