lexvm: A Lexical Analysis Virtual Machine
lexvm is a specialized virtual machine for lexical analysis (tokenization), derived from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm is optimized specifically for scanner/lexer workloads with deterministic, linear-time matching semantics and a streamlined instruction set.
Negative Factors
Traditional regular expression engines struggle with lexical constructs that must
exclude certain substrings. Greedy quantifiers (*, +) match as much as possible
but offer no native way to express "match anything except if it contains X".
Non-greedy quantifiers (*?, +?) and negative lookahead ((?!...)) attempt to
address this but:
- Break linear-time guarantees.
- Are not regular operators.
- Introduce fragile rule ordering in lexers
Apostrophe '
The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
by *, forming E'* does it activate the negative factor operator: Match the
longest token starting at the current position that does not
contain any substring matching E.
Examples
Escaped Strings
"(("|\\)'|\\.)*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"
"(\\.|("|\\)')*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"
C-Style Comments
\\\*(\*\\)'*\*\\ (120)
[1 == 1] \*lskd*\
[1 == 1] \****\
[1 == 1] \*\\*\
[0 == 0] \*ls*\ lsdk *\
Removed Features
Lazy Quantifiers
Superseded by the negative factor operator E'*, which provides stronger
exclusion semantics
Capture Groups
Lexers only need token boundaries—not submatch extraction. Removing capture infrastructure simplifies the VM and eliminates bookkeeping overhead.
Explicit anchors
All patterns implicitly start with BOL—a natural fit for lexer rules that always match from the current input position.
Word boundries:
Lexical analysis relies on explicit character classes and negative factors for token separation.
Syntax check for epsilon loops
All inputs either compile to a valid NFA or fail with a semantic error.
Further reading
https://research.swtch.com/sparse https://swtch.com/~rsc/regexp/regexp1.html
Author and License
licensed under BSD license, just as the original re1.