Files
lexvm/README
2026-02-09 07:15:35 -06:00

76 lines
2.3 KiB
Plaintext

# lexvm: A Lexical Analysis Virtual Machine
lexvm is a specialized virtual machine for lexical analysis (tokenization), derived
from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm
is optimized specifically for scanner/lexer workloads with deterministic,
linear-time matching semantics and a streamlined instruction set.
## Negative Factors
Traditional regular expression engines struggle with lexical constructs that must
exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible
but offer no native way to express "match anything except if it contains X".
Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to
address this but:
- Break linear-time guarantees.
- Are not regular operators.
- Introduce fragile rule ordering in lexers
## Apostrophe `'`
The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
by *, forming `E'*` does it activate the negative factor operator: Match the
longest token starting at the current position that does not
contain any substring matching `E`.
## Examples
### Escaped Strings
```
"(("|\\)'|\\.)*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"
"(\\.|("|\\)')*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"
```
### C-Style Comments
```
\\\*(\*\\)'*\*\\ (120)
[1 == 1] \*lskd*\
[1 == 1] \****\
[1 == 1] \*\\*\
[0 == 0] \*ls*\ lsdk *\
```
## Removed Features
### Lazy Quantifiers
Superseded by the negative factor operator `E'*`, which provides stronger
exclusion semantics
### Capture Groups
Lexers only need token boundaries—not submatch extraction. Removing capture
infrastructure simplifies the VM and eliminates bookkeeping overhead.
### Explicit anchors
All patterns implicitly start with BOL—a natural fit for lexer rules that always
match from the current input position.
### Word boundries:
Lexical analysis relies on explicit character classes and negative factors for
token separation.
### Syntax check for epsilon loops
All inputs either compile to a valid NFA or fail with a semantic error.
## Further reading
https://research.swtch.com/sparse
https://swtch.com/~rsc/regexp/regexp1.html
## Author and License
licensed under BSD license, just as the original re1.