76 lines
2.3 KiB
Plaintext
76 lines
2.3 KiB
Plaintext
# lexvm: A Lexical Analysis Virtual Machine
|
|
lexvm is a specialized virtual machine for lexical analysis (tokenization), derived
|
|
from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm
|
|
is optimized specifically for scanner/lexer workloads with deterministic,
|
|
linear-time matching semantics and a streamlined instruction set.
|
|
|
|
## Negative Factors
|
|
Traditional regular expression engines struggle with lexical constructs that must
|
|
exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible
|
|
but offer no native way to express "match anything except if it contains X".
|
|
Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to
|
|
address this but:
|
|
|
|
- Break linear-time guarantees.
|
|
- Are not regular operators.
|
|
- Introduce fragile rule ordering in lexers
|
|
|
|
## Apostrophe `'`
|
|
The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
|
|
by *, forming `E'*` does it activate the negative factor operator: Match the
|
|
longest token starting at the current position that does not
|
|
contain any substring matching `E`.
|
|
|
|
## Examples
|
|
### Escaped Strings
|
|
```
|
|
"(("|\\)'|\\.)*" (164)
|
|
[1 == 1] ""
|
|
[0 == 0] """
|
|
[0 == 0] "\"
|
|
[1 == 1] "\\"
|
|
[1 == 1] "lsk\"lsdk"
|
|
|
|
"(\\.|("|\\)')*" (164)
|
|
[1 == 1] ""
|
|
[0 == 0] """
|
|
[0 == 0] "\"
|
|
[1 == 1] "\\"
|
|
[1 == 1] "lsk\"lsdk"
|
|
```
|
|
### C-Style Comments
|
|
```
|
|
\\\*(\*\\)'*\*\\ (120)
|
|
[1 == 1] \*lskd*\
|
|
[1 == 1] \****\
|
|
[1 == 1] \*\\*\
|
|
[0 == 0] \*ls*\ lsdk *\
|
|
```
|
|
|
|
## Removed Features
|
|
### Lazy Quantifiers
|
|
Superseded by the negative factor operator `E'*`, which provides stronger
|
|
exclusion semantics
|
|
|
|
### Capture Groups
|
|
Lexers only need token boundaries—not submatch extraction. Removing capture
|
|
infrastructure simplifies the VM and eliminates bookkeeping overhead.
|
|
|
|
### Explicit anchors
|
|
All patterns implicitly start with BOL—a natural fit for lexer rules that always
|
|
match from the current input position.
|
|
|
|
### Word boundries:
|
|
Lexical analysis relies on explicit character classes and negative factors for
|
|
token separation.
|
|
|
|
### Syntax check for epsilon loops
|
|
All inputs either compile to a valid NFA or fail with a semantic error.
|
|
|
|
## Further reading
|
|
https://research.swtch.com/sparse
|
|
https://swtch.com/~rsc/regexp/regexp1.html
|
|
|
|
## Author and License
|
|
licensed under BSD license, just as the original re1.
|