76 lines
2.3 KiB
Plaintext
76 lines
2.3 KiB
Plaintext
# lexvm: A Lexical Analysis Virtual Machine
|
||
lexvm is a specialized virtual machine for lexical analysis (tokenization), derived
|
||
from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm
|
||
is optimized specifically for scanner/lexer workloads with deterministic,
|
||
linear-time matching semantics and a streamlined instruction set.
|
||
|
||
## Negative Factors
|
||
Traditional regular expression engines struggle with lexical constructs that must
|
||
exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible
|
||
but offer no native way to express "match anything except if it contains X".
|
||
Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to
|
||
address this but:
|
||
|
||
- Break linear-time guarantees.
|
||
- Are not regular operators.
|
||
- Introduce fragile rule ordering in lexers
|
||
|
||
## Apostrophe ´'´
|
||
The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
|
||
by *, forming `E'*` does it activate the negative factor operator: Match the
|
||
longest token starting at the current position that does not
|
||
contain any substring matching `E`.
|
||
|
||
## Examples
|
||
### Escaped Strings
|
||
```
|
||
"(("|\\)'|\\.)*" (164)
|
||
[1 == 1] ""
|
||
[0 == 0] """
|
||
[0 == 0] "\"
|
||
[1 == 1] "\\"
|
||
[1 == 1] "lsk\"lsdk"
|
||
|
||
"(\\.|("|\\)')*" (164)
|
||
[1 == 1] ""
|
||
[0 == 0] """
|
||
[0 == 0] "\"
|
||
[1 == 1] "\\"
|
||
[1 == 1] "lsk\"lsdk"
|
||
```
|
||
### C-Style Comments
|
||
```
|
||
\\\*(\*\\)'*\*\\ (120)
|
||
[1 == 1] \*lskd*\
|
||
[1 == 1] \****\
|
||
[1 == 1] \*\\*\
|
||
[0 == 0] \*ls*\ lsdk *\
|
||
```
|
||
|
||
## Removed Features
|
||
### Lazy Quantifiers
|
||
Superseded by the negative factor operator `E'*`, which provides stronger
|
||
exclusion semantics
|
||
|
||
### Capture Groups
|
||
Lexers only need token boundaries—not submatch extraction. Removing capture
|
||
infrastructure simplifies the VM and eliminates bookkeeping overhead.
|
||
|
||
### Explicit anchors
|
||
All patterns implicitly start with BOL—a natural fit for lexer rules that always
|
||
match from the current input position.
|
||
|
||
### Word boundries:
|
||
Lexical analysis relies on explicit character classes and negative factors for
|
||
token separation.
|
||
|
||
### Syntax check for epsilon loops
|
||
All inputs either compile to a valid NFA or fail with a semantic error.
|
||
|
||
## Further reading
|
||
https://research.swtch.com/sparse
|
||
https://swtch.com/~rsc/regexp/regexp1.html
|
||
|
||
## Author and License
|
||
licensed under BSD license, just as the original re1.
|