2026-02-09 07:15:35 -06:00
2021-07-13 21:07:53 +00:00
2026-02-09 06:26:57 -06:00
2026-02-09 07:15:35 -06:00
2025-11-02 00:01:07 +00:00

# lexvm: A Lexical Analysis Virtual Machine
lexvm is a specialized virtual machine for lexical analysis (tokenization), derived
from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm
is optimized specifically for scanner/lexer workloads with deterministic, 
linear-time matching semantics and a streamlined instruction set.

## Negative Factors
Traditional regular expression engines struggle with lexical constructs that must
exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible
but offer no native way to express "match anything except if it contains X". 
Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to 
address this but:

- Break linear-time guarantees.
- Are not regular operators.
- Introduce fragile rule ordering in lexers

## Apostrophe `'`
The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
by *, forming `E'*` does it activate the negative factor operator: Match the 
longest token starting at the current position that does not 
contain any substring matching `E`.

## Examples
### Escaped Strings
```
"(("|\\)'|\\.)*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"

"(\\.|("|\\)')*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"
```
### C-Style Comments
```
\\\*(\*\\)'*\*\\ (120)
[1 == 1] \*lskd*\
[1 == 1] \****\
[1 == 1] \*\\*\
[0 == 0] \*ls*\ lsdk *\
```

## Removed Features
### Lazy Quantifiers
Superseded by the negative factor operator `E'*`, which provides stronger 
exclusion semantics 

### Capture Groups
Lexers only need token boundaries—not submatch extraction. Removing capture 
infrastructure simplifies the VM and eliminates bookkeeping overhead.

### Explicit anchors
All patterns implicitly start with BOL—a natural fit for lexer rules that always
match from the current input position.

### Word boundries:
Lexical analysis relies on explicit character classes and negative factors for 
token separation.

### Syntax check for epsilon loops
All inputs either compile to a valid NFA or fail with a semantic error.

## Further reading
https://research.swtch.com/sparse
https://swtch.com/~rsc/regexp/regexp1.html

## Author and License
licensed under BSD license, just as the original re1.
Description
A specialized virtual machine for lexical analysis (tokenization), derived from Russ Cox's PikeVM implementation.
Readme 152 KiB
Languages
C 53.2%
Shell 46.8%