2026-02-09 07:13:17 -06:00
2021-07-13 21:07:53 +00:00
2026-02-09 06:26:57 -06:00
2026-02-09 07:13:17 -06:00
2025-11-02 00:01:07 +00:00

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# lexvm: A Lexical Analysis Virtual Machine
lexvm is a specialized virtual machine for lexical analysis (tokenization), derived
from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm
is optimized specifically for scanner/lexer workloads with deterministic, 
linear-time matching semantics and a streamlined instruction set.

## Negative Factors
Traditional regular expression engines struggle with lexical constructs that must
exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible
but offer no native way to express "match anything except if it contains X". 
Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to 
address this but:

- Break linear-time guarantees.
- Are not regular operators.
- Introduce fragile rule ordering in lexers

## Apostrophe ´'´
The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
by *, forming `E'*` does it activate the negative factor operator: Match the 
longest token starting at the current position that does not 
contain any substring matching `E`.

## Examples
### Escaped Strings
```
"(("|\\)'|\\.)*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"

"(\\.|("|\\)')*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"
```
### C-Style Comments
```
\\\*(\*\\)'*\*\\ (120)
[1 == 1] \*lskd*\
[1 == 1] \****\
[1 == 1] \*\\*\
[0 == 0] \*ls*\ lsdk *\
```

## Removed Features
### Lazy Quantifiers
Superseded by the negative factor operator `E'*`, which provides stronger 
exclusion semantics 

### Capture Groups
Lexers only need token boundaries—not submatch extraction. Removing capture 
infrastructure simplifies the VM and eliminates bookkeeping overhead.

### Explicit anchors
All patterns implicitly start with BOL—a natural fit for lexer rules that always
match from the current input position.

### Word boundries:
Lexical analysis relies on explicit character classes and negative factors for 
token separation.

### Syntax check for epsilon loops
All inputs either compile to a valid NFA or fail with a semantic error.

## Further reading
https://research.swtch.com/sparse
https://swtch.com/~rsc/regexp/regexp1.html

## Author and License
licensed under BSD license, just as the original re1.
Description
A specialized virtual machine for lexical analysis (tokenization), derived from Russ Cox's PikeVM implementation.
Readme 152 KiB
Languages
C 53.2%
Shell 46.8%