2026-02-09 07:16:45 -06:00
2021-07-13 21:07:53 +00:00
2026-02-09 06:26:57 -06:00
2026-02-09 07:16:45 -06:00
2025-11-02 00:01:07 +00:00

lexvm: A Lexical Analysis Virtual Machine

lexvm is a specialized virtual machine for lexical analysis (tokenization), derived from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm is optimized specifically for scanner/lexer workloads with deterministic, linear-time matching semantics and a streamlined instruction set.

Negative Factors

Traditional regular expression engines struggle with lexical constructs that must exclude certain substrings. Greedy quantifiers (*, +) match as much as possible but offer no native way to express "match anything except if it contains X". Non-greedy quantifiers (*?, +?) and negative lookahead ((?!...)) attempt to address this but:

  • Break linear-time guarantees.
  • Are not regular operators.
  • Introduce fragile rule ordering in lexers

Apostrophe '

The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed by *, forming E'* does it activate the negative factor operator: Match the longest token starting at the current position that does not contain any substring matching E.

Examples

Escaped Strings

"(("|\\)'|\\.)*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"

"(\\.|("|\\)')*" (164)
[1 == 1] ""
[0 == 0] """
[0 == 0] "\"
[1 == 1] "\\"
[1 == 1] "lsk\"lsdk"

C-Style Comments

\\\*(\*\\)'*\*\\ (120)
[1 == 1] \*lskd*\
[1 == 1] \****\
[1 == 1] \*\\*\
[0 == 0] \*ls*\ lsdk *\

Removed Features

Lazy Quantifiers

Superseded by the negative factor operator E'*, which provides stronger exclusion semantics

Capture Groups

Lexers only need token boundaries—not submatch extraction. Removing capture infrastructure simplifies the VM and eliminates bookkeeping overhead.

Explicit anchors

All patterns implicitly start with BOL—a natural fit for lexer rules that always match from the current input position.

Word boundries:

Lexical analysis relies on explicit character classes and negative factors for token separation.

Syntax check for epsilon loops

All inputs either compile to a valid NFA or fail with a semantic error.

Further reading

https://research.swtch.com/sparse https://swtch.com/~rsc/regexp/regexp1.html

Author and License

licensed under BSD license, just as the original re1.

Description
A specialized virtual machine for lexical analysis (tokenization), derived from Russ Cox's PikeVM implementation.
Readme 152 KiB
Languages
C 53.2%
Shell 46.8%