9e12aaff2684734eb0492f329d24614d9242bc2e
This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# lexvm: A Lexical Analysis Virtual Machine lexvm is a specialized virtual machine for lexical analysis (tokenization), derived from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm is optimized specifically for scanner/lexer workloads with deterministic, linear-time matching semantics and a streamlined instruction set. ## Negative Factors Traditional regular expression engines struggle with lexical constructs that must exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible but offer no native way to express "match anything except if it contains X". Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to address this but: - Break linear-time guarantees. - Are not regular operators. - Introduce fragile rule ordering in lexers ## Apostrophe ´'´ The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed by *, forming `E'*` does it activate the negative factor operator: Match the longest token starting at the current position that does not contain any substring matching `E`. ## Examples ### Escaped Strings ``` "(("|\\)'|\\.)*" (164) [1 == 1] "" [0 == 0] """ [0 == 0] "\" [1 == 1] "\\" [1 == 1] "lsk\"lsdk" "(\\.|("|\\)')*" (164) [1 == 1] "" [0 == 0] """ [0 == 0] "\" [1 == 1] "\\" [1 == 1] "lsk\"lsdk" ``` ### C-Style Comments ``` \\\*(\*\\)'*\*\\ (120) [1 == 1] \*lskd*\ [1 == 1] \****\ [1 == 1] \*\\*\ [0 == 0] \*ls*\ lsdk *\ ``` ## Removed Features ### Lazy Quantifiers Superseded by the negative factor operator `E'*`, which provides stronger exclusion semantics ### Capture Groups Lexers only need token boundaries—not submatch extraction. Removing capture infrastructure simplifies the VM and eliminates bookkeeping overhead. ### Explicit anchors All patterns implicitly start with BOL—a natural fit for lexer rules that always match from the current input position. ### Word boundries: Lexical analysis relies on explicit character classes and negative factors for token separation. ### Syntax check for epsilon loops All inputs either compile to a valid NFA or fail with a semantic error. ## Further reading https://research.swtch.com/sparse https://swtch.com/~rsc/regexp/regexp1.html ## Author and License licensed under BSD license, just as the original re1.
Description
A specialized virtual machine for lexical analysis (tokenization), derived from Russ Cox's PikeVM implementation.
Languages
C
53.2%
Shell
46.8%