# lexvm: A Lexical Analysis Virtual Machine lexvm is a specialized virtual machine for lexical analysis (tokenization), derived from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm is optimized specifically for scanner/lexer workloads with deterministic, linear-time matching semantics and a streamlined instruction set. ## Negative Factors Traditional regular expression engines struggle with lexical constructs that must exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible but offer no native way to express "match anything except if it contains X". Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to address this but: - Break linear-time guarantees. - Are not regular operators. - Introduce fragile rule ordering in lexers ## Apostrophe ´'´ The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed by *, forming `E'*` does it activate the negative factor operator: Match the longest token starting at the current position that does not contain any substring matching `E`. ## Examples ### Escaped Strings ``` "(("|\\)'|\\.)*" (164) [1 == 1] "" [0 == 0] """ [0 == 0] "\" [1 == 1] "\\" [1 == 1] "lsk\"lsdk" "(\\.|("|\\)')*" (164) [1 == 1] "" [0 == 0] """ [0 == 0] "\" [1 == 1] "\\" [1 == 1] "lsk\"lsdk" ``` ### C-Style Comments ``` \\\*(\*\\)'*\*\\ (120) [1 == 1] \*lskd*\ [1 == 1] \****\ [1 == 1] \*\\*\ [0 == 0] \*ls*\ lsdk *\ ``` ## Removed Features ### Lazy Quantifiers Superseded by the negative factor operator `E'*`, which provides stronger exclusion semantics ### Capture Groups Lexers only need token boundaries—not submatch extraction. Removing capture infrastructure simplifies the VM and eliminates bookkeeping overhead. ### Explicit anchors All patterns implicitly start with BOL—a natural fit for lexer rules that always match from the current input position. ### Word boundries: Lexical analysis relies on explicit character classes and negative factors for token separation. ### Syntax check for epsilon loops All inputs either compile to a valid NFA or fail with a semantic error. ## Further reading https://research.swtch.com/sparse https://swtch.com/~rsc/regexp/regexp1.html ## Author and License licensed under BSD license, just as the original re1.