Update Readme

2026-02-09 07:16:45 -06:00
parent 9e12aaff26
commit bfba8fb3c5
1 changed files with 1 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,75 @@
+# lexvm: A Lexical Analysis Virtual Machine
+lexvm is a specialized virtual machine for lexical analysis (tokenization), derived
+from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm
+is optimized specifically for scanner/lexer workloads with deterministic, 
+linear-time matching semantics and a streamlined instruction set.
+
+## Negative Factors
+Traditional regular expression engines struggle with lexical constructs that must
+exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible
+but offer no native way to express "match anything except if it contains X". 
+Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to 
+address this but:
+
+- Break linear-time guarantees.
+- Are not regular operators.
+- Introduce fragile rule ordering in lexers
+
+## Apostrophe `'`
+The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
+by *, forming `E'*` does it activate the negative factor operator: Match the 
+longest token starting at the current position that does not 
+contain any substring matching `E`.
+
+## Examples
+### Escaped Strings
+```
+"(("|\\)'|\\.)*" (164)
+[1 == 1] ""
+[0 == 0] """
+[0 == 0] "\"
+[1 == 1] "\\"
+[1 == 1] "lsk\"lsdk"
+
+"(\\.|("|\\)')*" (164)
+[1 == 1] ""
+[0 == 0] """
+[0 == 0] "\"
+[1 == 1] "\\"
+[1 == 1] "lsk\"lsdk"
+```
+### C-Style Comments
+```
+\\\*(\*\\)'*\*\\ (120)
+[1 == 1] \*lskd*\
+[1 == 1] \****\
+[1 == 1] \*\\*\
+[0 == 0] \*ls*\ lsdk *\
+```
+
+## Removed Features
+### Lazy Quantifiers
+Superseded by the negative factor operator `E'*`, which provides stronger 
+exclusion semantics 
+
+### Capture Groups
+Lexers only need token boundaries—not submatch extraction. Removing capture 
+infrastructure simplifies the VM and eliminates bookkeeping overhead.
+
+### Explicit anchors
+All patterns implicitly start with BOL—a natural fit for lexer rules that always
+match from the current input position.
+
+### Word boundries:
+Lexical analysis relies on explicit character classes and negative factors for 
+token separation.
+
+### Syntax check for epsilon loops
+All inputs either compile to a valid NFA or fail with a semantic error.
+
+## Further reading
+https://research.swtch.com/sparse
+https://swtch.com/~rsc/regexp/regexp1.html
+
+## Author and License
+licensed under BSD license, just as the original re1.