Update Readme
This commit is contained in:
75
README.md
Normal file
75
README.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# lexvm: A Lexical Analysis Virtual Machine
|
||||
lexvm is a specialized virtual machine for lexical analysis (tokenization), derived
|
||||
from Russ Cox's PikeVM implementation. Unlike general-purpose regex engines, lexvm
|
||||
is optimized specifically for scanner/lexer workloads with deterministic,
|
||||
linear-time matching semantics and a streamlined instruction set.
|
||||
|
||||
## Negative Factors
|
||||
Traditional regular expression engines struggle with lexical constructs that must
|
||||
exclude certain substrings. Greedy quantifiers (`*`, `+`) match as much as possible
|
||||
but offer no native way to express "match anything except if it contains X".
|
||||
Non-greedy quantifiers (`*?`, `+?`) and negative lookahead (`(?!...)`) attempt to
|
||||
address this but:
|
||||
|
||||
- Break linear-time guarantees.
|
||||
- Are not regular operators.
|
||||
- Introduce fragile rule ordering in lexers
|
||||
|
||||
## Apostrophe `'`
|
||||
The apostrophe ' is syntactic sugar with no standalone meaning. Only when followed
|
||||
by *, forming `E'*` does it activate the negative factor operator: Match the
|
||||
longest token starting at the current position that does not
|
||||
contain any substring matching `E`.
|
||||
|
||||
## Examples
|
||||
### Escaped Strings
|
||||
```
|
||||
"(("|\\)'|\\.)*" (164)
|
||||
[1 == 1] ""
|
||||
[0 == 0] """
|
||||
[0 == 0] "\"
|
||||
[1 == 1] "\\"
|
||||
[1 == 1] "lsk\"lsdk"
|
||||
|
||||
"(\\.|("|\\)')*" (164)
|
||||
[1 == 1] ""
|
||||
[0 == 0] """
|
||||
[0 == 0] "\"
|
||||
[1 == 1] "\\"
|
||||
[1 == 1] "lsk\"lsdk"
|
||||
```
|
||||
### C-Style Comments
|
||||
```
|
||||
\\\*(\*\\)'*\*\\ (120)
|
||||
[1 == 1] \*lskd*\
|
||||
[1 == 1] \****\
|
||||
[1 == 1] \*\\*\
|
||||
[0 == 0] \*ls*\ lsdk *\
|
||||
```
|
||||
|
||||
## Removed Features
|
||||
### Lazy Quantifiers
|
||||
Superseded by the negative factor operator `E'*`, which provides stronger
|
||||
exclusion semantics
|
||||
|
||||
### Capture Groups
|
||||
Lexers only need token boundaries—not submatch extraction. Removing capture
|
||||
infrastructure simplifies the VM and eliminates bookkeeping overhead.
|
||||
|
||||
### Explicit anchors
|
||||
All patterns implicitly start with BOL—a natural fit for lexer rules that always
|
||||
match from the current input position.
|
||||
|
||||
### Word boundries:
|
||||
Lexical analysis relies on explicit character classes and negative factors for
|
||||
token separation.
|
||||
|
||||
### Syntax check for epsilon loops
|
||||
All inputs either compile to a valid NFA or fail with a semantic error.
|
||||
|
||||
## Further reading
|
||||
https://research.swtch.com/sparse
|
||||
https://swtch.com/~rsc/regexp/regexp1.html
|
||||
|
||||
## Author and License
|
||||
licensed under BSD license, just as the original re1.
|
||||
Reference in New Issue
Block a user