Files
lx/README.md
2026-01-09 12:37:19 -06:00

122 lines
4.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# C Regex-Based Lexer (`lx`)
A minimal, educational regular expression-based lexer implemented in pure C.
It supports basic regex constructs and can be used to tokenize input strings based on patterntoken rules.
---
## Features
- **Regex syntax support**:
- Literal characters: `a`, `b`, etc.
- Wildcard: `.`
- Quantifiers: `*` (zero or more), `+` (one or more), `?` (zero or one)
- Alternation: `|`
- Grouping: `( ... )`
- Character classes via escape sequences:
- `\d` → digits
- `\w` → word characters (letters)
- `\s` → whitespace
- Escaping special characters: `\\`, `\(`, `\)`, `\*`, etc.
- Negated patterns using trailing `'` (e.g., `a'` matches a rune **not** matched by `a`)
- **Tokenization**: Assign integer tokens to regex patterns.
- **Streaming input**: Processes input incrementally via `lx_state`.
---
## Partial Negation (`'`)
The lexer supports a custom operator: a trailing apostrophe (`'`) applied to a pattern `a`, written as `a'`. This implements *partial negation*—it matches and consumes exactly one character if that character is **not** matched by `a` at the current position. If `a` would match the next character, `a'` fails without consuming input.
Because the lexer does not use regex modifiers (like greedy/non-greedy flags), combinations such as `*?` and `+?` are treated as valid sequences of independent operators: `*` followed by `?`, or `+` followed by `?`. These are parsed as `(a*)?` and `(a+)?` respectively, not as non-greedy quantifiers. This design choice simplifies the parser and avoids ambiguity with the `'` operator, which always applies to the immediately preceding pattern fragment.
---
## Usage
### Types
```c
typedef struct {
struct {
char *p;
unsigned int line_number, line_offset;
} start, end;
int token;
} lx_state;
typedef void *lx_lexer;
```
### Functions
- `lx_state lx_init_str(char *s)`
Initialize a lexer state from a null-terminated string `s`.
- `void lx_append(lx_lexer *l, char *pattern, int token)`
Add a regex `pattern` that produces `token` when matched. Patterns are combined with earlier ones using union (`|`).
- `int lx_lex(lx_lexer p, lx_state *s)`
Attempt to match input starting at `s->end`. On success, sets `s->token` and updates position; returns `0`. Returns non-zero on error.
- `int lx_free(lx_lexer l)`
Free all memory associated with the lexer. **Note**: Currently leaks memory for patterns containing `++` due to unhandled cyclic fragment references.
---
## Example
```c
#include "main.h"
int main() {
lx_lexer lexer = NULL;
lx_append(&lexer, "abc", 100);
lx_append(&lexer, "\\d+", 200);
lx_state state = lx_init_str("123 abc");
while (*state.end.p) {
lx_lex(lexer, &state);
if (state.token) {
printf("Token %d: '%.*s'\n",
state.token,
(int)(state.end.p - state.start.p),
state.start.p);
} else {
state.end.p++; // skip unmatched char
}
}
lx_free(lexer);
return 0;
}
```
---
## Known Issues
- **Memory leak**: The implementation leaks memory when parsing regexes containing the `++` operator (e.g., `ab++c`). This occurs because the fragment graph creates cyclic references that are not properly tracked during deallocation.
- No built-in line/column tracking beyond initial state setup.
- Limited error reporting.
---
## Design Notes
- Built around **Thompsons NFA construction** using a stack-based fragment system.
- Uses **nullable transitions** and **patching** to wire together regex components.
- Supports **negation** via trailing `'` — a custom extension not found in standard regex.
---
## Testing
The behavior is validated against `test.txt`, which contains regex patterns paired with test strings and expected match results in the format:
```
[expected == actual] input_string
```
All tests pass except those involving pathological quantifier combinations like `++`, which expose the memory leak.