Add README.md
This commit is contained in:
121
README.md
Normal file
121
README.md
Normal file
@@ -0,0 +1,121 @@
|
|||||||
|
# C Regex-Based Lexer (`lx`)
|
||||||
|
|
||||||
|
A minimal, educational regular expression-based lexer implemented in pure C.
|
||||||
|
It supports basic regex constructs and can be used to tokenize input strings based on pattern–token rules.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Regex syntax support**:
|
||||||
|
- Literal characters: `a`, `b`, etc.
|
||||||
|
- Wildcard: `.`
|
||||||
|
- Quantifiers: `*` (zero or more), `+` (one or more), `?` (zero or one)
|
||||||
|
- Alternation: `|`
|
||||||
|
- Grouping: `( ... )`
|
||||||
|
- Character classes via escape sequences:
|
||||||
|
- `\d` → digits
|
||||||
|
- `\w` → word characters (letters)
|
||||||
|
- `\s` → whitespace
|
||||||
|
- Escaping special characters: `\\`, `\(`, `\)`, `\*`, etc.
|
||||||
|
- Negated patterns using trailing `'` (e.g., `a'` matches a rune **not** matched by `a`)
|
||||||
|
- **Tokenization**: Assign integer tokens to regex patterns.
|
||||||
|
- **Streaming input**: Processes input incrementally via `lx_state`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Partial Negation (`'`)
|
||||||
|
|
||||||
|
The lexer supports a custom operator: a trailing apostrophe (`'`) applied to a pattern `a`, written as `a'`. This implements *partial negation*—it matches and consumes exactly one character if that character is **not** matched by `a` at the current position. If `a` would match the next character, `a'` fails without consuming input.
|
||||||
|
|
||||||
|
Because the lexer does not use regex modifiers (like greedy/non-greedy flags), combinations such as `*?` and `+?` are treated as valid sequences of independent operators: `*` followed by `?`, or `+` followed by `?`. These are parsed as `(a*)?` and `(a+)?` respectively, not as non-greedy quantifiers. This design choice simplifies the parser and avoids ambiguity with the `'` operator, which always applies to the immediately preceding pattern fragment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Types
|
||||||
|
|
||||||
|
```c
|
||||||
|
typedef struct {
|
||||||
|
struct {
|
||||||
|
char *p;
|
||||||
|
unsigned int line_number, line_offset;
|
||||||
|
} start, end;
|
||||||
|
int token;
|
||||||
|
} lx_state;
|
||||||
|
|
||||||
|
typedef void *lx_lexer;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Functions
|
||||||
|
|
||||||
|
- `lx_state lx_init_str(char *s)`
|
||||||
|
Initialize a lexer state from a null-terminated string `s`.
|
||||||
|
|
||||||
|
- `void lx_append(lx_lexer *l, char *pattern, int token)`
|
||||||
|
Add a regex `pattern` that produces `token` when matched. Patterns are combined with earlier ones using union (`|`).
|
||||||
|
|
||||||
|
- `int lx_lex(lx_lexer p, lx_state *s)`
|
||||||
|
Attempt to match input starting at `s->end`. On success, sets `s->token` and updates position; returns `0`. Returns non-zero on error.
|
||||||
|
|
||||||
|
- `int lx_free(lx_lexer l)`
|
||||||
|
Free all memory associated with the lexer. **Note**: Currently leaks memory for patterns containing `++` due to unhandled cyclic fragment references.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example
|
||||||
|
|
||||||
|
```c
|
||||||
|
#include "main.h"
|
||||||
|
|
||||||
|
int main() {
|
||||||
|
lx_lexer lexer = NULL;
|
||||||
|
lx_append(&lexer, "abc", 100);
|
||||||
|
lx_append(&lexer, "\\d+", 200);
|
||||||
|
|
||||||
|
lx_state state = lx_init_str("123 abc");
|
||||||
|
while (*state.end.p) {
|
||||||
|
lx_lex(lexer, &state);
|
||||||
|
if (state.token) {
|
||||||
|
printf("Token %d: '%.*s'\n",
|
||||||
|
state.token,
|
||||||
|
(int)(state.end.p - state.start.p),
|
||||||
|
state.start.p);
|
||||||
|
} else {
|
||||||
|
state.end.p++; // skip unmatched char
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
lx_free(lexer);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Known Issues
|
||||||
|
|
||||||
|
- **Memory leak**: The implementation leaks memory when parsing regexes containing the `++` operator (e.g., `ab++c`). This occurs because the fragment graph creates cyclic references that are not properly tracked during deallocation.
|
||||||
|
- No built-in line/column tracking beyond initial state setup.
|
||||||
|
- Limited error reporting.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design Notes
|
||||||
|
|
||||||
|
- Built around **Thompson’s NFA construction** using a stack-based fragment system.
|
||||||
|
- Uses **nullable transitions** and **patching** to wire together regex components.
|
||||||
|
- Supports **negation** via trailing `'` — a custom extension not found in standard regex.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
The behavior is validated against `test.txt`, which contains regex patterns paired with test strings and expected match results in the format:
|
||||||
|
|
||||||
|
```
|
||||||
|
[expected == actual] input_string
|
||||||
|
```
|
||||||
|
|
||||||
|
All tests pass except those involving pathological quantifier combinations like `++`, which expose the memory leak.
|
||||||
Reference in New Issue
Block a user