From 581da6491463b8e449b4a068fa19c0fd4a41f6ca Mon Sep 17 00:00:00 2001 From: PedroEdiaz Date: Fri, 9 Jan 2026 12:37:19 -0600 Subject: [PATCH] Add README.md --- README.md | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 121 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..01e74cf --- /dev/null +++ b/README.md @@ -0,0 +1,121 @@ +# C Regex-Based Lexer (`lx`) + +A minimal, educational regular expression-based lexer implemented in pure C. +It supports basic regex constructs and can be used to tokenize input strings based on pattern–token rules. + +--- + +## Features + +- **Regex syntax support**: + - Literal characters: `a`, `b`, etc. + - Wildcard: `.` + - Quantifiers: `*` (zero or more), `+` (one or more), `?` (zero or one) + - Alternation: `|` + - Grouping: `( ... )` + - Character classes via escape sequences: + - `\d` → digits + - `\w` → word characters (letters) + - `\s` → whitespace + - Escaping special characters: `\\`, `\(`, `\)`, `\*`, etc. + - Negated patterns using trailing `'` (e.g., `a'` matches a rune **not** matched by `a`) +- **Tokenization**: Assign integer tokens to regex patterns. +- **Streaming input**: Processes input incrementally via `lx_state`. + +--- + +## Partial Negation (`'`) + +The lexer supports a custom operator: a trailing apostrophe (`'`) applied to a pattern `a`, written as `a'`. This implements *partial negation*—it matches and consumes exactly one character if that character is **not** matched by `a` at the current position. If `a` would match the next character, `a'` fails without consuming input. + +Because the lexer does not use regex modifiers (like greedy/non-greedy flags), combinations such as `*?` and `+?` are treated as valid sequences of independent operators: `*` followed by `?`, or `+` followed by `?`. These are parsed as `(a*)?` and `(a+)?` respectively, not as non-greedy quantifiers. This design choice simplifies the parser and avoids ambiguity with the `'` operator, which always applies to the immediately preceding pattern fragment. + +--- + +## Usage + +### Types + +```c +typedef struct { + struct { + char *p; + unsigned int line_number, line_offset; + } start, end; + int token; +} lx_state; + +typedef void *lx_lexer; +``` + +### Functions + +- `lx_state lx_init_str(char *s)` + Initialize a lexer state from a null-terminated string `s`. + +- `void lx_append(lx_lexer *l, char *pattern, int token)` + Add a regex `pattern` that produces `token` when matched. Patterns are combined with earlier ones using union (`|`). + +- `int lx_lex(lx_lexer p, lx_state *s)` + Attempt to match input starting at `s->end`. On success, sets `s->token` and updates position; returns `0`. Returns non-zero on error. + +- `int lx_free(lx_lexer l)` + Free all memory associated with the lexer. **Note**: Currently leaks memory for patterns containing `++` due to unhandled cyclic fragment references. + +--- + +## Example + +```c +#include "main.h" + +int main() { + lx_lexer lexer = NULL; + lx_append(&lexer, "abc", 100); + lx_append(&lexer, "\\d+", 200); + + lx_state state = lx_init_str("123 abc"); + while (*state.end.p) { + lx_lex(lexer, &state); + if (state.token) { + printf("Token %d: '%.*s'\n", + state.token, + (int)(state.end.p - state.start.p), + state.start.p); + } else { + state.end.p++; // skip unmatched char + } + } + + lx_free(lexer); + return 0; +} +``` + +--- + +## Known Issues + +- **Memory leak**: The implementation leaks memory when parsing regexes containing the `++` operator (e.g., `ab++c`). This occurs because the fragment graph creates cyclic references that are not properly tracked during deallocation. +- No built-in line/column tracking beyond initial state setup. +- Limited error reporting. + +--- + +## Design Notes + +- Built around **Thompson’s NFA construction** using a stack-based fragment system. +- Uses **nullable transitions** and **patching** to wire together regex components. +- Supports **negation** via trailing `'` — a custom extension not found in standard regex. + +--- + +## Testing + +The behavior is validated against `test.txt`, which contains regex patterns paired with test strings and expected match results in the format: + +``` +[expected == actual] input_string +``` + +All tests pass except those involving pathological quantifier combinations like `++`, which expose the memory leak.