Add README.md

2026-01-09 12:37:19 -06:00
parent 9c4acf9519
commit 581da64914
1 changed files with 121 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,121 @@
+# C Regex-Based Lexer (`lx`)
+
+A minimal, educational regular expression-based lexer implemented in pure C.  
+It supports basic regex constructs and can be used to tokenize input strings based on pattern–token rules.
+
+---
+
+## Features
+
+- **Regex syntax support**:
+  - Literal characters: `a`, `b`, etc.
+  - Wildcard: `.`
+  - Quantifiers: `*` (zero or more), `+` (one or more), `?` (zero or one)
+  - Alternation: `|`
+  - Grouping: `( ... )`
+  - Character classes via escape sequences:
+    - `\d` → digits
+    - `\w` → word characters (letters)
+    - `\s` → whitespace
+  - Escaping special characters: `\\`, `\(`, `\)`, `\*`, etc.
+  - Negated patterns using trailing `'` (e.g., `a'` matches a rune **not** matched by `a`)
+- **Tokenization**: Assign integer tokens to regex patterns.
+- **Streaming input**: Processes input incrementally via `lx_state`.
+
+---
+
+## Partial Negation (`'`)
+
+The lexer supports a custom operator: a trailing apostrophe (`'`) applied to a pattern `a`, written as `a'`. This implements *partial negation*—it matches and consumes exactly one character if that character is **not** matched by `a` at the current position. If `a` would match the next character, `a'` fails without consuming input.
+
+Because the lexer does not use regex modifiers (like greedy/non-greedy flags), combinations such as `*?` and `+?` are treated as valid sequences of independent operators: `*` followed by `?`, or `+` followed by `?`. These are parsed as `(a*)?` and `(a+)?` respectively, not as non-greedy quantifiers. This design choice simplifies the parser and avoids ambiguity with the `'` operator, which always applies to the immediately preceding pattern fragment.
+
+---
+
+## Usage
+
+### Types
+
+```c
+typedef struct {
+    struct {
+        char *p;
+        unsigned int line_number, line_offset;
+    } start, end;
+    int token;
+} lx_state;
+
+typedef void *lx_lexer;
+```
+
+### Functions
+
+- `lx_state lx_init_str(char *s)`  
+  Initialize a lexer state from a null-terminated string `s`.
+
+- `void lx_append(lx_lexer *l, char *pattern, int token)`  
+  Add a regex `pattern` that produces `token` when matched. Patterns are combined with earlier ones using union (`|`).
+
+- `int lx_lex(lx_lexer p, lx_state *s)`  
+  Attempt to match input starting at `s->end`. On success, sets `s->token` and updates position; returns `0`. Returns non-zero on error.
+
+- `int lx_free(lx_lexer l)`  
+  Free all memory associated with the lexer. **Note**: Currently leaks memory for patterns containing `++` due to unhandled cyclic fragment references.
+
+---
+
+## Example
+
+```c
+#include "main.h"
+
+int main() {
+    lx_lexer lexer = NULL;
+    lx_append(&lexer, "abc", 100);
+    lx_append(&lexer, "\\d+", 200);
+
+    lx_state state = lx_init_str("123 abc");
+    while (*state.end.p) {
+        lx_lex(lexer, &state);
+        if (state.token) {
+            printf("Token %d: '%.*s'\n",
+                   state.token,
+                   (int)(state.end.p - state.start.p),
+                   state.start.p);
+        } else {
+            state.end.p++; // skip unmatched char
+        }
+    }
+
+    lx_free(lexer);
+    return 0;
+}
+```
+
+---
+
+## Known Issues
+
+- **Memory leak**: The implementation leaks memory when parsing regexes containing the `++` operator (e.g., `ab++c`). This occurs because the fragment graph creates cyclic references that are not properly tracked during deallocation.
+- No built-in line/column tracking beyond initial state setup.
+- Limited error reporting.
+
+---
+
+## Design Notes
+
+- Built around **Thompson’s NFA construction** using a stack-based fragment system.
+- Uses **nullable transitions** and **patching** to wire together regex components.
+- Supports **negation** via trailing `'` — a custom extension not found in standard regex.
+
+---
+
+## Testing
+
+The behavior is validated against `test.txt`, which contains regex patterns paired with test strings and expected match results in the format:
+
+```
+[expected == actual] input_string
+```
+
+All tests pass except those involving pathological quantifier combinations like `++`, which expose the memory leak.