C Regex-Based Lexer (lx)
A minimal, educational regular expression-based lexer implemented in pure C.
It supports basic regex constructs and can be used to tokenize input strings based on pattern–token rules.
Features
- Regex syntax support:
- Literal characters:
a,b, etc. - Wildcard:
. - Quantifiers:
*(zero or more),+(one or more),?(zero or one) - Alternation:
| - Grouping:
( ... ) - Character classes via escape sequences:
\d→ digits\w→ word characters (letters)\s→ whitespace
- Escaping special characters:
\\,\(,\),\*, etc. - Negated patterns using trailing
'(e.g.,a'matches a rune not matched bya)
- Literal characters:
- Tokenization: Assign integer tokens to regex patterns.
- Streaming input: Processes input incrementally via
lx_state.
Partial Negation (')
The lexer supports a custom operator: a trailing apostrophe (') applied to a pattern a, written as a'. This implements partial negation—it matches and consumes exactly one character if that character is not matched by a at the current position. If a would match the next character, a' fails without consuming input.
Because the lexer does not use regex modifiers (like greedy/non-greedy flags), combinations such as *? and +? are treated as valid sequences of independent operators: * followed by ?, or + followed by ?. These are parsed as (a*)? and (a+)? respectively, not as non-greedy quantifiers. This design choice simplifies the parser and avoids ambiguity with the ' operator, which always applies to the immediately preceding pattern fragment.
Usage
Types
typedef struct {
struct {
char *p;
unsigned int line_number, line_offset;
} start, end;
int token;
} lx_state;
typedef void *lx_lexer;
Functions
-
lx_state lx_init_str(char *s)
Initialize a lexer state from a null-terminated strings. -
void lx_append(lx_lexer *l, char *pattern, int token)
Add a regexpatternthat producestokenwhen matched. Patterns are combined with earlier ones using union (|). -
int lx_lex(lx_lexer p, lx_state *s)
Attempt to match input starting ats->end. On success, setss->tokenand updates position; returns0. Returns non-zero on error. -
int lx_free(lx_lexer l)
Free all memory associated with the lexer. Note: Currently leaks memory for patterns containing++due to unhandled cyclic fragment references.
Example
#include "main.h"
int main() {
lx_lexer lexer = NULL;
lx_append(&lexer, "abc", 100);
lx_append(&lexer, "\\d+", 200);
lx_state state = lx_init_str("123 abc");
while (*state.end.p) {
lx_lex(lexer, &state);
if (state.token) {
printf("Token %d: '%.*s'\n",
state.token,
(int)(state.end.p - state.start.p),
state.start.p);
} else {
state.end.p++; // skip unmatched char
}
}
lx_free(lexer);
return 0;
}
Known Issues
- Memory leak: The implementation leaks memory when parsing regexes containing the
++operator (e.g.,ab++c). This occurs because the fragment graph creates cyclic references that are not properly tracked during deallocation. - No built-in line/column tracking beyond initial state setup.
- Limited error reporting.
Design Notes
- Built around Thompson’s NFA construction using a stack-based fragment system.
- Uses nullable transitions and patching to wire together regex components.
- Supports negation via trailing
'— a custom extension not found in standard regex.
Testing
The behavior is validated against test.txt, which contains regex patterns paired with test strings and expected match results in the format:
[expected == actual] input_string
All tests pass except those involving pathological quantifier combinations like ++, which expose the memory leak.