Files
lx/README.md
2026-01-09 12:37:19 -06:00

4.0 KiB
Raw Permalink Blame History

C Regex-Based Lexer (lx)

A minimal, educational regular expression-based lexer implemented in pure C.
It supports basic regex constructs and can be used to tokenize input strings based on patterntoken rules.


Features

  • Regex syntax support:
    • Literal characters: a, b, etc.
    • Wildcard: .
    • Quantifiers: * (zero or more), + (one or more), ? (zero or one)
    • Alternation: |
    • Grouping: ( ... )
    • Character classes via escape sequences:
      • \d → digits
      • \w → word characters (letters)
      • \s → whitespace
    • Escaping special characters: \\, \(, \), \*, etc.
    • Negated patterns using trailing ' (e.g., a' matches a rune not matched by a)
  • Tokenization: Assign integer tokens to regex patterns.
  • Streaming input: Processes input incrementally via lx_state.

Partial Negation (')

The lexer supports a custom operator: a trailing apostrophe (') applied to a pattern a, written as a'. This implements partial negation—it matches and consumes exactly one character if that character is not matched by a at the current position. If a would match the next character, a' fails without consuming input.

Because the lexer does not use regex modifiers (like greedy/non-greedy flags), combinations such as *? and +? are treated as valid sequences of independent operators: * followed by ?, or + followed by ?. These are parsed as (a*)? and (a+)? respectively, not as non-greedy quantifiers. This design choice simplifies the parser and avoids ambiguity with the ' operator, which always applies to the immediately preceding pattern fragment.


Usage

Types

typedef struct {
    struct {
        char *p;
        unsigned int line_number, line_offset;
    } start, end;
    int token;
} lx_state;

typedef void *lx_lexer;

Functions

  • lx_state lx_init_str(char *s)
    Initialize a lexer state from a null-terminated string s.

  • void lx_append(lx_lexer *l, char *pattern, int token)
    Add a regex pattern that produces token when matched. Patterns are combined with earlier ones using union (|).

  • int lx_lex(lx_lexer p, lx_state *s)
    Attempt to match input starting at s->end. On success, sets s->token and updates position; returns 0. Returns non-zero on error.

  • int lx_free(lx_lexer l)
    Free all memory associated with the lexer. Note: Currently leaks memory for patterns containing ++ due to unhandled cyclic fragment references.


Example

#include "main.h"

int main() {
    lx_lexer lexer = NULL;
    lx_append(&lexer, "abc", 100);
    lx_append(&lexer, "\\d+", 200);

    lx_state state = lx_init_str("123 abc");
    while (*state.end.p) {
        lx_lex(lexer, &state);
        if (state.token) {
            printf("Token %d: '%.*s'\n",
                   state.token,
                   (int)(state.end.p - state.start.p),
                   state.start.p);
        } else {
            state.end.p++; // skip unmatched char
        }
    }

    lx_free(lexer);
    return 0;
}

Known Issues

  • Memory leak: The implementation leaks memory when parsing regexes containing the ++ operator (e.g., ab++c). This occurs because the fragment graph creates cyclic references that are not properly tracked during deallocation.
  • No built-in line/column tracking beyond initial state setup.
  • Limited error reporting.

Design Notes

  • Built around Thompsons NFA construction using a stack-based fragment system.
  • Uses nullable transitions and patching to wire together regex components.
  • Supports negation via trailing ' — a custom extension not found in standard regex.

Testing

The behavior is validated against test.txt, which contains regex patterns paired with test strings and expected match results in the format:

[expected == actual] input_string

All tests pass except those involving pathological quantifier combinations like ++, which expose the memory leak.