Kyryl Melekhin e87fdef734 more killer tests
2021-08-03 18:58:18 +00:00
2021-07-13 21:07:53 +00:00
2021-08-03 18:33:41 +00:00
2021-08-03 18:58:18 +00:00

What is pikevm?
==============

re1 (http://code.google.com/p/re1/) is "toy regular expression implementation"
by Russel Cox, featuring simplicity and minimal code size unheard of in other
implementations. re2 (http://code.google.com/p/re2/) is "an efficient,
principled regular expression library" by the same author. It is robust,
full-featured, and ... bloated, comparing to re1.

This is implementation of pikevm based on re1.5 which adds features required for
minimalistic real-world use, while sticking to the minimal code size and
memory use.
https://github.com/pfalcon/re1.5

Why?
====
Pikevm guarantees that any input regex will scale O(n) with the size of the
string, thus making it the fastest regex implementation. There is no backtracking
that usually expodes to O(n^k) time and space where k is some constant.

Features
========

* UnLike re1.5, here is only pikevm, one file easy to use.
* Unlike re1.5, regexes is compiled to type sized code rather than bytecode,
eliviating the problem of byte overflow in splits/jmps on large regexes. 
Currently the type used is int, and every atom in compiled code is aligned
to that.
* Matcher does not take size of string as param, it checks for '\0' instead,
so that the user does not need to waste time taking strlen()
* Support for quoted chars in regex.
* Support for ^, $ assertions in regex.
* Support for repetition operator {n} and {n,m}.
* Support for Unicode (UTF-8).
* Unlike other engines, the output is byte level offset. (Which is more useful)
* Support for wordend & wordbeg assertions
- Some limitations for word assertions are meta chars like spaces being used
in for expression itself, for example "\< abc" should match " abc" exactly at
that space word boundary but it won't. It's possible to fix this, but it would
require rsplit before word assert, and some dirty logic to check that the character
or class is a space we want to match not assert at. But the code for it was too
dirty and I scrapped it. Syntax for word assertions are like posix C library, not
the pcre "\b" which can be used both in front or back of the word, because there is
no distinction, it makes the implementation potentially even uglier.


TODO
====

* Support for matching flags like case-insensitive
* maybe add lookaround, ahead, behind

Author and License
==================
licensed under BSD license, just as the original re1.
Description
A specialized virtual machine for lexical analysis (tokenization), derived from Russ Cox's PikeVM implementation.
Readme 152 KiB
Languages
C 53.2%
Shell 46.8%