64 lines
2.8 KiB
Plaintext
64 lines
2.8 KiB
Plaintext
What is pikevm?
|
|
==============
|
|
|
|
re1 (http://code.google.com/p/re1/) is "toy regular expression implementation"
|
|
by Russel Cox, featuring simplicity and minimal code size unheard of in other
|
|
implementations. re2 (http://code.google.com/p/re2/) is "an efficient,
|
|
principled regular expression library" by the same author. It is robust,
|
|
full-featured, and ... bloated, comparing to re1.
|
|
|
|
This is implementation of pikevm based on re1.5 which adds features required for
|
|
minimalistic real-world use, while sticking to the minimal code size and
|
|
memory use.
|
|
https://github.com/pfalcon/re1.5
|
|
|
|
Why?
|
|
====
|
|
Pikevm guarantees that any input regex will scale O(n) with the size of the
|
|
string, thus making it the fastest regex implementation. There is no backtracking
|
|
that usually expodes to O(n^k) time and space where k is some constant.
|
|
|
|
Features
|
|
========
|
|
|
|
* UnLike re1.5, here is only pikevm, one file easy to use.
|
|
* Unlike re1.5, regexes is compiled to type sized code rather than bytecode,
|
|
eliviating the problem of byte overflow in splits/jmps on large regexes.
|
|
Currently the type used is int, and every atom in compiled code is aligned
|
|
to that.
|
|
* Matcher does not take size of string as param, it checks for '\0' instead,
|
|
so that the user does not need to waste time taking strlen()
|
|
* Highly optimized source code, probably 2x faster than re1.5
|
|
* Support for named character classes: \d \D \s \S \w \W was dropped cause it's redundant.
|
|
* Support for quoted chars in regex.
|
|
* Support for ^, $ assertions in regex.
|
|
* Support for repetition operator {n} and {n,m}.
|
|
* Support for Unicode (UTF-8).
|
|
* Unlike other engines, the output is byte level offset. (Which is more useful)
|
|
* Support for wordend & wordbeg assertions
|
|
- Some limitations for word assertions are meta chars like spaces being used
|
|
in for expression itself, for example "\< abc" should match " abc" exactly at
|
|
that space word boundary but it won't. It's possible to fix this, but it would
|
|
require rsplit before word assert, and some dirty logic to check that the character
|
|
or class is a space we want to match not assert at. But the code for it was too
|
|
dirty and I scrapped it. Syntax for word assertions are like posix C library, not
|
|
the pcre "\b" which can be used both in front or back of the word, because there is
|
|
no distinction, it makes the implementation potentially even uglier.
|
|
|
|
NOTES
|
|
=====
|
|
Currently submatch tracking allocates a fix constant, however it is theoretically
|
|
possible to compute the worst case memory usage at compile time and get rid of
|
|
the constant. If this note remains, that means I did not find an efficient way
|
|
to calculate the worst case.
|
|
|
|
TODO
|
|
====
|
|
|
|
* Support for matching flags like case-insensitive
|
|
* maybe add lookaround, ahead, behind
|
|
|
|
Author and License
|
|
==================
|
|
licensed under BSD license, just as the original re1.
|