readme: more info
This commit is contained in:
38
README
38
README
@@ -84,6 +84,38 @@ it would work. The algorithm requires O(mt) memory and O(nm^2t) time
|
|||||||
(assuming a worst-case optimal closure algorithm), where n is the
|
(assuming a worst-case optimal closure algorithm), where n is the
|
||||||
length of input, m it the size of RE and t is the number of submatch groups
|
length of input, m it the size of RE and t is the number of submatch groups
|
||||||
and subexpressions that contain them."
|
and subexpressions that contain them."
|
||||||
|
|
||||||
|
Research has shown that it is possible to disambiguate NFA in polynomial time
|
||||||
|
but it brings serious performance issues on non ambiguous inputs.
|
||||||
|
See the branch "disambiguate_paths" on this repo shows what is being
|
||||||
|
done to solve it and the potential performance costs. In short it
|
||||||
|
requires tracking the parent of every state added on nlist from clist.
|
||||||
|
If the state from nlist matches the consumer, the alternative clist
|
||||||
|
state related to that nlist state gets discarded and the nsub ref
|
||||||
|
can be decremented (freed). The reason why this problem does not
|
||||||
|
exist for non ambiguous regexes is because the alternative clist
|
||||||
|
state will never match due to the next state having a different
|
||||||
|
consumer. There is no need for any extra handling it gets freed normally.
|
||||||
|
I decided to not apply this solution here because I think
|
||||||
|
most use cases for regex are not ambiguious like say regex:
|
||||||
|
"a{10000}". If you try matching 10000 'a' characters in a row
|
||||||
|
like that you will have a problem where the stack usage will
|
||||||
|
jump up to 10000*(subsize) but it will never exceed the size
|
||||||
|
of regex though, but the number of NFA states will also increase
|
||||||
|
by the same amount, so at the charater 9999 you will find
|
||||||
|
9999 redundant nlist states, that will degrade performance
|
||||||
|
linearly, however it will be very slow compared to uplimited
|
||||||
|
regex like a+. The cost of this solution is somewhere around
|
||||||
|
2% general performance decrease (broadly), but a magnitude of
|
||||||
|
complexity decrease for ambiguous cases, for example
|
||||||
|
matching 64 characters went down from 30 to 9 microseconds.
|
||||||
|
Another solution to this problem can be to determine the
|
||||||
|
ambiguous paths at compile time and flag the inner
|
||||||
|
states as ambiguous ahead of time, still this can't avoid
|
||||||
|
having a loop though the alt states as their positioning
|
||||||
|
in clist can't be precomputed due to the dynamic changes.
|
||||||
|
|
||||||
|
(Comment about O(mt) memory complexity)
|
||||||
This worst case scenario can only happen on ambiguous input, that is why nsubs
|
This worst case scenario can only happen on ambiguous input, that is why nsubs
|
||||||
size is set to half a MB just in case, this can match 5000000
|
size is set to half a MB just in case, this can match 5000000
|
||||||
ambiguous consumers (char, class, any) assuming t is 1. In practice there
|
ambiguous consumers (char, class, any) assuming t is 1. In practice there
|
||||||
@@ -91,8 +123,10 @@ is almost never a situation where someone wants to search using regex this
|
|||||||
large. Use of alloca() instead of VLA, could remove this limit, I just wish
|
large. Use of alloca() instead of VLA, could remove this limit, I just wish
|
||||||
it was standardized. If you ever wondered about a situation where alloca
|
it was standardized. If you ever wondered about a situation where alloca
|
||||||
is a must, this is the algorithm.
|
is a must, this is the algorithm.
|
||||||
Research has shown that it is possible to disambiguate NFA in polynomial time
|
Most of the time memory usage is very low and the space
|
||||||
but it brings serious performance issues on non ambiguous inputs.
|
complexity for non ambigious regex is O(nt) where n is
|
||||||
|
the number of alternate paths in the regex and t is the
|
||||||
|
number of submatch groups.
|
||||||
|
|
||||||
This pikevm features an improved submatch extraction
|
This pikevm features an improved submatch extraction
|
||||||
algorithm based on Russ Cox's original design.
|
algorithm based on Russ Cox's original design.
|
||||||
|
|||||||
Reference in New Issue
Block a user