From d0187d01be490cb857f7e91c9663064c10e644c2 Mon Sep 17 00:00:00 2001
From: Kyryl Melekhin <k.melekhin@gmail.com>
Date: Sun, 17 Oct 2021 22:31:00 +0000
Subject: [PATCH] readme: more info

---
 README | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/README b/README
index 157337d..f14952b 100644
--- a/README
+++ b/README
@@ -84,6 +84,38 @@ it would work.  The algorithm requires O(mt) memory and O(nm^2t) time
 (assuming a worst-case optimal closure algorithm), where n is the
 length of input, m it the size of RE and t is the number of submatch groups 
 and subexpressions that contain them."
+
+Research has shown that it is possible to disambiguate NFA in polynomial time
+but it brings serious performance issues on non ambiguous inputs.
+See the branch "disambiguate_paths" on this repo shows what is being
+done to solve it and the potential performance costs. In short it
+requires tracking the parent of every state added on nlist from clist.
+If the state from nlist matches the consumer, the alternative clist
+state related to that nlist state gets discarded and the nsub ref
+can be decremented (freed). The reason why this problem does not
+exist for non ambiguous regexes is because the alternative clist
+state will never match due to the next state having a different
+consumer. There is no need for any extra handling it gets freed normally.
+I decided to not apply this solution here because I think
+most use cases for regex are not ambiguious like say regex:
+"a{10000}". If you try matching 10000 'a' characters in a row
+like that you will have a problem where the stack usage will
+jump up to 10000*(subsize) but it will never exceed the size
+of regex though, but the number of NFA states will also increase
+by the same amount, so at the charater 9999 you will find
+9999 redundant nlist states, that will degrade performance
+linearly, however it will be very slow compared to uplimited
+regex like a+. The cost of this solution is somewhere around
+2% general performance decrease (broadly), but a magnitude of 
+complexity decrease for ambiguous cases, for example
+matching 64 characters went down from 30 to 9 microseconds.
+Another solution to this problem can be to determine the
+ambiguous paths at compile time and flag the inner
+states as ambiguous ahead of time, still this can't avoid
+having a loop though the alt states as their positioning
+in clist can't be precomputed due to the dynamic changes.
+
+(Comment about O(mt) memory complexity)
 This worst case scenario can only happen on ambiguous input, that is why nsubs
 size is set to half a MB just in case, this can match 5000000 
 ambiguous consumers (char, class, any) assuming t is 1. In practice there
@@ -91,8 +123,10 @@ is almost never a situation where someone wants to search using regex this
 large. Use of alloca() instead of VLA, could remove this limit, I just wish
 it was standardized. If you ever wondered about a situation where alloca
 is a must, this is the algorithm.
-Research has shown that it is possible to disambiguate NFA in polynomial time
-but it brings serious performance issues on non ambiguous inputs.
+Most of the time memory usage is very low and the space
+complexity for non ambigious regex is O(nt) where n is
+the number of alternate paths in the regex and t is the
+number of submatch groups.
 
 This pikevm features an improved submatch extraction
 algorithm based on Russ Cox's original design.