update README

This commit is contained in:
Pascal Kuthe 2023-08-03 02:34:49 +02:00
parent 7a432aa051
commit 6dcfb41545
No known key found for this signature in database
GPG Key ID: D715E8655AE166A6
2 changed files with 19 additions and 12 deletions

View File

@ -1,6 +1,12 @@
# Nucleo
An optimized rust port of the fzf fuzzy matching algorithm
Nucleo is a fast fuzzy matcher written in rust aimed at performance.
It's not intended to function as a standalone library and is instead primarily aimed at library use cases (most notably embedding within the helix editor). One of its primary goals (and distinguishing factor compared to `skim`/the fuzzy matcher cratae) is its performance. The matching algorithm can be
10 times faster than the fuzzy matcher crate in many cases.
TODO show results
Furthermore, nucleo aims to offer simple high level API that makes both streaming and matching on a background thread(pool) easy.
## Notes:
@ -18,17 +24,18 @@ An optimized rust port of the fzf fuzzy matching algorithm
* we aggressively prefilter (especially ASCII but also unicode to a lesser extent) to ensure we reject non-matching haystacks as fast as possible. Usually most haystacks will not match when fuzzy matching large lists so having very quick reject past is good
* for very long matches we fallback to a greedy matcher which runs in `O(N)` (and `O(1)` space complexity) to avoid the `O(mn)` blowup. This is fzfs old algorithm and yields decent (but not great) results.
* There is a misunderstanding in both skim and fzf. Basically what they do is give a bonus to each character (like word boundaries). That makes senes and is reasonable, but the problem is that they use the **maximum bonus** when multiple chars match in sequence. That means that the bonus of a character depends on which characters exactly matched around it. But the fundamental assumption of this algorithm (and why it doesn't require backtracking) is that the score of each character is independent of what other chars matched (this is the difference between the affine gap and the generic gap case shown in the paper too). During fuzzing I found many cases where this mechanism leads to a non-optimal match being reported (so the sort order and fuzzy indices would be wrong). In my testing removing this mechanism and slightly tweaking the bonus calculation results in similar match quality but made sure the algorithm always worked correctly (and removed a bunch of weird edges cases).
* [ ] it seems this makes us overemphasize word boundaries for small search strings, this is likely okay as the consecutive bonus wins fairly quickly. Maybe we just do a greedy search for the first 2 chars to reduce visual noise?
* [x] substring/prefix/postfix/exact matcher
* [ ] case mismatch penalty. This doesn't seem like a good idea to me. FZF doesn't do this (only skin), smart case should cover most cases. .would be nice for fully case-insensitive matching without smart case like in autocompletion tough. Realistically there won't be more than 3 items that are identical with different casing tough, so I don't think it matters too much. It is a bit annoying to implement since you can no longer pre-normalize queries(or need two queries) :/
* [ ] high level API (worker thread, query parsing, sorting), in progress
* apparently sorting is superfast (at most 5% of match time for nucleo matcher with a highly selective query, otherwise its completely negligible compared to fuzzy matching). All the bending over backwards fzf does (and skim copied but way worse) seems a little silly. I think fzf does it because go doesn't have a good parallel sort. Fzf divides the matches into a couple fairly large chunks and sorts those on each worker thread and then lazily merges the result. That makes the sorting without the merging `Nlog(N/M)` which is basically equivalent for large `N` and small `M` as is the case here. Atleast its parallel tough. In rust we have a great pattern defeating parallel quicksort tough (rayon) which is way easier.
* [x] basic implementation (workers, streaming, invalidation)
* [ ] verify it actually works
* [ ] query paring
* [ ] hook up to helix
* [ ] currently I simply use a tick system (called on every redraw)
together with a redraw/tick nofication (ideally debounced) is that enough?
* [ ] for streaming callers should buffer their data. Can we provide a better API for that beyond what is currently there?
* [x] verify it actually works
* [x] query paring
* [x] hook up to helix
* [x] currently I simply use a tick system (called on every redraw)
together with a redraw/tick nofication (ideally debounced) is that enough? yes works nicely
* [x] for streaming callers should buffer their data. Can we provide a better API for that beyond what is currently there? yes lock-free stream
* [ ] cleanup code, improve API
* [ ] write docs
@ -37,5 +44,5 @@ An optimized rust port of the fzf fuzzy matching algorithm
* [x] port the full fzf test suite for fuzzy matching
* [ ] port the full skim test suite for fuzzy matching
* [ ] highlevel API
* [ ] test bustring/exact/prefix/postfix match
* [~] test substring/exact/prefix/postfix match
* [ ] coverage report (fuzzy matcher was at 86%)

View File

@ -19,7 +19,7 @@ pub(crate) const BONUS_BOUNDARY: u16 = SCORE_MATCH / 2;
// Their value should be BONUS_BOUNDARY - PENALTY_GAP_EXTENSION = 7.
// However, this priporitzes camel case over non-camel case.
// In fzf/skim this is not a problem since they score off the max
// consecutive bounus. However, we don't do that (because its incorrect)
// consecutive bonus. However, we don't do that (because its incorrect)
// so to avoids prioritzing camel we use a lower bonus. I think that's fine
// usually camel case is wekaer boundary than actual wourd boundaries anyway
// This also has the nice sideeffect of perfectly balancing out