mirror of
https://github.com/solaeus/nucleo.git
synced 2024-12-22 01:47:49 +00:00
small spelling and grammar fixes (#1)
This commit is contained in:
parent
f4e19b4567
commit
e499e2b601
25
README.md
25
README.md
@ -2,21 +2,21 @@
|
||||
|
||||
`nucleo` is a highly performant fuzzy matcher written in rust. It aims to fill the same use case as `fzf` and `skim`. Compared to `fzf` `nucleo` has a significantly faster matching algorithm. This mainly makes a difference when matching patterns with low selectivity on many items. An (unscientific) comparison is shown in the benchmark section below.
|
||||
|
||||
Nucleo uses the exact **same scoring system as fzf**. That means you should get the same ranking quality (or better) as you are used to from fzf. However, `nucleo` has a more faithful implementation of the Smith-Waterman algorithm which is normally used in DNA sequence alignment (see https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/gaps.pdf) with two separate matrices (instead of one like fzf). This means that `nucleo` finds the optimal match more often. For example if you match `foo` in `xf foo` `nucleo` will match `x__foo` but `fzf` will match `xf_oo` (you can increase the word length the result will stay the same).The former is the more intuitive match and has a higher score according to the ranking system that both `nucleo` and fzf.
|
||||
Nucleo uses the exact **same scoring system as fzf**. That means you should get the same ranking quality (or better) as you are used to from fzf. However, `nucleo` has a more faithful implementation of the Smith-Waterman algorithm which is normally used in DNA sequence alignment (see https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/gaps.pdf) with two separate matrices (instead of one like fzf). This means that `nucleo` finds the optimal match more often. For example if you match `foo` in `xf foo` `nucleo` will match `x__foo` but `fzf` will match `xf_oo` (you can increase the word length the result will stay the same). The former is the more intuitive match and has a higher score according to the ranking system that both `nucleo` and fzf.
|
||||
|
||||
**Compared to `skim`** (and the `fuzzy-matcher` crate) `nucleo` has an even larger performance advantage and is often around **six times faster** (see benchmarks below). Furthermore, the bonus system used by nucleo and fzf is (in my opinion)more consistent/superior. `nulceo` also handles non-ascii text much better. (`skim`s bonus system and even case insensitivity only work for ASCII).
|
||||
|
||||
Nucleo also handles Unicode graphemes more correctly. `Fzf` and `skim` both operate on Unicode code points (chars). That means that multi codepoint graphemes can have weird effects (match multiple times, weirdly change the score, ..). `nucleo` will always use the first codepoint of the grapheme for matching instead (and reports grapheme indices, so they can be highlighted correctly).
|
||||
Nucleo also handles Unicode graphemes more correctly. `Fzf` and `skim` both operate on Unicode code points (chars). That means that multi codepoint graphemes can have weird effects (match multiple times, weirdly change the score, ...). `nucleo` will always use the first codepoint of the grapheme for matching instead (and reports grapheme indices, so they can be highlighted correctly).
|
||||
|
||||
## Benchmarks
|
||||
|
||||
> WIP currently more of a demonstration than a comprehensive benchmark suit
|
||||
> most notably scientific comparisons with `fzf` are missing (a pain because it can't be called as a library
|
||||
> most notably scientific comparisons with `fzf` are missing (a pain because it can't be called as a library)
|
||||
|
||||
|
||||
### Matcher micro benchmarks
|
||||
|
||||
Benchmark comparing the runtime of various pattterns mathec against all files in the source thee of the linux kernel. Repeat on your system with `BENCHMARK_DIR=<path_to_linux> cargo run -p benches --release` (you can specify an empty directory and the kernel is cloned automatically).
|
||||
Benchmark comparing the runtime of various patterns matched against all files in the source of the linux kernel. Repeat on your system with `BENCHMARK_DIR=<path_to_linux> cargo run -p benches --release` (you can specify an empty directory and the kernel is cloned automatically).
|
||||
|
||||
Method | Mean | Samples
|
||||
-----------------------|-----------|-----------
|
||||
@ -51,9 +51,9 @@ For example in the following two screencasts the pattern `///.` is pasted into `
|
||||
|
||||
# Naming
|
||||
|
||||
The name `nucleo` plays at the fact that the `Smith-Waterman` algorithm (that it's based on) war originally developed for matching DNA/RNA sequences. The elements of DNA/RNA that are matched are called *nucleotides* which was shortened to `nucleo` here.
|
||||
The name `nucleo` plays on the fact that the `Smith-Waterman` algorithm (that it's based on) was originally developed for matching DNA/RNA sequences. The elements of DNA/RNA that are matched are called *nucleotides* which was shortened to `nucleo` here.
|
||||
|
||||
The name also indicates its close relationship with the *helix* editor (sticking with the DNA theme).# Implementation Details
|
||||
The name also indicates its close relationship with the *helix* editor (sticking with the DNA theme).
|
||||
|
||||
# Implementation Details
|
||||
|
||||
@ -64,17 +64,17 @@ The name also indicates its close relationship with the *helix* editor (sticking
|
||||
<!-- Furthermore, `nucleo` also features fully lock-free multithreaded streaming so if used as a library its possible to performantly scale streaming to a practically unlimited number of producer threads (for example running `ignore` or `jwalk` across all cores) without any buffering or other additional logic. -->
|
||||
|
||||
|
||||
The fuzzy matching algorithm is based on the `Smith-Waterman` (with affine gaps) as described in https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/gaps.pdf (TODO: explain). `Nulceo` faithfully implements this algorithm and therefore has two separate matrices. However, by precomputing the next `m-matrix` row we can avoid store the p-matrix at all and instead just store the value in a variable as we iterate the row.
|
||||
The fuzzy matching algorithm is based on the `Smith-Waterman` (with affine gaps) as described in https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/gaps.pdf (TODO: explain). `Nulceo` faithfully implements this algorithm and therefore has two separate matrices. However, by precomputing the next `m-matrix` row we can avoid storing the p-matrix at all and instead just store the value in a variable as we iterate the row.
|
||||
|
||||
Nucleo also never really stores the `m-matrix` instead we only ever store the current row (which simultaneously serves as the next row). During index calculation a full matrix is however required to barcktrack which indices were actually matched. We only store two bools here (to indicates where we came from in the matrix).
|
||||
Nucleo also never really stores the `m-matrix` instead we only ever store the current row (which simultaneously serves as the next row). During index calculation a full matrix is however required to backtrack which indices were actually matched. We only store two bools here (to indicate where we came from in the matrix).
|
||||
|
||||
By comparison `skim` stores the full p and m matrix in that case. `Fzf` always allocates a full `mn` matrix (even during matching!).
|
||||
By comparison `skim` stores the full p and m matrix in that case. `fzf` always allocates a full `mn` matrix (even during matching!).
|
||||
|
||||
`nucleo`s' matrix only was width `n-m+1` instead of width `n`. This comes from the observation that the `p.` char requires `p-1` chars before it and `m-p` chars after it, so there are always `p-1 + m-p = m+1` chars that can never match the current char. This works especially well with only using a single row because the first relevant char is always at the same position even tough it'rs technically further to the right. This is particularly nice because we precalculate the m-matrix row. The m-matrix is computed from diagonal elements, so the precalculated values stay in the same matrix cell.
|
||||
`nucleo`s' matrix is only width `n-m+1` instead of width `n`. This comes from the observation that the `p` char requires `p-1` chars before it and `m-p` chars after it, so there are always `p-1 + m-p = m+1` chars that can never match the current char. This works especially well with only using a single row because the first relevant char is always at the same position even though it's technically further to the right. This is particularly nice because we precalculate the m-matrix row. The m-matrix is computed from diagonal elements, so the precalculated values stay in the same matrix cell.
|
||||
|
||||
Compared to `skim` nucleo does couple simpler (but arguably even more impactful) optimizations:
|
||||
* *Presegment Unicode*: Unicode segmentation is somewhat slow and matcher will filter the same elements quite often so only doing it once is nice. It also prevents a very common source of bugs (mixing of char indices which we use here and utf8 indices) and makes the code a lot simpler as a result. Fzf does the same.
|
||||
* *Aggressive prefiltering*: Especially for ASCII this works very well but we also do this for Unicode to a lesser extend. This ensures we reject non-matching haystacks as fast as possible. Usually most haystacks will not match when fuzzy matching large lists so having fast path for that case is a huge win.
|
||||
* *Aggressive prefiltering*: Especially for ASCII this works very well, but we also do this for Unicode to a lesser extent. This ensures we reject non-matching haystacks as fast as possible. Usually most haystacks will not match when fuzzy matching large lists so having fast path for that case is a huge win.
|
||||
* *Special-case ASCII*: 90% of practical text is ASCII. ASCII can be stored as bytes instead of `chars`, so cache locality is improved a lot, and we can use `memchar` for superfast prefilters (even case-insensitive prefilter are possible that way)
|
||||
* *Fallback for very long matches*: We fall back to greedy matcher which runs in `O(N)` (and `O(1)` space complexity) to avoid the `O(mn)` blowup for large matches. This is fzfs old algorithm and yields decent (but not great) results.
|
||||
|
||||
@ -90,8 +90,7 @@ Compared to `skim` nucleo does couple simpler (but arguably even more impactful)
|
||||
<!-- * [x] verify it actually works -->
|
||||
<!-- * [x] query paring -->
|
||||
<!-- * [x] hook up to helix -->
|
||||
<!-- * [x] currently I simply use a tick system (called on every redraw) -->
|
||||
<!-- together with a redraw/tick nofication (ideally debounced) is that enough? yes works nicely -->
|
||||
<!-- * [x] currently I simply use a tick system (called on every redraw), together with a redraw/tick nofication (ideally debounced) is that enough? yes works nicely -->
|
||||
<!-- * [x] for streaming callers should buffer their data. Can we provide a better API for that beyond what is currently there? yes lock-free stream -->
|
||||
<!-- * [ ] cleanup code, improve API -->
|
||||
<!-- * [ ] write docs -->
|
||||
|
Loading…
Reference in New Issue
Block a user