small spelling and grammar fixes (#1)

This commit is contained in:
Gabriel Dinner-David 2023-08-04 02:23:01 -04:00 committed by GitHub
parent f4e19b4567
commit e499e2b601
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -6,17 +6,17 @@ Nucleo uses the exact **same scoring system as fzf**. That means you should get
**Compared to `skim`** (and the `fuzzy-matcher` crate) `nucleo` has an even larger performance advantage and is often around **six times faster** (see benchmarks below). Furthermore, the bonus system used by nucleo and fzf is (in my opinion)more consistent/superior. `nulceo` also handles non-ascii text much better. (`skim`s bonus system and even case insensitivity only work for ASCII).
Nucleo also handles Unicode graphemes more correctly. `Fzf` and `skim` both operate on Unicode code points (chars). That means that multi codepoint graphemes can have weird effects (match multiple times, weirdly change the score, ..). `nucleo` will always use the first codepoint of the grapheme for matching instead (and reports grapheme indices, so they can be highlighted correctly).
Nucleo also handles Unicode graphemes more correctly. `Fzf` and `skim` both operate on Unicode code points (chars). That means that multi codepoint graphemes can have weird effects (match multiple times, weirdly change the score, ...). `nucleo` will always use the first codepoint of the grapheme for matching instead (and reports grapheme indices, so they can be highlighted correctly).
## Benchmarks
> WIP currently more of a demonstration than a comprehensive benchmark suit
> most notably scientific comparisons with `fzf` are missing (a pain because it can't be called as a library
> most notably scientific comparisons with `fzf` are missing (a pain because it can't be called as a library)
### Matcher micro benchmarks
Benchmark comparing the runtime of various pattterns mathec against all files in the source thee of the linux kernel. Repeat on your system with `BENCHMARK_DIR=<path_to_linux> cargo run -p benches --release` (you can specify an empty directory and the kernel is cloned automatically).
Benchmark comparing the runtime of various patterns matched against all files in the source of the linux kernel. Repeat on your system with `BENCHMARK_DIR=<path_to_linux> cargo run -p benches --release` (you can specify an empty directory and the kernel is cloned automatically).
Method | Mean | Samples
-----------------------|-----------|-----------
@ -51,9 +51,9 @@ For example in the following two screencasts the pattern `///.` is pasted into `
# Naming
The name `nucleo` plays at the fact that the `Smith-Waterman` algorithm (that it's based on) war originally developed for matching DNA/RNA sequences. The elements of DNA/RNA that are matched are called *nucleotides* which was shortened to `nucleo` here.
The name `nucleo` plays on the fact that the `Smith-Waterman` algorithm (that it's based on) was originally developed for matching DNA/RNA sequences. The elements of DNA/RNA that are matched are called *nucleotides* which was shortened to `nucleo` here.
The name also indicates its close relationship with the *helix* editor (sticking with the DNA theme).# Implementation Details
The name also indicates its close relationship with the *helix* editor (sticking with the DNA theme).
# Implementation Details
@ -64,17 +64,17 @@ The name also indicates its close relationship with the *helix* editor (sticking
<!-- Furthermore, `nucleo` also features fully lock-free multithreaded streaming so if used as a library its possible to performantly scale streaming to a practically unlimited number of producer threads (for example running `ignore` or `jwalk` across all cores) without any buffering or other additional logic. -->
The fuzzy matching algorithm is based on the `Smith-Waterman` (with affine gaps) as described in https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/gaps.pdf (TODO: explain). `Nulceo` faithfully implements this algorithm and therefore has two separate matrices. However, by precomputing the next `m-matrix` row we can avoid store the p-matrix at all and instead just store the value in a variable as we iterate the row.
The fuzzy matching algorithm is based on the `Smith-Waterman` (with affine gaps) as described in https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/gaps.pdf (TODO: explain). `Nulceo` faithfully implements this algorithm and therefore has two separate matrices. However, by precomputing the next `m-matrix` row we can avoid storing the p-matrix at all and instead just store the value in a variable as we iterate the row.
Nucleo also never really stores the `m-matrix` instead we only ever store the current row (which simultaneously serves as the next row). During index calculation a full matrix is however required to barcktrack which indices were actually matched. We only store two bools here (to indicates where we came from in the matrix).
Nucleo also never really stores the `m-matrix` instead we only ever store the current row (which simultaneously serves as the next row). During index calculation a full matrix is however required to backtrack which indices were actually matched. We only store two bools here (to indicate where we came from in the matrix).
By comparison `skim` stores the full p and m matrix in that case. `Fzf` always allocates a full `mn` matrix (even during matching!).
By comparison `skim` stores the full p and m matrix in that case. `fzf` always allocates a full `mn` matrix (even during matching!).
`nucleo`s' matrix only was width `n-m+1` instead of width `n`. This comes from the observation that the `p.` char requires `p-1` chars before it and `m-p` chars after it, so there are always `p-1 + m-p = m+1` chars that can never match the current char. This works especially well with only using a single row because the first relevant char is always at the same position even tough it'rs technically further to the right. This is particularly nice because we precalculate the m-matrix row. The m-matrix is computed from diagonal elements, so the precalculated values stay in the same matrix cell.
`nucleo`s' matrix is only width `n-m+1` instead of width `n`. This comes from the observation that the `p` char requires `p-1` chars before it and `m-p` chars after it, so there are always `p-1 + m-p = m+1` chars that can never match the current char. This works especially well with only using a single row because the first relevant char is always at the same position even though it's technically further to the right. This is particularly nice because we precalculate the m-matrix row. The m-matrix is computed from diagonal elements, so the precalculated values stay in the same matrix cell.
Compared to `skim` nucleo does couple simpler (but arguably even more impactful) optimizations:
* *Presegment Unicode*: Unicode segmentation is somewhat slow and matcher will filter the same elements quite often so only doing it once is nice. It also prevents a very common source of bugs (mixing of char indices which we use here and utf8 indices) and makes the code a lot simpler as a result. Fzf does the same.
* *Aggressive prefiltering*: Especially for ASCII this works very well but we also do this for Unicode to a lesser extend. This ensures we reject non-matching haystacks as fast as possible. Usually most haystacks will not match when fuzzy matching large lists so having fast path for that case is a huge win.
* *Aggressive prefiltering*: Especially for ASCII this works very well, but we also do this for Unicode to a lesser extent. This ensures we reject non-matching haystacks as fast as possible. Usually most haystacks will not match when fuzzy matching large lists so having fast path for that case is a huge win.
* *Special-case ASCII*: 90% of practical text is ASCII. ASCII can be stored as bytes instead of `chars`, so cache locality is improved a lot, and we can use `memchar` for superfast prefilters (even case-insensitive prefilter are possible that way)
* *Fallback for very long matches*: We fall back to greedy matcher which runs in `O(N)` (and `O(1)` space complexity) to avoid the `O(mn)` blowup for large matches. This is fzfs old algorithm and yields decent (but not great) results.
@ -90,8 +90,7 @@ Compared to `skim` nucleo does couple simpler (but arguably even more impactful)
<!-- * [x] verify it actually works -->
<!-- * [x] query paring -->
<!-- * [x] hook up to helix -->
<!-- * [x] currently I simply use a tick system (called on every redraw) -->
<!-- together with a redraw/tick nofication (ideally debounced) is that enough? yes works nicely -->
<!-- * [x] currently I simply use a tick system (called on every redraw), together with a redraw/tick nofication (ideally debounced) is that enough? yes works nicely -->
<!-- * [x] for streaming callers should buffer their data. Can we provide a better API for that beyond what is currently there? yes lock-free stream -->
<!-- * [ ] cleanup code, improve API -->
<!-- * [ ] write docs -->