Engines & Concepts#

One immutable trie, two search drivers. Pick one with engine= or let engine="auto" choose per query.

Scope vs budget#

A query matches a reference when it satisfies the scope and/or the budget:

  • Scope — per-type edit caps: max_subs, max_ins, max_dels, and an optional combined max_total_edits. A per-type cap of 0 means zero of that type.

  • Budget — a score threshold max_penalty under a substitution matrix and gap costs.

max_total_edits is an independent total cap (0 means “no total cap”, falling back to the per-type sum). It is not clamped by the per-type caps, so seqtrie can be driven by it alone.

seqtm — branch-and-bound#

Enumerates each edit (substitution / insertion / deletion) while descending the trie, tracking the per-type counts, and prunes as soon as a cap or the budget is exceeded. Consequences:

  • Per-type caps are enforced exactly, and every hit reports an exact (n_subs, n_ins, n_dels) breakdown.

  • A dedicated Hamming-only path runs when max_ins == max_dels == 0.

  • Cost grows with the number of allowed edits, so it is fastest at small distances (k = 1–3) — which covers UMI collapse, CDR3 error correction, and CDR3 / epitope matching.

seqtrie — banded DP#

Carries an edit-distance DP row down the trie and prunes a subtree once its best cell exceeds the budget. Consequences:

  • Handles a matrix-weighted score budget (BLOSUM62 or a custom matrix) and indels naturally.

  • Cost is independent of the edit count, so it scales better to large budgets / long range.

  • It tracks a single cost, so it enforces max_total_edits + max_penalty but not the per-type caps; n_subs / n_ins / n_dels are reported as 0 (use align() to recover the breakdown).

engine="auto" routes substitution-only / small-k indel work to seqtm and matrix-weighted budgets to seqtrie.

Scoring#

Scores are non-negative penalties where 0 is an exact match. Similarity matrices such as BLOSUM62 are converted at load time via the Gram→squared-distance transform pen[a][b] = sim[a][a] + sim[b][b] - 2*sim[a][b] — i.e. ‖φ(a) - φ(b)‖² if the score is read as an inner product sim(a,b) = ⟨φ(a), φ(b)⟩. It is symmetric, 0 on the identity, and non-negative for BLOSUM/PAM (the diagonal is each row’s maximum), so every edit adds a non-negative cost and the budget prune stays valid. Note penalties are roughly twice the scale of the raw score gaps, so in matrix mode set gap_open / max_penalty accordingly. With no matrix, the cost is unit (1 per substitution and per gap position; linear gaps in v1).

Alphabets#

"aa" (20 amino acids in BLOSUM62 order plus B Z X *), "nt" (ACGT), and "iupac" (nucleotide ambiguity codes). Encoding is case-insensitive; a symbol outside the alphabet raises an error at build or search time.