Roadmap
=======

The v1 core ships both engines, scope + budget search, BLOSUM62 / PAM50 / custom matrices
(Gram-distance penalty) with linear gaps, top / all hits, on-demand alignment, parallel batch and
batch-of-batches, auto-selecting pairwise search, and **control-set E-values** for significance (see
:doc:`evalue`). Planned next:

1. **Alignment polish** — affine gaps (gap open vs extend), CIGAR-style output, batch alignment.
2. **Position-specific scoring matrices** — ``penalty(pos, a, b)``; consume per-position PWMs.
3. **Significance, continued.** Control-set E-values ship (``load_control`` + ``evalues``); a
   *tf-idf redundancy weighting* — down-weighting hits to common / clustered motifs intrinsically,
   without an external control set — is still under consideration.
4. **Memory** — succinct (LOUDS) trie / packed reference strings if reference counts reach the
   10M+ range; memory-mapped frozen index for zero-copy load.
5. **Batch-of-batches & distribution** — streaming query batches and optional process-level fan-out.

Downstream consumers (built in *other* libraries, not here): UMI collapsing, CDR3 nucleotide error
correction, VDJdb CDR3 matching, IEDB epitope matching. seqtree stays payload-agnostic; those
libraries apply V-gene / MHC / count filters on top of ``ref_id`` results.

Benchmark findings
------------------

The simple max-edit-3 benchmark (``bench/bench_gnuplot.py``) makes the cost structure clear:
throughput is dominated by **scope**, not reference-set size. ``seqtm`` substitution-only search
stays cheap; allowing insertions/deletions widens the branch-and-bound frontier sharply, and the
``seqtrie`` banded DP pays a larger constant per candidate. Crucially, enumeration cost depends on
**reference-set redundancy**: a low-redundancy (uniform-random) set has a bushy trie with no shared
prefixes, so indel-heavy scopes blow up; real, clustered repertoires share structure and the trie
collapses, keeping the same scope tractable. This shapes the per-domain direction below.

Per-domain search strategy (open questions)
-------------------------------------------

The three target domains have very different null structure and need different indexing:

- **CDR3 (amino acid).** Index from the **3' / J-proximal end** rather than the 5' end: the J segment
  is longer and more conserved than the V-side N-region, so suffix-anchored tries share more prefix
  structure and prune faster. Reverse the sequence before insertion (and the query before search).
  See also https://arxiv.org/abs/2604.26190 for an alternative formulation.
- **UMIs / nucleotide.** Largely a birthday-problem + Hamming-tree regime: short fixed-length tags,
  substitution-dominated errors, no indels in the common case. A plain Hamming trie with a small
  ``max_subs`` budget is sufficient; the interesting part is collision probability, not search.
- **Epitopes (MHC-presented peptides).** The hardest case — the null is not uniform over peptide
  space but shaped by presentation. Need to characterize the real structure of MHC-I and MHC-II
  ligand repertoires (from eluted-ligand / immunopeptidomics data) or adopt a probabilistic
  presentation model (https://journals.aps.org/prxlife/abstract/10.1103/fbct-vzwm) to define a
  meaningful background before E-values over epitope hits are trustworthy.