Benchmarks ========== Several harnesses ship with the repo, covering raw throughput, recall, E-value significance, and the epitope-detection formalism. Most bootstrap realistic TCR CDR3 sequences from OLGA (if installed) and otherwise fall back to seeded random sequences. Figures are rendered to SVG with gnuplot. C++ (raw throughput + scaling) ------------------------------ .. code-block:: console cmake -S . -B build -DSEQTREE_BENCH=ON && cmake --build build ./build/seqtree_bench 1000 10000 100000 1000000 Reports build time, peak RSS, single-query latency (median / p99), batch throughput, and thread scaling, followed by a two-engine comparison and per-call alignment cost. Python (methods + recall) ------------------------- ``bench/bench_methods.py`` compares ``seqtm`` vs ``seqtrie`` across reference sizes, edit scopes, and budgets (edit count and BLOSUM62 score), plus alignment-fetch cost: .. code-block:: fish python bench/bench_methods.py env RUN_BENCHMARK=1 python bench/bench_methods.py --sizes 100000 1000000 ``bench/bench.py`` measures **recall** against ground truth on the AIRR VDJdb table (queries are mutated references with known parents), with throughput and peak RSS: .. code-block:: fish python bench/bench.py env RUN_BENCHMARK=1 python bench/bench.py --sizes 1000000 --queries 1000000 --threads 16 E-value benchmark ----------------- ``bench/bench_evalue.py`` is the **true E-value benchmark**. For a target repertoire (VDJdb, antigen-selected) scored against the ``airr_control`` background, at each scope/budget it reports the number of **neighbours** (distinct hits, **excluding exact/self matches** — the queries are members of the target, so the self-match is dropped per the punctured-null lemma), the **exact** self-hits removed, the number of **collisions** (references re-reached via a different edit path — non-zero only for ``seqtm`` with indels), and the **fraction of neighbours called significant** both at fixed E-value cutoffs and after a Benjamini–Hochberg FDR correction across the query family: .. code-block:: fish python bench/bench_evalue.py env RUN_BENCHMARK=1 python bench/bench_evalue.py --target-size 200000 --control-size 2000000 The discriminating result is the contrast between query sets: antigen-selected VDJdb queries produce orders of magnitude more neighbours and are largely significant (BH FDR < 0.05), whereas background (control) queries produce almost none and survive no correction. The smallest resolvable E-value is :math:`N/M`, so finer fixed cutoffs (``E < 0.01``) require a control much larger than the target (the ``RUN_BENCHMARK`` tier uses :math:`M = 2{,}000{,}000`); the BH correction is what makes the fixed-cutoff fractions trustworthy at small control sizes. Comprehensive matrix ~~~~~~~~~~~~~~~~~~~~~ ``bench/bench_evalue_matrix.py`` sweeps the full grid — reference set (``vdjdb`` / ``olga`` / ``vdjdb+noise`` / ``olga+noise``, all built to the same size :math:`N`), background control (OLGA ``1M`` / ``2M`` / ``10M``), query set (``vdjdb`` / ``olga``), and scope (1–3 substitutions) — and emits a TSV table plus ``evalue_matrix.svg``: .. code-block:: fish python bench/bench_evalue_matrix.py # control 1M env RUN_BENCHMARK=1 python bench/bench_evalue_matrix.py # controls 1M / 2M / 10M The signal is structural: ``vdjdb`` queries against a ``vdjdb`` reference are ~0.87–0.98 significant (BH), ``vdjdb+noise`` stays ~0.77–0.97 (real clusters survive 50 % dilution), while ``olga``, ``olga+noise``, and all cross combinations sit at ~0. The naive ``E < 1`` column over-calls (e.g. ``olga``-vs-``olga`` reads 1.0) where BH reads 0.0 — a direct demonstration that the multiple-testing correction is what separates genuine convergence from background. .. image:: _static/bench/evalue_matrix.svg :alt: significant fraction across reference/control/query/scope combinations :width: 80% Epitope detection complexity ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``bench/bench_epitope.py`` tests the detectability formalism on two HLA-A*02 epitopes of opposite repertoire structure: **GIL** (GILGFVFTL, one dominant convergent cluster) and **NLV** (NLVPMVATV, many diverse small clusters). It reports each epitope's within-set neighbour density, degree, and cluster sizes, then subsamples to depth :math:`n` and plots the fraction called significant (BH FDR < 0.05) against the degree-distribution prediction of Eq. (φ) in the appendix: .. code-block:: fish python bench/bench_epitope.py --scopes 1 2 At scope 1, GIL (ρ = 3.4×10⁻⁴, largest cluster 896 = 17 % of the set) is ~20–30 % recovered by :math:`n\sim10^3` sampled TCRs, while NLV (ρ = 2.8×10⁻⁵, largest cluster 152 = 1.2 %) stays below 5 % even at :math:`n\sim5\times10^3` — detection complexities differing by an order of magnitude purely from repertoire structure. See ``appendix/evalue.tex`` §"Epitope detection complexity". .. image:: _static/bench/epitope_detection.svg :alt: significant fraction vs sampled depth for GIL and NLV epitopes :width: 80% TCR-beta benchmark (gnuplot figures) ------------------------------------ ``bench/bench_gnuplot.py`` is the main benchmark. It measures **two reference families separately**, so the effect of sequence structure is visible rather than averaged away: * **olga** — OLGA-generated human TRB CDR3 (a generative model, **no antigen motif**); queried with 1000 fresh OLGA TRB sequences. * **vdjdb** — VDJdb CDR3 **mutated** (real antigen-specific receptors, **with shared motif** structure); queried with 1000 held-out VDJdb CDR3. Both families are expanded by substitution-mutation to each target size. Timings are over the 1000 queries. Figures are **vertically stacked two-panel SVGs**; **seqtm is drawn with a long dash and seqtrie with a dash-dot**, and the reference family is encoded by colour. .. code-block:: fish python bench/bench_gnuplot.py # fast tier: 10k / 100k env RUN_BENCHMARK=1 python bench/bench_gnuplot.py # full tier: 10k / 100k / 1M / 10M Each figure is written to ``bench/figures/.svg`` (+ per-panel ``.tsv``). Requires ``gnuplot`` and ``olga-generate_sequences`` on PATH (``pip install olga``). The *scaling*, *matrix* and *per-op* figures span all reference sizes; the edit-budget sweeps (*scope*, *selectivity*, *collisions*) run at one representative size to stay tractable. A note on engine semantics: at an edit budget *e*, **seqtm** explores the **Hamming ball** (``max_subs=e``, substitutions only — the dominant TCR diversity/error mode) while **seqtrie** explores the **edit-distance ball** (``max_total_edits=e``, substitutions *and* indels). They answer subtly different questions, which shows as a higher match count for seqtrie at the same *e*. Scaling and parallelism ~~~~~~~~~~~~~~~~~~~~~~~~~ Throughput (queries per millisecond) versus reference-set size, for both engines at 1, 4, and 8 threads (fixed scope: 2 substitutions), with the olga family on top and vdjdb below. Batches parallelize near-linearly to 8 cores (~6.5–7×): .. image:: _static/bench/scaling.svg :alt: throughput vs reference size per engine and thread count, olga and vdjdb :width: 80% Edit budget ~~~~~~~~~~~ Cost (top) and selectivity (bottom) as the edit budget grows from 1 to 5. Throughput is governed by **scope** far more than by reference-set size, and the match count grows steeply — by *e* = 5 a query already pulls hundreds (seqtm Hamming ball) to thousands (seqtrie edit ball) of neighbours, so loose budgets are rarely useful: .. image:: _static/bench/scope.svg :alt: throughput and matches per query vs edit budget 1..5 :width: 80% Matrix scoring (BLOSUM62 / PAM50 / custom) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ seqtm scores substitutions through a substitution matrix, reporting the best (minimum-penalty) score across all alignments to each reference. The **time overhead of matrix scoring is small** (within ~5–10 % of unit cost — one table lookup replaces a character compare), for both families: .. image:: _static/bench/matrix.svg :alt: seqtm throughput unit vs BLOSUM62 vs PAM50, olga and vdjdb :width: 80% Besides the built-in ``BLOSUM62`` and ``PAM50``, a custom matrix is supplied via ``SubstitutionMatrix.from_similarity`` (row/column order from ``seqtree.amino_acids()``). Selectivity and collisions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Top: matches per query versus the seqtrie ``max_penalty`` budget — PAM50 is stricter than BLOSUM62 at equal budget. Bottom: **collisions** — how often seqtm's branch-and-bound re-reaches the *same* reference via a *different* edit path (reported by ``Index.collisions_batch``). Substitution-only search never collides; once indels are allowed, collisions rise with the edit budget, and the motif-rich **vdjdb** family collides far more than **olga** because shared structure makes a reference reachable by many distinct edit paths: .. image:: _static/bench/selectivity_collisions.svg :alt: matches per query vs penalty budget, and seqtm collisions per query vs edit budget :width: 80% Per-operation costs ~~~~~~~~~~~~~~~~~~~~~ Top: fetching a global-alignment CIGAR (the C++ Needleman–Wunsch in ``Index.align``) is on-demand and about a microsecond per call, roughly flat in reference count. Bottom: peak resident memory after the index build, which scales with the reference count (the trie is shared by both engines): .. image:: _static/bench/perop.svg :alt: align CIGAR fetch cost and peak RSS vs reference size :width: 80% Indicative numbers ~~~~~~~~~~~~~~~~~~~ Apple M3, OLGA TRB references, 1000 queries, 8 threads (``bench/bench_gnuplot.py``, full tier): .. list-table:: :header-rows: 1 * - metric - 10k - 100k - 1M - 10M * - seqtm, 2 subs (q/ms) - ~271 - ~48 - ~8.6 - ~2.4 * - seqtrie, edits≤2 (q/ms) - ~55 - ~11 - ~1.2 - ~0.28 * - align CIGAR fetch (µs) - ~0.9 - ~1.1 - ~1.1 - ~1.8 * - peak RSS (MB) - ~91 - ~183 - ~990 - ~3170 The 8-thread speed-up is ~6.5–7×; matrix scoring stays within ~5–10 % of unit cost across all sizes. Takeaway ~~~~~~~~ Throughput is governed by **scope** (edit budget) far more than reference-set size, parallelizes near-linearly to 8 cores, and matrix scoring is nearly free. Sequence structure matters: the motif-rich vdjdb family is denser and collides more under indels — enumeration cost ultimately depends on reference redundancy (see :doc:`roadmap`).