Benchmarks#

Several harnesses ship with the repo, covering raw throughput, recall, E-value significance, and the epitope-detection formalism. Most bootstrap realistic TCR CDR3 sequences from OLGA (if installed) and otherwise fall back to seeded random sequences. Figures are rendered to SVG with gnuplot.

C++ (raw throughput + scaling)#

cmake -S . -B build -DSEQTREE_BENCH=ON && cmake --build build
./build/seqtree_bench 1000 10000 100000 1000000

Reports build time, peak RSS, single-query latency (median / p99), batch throughput, and thread scaling, followed by a two-engine comparison and per-call alignment cost.

Python (methods + recall)#

bench/bench_methods.py compares seqtm vs seqtrie across reference sizes, edit scopes, and budgets (edit count and BLOSUM62 score), plus alignment-fetch cost:

python bench/bench_methods.py
env RUN_BENCHMARK=1 python bench/bench_methods.py --sizes 100000 1000000

bench/bench.py measures recall against ground truth on the AIRR VDJdb table (queries are mutated references with known parents), with throughput and peak RSS:

python bench/bench.py
env RUN_BENCHMARK=1 python bench/bench.py --sizes 1000000 --queries 1000000 --threads 16

E-value benchmark#

bench/bench_evalue.py is the true E-value benchmark. For a target repertoire (VDJdb, antigen-selected) scored against the airr_control background, at each scope/budget it reports the number of neighbours (distinct hits, excluding exact/self matches — the queries are members of the target, so the self-match is dropped per the punctured-null lemma), the exact self-hits removed, the number of collisions (references re-reached via a different edit path — non-zero only for seqtm with indels), and the fraction of neighbours called significant both at fixed E-value cutoffs and after a Benjamini–Hochberg FDR correction across the query family:

python bench/bench_evalue.py
env RUN_BENCHMARK=1 python bench/bench_evalue.py --target-size 200000 --control-size 2000000

The discriminating result is the contrast between query sets: antigen-selected VDJdb queries produce orders of magnitude more neighbours and are largely significant (BH FDR < 0.05), whereas background (control) queries produce almost none and survive no correction. The smallest resolvable E-value is \(N/M\), so finer fixed cutoffs (E < 0.01) require a control much larger than the target (the RUN_BENCHMARK tier uses \(M = 2{,}000{,}000\)); the BH correction is what makes the fixed-cutoff fractions trustworthy at small control sizes.

Comprehensive matrix#

bench/bench_evalue_matrix.py sweeps the full grid — reference set (vdjdb / olga / vdjdb+noise / olga+noise, all built to the same size \(N\)), background control (OLGA 1M / 2M / 10M), query set (vdjdb / olga), and scope (1–3 substitutions) — and emits a TSV table plus evalue_matrix.svg:

python bench/bench_evalue_matrix.py                       # control 1M
env RUN_BENCHMARK=1 python bench/bench_evalue_matrix.py   # controls 1M / 2M / 10M

The signal is structural: vdjdb queries against a vdjdb reference are ~0.87–0.98 significant (BH), vdjdb+noise stays ~0.77–0.97 (real clusters survive 50 % dilution), while olga, olga+noise, and all cross combinations sit at ~0. The naive E < 1 column over-calls (e.g. olga-vs-olga reads 1.0) where BH reads 0.0 — a direct demonstration that the multiple-testing correction is what separates genuine convergence from background.

significant fraction across reference/control/query/scope combinations

Epitope detection complexity#

bench/bench_epitope.py tests the detectability formalism on two HLA-A*02 epitopes of opposite repertoire structure: GIL (GILGFVFTL, one dominant convergent cluster) and NLV (NLVPMVATV, many diverse small clusters). It reports each epitope’s within-set neighbour density, degree, and cluster sizes, then subsamples to depth \(n\) and plots the fraction called significant (BH FDR < 0.05) against the degree-distribution prediction of Eq. (φ) in the appendix:

python bench/bench_epitope.py --scopes 1 2

At scope 1, GIL (ρ = 3.4×10⁻⁴, largest cluster 896 = 17 % of the set) is ~20–30 % recovered by \(n\sim10^3\) sampled TCRs, while NLV (ρ = 2.8×10⁻⁵, largest cluster 152 = 1.2 %) stays below 5 % even at \(n\sim5\times10^3\) — detection complexities differing by an order of magnitude purely from repertoire structure. See appendix/evalue.tex §”Epitope detection complexity”.

significant fraction vs sampled depth for GIL and NLV epitopes

TCR-beta benchmark (gnuplot figures)#

bench/bench_gnuplot.py is the main benchmark. It measures two reference families separately, so the effect of sequence structure is visible rather than averaged away:

  • olga — OLGA-generated human TRB CDR3 (a generative model, no antigen motif); queried with 1000 fresh OLGA TRB sequences.

  • vdjdb — VDJdb CDR3 mutated (real antigen-specific receptors, with shared motif structure); queried with 1000 held-out VDJdb CDR3.

Both families are expanded by substitution-mutation to each target size. Timings are over the 1000 queries. Figures are vertically stacked two-panel SVGs; seqtm is drawn with a long dash and seqtrie with a dash-dot, and the reference family is encoded by colour.

python bench/bench_gnuplot.py                       # fast tier: 10k / 100k
env RUN_BENCHMARK=1 python bench/bench_gnuplot.py   # full tier: 10k / 100k / 1M / 10M

Each figure is written to bench/figures/<key>.svg (+ per-panel .tsv). Requires gnuplot and olga-generate_sequences on PATH (pip install olga). The scaling, matrix and per-op figures span all reference sizes; the edit-budget sweeps (scope, selectivity, collisions) run at one representative size to stay tractable.

A note on engine semantics: at an edit budget e, seqtm explores the Hamming ball (max_subs=e, substitutions only — the dominant TCR diversity/error mode) while seqtrie explores the edit-distance ball (max_total_edits=e, substitutions and indels). They answer subtly different questions, which shows as a higher match count for seqtrie at the same e.

Scaling and parallelism#

Throughput (queries per millisecond) versus reference-set size, for both engines at 1, 4, and 8 threads (fixed scope: 2 substitutions), with the olga family on top and vdjdb below. Batches parallelize near-linearly to 8 cores (~6.5–7×):

throughput vs reference size per engine and thread count, olga and vdjdb

Edit budget#

Cost (top) and selectivity (bottom) as the edit budget grows from 1 to 5. Throughput is governed by scope far more than by reference-set size, and the match count grows steeply — by e = 5 a query already pulls hundreds (seqtm Hamming ball) to thousands (seqtrie edit ball) of neighbours, so loose budgets are rarely useful:

throughput and matches per query vs edit budget 1..5

Matrix scoring (BLOSUM62 / PAM50 / custom)#

seqtm scores substitutions through a substitution matrix, reporting the best (minimum-penalty) score across all alignments to each reference. The time overhead of matrix scoring is small (within ~5–10 % of unit cost — one table lookup replaces a character compare), for both families:

seqtm throughput unit vs BLOSUM62 vs PAM50, olga and vdjdb

Besides the built-in BLOSUM62 and PAM50, a custom matrix is supplied via SubstitutionMatrix.from_similarity (row/column order from seqtree.amino_acids()).

Selectivity and collisions#

Top: matches per query versus the seqtrie max_penalty budget — PAM50 is stricter than BLOSUM62 at equal budget. Bottom: collisions — how often seqtm’s branch-and-bound re-reaches the same reference via a different edit path (reported by Index.collisions_batch). Substitution-only search never collides; once indels are allowed, collisions rise with the edit budget, and the motif-rich vdjdb family collides far more than olga because shared structure makes a reference reachable by many distinct edit paths:

matches per query vs penalty budget, and seqtm collisions per query vs edit budget

Per-operation costs#

Top: fetching a global-alignment CIGAR (the C++ Needleman–Wunsch in Index.align) is on-demand and about a microsecond per call, roughly flat in reference count. Bottom: peak resident memory after the index build, which scales with the reference count (the trie is shared by both engines):

align CIGAR fetch cost and peak RSS vs reference size

Indicative numbers#

Apple M3, OLGA TRB references, 1000 queries, 8 threads (bench/bench_gnuplot.py, full tier):

metric

10k

100k

1M

10M

seqtm, 2 subs (q/ms)

~271

~48

~8.6

~2.4

seqtrie, edits≤2 (q/ms)

~55

~11

~1.2

~0.28

align CIGAR fetch (µs)

~0.9

~1.1

~1.1

~1.8

peak RSS (MB)

~91

~183

~990

~3170

The 8-thread speed-up is ~6.5–7×; matrix scoring stays within ~5–10 % of unit cost across all sizes.

Takeaway#

Throughput is governed by scope (edit budget) far more than reference-set size, parallelizes near-linearly to 8 cores, and matrix scoring is nearly free. Sequence structure matters: the motif-rich vdjdb family is denser and collides more under indels — enumeration cost ultimately depends on reference redundancy (see Roadmap).