E-values for TCR hits#
Fuzzy search tells you which references are near a query; an E-value tells you whether that proximity is surprising. For TCR repertoires the hard part is biological redundancy: convergent V(D)J recombination, public clones, and clonal expansion mean a query in a common region of sequence space has many neighbours for purely generative reasons. A naive BLAST E-value (an i.i.d.-letter null) would call these wildly significant.
Full derivation
This page summarises the method; the complete derivation — theorems, finite-sample bounds, the
selection Q-factor, multiple-testing correction, and the epitope detection-complexity formalism —
is in the technical appendix: A control-calibrated E-value for fuzzy TCR sequence
search (PDF) (LaTeX source in appendix/evalue.tex).
seqtree calibrates against a background control instead. With a target index of N unique
clonotypes (e.g. VDJdb) and a control index of M unique clonotypes from an unselected
repertoire, for a query q at a fixed scope/budget:
Redundancy explained by the background process inflates n_control and hence E, so such hits
are not significant; antigen-driven convergence shows up as n_target exceeding E. This is
the TCRNET approach — counting sequence neighbours against a real-world control repertoire —
put on a rigorous, finite-sample footing; it reduces to the classical Karlin–Altschul E-value when
the background is an i.i.d. product measure and alignments are ungapped.
An empirical control already is the post-selection law: thymic and peripheral selection are baked into a real repertoire, so no extra correction is applied when calibrating against it. Where a generative null is needed instead (rare queries with too few control neighbours), the V(D)J generation probability \(P_\mathrm{gen}\) is reshaped by the per-sequence Elhanati selection factor \(Q\) (\(\langle Q\rangle_{P_\mathrm{gen}} = 1\)), giving the post-selection law \(P_0 \propto Q\,P_\mathrm{gen}\). A single global thymic-acceptance fraction only rescales absolute frequencies and is a crude last-resort fallback. See the appendix remark on the selection factor for the full treatment.
Usage#
import seqtree
control = seqtree.load_control("human_trb_aa", size=1_000_000) # cached after first build
target = seqtree.Index.build(vdjdb_cdr3s, alphabet="aa") # unique clonotypes
p = seqtree.SearchParams(max_subs=1, engine="seqtm")
for q, r in zip(queries, seqtree.evalues(target, control, queries, p)):
if r["p_enrichment"] < 1e-3:
print(q, r["E"], r["n_target"], r["n_control"])
load_control ships a small bundled subset for quick use and downloads larger controls from the
isalgo/airr_control dataset on demand (needs huggingface_hub); both are deduplicated to
unique clonotypes. Meaningful E-values need a control at least as large as the target — see the
precision bound below.
Theory#
The full derivation — Poisson approximation with an explicit Chen–Stein / Le Cam error bound,
the self-match / punctured-null lemma (benchmark-only exact-hit exclusion), clonotype-collapsing for
over-dispersion, the tf-idf = self-information equivalence, multiple-testing control (Bonferroni and
Benjamini–Hochberg), the control-size requirement \(M \gtrsim N/(\rho^2 E^\ast)\), the closest-hit
Gumbel law, epitope detection complexity from the degree distribution (worked NLV vs GIL example),
the Karlin–Altschul reduction, and the epitope-presentation limitation — is in the technical appendix
linked at the top of this page (LaTeX source appendix/evalue.tex; rebuild with make -C appendix).