E-values for TCR hits#

Fuzzy search tells you which references are near a query; an E-value tells you whether that proximity is surprising. For TCR repertoires the hard part is biological redundancy: convergent V(D)J recombination, public clones, and clonal expansion mean a query in a common region of sequence space has many neighbours for purely generative reasons. A naive BLAST E-value (an i.i.d.-letter null) would call these wildly significant.

Full derivation

This page summarises the method; the complete derivation — theorems, finite-sample bounds, the selection Q-factor, multiple-testing correction, and the epitope detection-complexity formalism — is in the technical appendix: A control-calibrated E-value for fuzzy TCR sequence search (PDF) (LaTeX source in appendix/evalue.tex).

seqtree calibrates against a background control instead. With a target index of N unique clonotypes (e.g. VDJdb) and a control index of M unique clonotypes from an unselected repertoire, for a query q at a fixed scope/budget:

\[E(q) = \frac{N}{M}\, n_{\mathrm{control}}(q), \qquad p_{\mathrm{any}} = 1 - e^{-E}, \qquad p_{\mathrm{enrich}} = \Pr\!\big(\mathrm{Poisson}(E) \ge n_{\mathrm{target}}(q)\big).\]

Redundancy explained by the background process inflates n_control and hence E, so such hits are not significant; antigen-driven convergence shows up as n_target exceeding E. This is the TCRNET approach — counting sequence neighbours against a real-world control repertoire — put on a rigorous, finite-sample footing; it reduces to the classical Karlin–Altschul E-value when the background is an i.i.d. product measure and alignments are ungapped.

An empirical control already is the post-selection law: thymic and peripheral selection are baked into a real repertoire, so no extra correction is applied when calibrating against it. Where a generative null is needed instead (rare queries with too few control neighbours), the V(D)J generation probability \(P_\mathrm{gen}\) is reshaped by the per-sequence Elhanati selection factor \(Q\) (\(\langle Q\rangle_{P_\mathrm{gen}} = 1\)), giving the post-selection law \(P_0 \propto Q\,P_\mathrm{gen}\). A single global thymic-acceptance fraction only rescales absolute frequencies and is a crude last-resort fallback. See the appendix remark on the selection factor for the full treatment.

Usage#

import seqtree

control = seqtree.load_control("human_trb_aa", size=1_000_000)   # cached after first build
target = seqtree.Index.build(vdjdb_cdr3s, alphabet="aa")          # unique clonotypes

p = seqtree.SearchParams(max_subs=1, engine="seqtm")
for q, r in zip(queries, seqtree.evalues(target, control, queries, p)):
    if r["p_enrichment"] < 1e-3:
        print(q, r["E"], r["n_target"], r["n_control"])

load_control ships a small bundled subset for quick use and downloads larger controls from the isalgo/airr_control dataset on demand (needs huggingface_hub); both are deduplicated to unique clonotypes. Meaningful E-values need a control at least as large as the target — see the precision bound below.

Note

Safe to call from many processes at once, on a cold cache. The built index is cached under ~/.cache/seqtree, and Index.save writes it through a temporary file which is then renamed into place — an atomic operation on the same filesystem. A worker that arrives while another is still writing therefore sees either no cache or a complete one, never a half-written index.

That matters on first use in any multi-process fan-out — pytest-xdist, a Snakemake or Nextflow rule, a multiprocessing pool — all sharing one ~/.cache. Once the cache is warm it is read-only and was never at risk. If filelock is installed (it arrives with huggingface_hub) the build is additionally serialized, so one worker builds and the rest wait and load rather than all N rebuilding the same index; correctness does not depend on it.

A cache that is corrupt, truncated, or written by an older seqtree is silently rebuilt.

The cache is content-addressed. Its filename carries a fingerprint of everything that decides the sequences — the bundled asset’s own bytes, or the download’s source and seed — so a release that changes the control simply misses the old cache rather than serving it. Superseded caches are removed on the next build. You never need to clear ~/.cache/seqtree by hand.

Theory#

The full derivation — Poisson approximation with an explicit Chen–Stein / Le Cam error bound, the self-match / punctured-null lemma (benchmark-only exact-hit exclusion), clonotype-collapsing for over-dispersion, the tf-idf = self-information equivalence, multiple-testing control (Bonferroni and Benjamini–Hochberg), the control-size requirement \(M \gtrsim N/(\rho^2 E^\ast)\), the closest-hit Gumbel law, epitope detection complexity from the degree distribution (worked NLV vs GIL example), the Karlin–Altschul reduction, and the epitope-presentation limitation — is in the technical appendix linked at the top of this page (LaTeX source appendix/evalue.tex; rebuild with make -C appendix).