E-values for TCR hits
=====================

Fuzzy search tells you *which* references are near a query; an **E-value** tells you whether that
proximity is *surprising*. For TCR repertoires the hard part is biological redundancy: convergent
V(D)J recombination, public clones, and clonal expansion mean a query in a common region of
sequence space has many neighbours for purely generative reasons. A naive BLAST E-value (an
i.i.d.-letter null) would call these wildly significant.

.. admonition:: Full derivation

   This page summarises the method; the complete derivation — theorems, finite-sample bounds, the
   selection Q-factor, multiple-testing correction, and the epitope detection-complexity formalism —
   is in the technical appendix: :download:`A control-calibrated E-value for fuzzy TCR sequence
   search (PDF) <_static/evalue-appendix.pdf>` (LaTeX source in ``appendix/evalue.tex``).

seqtree calibrates against a **background control** instead. With a target index of ``N`` unique
clonotypes (e.g. VDJdb) and a control index of ``M`` unique clonotypes from an unselected
repertoire, for a query ``q`` at a fixed scope/budget:

.. math::

   E(q) = \frac{N}{M}\, n_{\mathrm{control}}(q), \qquad
   p_{\mathrm{any}} = 1 - e^{-E}, \qquad
   p_{\mathrm{enrich}} = \Pr\!\big(\mathrm{Poisson}(E) \ge n_{\mathrm{target}}(q)\big).

Redundancy explained by the background process inflates ``n_control`` and hence ``E``, so such hits
are **not** significant; antigen-driven convergence shows up as ``n_target`` exceeding ``E``. This is
the **TCRNET** approach — counting sequence neighbours against a real-world control repertoire —
put on a rigorous, finite-sample footing; it reduces to the classical Karlin–Altschul E-value when
the background is an i.i.d. product measure and alignments are ungapped.

An empirical control already **is** the post-selection law: thymic and peripheral selection are baked
into a real repertoire, so no extra correction is applied when calibrating against it. Where a
*generative* null is needed instead (rare queries with too few control neighbours), the V(D)J
generation probability :math:`P_\mathrm{gen}` is reshaped by the per-sequence Elhanati selection
factor :math:`Q` (:math:`\langle Q\rangle_{P_\mathrm{gen}} = 1`), giving the post-selection law
:math:`P_0 \propto Q\,P_\mathrm{gen}`. A single global thymic-acceptance fraction only rescales
absolute frequencies and is a crude last-resort fallback. See the appendix remark on the selection
factor for the full treatment.

Usage
-----

.. code-block:: python

   import seqtree

   control = seqtree.load_control("human_trb_aa", size=1_000_000)   # cached after first build
   target = seqtree.Index.build(vdjdb_cdr3s, alphabet="aa")          # unique clonotypes

   p = seqtree.SearchParams(max_subs=1, engine="seqtm")
   for q, r in zip(queries, seqtree.evalues(target, control, queries, p)):
       if r["p_enrichment"] < 1e-3:
           print(q, r["E"], r["n_target"], r["n_control"])

``load_control`` ships a small bundled subset for quick use and downloads larger controls from the
``isalgo/airr_control`` dataset on demand (needs ``huggingface_hub``); both are deduplicated to
unique clonotypes. Meaningful E-values need a control at least as large as the target — see the
precision bound below.

Theory
------

The full derivation — Poisson approximation with an explicit Chen–Stein / Le Cam error bound,
the self-match / punctured-null lemma (benchmark-only exact-hit exclusion), clonotype-collapsing for
over-dispersion, the tf-idf = self-information equivalence, multiple-testing control (Bonferroni and
Benjamini–Hochberg), the control-size requirement :math:`M \gtrsim N/(\rho^2 E^\ast)`, the closest-hit
Gumbel law, **epitope detection complexity** from the degree distribution (worked NLV vs GIL example),
the Karlin–Altschul reduction, and the epitope-presentation limitation — is in the technical appendix
linked at the top of this page (LaTeX source ``appendix/evalue.tex``; rebuild with ``make -C appendix``).