vdjmatch#
Fast, control-calibrated annotation of T-cell receptor antigen specificity.
vdjmatch annotates clonotypes in large AIRR repertoires against
VDJdb by fuzzy CDR3 search, reporting a
control-calibrated E-value (BLAST-style significance against a background repertoire) and
enriched antigen-specificity labels. It is a Python rewrite of the legacy Java/Groovy vdjmatch,
built on the seqtree search core.
Note
vdjmatch 0.0.1 is an early release. The single-chain annotator (VDJdb fetch, AIRR I/O,
fuzzy search, E-values, epitope-enrichment summaries, CLI) is in place; paired α/β scoring, a
re-derived segment-aware substitution matrix (VDJAM), and tool comparisons are in progress.
Installation#
pip install vdjmatch
seqtree (the search engine) is installed as a dependency. For development:
python -m venv .venv && source .venv/bin/activate
pip install -e ".[test,bench]"
Quickstart#
Fetch the latest VDJdb release and annotate an AIRR rearrangement sample:
vdjmatch update # cache the latest VDJdb release
vdjmatch match --species HomoSapiens --scope 1,0,0,1 -o out sample_airr.tsv
This writes three tab-separated tables per sample:
out.<sample>.hits.txt— every query→VDJdb hit with CDR3 alignment, CIGAR, edit counts and score.out.<sample>.calls.txt— one predicted epitope per query clonotype with its E-value.out.<sample>.summary.txt— epitope-level enrichment (unique clonotypes, reads) by MHC class and antigen species.
Key ideas#
Control-calibrated E-value. Immune repertoires are biologically redundant (convergent
recombination, public clones), so a naive i.i.d. null massively over-calls. vdjmatch counts a
query’s VDJdb neighbours within a fixed search scope and compares to the count expected from a
matched background control repertoire; the Poisson-tail p_enrichment is significant only
when a clonotype has more VDJdb neighbours than the generative process predicts — the hallmark of
antigen-driven selection. The theory is derived in the seqtree appendix.
Scope / budget. --scope s,i,d,t sets the maximum substitutions, insertions, deletions and
total edits of the CDR3 search ball.
VDJAM. A TCR-specific amino-acid substitution matrix (bundled), with optional region-aware
weighting that emphasises the antigen-contacting NDN core over the germline-fixed V/J flanks
(germline-retention profiles derived from the OLGA model via mirpy).
What scoring actually buys. An empirical study on VDJdb (the scoring appendix) finds that Hamming distance 1 is the signal:noise optimum (neighbour purity 0.53 → 0.13 over edit distance 1–5; the original VDJdb observation), that central substitutions carry the specificity signal (a core mismatch changes specificity far more than a germline-flank one), and that no substitution matrix beats BLOSUM62 for epitope retrieval — so the substitution matrix is a second-order lever and the first-order statistic is the control-calibrated E-value.