API reference#

mhcmatch.store module#

MHC restriction & presentation from a reference epitope panel.

Productionizes the validated reverse-problem method (seqtree bench/bench_mhc_guess.py): index reference peptides by their anchored presentation signature (seqtree.layout.presentation_features()), widen the search scope around a query until it has enough neighbours, then rank presenting alleles by neighbour vote fraction and score confidence by a binomial-tail enrichment over the panel background. The vote fraction is the ranking statistic (robust to panel skew); the enrichment is the non-binder filter.

Significance theory: appendix/mhcmatch.tex §2-3 (forward per-allele E-value + reverse problem).

mhcmatch.store.infer_class(peptide)[source]#

Heuristic class from length: MHC-I if <=11, else MHC-II. Pass cls to override.

Parameters:: peptide (str)
Return type:: str

class mhcmatch.store.Restriction(allele: 'str', vote: 'float', enrichment: 'float', n_votes: 'int', binder: 'bool', anchor_score: 'float | None' = None)[source]#

Bases: object

Parameters:

allele (str)
vote (float)
enrichment (float)
n_votes (int)
binder (bool)
anchor_score (float | None)

allele: str#

vote: float#

enrichment: float#

n_votes: int#

binder: bool#

anchor_score: float | None = None#

class mhcmatch.store.Decomposition(peptide: 'str', tcr_facing: 'str', presentation: 'str', anchors: 'tuple')[source]#

Bases: object

Parameters:

peptide (str)
tcr_facing (str)
presentation (str)
anchors (tuple)

peptide: str#

tcr_facing: str#

presentation: str#

anchors: tuple#

mhcmatch.store.anchor_indices(peptide, cls)[source]#

0-based anchor positions for a peptide: class-I P2/PΩ, class-II core P1/P4/P6/P9.

Parameters:

peptide (str)
cls (str)

Return type:

tuple

mhcmatch.store.resolve_anchor_index(peptide, cls, anchor)[source]#

0-based index of a scoring anchor in peptide (or None if out of range).

MHC-I: anchor is a 1-based peptide position (negatives count from the C-terminus). MHC-II: anchor is a 1-based position within the register-anchored 9-mer core (P1..P9).

Parameters:

peptide (str)
cls (str)
anchor (int)

class mhcmatch.store.Store[source]#

Bases: object

Searchable reference panel of presented peptides, partitioned by MHC class.

classmethod from_records(records)[source]#: records: dicts with epitope, mhc_a (or mhc), mhc_class; optional weight (default 1.0) confidence-weights the peptide in anchor-preference estimation.

classmethod from_pmhc(path=None, tier='full', species=None, classes=('mhc1', 'mhc2'))[source]#: Load the isalgo/pmhc_data TSV(.gz). species filters the MHC species ("human" / "mouse"). If path is None, uses $MHCMATCH_PMHC/pmhc_<tier>.tsv.gz.

alleles(cls)[source]#

restriction(peptide, cls=None, alleles='all', top=10, alpha=0.05, diffuse=False)[source]#

Rank presenting alleles for peptide (vote fraction), flag binders (enrichment).

alleles: "all", a single allele, or a list. alpha: per-allele significance for the non-binder flag (binder iff binomial-tail p <= alpha and the allele got votes).

With diffuse=True the diffusion-shrunk anchor log-odds (mhcmatch.diffusion.AnchorModel) ranks and the neighbour vote/enrichment gates: an allele is a binder if it is vote-significant or its anchors are plausible (anchor_score > 0). On held-out (novel) peptides the anchor log-odds is the far better ranker—the vote method relies on same-allele signature neighbours, which are sparse for a genuinely new peptide, so vote-first ranking buries the true allele; the diffused anchor score scores every allele directly and rescues rare ones. Vote breaks ties. Without diffusion, vote fraction ranks and the call returns [] when there are no neighbours.

is_binder(peptide, allele, cls=None, alpha=0.05)[source]#

is_presented(peptide, cls=None, alpha=0.05)[source]#: Overall presentation: does any panel allele present this peptide?

scan_protein(protein, cls='mhc1', alleles='all', lengths=None, alpha=0.05, top=3, correction=None)[source]#

Slide all binding-length windows over protein and return presented peptides.

Returns [(position, peptide, [Restriction, ...]), ...] for windows with >=1 binder.

correction controls multiple testing over the (window, allele) presentation tests in the scan (appendix §5): None (default) keeps the per-window per-allele alpha; "bonferroni" controls the family-wise error rate (threshold alpha/m); "bh" controls the Benjamini-Hochberg false-discovery rate. m is the number of voted (window, allele) tests. The vote tail p-value is 10**(-enrichment); corrected calls replace the per-window binder flag.

decompose(peptide, cls=None, allele=None)[source]#

Split peptide into anchor and TCR-facing parts, each masked with X.

tcr_facing: anchors -> X (the recognition readout). presentation: TCR-facing -> X (the anchor readout). allele is accepted for forward-compat (allele-specific learned anchors, Phase 1); v0 uses class-default anchor positions.

anchor_model(cls='mhc1', h=2.0, prior_strength=10.0, anchors=None, learn_weights=True, prune_dpi=False, weights='learned')[source]#

Anchor-factored presentation model with cross-allele kernel-shrinkage diffusion.

See mhcmatch.diffusion.AnchorModel. The diffusion rescues rare alleles by borrowing anchor preferences from groove-similar frequent ones, with a bounded prior strength so a large neighbour cannot swamp a rare allele’s own peptides.

anchor_preferences(cls, anchor)[source]#: {allele: Counter(residue)} at a 1-based anchor position (negative from C-term).

mhcmatch.search module#

Large-scale peptide similarity search over big peptide sets / proteomes.

Two notions of “similar”, both via the seqtree C++ KmerIndex seed-and-gather:

mode="tcr" – anchor-masked TCR-facing homology: similar T-cell recognition profile (the basis for cross-reactivity / molecular mimicry).
mode="mhc" – anchored presentation signature: likely presented by the same MHC.

For neoantigen mimicry with per-allele presentation-aware E-values, use find_mimics() (re-exported from seqtree). See appendix/mhcmatch.tex §5.

class mhcmatch.search.Match(peptide: 'str', shared_kmers: 'int', score: 'int')[source]#

Bases: object

Parameters:

peptide (str)
shared_kmers (int)
score (int)

peptide: str#

shared_kmers: int#

score: int#

mhcmatch.search.search(query, peptides, mode='tcr', cls='mhc1', k=4, max_subs=1, min_shared=1, exclude_self=True, threads=0)[source]#: Peptides in peptides similar to query under mode ("tcr" or "mhc").

mhcmatch.proteome module#

Near-exact source-peptide lookup against a reference proteome.

Given a query peptide (e.g. a neoantigen), find the nearly-exact self peptide it derives from and its parent protein / position via full-sequence (unmasked) <= max_subs search over all windows of the proteome of the query’s length – using the seqtree Hamming fast path. This is a distinct mode from the anchor-masked TCR-facing homology and the presentation-signature searches. See appendix/mhcmatch.tex §5 (near-exact source identification).

mhcmatch.proteome.read_fasta(path)[source]#: {name: sequence} from a (optionally gzipped) FASTA; name = first whitespace token.

class mhcmatch.proteome.SourceHit(protein: 'str', position: 'int', ref_peptide: 'str', n_subs: 'int', mutations: 'tuple')[source]#

Bases: object

Parameters:

protein (str)
position (int)
ref_peptide (str)
n_subs (int)
mutations (tuple)

protein: str#

position: int#

ref_peptide: str#

n_subs: int#

mutations: tuple#

class mhcmatch.proteome.Proteome(seqs)[source]#

Bases: object

A reference proteome with lazily-built per-length window indices.

classmethod from_fasta(path)[source]#

find_source(peptide, max_subs=1, exclude_exact=False)[source]#

Self peptides within max_subs substitutions of peptide, nearest first.

Returns [SourceHit, ...]. exclude_exact=True drops perfect (0-mismatch) matches – useful to find the wild-type a mutated neoantigen derives from when the query is itself self.

mhcmatch.pseudoseq module#

MHC pseudosequence allele-similarity & cross-allele diffusion.

Each allele is a 34-residue groove pseudosequence (NetMHCpan-style; vendored in data/{mhci,mhcii}_pseudo.fa). Allele similarity is an anchor-factored kernel over these positions: K_j(a,b) = exp(-d_j(a,b)/h) where d_j is a position-weighted Hamming distance and the per-anchor weights w_j say which groove residues govern peptide anchor j (e.g. MHC-I P2 vs PΩ). learn_anchor_weights() learns w_j from data (mutual information between a groove position and the allele’s anchor-residue choice) – the “feature importance” of each pocket.

Kernel-weighted shrinkage (Pseudoseq.shrink()) borrows presented-peptide statistics from similar alleles to rescue rare ones, lifting the seqtree limitation “distinct alleles are distinct nulls”. See appendix/mhcmatch.tex §4.

mhcmatch.pseudoseq.normalize_allele(a)[source]#

pmhc allele name -> pseudosequence-FASTA key.

Drops the * ('HLA-A*02:01' -> 'HLA-A02:01') and repairs the mouse H-2 dash (pmhc 'H-2Kb' -> FASTA 'H-2-Kb').

Parameters:: a (str)
Return type:: str

mhcmatch.pseudoseq.class2_key(mhc_a, mhc_b='')[source]#

pmhc class-II allele -> pseudosequence-FASTA key (locus-aware).

DR (the DRA chain is monomorphic) is keyed by the beta chain alone, e.g. 'HLA-DRB1*01:01' -> 'DRB1_0101'. DP/DQ are keyed by the alpha-beta pair, e.g. ('HLA-DPA1*01:03', 'HLA-DPB1*04:01') -> 'HLA-DPA10103-DPB10401'. With no beta chain the input is returned unchanged (mouse H-2 and fallbacks).

Parameters:

mhc_a (str)
mhc_b (str)

Return type:

str

mhcmatch.pseudoseq.resolve_allele(name, cls)[source]#

Resolve a user-typed allele name to a pseudosequence key for cls.

Returns (key, exact). exact=True when name (after normalize_allele()) is a known key; otherwise the closest key by name—a missing HLA- prefix is repaired and a too-short (e.g. two-field 'HLA-A02:01') name is completed by prefix to its first matching key—with exact=False; (None, False) if nothing matches. Serotype names ('HLA-A2') are not expanded. Lets callers accept messy input ('A*02:01', 'HLA-A0201') and report when a requested allele is unknown rather than silently dropping it.

Parameters:

name (str)
cls (str)

mhcmatch.pseudoseq.load_pseudo(cls)[source]#

allele-id -> 34-mer for the bundled pseudosequence FASTA of a class.

Parameters:: cls (str)
Return type:: dict

mhcmatch.pseudoseq.mutual_information(xs, ys)[source]#

MI(X;Y) in bits for two aligned categorical sequences.

Return type:: float

mhcmatch.pseudoseq.learn_anchor_weights(pseudo_seqs, anchor_residue, prune_dpi=False, tol=0.0)[source]#

Per-position relevance w[p] = MI(groove position p residue ; anchor residue) across alleles, normalized to mean 1. anchor_residue: {allele: residue} (e.g. the modal residue at one peptide anchor for that allele). Positions that discriminate the anchor get more weight.

Raw MI is inflated by linkage between groove positions (they co-vary across alleles), so many positions look relevant and the per-pocket profile is smeared. With prune_dpi=True an ARACNE data-processing-inequality prune removes indirect links: position p’s edge to the pocket is dropped if some other position q is more informative about the pocket and about p (I(p;pocket) <= min(I(q;pocket), I(p;q))), leaving the direct pocket positions sparse and distinct.

Parameters:

pseudo_seqs (dict)
anchor_residue (dict)
prune_dpi (bool)
tol (float)

Return type:

list

mhcmatch.pseudoseq.load_structural_weights(cls)[source]#

Per-anchor structural pocket weights from the vendored structural_pockets_<cls>.tsv (contact frequency of each groove position with each peptide anchor, over pMHC structures; see bench/structural_pockets.py). Returns {anchor:int -> [34 weights]} normalized to mean 1, or {} if the file is absent. A structural alternative/prior to learn_anchor_weights().

Parameters:: cls (str)
Return type:: dict

class mhcmatch.pseudoseq.Pseudoseq(cls, h=2.0, weights=None, metric='blosum')[source]#

Bases: object

Allele-similarity kernel and diffusion over groove pseudosequences for one MHC class.

kernel(a, b, anchor=None)[source]#

Return type:: float

neighbors(allele, candidates=None, anchor=None, top=10, min_k=0.0)[source]#: [(allele, kernel), ...] most groove-similar to allele (self excluded).

cluster(alleles, anchor=None, threshold=0.5)[source]#: Single-linkage clusters: merge alleles with kernel >= threshold. O(n^2); use on a panel (~hundreds of alleles), not the full 4k-allele set.

shrink(prefs, allele, anchor=None, candidates=None, prior_strength=None)[source]#

Kernel-weighted empirical-Bayes pooling of a per-anchor residue distribution.

prefs: {allele: Counter(residue -> count)} for one anchor. Returns the shrunk probability dict for allele.

With prior_strength=None (default) this is the counts-weighted form (n_a π_a + Σ_b K_ab n_b π_b) / (n_a + Σ_b K_ab n_b) with limits h -> 0 (raw per-allele) and h -> ∞ (global pool). With prior_strength=τ it uses the fixed-concentration form (n_a π_a + τ m_a) / (n_a + τ) where m_a is the kernel-weighted neighbour mean – a bounded prior that prevents one large neighbour from swamping a rare allele’s own peptides and self-adapts to n_a (appendix §4, Prop. on bias–variance). The latter is the recommended default for the forward scorer.

Return type:: dict

mhcmatch.diffusion module#

Anchor-factored presentation scoring with cross-allele kernel-shrinkage diffusion.

A per-allele anchor log-odds predictor – a small PWM over the anchor positions (MHC-I P2/PΩ) – whose per-allele anchor residue distributions are smoothed toward groove-similar alleles via mhcmatch.Pseudoseq. With raw=True (or bandwidth h -> 0) there is no borrowing and a rare allele scores off its own few peptides; with diffusion on, it borrows from frequent groove-neighbours, rescuing rare alleles. This is the forward per-allele E-value’s data-rescued null of appendix/mhcmatch.tex §4.

class mhcmatch.diffusion.AnchorModel(store, cls='mhc1', anchors=None, h=2.0, prior_strength=10.0, learn_weights=True, prune_dpi=False, weights='learned', blend_alpha=0.5)[source]#

Bases: object

Per-allele anchor presentation model with optional cross-allele diffusion.

Built from a mhcmatch.Store. anchors are 1-based positions (default MHC-I P2/PΩ). Per-anchor groove-position weights are learned by mutual information unless learn_weights is False; the kernel bandwidth h controls how much rare alleles borrow.

score(peptide, allele, raw=False, eps=0.001)[source]#

Anchor log-odds of peptide for allele vs the panel background.

raw=True uses the allele’s own anchor frequencies (no borrowing); the default diffuses over groove-similar alleles. Returns -inf if the peptide is too short for the anchors.

mhcmatch.logo module#

Per-allele motif logos (information content) + length distributions.

motif() returns the numeric logo (PWM, per-position bits, length histogram) – pure-Python, always available. render() draws it with logomaker (optional [logo] extra). MHC-I logos use peptides of a fixed length (default the modal length); MHC-II uses register-anchored 9-mer cores. See appendix/mhcmatch.tex §6.

mhcmatch.logo.motif(store, allele, cls, length=None)[source]#

Logo data for allele’s presented peptides.

Returns {allele, cls, width, n, pwm, bits, length_hist} where pwm[i] is a residue->freq dict (sums to 1), bits[i] the information content (log2(20) - entropy) in [0, log2(20)], and length_hist a length->count dict over all the allele’s peptides.

mhcmatch.logo.render(m, ax=None)[source]#: Render a motif() result as an information-content sequence logo (needs [logo]).