API reference#
mhcmatch.store module#
MHC restriction & presentation from a reference epitope panel.
Productionizes the validated reverse-problem method (seqtree bench/bench_mhc_guess.py):
index reference peptides by their anchored presentation signature
(seqtree.layout.presentation_features()), widen the search scope around a query until it
has enough neighbours, then rank presenting alleles by neighbour vote fraction and score
confidence by a binomial-tail enrichment over the panel background. The vote fraction is the
ranking statistic (robust to panel skew); the enrichment is the non-binder filter.
Significance theory: appendix/mhcmatch.tex §2-3 (forward per-allele E-value + reverse problem).
- mhcmatch.store.infer_class(peptide)[source]#
Heuristic class from length: MHC-I if <=11, else MHC-II. Pass
clsto override.- Parameters:
peptide (str)
- Return type:
str
- class mhcmatch.store.Restriction(allele: 'str', vote: 'float', enrichment: 'float', n_votes: 'int', binder: 'bool', anchor_score: 'float | None' = None)[source]#
Bases:
object- Parameters:
allele (str)
vote (float)
enrichment (float)
n_votes (int)
binder (bool)
anchor_score (float | None)
- allele: str#
- vote: float#
- enrichment: float#
- n_votes: int#
- binder: bool#
- anchor_score: float | None = None#
- class mhcmatch.store.Decomposition(peptide: 'str', tcr_facing: 'str', presentation: 'str', anchors: 'tuple')[source]#
Bases:
object- Parameters:
peptide (str)
tcr_facing (str)
presentation (str)
anchors (tuple)
- peptide: str#
- tcr_facing: str#
- presentation: str#
- anchors: tuple#
- mhcmatch.store.anchor_indices(peptide, cls)[source]#
0-based anchor positions for a peptide: class-I P2/PΩ, class-II core P1/P4/P6/P9.
- Parameters:
peptide (str)
cls (str)
- Return type:
tuple
- mhcmatch.store.resolve_anchor_index(peptide, cls, anchor)[source]#
0-based index of a scoring
anchorinpeptide(or None if out of range).MHC-I:
anchoris a 1-based peptide position (negatives count from the C-terminus). MHC-II:anchoris a 1-based position within the register-anchored 9-mer core (P1..P9).- Parameters:
peptide (str)
cls (str)
anchor (int)
- class mhcmatch.store.Store[source]#
Bases:
objectSearchable reference panel of presented peptides, partitioned by MHC class.
- classmethod from_records(records)[source]#
records: dicts with
epitope,mhc_a(ormhc),mhc_class; optionalweight(default 1.0) confidence-weights the peptide in anchor-preference estimation.
- classmethod from_pmhc(path=None, tier='full', species=None, classes=('mhc1', 'mhc2'))[source]#
Load the isalgo/pmhc_data TSV(.gz).
speciesfilters the MHC species ("human"/"mouse"). Ifpathis None, uses$MHCMATCH_PMHC/pmhc_<tier>.tsv.gz.
- restriction(peptide, cls=None, alleles='all', top=10, alpha=0.05, diffuse=False)[source]#
Rank presenting alleles for
peptide(vote fraction), flag binders (enrichment).alleles:"all", a single allele, or a list.alpha: per-allele significance for the non-binder flag (binder iff binomial-tail p <= alpha and the allele got votes).With
diffuse=Truethe diffusion-shrunk anchor log-odds (mhcmatch.diffusion.AnchorModel) ranks and the neighbour vote/enrichment gates: an allele is a binder if it is vote-significant or its anchors are plausible (anchor_score > 0). On held-out (novel) peptides the anchor log-odds is the far better ranker—the vote method relies on same-allele signature neighbours, which are sparse for a genuinely new peptide, so vote-first ranking buries the true allele; the diffused anchor score scores every allele directly and rescues rare ones. Vote breaks ties. Without diffusion, vote fraction ranks and the call returns[]when there are no neighbours.
- is_presented(peptide, cls=None, alpha=0.05)[source]#
Overall presentation: does any panel allele present this peptide?
- scan_protein(protein, cls='mhc1', alleles='all', lengths=None, alpha=0.05, top=3, correction=None)[source]#
Slide all binding-length windows over
proteinand return presented peptides.Returns
[(position, peptide, [Restriction, ...]), ...]for windows with >=1 binder.correctioncontrols multiple testing over the (window, allele) presentation tests in the scan (appendix §5):None(default) keeps the per-window per-allelealpha;"bonferroni"controls the family-wise error rate (thresholdalpha/m);"bh"controls the Benjamini-Hochberg false-discovery rate.mis the number of voted (window, allele) tests. The vote tail p-value is10**(-enrichment); corrected calls replace the per-window binder flag.
- decompose(peptide, cls=None, allele=None)[source]#
Split
peptideinto anchor and TCR-facing parts, each masked withX.tcr_facing: anchors -> X (the recognition readout).presentation: TCR-facing -> X (the anchor readout).alleleis accepted for forward-compat (allele-specific learned anchors, Phase 1); v0 uses class-default anchor positions.
- anchor_model(cls='mhc1', h=2.0, prior_strength=10.0, anchors=None, learn_weights=True, prune_dpi=False, weights='learned')[source]#
Anchor-factored presentation model with cross-allele kernel-shrinkage diffusion.
See
mhcmatch.diffusion.AnchorModel. The diffusion rescues rare alleles by borrowing anchor preferences from groove-similar frequent ones, with a bounded prior strength so a large neighbour cannot swamp a rare allele’s own peptides.
mhcmatch.search module#
Large-scale peptide similarity search over big peptide sets / proteomes.
Two notions of “similar”, both via the seqtree C++ KmerIndex seed-and-gather:
mode="tcr"– anchor-masked TCR-facing homology: similar T-cell recognition profile (the basis for cross-reactivity / molecular mimicry).mode="mhc"– anchored presentation signature: likely presented by the same MHC.
For neoantigen mimicry with per-allele presentation-aware E-values, use find_mimics()
(re-exported from seqtree). See appendix/mhcmatch.tex §5.
mhcmatch.proteome module#
Near-exact source-peptide lookup against a reference proteome.
Given a query peptide (e.g. a neoantigen), find the nearly-exact self peptide it derives from and
its parent protein / position via full-sequence (unmasked) <= max_subs search over all
windows of the proteome of the query’s length – using the seqtree Hamming fast path. This is a
distinct mode from the anchor-masked TCR-facing homology and the presentation-signature searches.
See appendix/mhcmatch.tex §5 (near-exact source identification).
- mhcmatch.proteome.read_fasta(path)[source]#
{name: sequence}from a (optionally gzipped) FASTA; name = first whitespace token.
- class mhcmatch.proteome.SourceHit(protein: 'str', position: 'int', ref_peptide: 'str', n_subs: 'int', mutations: 'tuple')[source]#
Bases:
object- Parameters:
protein (str)
position (int)
ref_peptide (str)
n_subs (int)
mutations (tuple)
- protein: str#
- position: int#
- ref_peptide: str#
- n_subs: int#
- mutations: tuple#
- class mhcmatch.proteome.Proteome(seqs)[source]#
Bases:
objectA reference proteome with lazily-built per-length window indices.
- find_source(peptide, max_subs=1, exclude_exact=False)[source]#
Self peptides within
max_subssubstitutions ofpeptide, nearest first.Returns
[SourceHit, ...].exclude_exact=Truedrops perfect (0-mismatch) matches – useful to find the wild-type a mutated neoantigen derives from when the query is itself self.
mhcmatch.pseudoseq module#
MHC pseudosequence allele-similarity & cross-allele diffusion.
Each allele is a 34-residue groove pseudosequence (NetMHCpan-style; vendored in
data/{mhci,mhcii}_pseudo.fa). Allele similarity is an anchor-factored kernel over these
positions: K_j(a,b) = exp(-d_j(a,b)/h) where d_j is a position-weighted Hamming distance and
the per-anchor weights w_j say which groove residues govern peptide anchor j (e.g. MHC-I P2
vs PΩ). learn_anchor_weights() learns w_j from data (mutual information between a groove
position and the allele’s anchor-residue choice) – the “feature importance” of each pocket.
Kernel-weighted shrinkage (Pseudoseq.shrink()) borrows presented-peptide statistics from
similar alleles to rescue rare ones, lifting the seqtree limitation “distinct alleles are distinct
nulls”. See appendix/mhcmatch.tex §4.
- mhcmatch.pseudoseq.normalize_allele(a)[source]#
pmhc allele name -> pseudosequence-FASTA key.
Drops the
*('HLA-A*02:01'->'HLA-A02:01') and repairs the mouse H-2 dash (pmhc'H-2Kb'-> FASTA'H-2-Kb').- Parameters:
a (str)
- Return type:
str
- mhcmatch.pseudoseq.class2_key(mhc_a, mhc_b='')[source]#
pmhc class-II allele -> pseudosequence-FASTA key (locus-aware).
DR (the DRA chain is monomorphic) is keyed by the beta chain alone, e.g.
'HLA-DRB1*01:01' -> 'DRB1_0101'. DP/DQ are keyed by the alpha-beta pair, e.g.('HLA-DPA1*01:03', 'HLA-DPB1*04:01') -> 'HLA-DPA10103-DPB10401'. With no beta chain the input is returned unchanged (mouse H-2 and fallbacks).- Parameters:
mhc_a (str)
mhc_b (str)
- Return type:
str
- mhcmatch.pseudoseq.resolve_allele(name, cls)[source]#
Resolve a user-typed allele name to a pseudosequence key for
cls.Returns
(key, exact).exact=Truewhenname(afternormalize_allele()) is a known key; otherwise the closest key by name—a missingHLA-prefix is repaired and a too-short (e.g. two-field'HLA-A02:01') name is completed by prefix to its first matching key—withexact=False;(None, False)if nothing matches. Serotype names ('HLA-A2') are not expanded. Lets callers accept messy input ('A*02:01','HLA-A0201') and report when a requested allele is unknown rather than silently dropping it.- Parameters:
name (str)
cls (str)
- mhcmatch.pseudoseq.load_pseudo(cls)[source]#
allele-id -> 34-merfor the bundled pseudosequence FASTA of a class.- Parameters:
cls (str)
- Return type:
dict
- mhcmatch.pseudoseq.mutual_information(xs, ys)[source]#
MI(X;Y) in bits for two aligned categorical sequences.
- Return type:
float
- mhcmatch.pseudoseq.learn_anchor_weights(pseudo_seqs, anchor_residue, prune_dpi=False, tol=0.0)[source]#
Per-position relevance
w[p]= MI(groove positionpresidue ; anchor residue) across alleles, normalized to mean 1.anchor_residue:{allele: residue}(e.g. the modal residue at one peptide anchor for that allele). Positions that discriminate the anchor get more weight.Raw MI is inflated by linkage between groove positions (they co-vary across alleles), so many positions look relevant and the per-pocket profile is smeared. With
prune_dpi=Truean ARACNE data-processing-inequality prune removes indirect links: position p’s edge to the pocket is dropped if some other position q is more informative about the pocket and about p (I(p;pocket) <= min(I(q;pocket), I(p;q))), leaving the direct pocket positions sparse and distinct.- Parameters:
pseudo_seqs (dict)
anchor_residue (dict)
prune_dpi (bool)
tol (float)
- Return type:
list
- mhcmatch.pseudoseq.load_structural_weights(cls)[source]#
Per-anchor structural pocket weights from the vendored
structural_pockets_<cls>.tsv(contact frequency of each groove position with each peptide anchor, over pMHC structures; seebench/structural_pockets.py). Returns{anchor:int -> [34 weights]}normalized to mean 1, or{}if the file is absent. A structural alternative/prior tolearn_anchor_weights().- Parameters:
cls (str)
- Return type:
dict
- class mhcmatch.pseudoseq.Pseudoseq(cls, h=2.0, weights=None, metric='blosum')[source]#
Bases:
objectAllele-similarity kernel and diffusion over groove pseudosequences for one MHC class.
- neighbors(allele, candidates=None, anchor=None, top=10, min_k=0.0)[source]#
[(allele, kernel), ...]most groove-similar toallele(self excluded).
- cluster(alleles, anchor=None, threshold=0.5)[source]#
Single-linkage clusters: merge alleles with
kernel >= threshold. O(n^2); use on a panel (~hundreds of alleles), not the full 4k-allele set.
- shrink(prefs, allele, anchor=None, candidates=None, prior_strength=None)[source]#
Kernel-weighted empirical-Bayes pooling of a per-anchor residue distribution.
prefs:{allele: Counter(residue -> count)}for one anchor. Returns the shrunk probability dict forallele.With
prior_strength=None(default) this is the counts-weighted form(n_a π_a + Σ_b K_ab n_b π_b) / (n_a + Σ_b K_ab n_b)with limitsh -> 0(raw per-allele) andh -> ∞(global pool). Withprior_strength=τit uses the fixed-concentration form(n_a π_a + τ m_a) / (n_a + τ)wherem_ais the kernel-weighted neighbour mean – a bounded prior that prevents one large neighbour from swamping a rare allele’s own peptides and self-adapts ton_a(appendix §4, Prop. on bias–variance). The latter is the recommended default for the forward scorer.- Return type:
dict
mhcmatch.diffusion module#
Anchor-factored presentation scoring with cross-allele kernel-shrinkage diffusion.
A per-allele anchor log-odds predictor – a small PWM over the anchor positions (MHC-I P2/PΩ) –
whose per-allele anchor residue distributions are smoothed toward groove-similar alleles via
mhcmatch.Pseudoseq. With raw=True (or bandwidth h -> 0) there is no borrowing and a
rare allele scores off its own few peptides; with diffusion on, it borrows from frequent
groove-neighbours, rescuing rare alleles. This is the forward per-allele E-value’s data-rescued null
of appendix/mhcmatch.tex §4.
- class mhcmatch.diffusion.AnchorModel(store, cls='mhc1', anchors=None, h=2.0, prior_strength=10.0, learn_weights=True, prune_dpi=False, weights='learned', blend_alpha=0.5)[source]#
Bases:
objectPer-allele anchor presentation model with optional cross-allele diffusion.
Built from a
mhcmatch.Store.anchorsare 1-based positions (default MHC-I P2/PΩ). Per-anchor groove-position weights are learned by mutual information unlesslearn_weightsis False; the kernel bandwidthhcontrols how much rare alleles borrow.
mhcmatch.logo module#
Per-allele motif logos (information content) + length distributions.
motif() returns the numeric logo (PWM, per-position bits, length histogram) – pure-Python,
always available. render() draws it with logomaker (optional [logo] extra). MHC-I
logos use peptides of a fixed length (default the modal length); MHC-II uses register-anchored
9-mer cores. See appendix/mhcmatch.tex §6.
- mhcmatch.logo.motif(store, allele, cls, length=None)[source]#
Logo data for
allele’s presented peptides.Returns
{allele, cls, width, n, pwm, bits, length_hist}wherepwm[i]is a residue->freq dict (sums to 1),bits[i]the information content (log2(20) - entropy) in [0, log2(20)], andlength_hista length->count dict over all the allele’s peptides.