tcren package#
Structure I/O#
tcren.structure.model module#
Lightweight structure data model.
These dataclasses wrap the parsed contents of a PDB/mmCIF file in the shape the rest of
the pipeline needs: per-chain residue lists carrying both a 0-based sequential index
(matching the legacy mir residue.index) and the original author numbering, plus
heavy-atom coordinates for contact computation.
- class tcren.structure.model.Atom(name, element, coord)[source]#
Bases:
objectA single (heavy) atom.
- Parameters:
name (str)
element (str)
coord (ndarray)
- name: str#
- element: str#
- coord: ndarray#
- class tcren.structure.model.Residue(seq_index, pdb_index, insertion_code, aa, resname, atoms)[source]#
Bases:
objectA polymer residue.
- Variables:
seq_index (int) – 0-based sequential index over the chain’s resolved polymer residues (the legacy
residue.index); independent of author numbering gaps.pdb_index (int) – Author residue number (
residue.index.pdb).insertion_code (str) – Author insertion code (
''when absent).aa (str) – One-letter amino-acid code (
'X'for unknown).resname (str) – Three-letter residue name (
HIS,MSE…).atoms (tuple[tcren.structure.model.Atom, ...]) – Heavy atoms of the residue.
- Parameters:
seq_index (int)
pdb_index (int)
insertion_code (str)
aa (str)
resname (str)
atoms (tuple[Atom, ...])
- seq_index: int#
- pdb_index: int#
- insertion_code: str#
- aa: str#
- resname: str#
- property ca: ndarray | None#
Cα coordinate, or
Noneif the residue has no Cα atom.
- property cb: ndarray | None#
Cβ coordinate, or
Noneif the residue has no Cβ atom (e.g. glycine).
- property cb_or_ca: ndarray | None#
Cβ coordinate, falling back to Cα (glycine / missing Cβ);
Noneif neither.
- class tcren.structure.model.Chain(chain_id, residues, chain_type=None, chain_supertype=None, allele_info=None, regions=<factory>)[source]#
Bases:
objectA polymer chain and its annotations.
- Parameters:
chain_id (str)
residues (list[Residue])
chain_type (str | None)
chain_supertype (str | None)
allele_info (str | None)
regions (list[RegionMarkup])
- chain_id: str#
- chain_type: str | None#
- chain_supertype: str | None#
- allele_info: str | None#
- regions: list[RegionMarkup]#
- class tcren.structure.model.RegionMarkup(region_type, start_seq_index, end_seq_index, sequence, residues)[source]#
Bases:
objectAn annotated region (CDR/FR for TCR, groove regions for MHC).
- Parameters:
region_type (str)
start_seq_index (int)
end_seq_index (int)
sequence (str)
residues (list[Residue])
- region_type: str#
- start_seq_index: int#
- end_seq_index: int#
- sequence: str#
- class tcren.structure.model.Structure(pdb_id, chains, complex_species=None, cell_type=None)[source]#
Bases:
objectA parsed complex: a set of annotated chains.
- Parameters:
pdb_id (str)
chains (list[Chain])
complex_species (str | None)
cell_type (str | None)
- pdb_id: str#
- complex_species: str | None#
- cell_type: str | None#
tcren.structure.io module#
Parse PDB / mmCIF files into the tcren.structure.model data model.
Accepts plain .pdb/.ent/.cif/.mmcif files, their gzip-compressed forms
(.pdb.gz/.cif.gz …), and — for batches — directories or .tar/.tar.gz archives
of any of those (see iter_structures()). Structure identifiers are resolved from the file
name by structure_id_from_path().
- tcren.structure.io.is_structure_file(name)[source]#
True if
nameis a (optionally gzipped) PDB/mmCIF structure file.- Parameters:
name (str | Path)
- Return type:
bool
- tcren.structure.io.structure_id_from_path(path)[source]#
Resolve a structure identifier from a file name.
Strips a trailing
.gzand the structure extension, then takes the part before the first_(so4x5w_renumbered.cifand1ao7.pdb.gzand6uk4_TCRpMHCmodels.pdball resolve to their PDB id).- Parameters:
path (str | Path)
- Return type:
str
- tcren.structure.io.parse_structure(path, pdb_id=None, model=0, keep_hydrogens=True)[source]#
Parse a structure file into a
Structure.Residues are taken in author order; only amino-acid residues (standard or modified, via the extended three→one table) are kept — waters, ions and ligands are dropped. Each kept residue receives a 0-based sequential
seq_indexper chain, matching the legacymirresidue.index.- Parameters:
path (str | Path) – Path to a
.pdb/.entor.cif/.mmciffile.pdb_id (str | None) – Structure identifier; defaults to the file stem.
model (int) – Model index to read (default 0 — the first model).
keep_hydrogens (bool) – Keep hydrogen atoms (default
True— the legacy mir contact definition counts hydrogens when a structure provides them).
- Returns:
The parsed
Structure.- Return type:
- tcren.structure.io.import_structure(path, pdb_id=None, model=0, keep_hydrogens=True, trim_c_gene=True, keep_c_gene=False, min_constant_score=80.0)[source]#
Parse a structure and prepare it for interface analysis.
Wraps
parse_structure(), records the αβ/γδ cell type from the TCR constant region, and — by default — trims that constant region so downstream analysis works on the variable domains and the interface.- Parameters:
path (str | Path) – as in
parse_structure().pdb_id (str | None) – as in
parse_structure().model (int) – as in
parse_structure().keep_hydrogens (bool) – as in
parse_structure().trim_c_gene (bool) – Trim the TCR constant domain (default
True).keep_c_gene (bool) – Retain the constant domain even if
trim_c_geneis set. Use this for molecular-dynamics / FlexPepDock and any workflow that needs the full chain — those depend on the presence of the C-gene.min_constant_score (float) – Minimum constant-region alignment score to trim on.
- Returns:
The imported
Structurewithcell_typeset.- Return type:
- tcren.structure.io.structure_paths(src)[source]#
List structure files for
src(a single file or a directory), sorted.Recognises plain and gzipped PDB/mmCIF (
.pdb,.cif.gz, …). For archives or streaming, useiter_structures().- Parameters:
src (str | Path)
- Return type:
list[Path]
- tcren.structure.io.iter_structures(src, importer=<function import_structure>, on_error='raise', **kwargs)[source]#
Yield
(pdb_id, Structure)for a file, directory, or.tar/.tar.gzarchive.Handles plain and gzipped PDB/mmCIF (
.pdb/.cif/.pdb.gz/.cif.gz…); a directory is scanned for those; a tar archive is streamed member-by-member (each member materialised to a temp file so the path-basedimporterworks unchanged). The identifier is resolved per file bystructure_id_from_path().- Parameters:
src (str | Path) – structure file, directory, or tar archive.
importer (Callable[[...], Structure]) – per-file parser —
import_structure()(default, trims the C-gene) orparse_structure()(parity-pure). Extrakwargsare forwarded to it.on_error (str) –
"raise"(default) or"skip"to ignore files that fail to parse.
- Return type:
Iterator[tuple[str, Structure]]
- tcren.structure.io.pdb_lines(structure, transform=None, keep_hydrogens=True)[source]#
ATOM/TER/END record lines for
structure(optionally coordinate-transformed).One conformer per atom name per residue (drops duplicate altlocs).
transformis an optionalcoord -> coordcallable (e.g. for an oriented frame); identity ifNone. Author residue numbers + insertion codes are preserved.- Parameters:
structure (Structure)
keep_hydrogens (bool)
- Return type:
list[str]
- tcren.structure.io.cif_lines(structure, transform=None, keep_hydrogens=True)[source]#
Minimal mmCIF
atom_siteloop forstructure(optionally transformed).Same atom selection as
pdb_lines()(one conformer per atom name per residue). Only the_atom_sitecategory is written — enough to round-trip coordinates + chain/residue identity through the Biopython MMCIF parser, which is all tcren consumes.- Parameters:
structure (Structure)
keep_hydrogens (bool)
- Return type:
list[str]
- tcren.structure.io.write_pdb(structure, path, transform=None, keep_hydrogens=True)[source]#
Write
structureto a PDB file; return the path.A
.gzsuffix (foo.pdb.gz) transparently gzip-compresses the output.- Parameters:
structure (Structure)
path (str | Path)
keep_hydrogens (bool)
- Return type:
Path
- tcren.structure.io.structure_output_path(directory, pdb_id, mmcif=False, compress=False)[source]#
Build an output path
<directory>/<pdb_id>.<ext>from format flags..pdbby default,.cififmmcif, with a trailing.gzifcompress.- Parameters:
directory (str | Path)
pdb_id (str)
mmcif (bool)
compress (bool)
- Return type:
Path
Annotation#
tcren.annotation.arda_adapter module#
TCR chain annotation via the arda library.
Extracts each chain’s amino-acid sequence, runs arda’s AIRR annotation, and projects the
returned region coordinates (1-based, end-inclusive into the input sequence) back onto
structure residues as RegionMarkup. Region names are
mapped to the legacy mir vocabulary (FR1/CDR1/…/FR4).
- tcren.annotation.arda_adapter.annotate_chain(chain, organism)[source]#
Annotate one chain with arda; return the AIRR record if it is a TCR chain.
Mutates
chainin place when arda recognises it as TRA/TRB: setschain_type,chain_supertype("TRAB"),allele_infoandregions. Returns the arda record (for any locus) orNoneif arda produced no locus.- Parameters:
organism (str)
- Return type:
dict | None
- tcren.annotation.arda_adapter.apply_records(chains, by_id)[source]#
Project a cached
{chain_id: record}map onto chains in place (no arda call).- Parameters:
by_id (dict[str, dict])
- Return type:
None
- tcren.annotation.arda_adapter.score_records(chains, by_id)[source]#
(receptor_ids, summed mmseqs2_score)from already-computed records.- Parameters:
by_id (dict[str, dict])
- Return type:
tuple[list[str], float]
- tcren.annotation.arda_adapter.annotate_chains(chains, organism)[source]#
Annotate a batch of chains in a single arda call; apply records in place.
One mmseqs invocation for all chains (the per-call process/DB overhead dominates the actual alignment, so batching is ~hundreds× faster than per-chain calls). Returns a
{chain_id: record}map for chains that had a sequence.- Parameters:
organism (str)
- Return type:
dict[str, dict]
- tcren.annotation.arda_adapter.annotate_tcr_chains(structure, organism='human')[source]#
Annotate all chains; return ids recognised as antigen-receptor (TCR/BCR) chains.
- Parameters:
organism (str)
- Return type:
list[str]
- tcren.annotation.arda_adapter.annotate_tcr_chains_scored(structure, organism='human')[source]#
Annotate all chains; return
(receptor_ids, summed mmseqs2_score).The summed mmseqs2 alignment score over the receptor chains measures how well the structure’s TCR/BCR chains match this organism’s germline reference — the signal used to pick the correct species when annotating against human vs mouse.
- Parameters:
organism (str)
- Return type:
tuple[list[str], float]
tcren.annotation.chains module#
Chain classification: TRA/TRB (via arda), PEPTIDE, and (provisional) MHC.
Precise MHC sub-typing (MHCa/MHCb/B2M, class I/II) is added in Phase B; here MHC chains
are left with the generic type "MHC" so the TCR↔peptide scoring path is complete.
- tcren.annotation.chains.classify_chains(structure, organism='human', peptide_max_len=30, autodetect_species=True, precomputed_records=None)[source]#
Classify every chain of
structurein place.TRA/TRB are assigned from arda’s locus call; the shortest remaining chains (length ≤
peptide_max_len) become PEPTIDE; longer remaining chains are tagged"MHC".- Parameters:
structure (Structure) – Structure to annotate (mutated in place).
organism (str) – arda organism (
"human"/"mouse").peptide_max_len (int) – Maximum residue count for a chain to be called PEPTIDE.
autodetect_species (bool) – Annotate against both supported species (human and mouse) and keep whichever gives the higher total mmseqs alignment score over the receptor chains. TCR/BCR germlines are organism-specific, so the wrong species scores measurably lower (e.g. mouse BM3.3 scores ~435 vs ~197 under human); this avoids mis-typing a chain under the wrong reference. Ties keep the requested
organism. Disable to forceorganismverbatim.precomputed_records (dict[str, dict[str, dict]] | None) – Optional
{organism: {chain_id: record}}of arda records for this structure’s chains, to reuse instead of calling arda (the batch path inannotate_structure_set()annotates the whole dataset in one mmseqs call per organism and injects the per-structure slices).
- Return type:
None
tcren.annotation.cgene module#
αβ vs γδ T-cell classification from the TCR constant region (C-gene).
arda annotates the variable V(D)J region but not the constant domain. When a structure
includes an ordered constant domain, aligning each chain to the TCR constant references
(TRAC/TRBC1/TRBC2 → αβ; TRGC/TRDC → γδ) identifies the chain (α/β/γ/δ) unambiguously and
therefore the cell type. This is authoritative for αβ-vs-γδ and independent of the
(occasionally ambiguous, e.g. TRAV/DV) V-gene call. Variable-domain-only chains carry no
constant region and yield no call (cell type "unknown").
- tcren.annotation.cgene.MIN_CONSTANT_SCORE = 80.0#
Minimum local-alignment score to accept a constant-domain match. The V domain alone scores ~30-45 against any constant; a real constant domain scores in the hundreds.
- class tcren.annotation.cgene.ConstantCall(chain_id, gene, chain_class, cell_type, score)[source]#
Bases:
objectA TCR constant-region identification for one chain.
- Parameters:
chain_id (str)
gene (str)
chain_class (str)
cell_type (str)
score (float)
- chain_id: str#
- gene: str#
- chain_class: str#
- cell_type: str#
- score: float#
- tcren.annotation.cgene.classify_chain_constant(sequence, min_score=80.0)[source]#
Identify the constant region of a single chain sequence, if one is present.
- Parameters:
sequence (str)
min_score (float)
- Return type:
ConstantCall | None
- tcren.annotation.cgene.constant_span(sequence, min_score=80.0)[source]#
Return the
(start, end)query span aligning to the best TCR constant region.start/endare 0-based half-open indices intosequence. ReturnsNoneif no constant domain is present (score belowmin_score). The constant region is C-terminal, so callers trim residues withseq_index >= start.- Parameters:
sequence (str)
min_score (float)
- Return type:
tuple[int, int] | None
- tcren.annotation.cgene.classify_constants(structure, min_score=80.0)[source]#
Identify the constant region of every chain that has one.
- Parameters:
min_score (float)
- Return type:
list[ConstantCall]
- tcren.annotation.cgene.cell_type(structure, min_score=80.0)[source]#
Return
"ab","gd"or"unknown"from the constant regions present.γδ wins if any γ/δ constant is found; otherwise αβ if any α/β constant is found; otherwise
"unknown"(no ordered constant domain — e.g. variable-only chains).- Parameters:
min_score (float)
- Return type:
str
MHC mapping#
tcren.mhc.imgt module#
Download and parse MHC allele references (IMGT/HLA + UniProt mouse H-2 + B2M).
Produces MhcAllele records labelled with species, MHC class and chain role
(MHCa = class-I heavy or class-II alpha; MHCb = class-II beta; B2M). Human
alleles come from IMGT/HLA (hla_prot.fasta); mouse H-2 and beta-2-microglobulin come
from reviewed UniProt entries.
- class tcren.mhc.imgt.MhcAllele(allele, locus, mhc_class, chain_role, species, sequence)[source]#
Bases:
objectA reference MHC allele sequence with its functional labels.
- Parameters:
allele (str)
locus (str)
mhc_class (str)
chain_role (str)
species (str)
sequence (str)
- allele: str#
- locus: str#
- mhc_class: str#
- chain_role: str#
- species: str#
- sequence: str#
- tcren.mhc.imgt.download_human(cache_dir, force=False)[source]#
Download IMGT/HLA
hla_prot.fastaintocache_dir.- Parameters:
cache_dir (Path)
force (bool)
- Return type:
Path
- tcren.mhc.imgt.download_mouse(cache_dir, force=False)[source]#
Download mouse H-2 / B2m and human B2M into
cache_dir.- Parameters:
cache_dir (Path)
force (bool)
- Return type:
tuple[Path, Path]
tcren.mhc.reference module#
Build and load the curated MHC reference shipped under database/mhc/.
The committed reference is a single FASTA (alleles.aa.fasta) whose headers encode the
metadata (allele|locus|mhc_class|chain_role|species) plus a metadata.tsv mirror.
The mmseqs search index is built on demand into a gitignored cache (mirroring arda’s
commit-FASTA / build-index-on-demand split).
- tcren.mhc.reference.build(species=('human', 'mouse'), cache_dir=PosixPath('/home/runner/work/tcren/tcren/data/mhc_cache'), out_dir=PosixPath('/home/runner/work/tcren/tcren/database/mhc'), force_download=False)[source]#
Download, curate and write the committed MHC reference.
- Parameters:
species (tuple[str, ...]) – Which species to include.
cache_dir (Path) – Where raw downloads are cached (gitignored).
out_dir (Path) – Where the curated
alleles.aa.fasta+metadata.tsvare written.force_download (bool) – Re-download even if cached files exist.
- Returns:
Path to the written
alleles.aa.fasta.- Return type:
Path
- tcren.mhc.reference.reference_fasta(out_dir=PosixPath('/home/runner/work/tcren/tcren/database/mhc'))[source]#
Path to the committed reference FASTA (raise if the reference is not built).
- Parameters:
out_dir (Path)
- Return type:
Path
- tcren.mhc.reference.reference_db(cache_dir=PosixPath('/home/runner/work/tcren/tcren/data/mhc_cache'))[source]#
Path to a compiled, pre-indexed mmseqs DB of the allele reference (built once, cached).
mmseqs easy-search otherwise rebuilds the target DB and its k-mer prefilter index from the ~28k-allele FASTA on every call. Caching createdb saves little; the dominant cost is the prefilter index, so we also run createindex once. Reusing this DB cuts a single-structure MHC search from ~4.5 s to ~0.9 s. Built into the gitignored
data/mhc_cachewhen missing or older than the FASTA.- Parameters:
cache_dir (Path)
- Return type:
Path
tcren.mhc.mapper module#
Map a structure’s MHC chains to allele / class / role via mmseqs.
Searches each not-yet-typed chain against the curated MHC reference and assigns the best hit’s class (MHCI/MHCII), chain role (MHCa/MHCb/B2M), locus and allele. Class is reconciled across the complex (B2M ⇒ class I; a class-II beta chain ⇒ class II).
- class tcren.mhc.mapper.MhcCall(chain_id, chain_role, mhc_class, allele, locus, species, identity, bits, qstart, qend, tstart, tend, cigar)[source]#
Bases:
objectResult of mapping one chain to the MHC reference.
- Parameters:
chain_id (str)
chain_role (str)
mhc_class (str)
allele (str)
locus (str)
species (str)
identity (float)
bits (float)
qstart (int)
qend (int)
tstart (int)
tend (int)
cigar (str)
- chain_id: str#
- chain_role: str#
- mhc_class: str#
- allele: str#
- locus: str#
- species: str#
- identity: float#
- bits: float#
- qstart: int#
- qend: int#
- tstart: int#
- tend: int#
- cigar: str#
- tcren.mhc.mapper.map_mhc(structure, sensitivity=5.7)[source]#
Map the structure’s MHC chains against the curated reference.
- tcren.mhc.mapper.calls_from_hits(candidates, best, key=None)[source]#
Build reconciled
MhcCall`s for ``candidates`from precomputed mmseqs hits.key(chain) -> strmaps a candidate chain to its key inbest(default the chain id; a batched search uses"<struct_idx>|<chain_id>"). Lets one mmseqs search over many structures’ chains be sliced back per structure — no per-structure mmseqs call.- Parameters:
best (dict[str, dict])
- Return type:
list[MhcCall]
tcren.mhc.domains module#
Canonical MHC groove region definitions.
Loads the bundled mhc_canonical.json: for each "<class>|<role>" key it holds a
canonical mature chain sequence and the 0-based positions of each groove region
(HELIX_A1/HELIX_A2 for class I, HELIX_A1/HELIX_B1 for class II, and
GROOVE_FLOOR). Region boundaries follow established mature-numbering ranges for the
α1/α2 (class I) and α1/β1 (class II) groove domains. Regions are projected onto query
chains in tcren.mhc.regions.
tcren.mhc.regions module#
Project canonical MHC groove regions onto a structure’s MHC chains.
Each MHC chain is aligned (global, BLOSUM62) to the canonical chain for its class/role;
the canonical region positions are then mapped through the alignment onto the chain’s
residues, producing RegionMarkup entries (HELIX_A1,
HELIX_A2/HELIX_B1, GROOVE_FLOOR) in the same schema as the TCR region markup.
- tcren.mhc.regions.partition_chain(chain, mhc_class, chain_role)[source]#
Return groove RegionMarkups for one MHC chain (empty for B2M / unknown roles).
- Parameters:
chain (Chain)
mhc_class (str)
chain_role (str)
- Return type:
list[RegionMarkup]
- tcren.mhc.regions.partition_mhc(structure, calls)[source]#
Assign groove regions to every mapped MHC chain in the structure (in place).
- tcren.mhc.regions.annotate_mhc(structure)[source]#
Map and partition the MHC chains of an (already chain-typed) structure.
Returns the
MhcCalllist and, in place, sets each MHC chain’s type (MHCa/MHCb/B2M), class supertype, allele and groove regions.
- tcren.mhc.regions.annotate_mhc_batch(structures, sensitivity=5.7)[source]#
MHC-annotate many (chain-typed) structures with a SINGLE mmseqs search.
Gathers every candidate MHC chain across all structures, runs one
easy_search(mmseqs parallelises internally — no Python threads, no per-structure call), then slices the hits back and applies the calls + groove partitioning to each structure in place. This is the batched equivalent of callingannotate_mhc()per structure, for dataset-scale work.- Parameters:
structures (list[Structure])
sensitivity (float)
- Return type:
None
tcren.mhc.linker module#
Detect and split covalently linked (single-chain) peptides via MHC alignment.
Engineered single-chain pMHC constructs fuse the peptide to an MHC chain through a flexible (usually Gly/Ser-rich) linker, so the peptide is not a separate chain. Aligning each chain to the MHC reference reveals this: the MHC domain aligns, leaving an unaligned terminal segment that — after stripping the linker — is the peptide. This module provides the alignment check and a splitter that lifts such peptides into their own chain.
No covalently linked peptides occur in the bundled TCR3D / PDB datasets (all conventional, separate-chain complexes); this is robustness for engineered and predicted structures.
- class tcren.mhc.linker.MhcAlignmentCheck(chain_id, best_ref, score, query_start, query_end, n_term_extra, c_term_extra)[source]#
Bases:
objectResult of aligning a chain onto the MHC reference (a chain-identity check).
- Parameters:
chain_id (str)
best_ref (str)
score (float)
query_start (int)
query_end (int)
n_term_extra (int)
c_term_extra (int)
- chain_id: str#
- best_ref: str#
- score: float#
- query_start: int#
- query_end: int#
- n_term_extra: int#
- c_term_extra: int#
- property is_mhc: bool#
- tcren.mhc.linker.check_against_mhc(chain)[source]#
Align a chain onto the MHC reference and report coverage / terminal extensions.
- Parameters:
chain (Chain)
- Return type:
- tcren.mhc.linker.detect_linked_peptide(chain, min_len=7, max_len=25)[source]#
Return the residues of a peptide fused to an MHC chain, or
None.Looks for a peptide-length segment (after stripping an adjacent Gly/Ser linker) at the N- or C-terminus of a chain whose remainder aligns to the MHC reference.
- Parameters:
chain (Chain)
min_len (int)
max_len (int)
- Return type:
list | None
- tcren.mhc.linker.split_linked_peptides(structure, peptide_chain_id='p')[source]#
Split covalently linked peptides off their MHC chains, in place.
For each chain carrying a fused peptide, the peptide residues are removed and added as a new PEPTIDE chain. Returns the list of chain ids that were split (empty if none).
- Parameters:
structure (Structure)
peptide_chain_id (str)
- Return type:
list[str]
tcren.mhc.pseudo module#
MHC pseudosequence (MPS) annotation.
NetMHCpan defines a 34-residue “pseudosequence” per allele — the polymorphic groove positions
that contact the peptide (class I: α1/α2 of MHCa; class II: α1 of MHCa + β1 of MHCb). The
committed mhci_pseudo.fa / mhcii_pseudo.fa (see scripts/build_pseudo_fasta.py) hold
the unique pseudosequences.
annotate_pseudo() adds an MPS region to a chain-typed + MHC-annotated structure, on
demand. The 34 pseudo positions are scattered along the chain (not a contiguous motif), so an
mmseqs/local search can’t find them — there is no shared k-mer to seed on. Instead we thread each
candidate 34-mer through the chain with a fitting alignment (gaps in the chain are free, the
pseudosequence may not gap), which recovers the positions because NetMHCpan lists them N→C. The
best-scoring pseudosequence is chosen (one hit), and its identically-matched residues are marked —
across MHCa only for class I, split across MHCa+MHCb for class II, never β2m. Scoring all ~4k
pseudosequences this way is ~0.1 s, so no prebuilt index is needed.
- tcren.mhc.pseudo.annotate_pseudo(structure)[source]#
Add an
MPSregion to each groove chain from the best-matching pseudosequence.structuremust already be chain-typed + MHC-annotated. Returns the chosen pseudosequence id (orNoneif there is no MHC). The best hit is selected once over the class groove sequence (MHCa for class I; MHCa+MHCb for class II) and its residues marked per chain.- Parameters:
structure (Structure)
- Return type:
str | None
Contacts#
tcren.contacts.geometry module#
Atom-level contact and Cα-distance computation.
Ports the legacy mir compute-pdb-contacts / compute-pdb-geom steps using a
scipy.spatial.cKDTree for the all-atom neighbour search.
- tcren.contacts.geometry.all_atom_contacts(structure, cutoff=5.0)[source]#
Closest inter-chain atom contact for each residue pair within
cutoffÅ.For every pair of residues on different chains that have at least one heavy-atom pair within
cutoff(inclusive, matching the legacydist <= 5), the row with the minimum atom–atom distance is kept.- Returns:
chain.id.from,residue.index.from,chain.id.to,residue.index.to,residue.aa.from,residue.aa.to,atom.from,atom.to,dist. Each unordered residue pair appears once, in(chain.id, residue.index)lexicographic order.- Return type:
Columns
- Parameters:
structure (Structure)
cutoff (float)
- tcren.contacts.geometry.representative_atom_contacts(structure, kind='ca', cutoff=12.0)[source]#
Inter-chain residue contacts by a single representative atom per residue.
kind="ca"uses Cα (default cutoff 12 Å);kind="cb"uses Cβ with a glycine/ missing-Cβ fallback to Cα (default cutoff 8 Å). Mirrorsall_atom_contacts()’ residue-pair schema (atom.from/atom.tocarry the representative atom kind).- Parameters:
structure (Structure)
kind (str)
cutoff (float)
- Return type:
DataFrame
- tcren.contacts.geometry.ca_distance_matrix(structure)[source]#
Pairwise Cα–Cα distance matrix over all residues with a Cα atom.
- Returns:
(matrix, keys)wherematrix[a, b]is the Cα distance andkeys[a]is the(chain_id, seq_index)of row/columna.- Parameters:
structure (Structure)
- Return type:
tuple[ndarray, list[tuple[str, int]]]
tcren.contacts.definitions module#
Flexible multi-threshold contact definition.
Beyond the legacy single 5 Å all-atom contact (the TCRen parity default, d1), this adds
two coarser residue-level layers: d2 over Cβ atoms (Cα for glycine) and d3 over Cα
atoms. The layers nest from tight side-chain proximity to backbone neighbourhood, giving the
2D maps and scoring a tunable contact model without changing the 5 Å default.
- class tcren.contacts.definitions.ContactDefinition(d1=5.0, d2=8.0, d3=12.0)[source]#
Bases:
objectThree nested contact thresholds (Å).
- Variables:
d1 (float) – closest heavy-atom distance (all-atom contact).
d2 (float) – closest Cβ distance (Cα for glycine / missing Cβ).
d3 (float) – closest Cα distance.
- Parameters:
d1 (float)
d2 (float)
d3 (float)
- d1: float#
- d2: float#
- d3: float#
- tcren.contacts.definitions.multi_contacts(structure, definition=ContactDefinition(d1=5.0, d2=8.0, d3=12.0))[source]#
Stacked inter-chain residue contacts across the three layers.
Returns the union of the
d1/d2/d3residue-pair tables with alayercolumn ("d1"/"d2"/"d3") and the layer’s distance. A residue pair can appear in several layers; callers filter bylayeras needed.- Parameters:
structure (Structure)
definition (ContactDefinition)
- Return type:
DataFrame
tcren.contacts.table module#
Annotate and symmetrise the residue contact table.
Joins per-residue annotations (chain type, region type, region start, amino acid) onto
the raw contacts and mirrors the R rbind(contacts, swapped) symmetrisation, yielding
the fully annotated, bidirectional contact table the contact map is built from.
- tcren.contacts.table.residue_annotation(structure)[source]#
Per-residue annotation table for joining onto contacts.
Columns:
chain.id,residue.index,chain.type,chain.supertype,region.type,region.start,residue.aa.region.type/region.startare null for residues without a region annotation.- Parameters:
structure (Structure)
- Return type:
DataFrame
- tcren.contacts.table.symmetrize(contacts)[source]#
Return contacts plus their from/to-swapped mirror (R
rbindsemantics).- Parameters:
contacts (DataFrame)
- Return type:
DataFrame
- tcren.contacts.table.tidy_contacts(structure, cutoff=5.0)[source]#
Symmetrised, fully annotated contact table for a structure.
Each inter-chain residue contact appears in both directions, with chain type, region type, region start and amino acid attached on both the
fromandtosides — the input totcren.contactmap.ContactMap.- Parameters:
structure (Structure)
cutoff (float)
- Return type:
DataFrame
tcren.contactmap module#
Residue-level contact map and interface partitioning.
A ContactMap wraps the annotated, symmetrised contact table and exposes the
three biological interfaces (TCR↔peptide, TCR↔MHC, peptide↔MHC). The TCR↔peptide
interface is the central object for scoring and reproduces the schema of
data/contact_maps_PDB.csv once chains and regions are annotated.
- class tcren.contactmap.ContactMap(pdb_id, contacts, peptide_length=None)[source]#
Bases:
objectAnnotated, symmetrised residue contacts for one structure.
- Parameters:
pdb_id (str)
contacts (DataFrame)
peptide_length (int | None)
- pdb_id: str#
- contacts: DataFrame#
- peptide_length: int | None#
- classmethod from_structure(structure, cutoff=5.0)[source]#
Build a contact map from an (annotated) structure.
- Parameters:
structure (Structure)
cutoff (float)
- Return type:
Potentials#
tcren.potential.model module#
Pairwise residue-level statistical potentials.
A Potential is a long-form table of pairwise amino-acid energies keyed on
(residue.aa.from, residue.aa.to). The “from” side is conventionally the TCR
residue and the “to” side the antigen (peptide) residue, matching the orientation of
the legacy R pipeline. Potentials can be loaded from the two CSV layouts shipped with
the project (wide and long) and exported to a dense matrix for fast scoring.
- tcren.potential.model.AA20: tuple[str, ...] = ('L', 'F', 'I', 'M', 'V', 'W', 'Y', 'C', 'H', 'A', 'G', 'P', 'T', 'S', 'Q', 'N', 'D', 'E', 'R', 'K')#
20 standard amino acids (one-letter), TCRen ordering used in the paper.
- tcren.potential.model.AA21: tuple[str, ...] = ('A', 'I', 'L', 'V', 'R', 'H', 'K', 'C', 'M', 'S', 'T', 'D', 'E', 'N', 'Q', 'G', 'P', 'Y', 'F', 'W', '-')#
21 amino acids plus the gap symbol.
- Type:
Alphabet of the alignment-matrix variant
- class tcren.potential.model.Potential(name, matrix, alphabet)[source]#
Bases:
objectA pairwise amino-acid potential in long form.
- Variables:
name (str) – Identifier of the potential (e.g.
"TCRen","MJ","Keskin").matrix (polars.dataframe.frame.DataFrame) – Long-form table with columns
residue.aa.from,residue.aa.to,value.alphabet (tuple[str, ...]) – Amino-acid symbols present on each axis.
- Parameters:
name (str)
matrix (DataFrame)
alphabet (tuple[str, ...])
- name: str#
- matrix: DataFrame#
- alphabet: tuple[str, ...]#
- value(aa_from, aa_to)[source]#
Return the energy for an ordered residue pair.
- Parameters:
aa_from (str) – One-letter code of the “from” (TCR) residue.
aa_to (str) – One-letter code of the “to” (antigen) residue.
- Returns:
The pairwise energy.
- Raises:
KeyError – If the pair is absent from the potential.
- Return type:
float
- as_matrix()[source]#
Return a dense
(n, n)matrix and an amino-acid → index map.Rows are indexed by
residue.aa.from, columns byresidue.aa.to. Missing pairs are filled withnan.- Return type:
tuple[ndarray, dict[str, int]]
- to_csv(path)[source]#
Write the potential to a long-form CSV (
from, to, value).- Parameters:
path (str | Path)
- Return type:
None
- classmethod from_csv(path, name=None, value_col=None)[source]#
Load a potential from a CSV, auto-detecting wide vs long layout.
Two layouts are supported:
wide —
residue.aa.from, residue.aa.to, <name>(e.g.TCRen_potential.csvwith aTCRenvalue column).long —
residue.aa.from, residue.aa.to, potential, value(e.g.MJ_Keskin_potentials.csv); load a single named potential from it.
- Parameters:
path (str | Path) – Path to the CSV file.
name (str | None) – Which potential to select (long layout) or the name to assign (wide layout). Defaults to the value-column name (wide) and is required when a long file holds more than one potential.
value_col (str | None) – Override the value column name for the wide layout.
- Returns:
The loaded
Potential.- Return type:
tcren.potential.derive module#
Derivation of the TCRen statistical potential from observed contact maps.
This is a direct port of the R derivations in code_paper/2_TCRen_derivation.Rmd
(variant="classic") and tcren_am/tcren_am.Rmd (variant="am"). The classic
variant reproduces TCRen_potential.csv; the alignment-matrix variant reproduces
tcren_am/tcren.txt.
- tcren.potential.derive.derive_tcren(contacts, include=None, exclude=None, pseudocount=1, variant='classic', beta=44.0, drop_cys=None)[source]#
Derive a TCRen potential from a table of residue contacts.
- Parameters:
contacts (DataFrame) – Long table of TCR↔peptide contacts with at least
residue.aa.from,residue.aa.toand (for filtering)pdb.id.include (list[str] | None) – If given, keep only contacts whose
pdb.idis in this list.exclude (list[str] | None) – If given, drop contacts whose
pdb.idis in this list.pseudocount (int) – Added to every amino-acid pair count (default 1).
variant (str) –
"classic"(natural-log log-odds over 20 aa, Cys dropped from the “from” axis) or"am"(log2/betaover 21 symbols including a gap, Cys retained).beta (float) – Temperature divisor used by the
"am"variant.drop_cys (bool | None) – Override the per-variant default for dropping
from == "C"rows.
- Returns:
The derived
Potential. For"am"the long matrix additionally carries acountcolumn.- Return type:
- tcren.potential.derive.derive_tcren_loo(contacts, pdb_ids, **kwargs)[source]#
Leave-one-out TCRen: derive once per structure, excluding it each time.
- Parameters:
contacts (DataFrame) – Contact table (see
derive_tcren()).pdb_ids (list[str]) – Structures to leave out one at a time (also the inclusion set).
**kwargs – Forwarded to
derive_tcren().
- Returns:
Long table
residue.aa.from, residue.aa.to, TCRen.LOO, pdb.idstacking the per-structure potentials.- Return type:
DataFrame
Scoring#
tcren.scoring module#
Candidate-peptide scoring by amino-acid substitution.
Ports the second half of run_TCRen.R: for each candidate peptide, substitute its
amino acids at the contacted peptide positions of a structure’s contact map and sum the
pairwise potential over all contacts. Lower scores indicate more favourable interactions.
- tcren.scoring.score_peptides(contact_map, candidates, potential, interface='tcr_peptide', require_same_length=True, substituted_side=None)[source]#
Score candidate peptides against a structure’s contact map.
- Parameters:
contact_map (ContactMap) – The structure’s contact map.
candidates (Iterable[str]) – Candidate peptide sequences (one-letter).
potential (Potential) – Pairwise potential to score with.
interface (Literal['tcr_peptide', 'tcr_mhc', 'peptide_mhc']) – Which interface to score over (default
"tcr_peptide").require_same_length (bool) – Only score candidates whose length matches the structure’s peptide length (mirrors the legacy length join). Ignored when the contact map has no recorded peptide length.
substituted_side (str | None) –
"to"or"from"— which contact side the candidate is threaded onto. Defaults to the peptide side ofinterface.
- Returns:
Columns
complex.id,peptide,potential,scoresorted bycomplex.idthen ascendingscore.- Return type:
DataFrame
- tcren.scoring.score_structures(contact_maps, candidates, potential, **kwargs)[source]#
Score candidates against several structures and stack the results.
- Parameters:
contact_maps (Iterable[ContactMap])
candidates (Iterable[str])
potential (Potential)
- Return type:
DataFrame
tcren.pipeline module#
End-to-end TCRen pipeline: structure → annotation → orientation → contacts → score.
One call takes a TCR-pMHC structure all the way through the tcren workflow:
import the structure (C-gene trimmed);
annotate chains — TCR loci/CDRs via arda, MHC allele/class/role + groove regions;
superimpose onto the canonical database (canonical Cα frame; optional);
markup + contacts — the per-residue region table and the 5 Å contact map;
score each interface with its potential: TCRen for TCR↔peptide, MJ for TCR↔MHC and peptide↔MHC, plus the total.
The interface energy is the sum of the residue-pair potential over the observed contacts of that interface (the closest-atom contact per residue pair, as everywhere in tcren).
- class tcren.pipeline.PipelineResult(pdb_id, mhc_calls, markup, contacts, scores, oriented=None, rmsd=None, extra=<factory>)[source]#
Bases:
objectEverything the pipeline produces for one structure.
- Parameters:
- pdb_id: str#
- markup: DataFrame#
- contacts: DataFrame#
- scores: dict[str, float]#
- rmsd: float | None#
- extra: dict#
- tcren.pipeline.run(structure, organism='human', superimpose=True, db_dir=None, cutoff=5.0)[source]#
Run the full pipeline on one structure (path or parsed
Structure).- Parameters:
structure (str | Path | Structure) – a structure file (any tcren-readable format) or an already-parsed structure.
organism (str) – organism for TCR annotation.
superimpose (bool) – also orient onto the canonical database (sets
oriented+rmsd).db_dir (str | Path | None) – canonical database for
superimpose(defaultdata/Canonical2026).cutoff (float) – contact distance threshold (Å).
- Returns:
A
PipelineResultwith the markup, contacts, per-interface scores and (if requested) the canonical-frame oriented structure.- Return type:
- tcren.pipeline.score_row(result)[source]#
Flatten a
PipelineResultto a one-row scores dict (for a CSV table).- Parameters:
result (PipelineResult)
- Return type:
dict
tcren.refine package#
Peptide substitution + potential-guided refinement.
substitute_peptide() threads a new sequence onto the peptide backbone; refine_peptide()
runs a knowledge-based rigid-body Monte-Carlo refinement of the peptide pose via the compiled
tcren._refine kernel. The refinement energy is the DOPE atom-level distance-dependent
statistical potential (Shen & Sali, Protein Science 2006), used here independently of the
TCRen/MJ potentials tcren scores epitopes with — so the pose is not optimised against the same
quantity it is later scored with. This is a lightweight, knowledge-based refine, NOT physics
relaxation (that is Rosetta FlexPepDock, as a subprocess).
- tcren.refine.substitute_peptide(structure, new_peptide, chain_type='PEPTIDE')[source]#
Return a copy of
structurewith the peptide chain threaded tonew_peptide.The peptide backbone (and Cβ) is preserved; side-chain atoms beyond Cβ are dropped (and Cβ too for any position mutated to glycine).
new_peptidemust equal the peptide length and use the 20 standard one-letter amino acids.
- tcren.refine.refine_peptide(structure, *, shell=12.0, restraint_w=0.5, n_steps=2000, trans_sigma=0.2, rot_sigma=0.05, temp0=1.0, temp1=0.05, seed=0)[source]#
Rigid-body refine the peptide pose against its TCR+MHC partners;
(structure, energy).The energy is the DOPE atom-level distance-dependent statistical potential summed over all peptide$leftrightarrow$partner heavy-atom pairs within DOPE’s range (its short-range bins are repulsive, so it provides its own clash term), plus a harmonic restraint to the input pose (
restraint_w) that keeps the search local. Only partner atoms withinshellÅ of the peptide are considered. The structure must be chain-typed (peptide = chain ofchain_type == 'PEPTIDE'). Requires the compiled_refineext + the bundled DOPE table.- Parameters:
structure (Structure)
shell (float)
restraint_w (float)
n_steps (int)
trans_sigma (float)
rot_sigma (float)
temp0 (float)
temp1 (float)
seed (int)
Backbone-preserving peptide substitution.
score_peptides scores a candidate peptide virtually (it re-indexes the potential matrix over
the native contact map — no atoms move). When you want to actually re-dock / refine a candidate you
first need its coordinates: substitute_peptide() threads an equal-length sequence onto the
existing peptide backbone, keeping N/Cα/C/O(+Cβ) and dropping the old side-chain atoms beyond Cβ
(a refiner / rotamer repack rebuilds them). Pure data-model manipulation; returns a new structure.
- tcren.refine.substitute.substitute_peptide(structure, new_peptide, chain_type='PEPTIDE')[source]#
Return a copy of
structurewith the peptide chain threaded tonew_peptide.The peptide backbone (and Cβ) is preserved; side-chain atoms beyond Cβ are dropped (and Cβ too for any position mutated to glycine).
new_peptidemust equal the peptide length and use the 20 standard one-letter amino acids.
Data paths#
tcren.paths module#
Filesystem locations for tcren’s reference data.
The library’s runtime dataset lives in the repo data/ directory (or $TCREN_DATA_DIR):
the canonical Native2026 structure set (HF isalgo/tcren_structures, gitignored),
PDB_date.tsv and orient_metadata.json. Structures are fetched lazily; nothing here is
bundled into the installed package.
- tcren.paths.data_dir()[source]#
Root of the runtime dataset:
$TCREN_DATA_DIRor the repodata/directory.- Return type:
Path
- tcren.paths.native_dir()[source]#
Directory holding the canonical
Native2026structures (data/Native2026).- Return type:
Path
- tcren.paths.reference_structure_path(pdb_id)[source]#
Resolve a canonical reference structure by id (plain/gzipped PDB/mmCIF).
Looks under
data/Native2026first; if absent (e.g. a pip-installed library with no repodata/), lazily downloads it from the HF dataset into the HF cache. This makes orienting a new, non-canonical structure work out of the box for both the library and CLI.Raises
FileNotFoundErrorif it is neither local nor fetchable.- Parameters:
pdb_id (str)
- Return type:
Path
Analysis#
tcren.analysis module#
Dataset-level analyses for the TCRen contact statistics and potentials.
Helpers for the analysis notebook / benchmarks, following the TCRen manuscript logic:
potential heatmaps and comparisons, the distribution of TCR↔peptide contacts per
structure and per region, and how contacts distribute over peptide / CDR3 positions as a
function of peptide / CDR3 length. They take the manuscript contact + summary tables as
explicit paths (the committed oracle lives under tests/assets/oracle/).
- tcren.analysis.load_interface_contacts(contact_maps, summary)[source]#
Load and enrich the manuscript TCR↔peptide contact table.
Adds, per contact:
peptide_pos(0-based peptide position =residue.index.to),peptide_len,cdr3_len(CDR3α/β length for the contacting TCR chain),cdr3_rel_pos(residue index relative to the chain’s first contacting CDR3 residue — a relative position, since the committed table carries no region start), and thenonredflag.- Parameters:
contact_maps (str | Path)
summary (str | Path)
- Return type:
DataFrame
- tcren.analysis.contacts_per_structure(df, nonred_only=True)[source]#
Number of TCR↔peptide contacts per structure (with TRA/TRB split).
- Parameters:
df (DataFrame)
nonred_only (bool)
- Return type:
DataFrame
- tcren.analysis.region_contact_counts(df, nonred_only=True)[source]#
Total contacts by TCR region (CDR1/2/3, FR) and chain (TRA/TRB).
- Parameters:
df (DataFrame)
nonred_only (bool)
- Return type:
DataFrame
- tcren.analysis.position_distribution(df, side='peptide', nonred_only=True)[source]#
Contact counts by position, stratified by chain/molecule length.
- Parameters:
df (DataFrame) – enriched contacts (see
load_interface_contacts()).side (str) –
"peptide"(peptide position vs peptide length) or"cdr3a"/"cdr3b"(relative CDR3 position vs CDR3 length, for that TCR chain).nonred_only (bool) – restrict to non-redundant structures.
- Returns:
length, position, n_contacts.- Return type:
Long counts
- tcren.analysis.potential_long(potential)[source]#
Heatmap-ready long form of a potential:
residue.aa.from, residue.aa.to, value.- Parameters:
potential (Potential)
- Return type:
DataFrame
Orientation#
tcren.orient.align module#
Bring a structure into a canonical reference frame by MHC superposition.
A query complex is oriented onto a native reference by superposing the conserved MHC
groove Cα atoms (the helix/floor residues from tcren.mhc.regions). Because every
structure is aligned to the same reference, all oriented complexes share one frame —
the basis for overlaying structures and for 2D interface projection. Correspondence
between query and reference groove residues is established by sequence alignment, so
different alleles/numbering are handled.
- class tcren.orient.align.OrientationResult(rotation, translation, rmsd, n_anchor_atoms, reference_id)[source]#
Bases:
objectRigid transform that maps a structure onto the canonical reference frame.
- Parameters:
rotation (ndarray)
translation (ndarray)
rmsd (float)
n_anchor_atoms (int)
reference_id (str)
- rotation: ndarray#
- translation: ndarray#
- rmsd: float#
- n_anchor_atoms: int#
- reference_id: str#
- tcren.orient.align.align_to_native(structure, reference_id=None)[source]#
Compute the transform orienting
structureonto a native reference by MHC.structuremust already be chain-typed and MHC-annotated (seetcren.mhc.annotate_mhc()). The reference (default a canonical complex for the structure’s MHC class) is loaded from theNative2026dataset (tcren.paths).- Parameters:
structure (Structure)
reference_id (str | None)
- Return type:
- tcren.orient.align.apply_transform(structure, result)[source]#
Return a copy of
structurewith the orientation transform applied to all atoms.- Parameters:
structure (Structure)
result (OrientationResult)
- Return type:
tcren.orient.frame module#
Canonical TCR-pMHC frame by PCA: z ≈ PC1 (MHC→TCR), y ≈ PC2 (peptide), x ≈ PC3.
Every structure is first superposed onto a per-class native reference by its MHC groove Cα
(tcren.orient.align.align_to_native()); a fixed per-class rotation R_canon then maps
that reference frame into the canonical axes. R_canon is obtained by centring the reference
complex’s Cα cloud at its centre of mass and taking its principal axes (PCA):
z= PC1 (largest variance, the MHC→TCR long axis), signed+ztoward the TCR so the MHC sits at−z;y= PC2 (the groove/peptide axis), signed+ytoward the peptide C-terminus;x= PC3 (the thin axis), signed for a right-handed frame.
R_canon + the variance fractions are cached in the bundled tcren/data/canonical_frame.json
so orientation is reproducible and inspectable. When no native database is available the same
PCA axes are fit directly from the query (the PCA fallback).
- class tcren.orient.frame.CanonResult(rotation, translation, rmsd, n_anchor_atoms, reference_id, frame, reversed_dock=None, chain_map=<factory>)[source]#
Bases:
objectComposed rigid transform that maps a structure into the canonical frame.
- Parameters:
rotation (ndarray)
translation (ndarray)
rmsd (float)
n_anchor_atoms (int)
reference_id (str | None)
frame (Literal['native', 'pca'])
reversed_dock (bool | None)
chain_map (dict[str, str])
- rotation: ndarray#
- translation: ndarray#
- rmsd: float#
- n_anchor_atoms: int#
- reference_id: str | None#
- frame: Literal['native', 'pca']#
- reversed_dock: bool | None#
- chain_map: dict[str, str]#
- tcren.orient.frame.canonical_frame(structure, reference_id=None, force_pca=False)[source]#
Compose the MHC superposition with the per-class
R_canon(native), or fit the canonical axes directly from the query (PCA fallback when no DB / too few anchors).- Parameters:
structure (Structure)
reference_id (str | None)
force_pca (bool)
- Return type:
tcren.orient.exceptions module#
Detect reverse-docked TCR-pMHC complexes (a biological exception, flagged not flipped).
The canonical frame is fixed by peptide polarity (+y = peptide C-terminus). The conserved
diagonal docking then places the VDJ chain (TRB/TRD, Vβ) on the peptide-C side (+y) and the
VJ chain (TRA/TRG, Vα) on the peptide-N side (−y) — consistent with the CDR footprint
CDR1α·CDR2α·CDR3α·CDR3β·CDR2β·CDR1β laid out N→C. A genuinely reverse-docked TCR lands with the
α/β sides mirrored. We report it; we never force-flip, because the orientation is meaningful.
- tcren.orient.exceptions.detect_reverse_dock(structure, rotation, translation, margin=2.0)[source]#
Apply the canonical transform and check the TCR α/β handedness.
Canonical: VDJ (TRB/TRD) at
+y(peptide-C side) and VJ (TRA/TRG) at−y. ReturnsTruewhen the VJ chain is on the+yside of the VDJ chain by more thanmarginÅ (reverse dock),Falsefor a canonical dock, andNonewhen a TCR side is missing.- Parameters:
structure (Structure)
rotation (ndarray)
translation (ndarray)
margin (float)
- Return type:
bool | None
tcren.orient.chains module#
Select a single TCR-pMHC complex and rename its chains to the canonical A–E scheme.
- tcren.orient.chains.select_primary_complex(structure)[source]#
Keep one mutually-contacting chain per canonical role (one TCR-pMHC complex).
Chosen as a connected unit so chains that do not touch the peptide (notably β2m) are not grabbed from another copy: the primary peptide is the one most embedded in an MHC-α groove (then most TCR contacts, then shortest); the TCR α/β and MHC-α are taken by contacts to that peptide; β2m / MHC-β by contacts to the chosen MHC-α. No-op for single complexes.
- tcren.orient.chains.rename_chains(structure)[source]#
Return a copy with only the canonical complex, chain ids remapped per
CHAIN_RENAME, plus the old→new map.Chains with no canonical role (tags, additives, unrelated proteins) are dropped so the output is exactly the A–E TCR-pMHC complex. Raises
ValueErrorif two source chains map to the same canonical id (unresolved multi-copy — runselect_primary_complex()first).
tcren.orient.pipeline module#
Orchestrate canonicalization of TCR-pMHC structures into the common MHC frame.
- tcren.orient.pipeline.canonicalize_structure(structure, reference_id=None, force_pca=False, select_primary=True)[source]#
Orient an (already chain-typed + MHC-annotated) structure into the canonical frame.
Returns the oriented, A–E renamed structure and the populated
CanonResult(transform, frame, rmsd, reverse-dock flag, chain map). Coordinates are transformed; the chain roles drive the rename, so order matters (frame + reverse-dock are read before the transform clears region markup).- Parameters:
structure (Structure)
reference_id (str | None)
force_pca (bool)
select_primary (bool)
- Return type:
tuple[Structure, CanonResult]
- tcren.orient.pipeline.align_to_canonical(structure, reference_id=None, organism='human', force_pca=False)[source]#
Align a NEW (parsed) structure onto the Native2026 canonical frame.
Runs chain typing + MHC annotation, then
canonicalize_structure(). The stored per-classR_canonis reused, so the result is in the same frame as the dataset and the composed transform in the returnedCanonResultreplays the placement exactly.- Parameters:
structure (Structure)
reference_id (str | None)
organism (str)
force_pca (bool)
- Return type:
tuple[Structure, CanonResult]
- tcren.orient.pipeline.check_oriented_complex(structure, max_peptide_len=25, max_offset=25.0, max_tcr_gap=15.0, max_orphan=70.0)[source]#
Geometric sanity check on an oriented A–E complex;
(ok, reason).Rejects structures whose canonical placement is inconsistent: missing / overlong peptide, peptide not at the groove centre (≈ origin), the TCR not engaging the peptide, or any chain stranded far from the complex (an orphan copy that survived primary-complex selection).
- Parameters:
max_peptide_len (int)
max_offset (float)
max_tcr_gap (float)
max_orphan (float)
- tcren.orient.pipeline.run_folder(structures, out, metadata=None, organism='human', reference_id=None, force_pca=False, threads=None, mmcif=False, compress=False)[source]#
Canonicalize a file or folder of structures; write oriented structures + a metadata table.
Output format follows
mmcif(.cifvs.pdb) andcompress(trailing.gz); plain PDB by default (passcompress=Trueto rebuild the gzipped Canonical2026 set).Annotation is BATCHED — one mmseqs search for all TCR chains (per organism) and one for all MHC chains across the whole set (mmseqs parallelises internally; never per-structure, never Python-threaded). Only the embarrassingly-parallel, mmseqs-free stages — parsing and the structural alignment + write — use a thread pool (
threadsworker threads, defaultos.cpu_count()).- Parameters:
structures (str | Path)
out (str | Path)
metadata (str | Path | None)
organism (str)
reference_id (str | None)
force_pca (bool)
threads (int | None)
mmcif (bool)
compress (bool)
- Return type:
DataFrame
- tcren.orient.pipeline.run_superimpose(structures, out, db_dir=None, organism='human', mmcif=False, compress=False, threads=None)[source]#
Superimpose input structure(s) onto a canonical database; write oriented structures.
structuresis a file, directory,.tar.gz, or a shell glob.outis an output directory, or — for a single input — a structure file whose extension must matchmmcif/compress. Annotation is BATCHED (one mmseqs pass over all inputs); only the mmseqs-free ensemble alignment + write runs on the thread pool (threadsworkers, default all cores). Seetcren.orient.superimpose()for the MHC-ensemble method.- Parameters:
structures (str | Path)
out (str | Path)
db_dir (str | Path | None)
organism (str)
mmcif (bool)
compress (bool)
threads (int | None)
- Return type:
DataFrame
tcren.orient.docking module#
TCR docking geometry: crossing angle and incident (tilt) angle.
The TCR “docking angle” (Rudolph, Stanfield & Wilson 2006; Garcia et al.) describes how the
αβ (or γδ) TCR sits on top of the peptide-MHC groove. It is computed here directly from the
canonical frame (tcren.orient.frame.canonical_frame()), so no external package
(TCRdock / STCRpy) is required:
the crossing angle is the angle between the Vα→Vβ pseudo-axis projected into the MHC groove plane and the groove long axis (peptide N→C, canonical
+y— collinear with the MHC α1 helix to within a few degrees). Reported on[0, 180); canonical αβ TCRs cluster around ~20–70°. A signed variant ([-180, 180)) carries the handedness of the docking.the incident (tilt) angle is the elevation of the same Vα→Vβ vector out of the groove plane (canonical
zis the MHC→TCR normal): positive when Vβ rides higher above the groove than Vα.
The Vα/Vβ landmarks are the centroids of the variable-domain Cα atoms of the two receptor chains (TRA/TRB for αβ, TRG/TRD for γδ). The frame is fit from the query itself (PCA), so the calculation needs neither the native database nor mmseqs once the structure is chain-typed.
- class tcren.orient.docking.DockingAngles(crossing_angle, crossing_angle_signed, incident_angle, cell_type, n_va, n_vb)[source]#
Bases:
objectTCR docking geometry relative to the MHC groove (all angles in degrees).
- Parameters:
crossing_angle (float)
crossing_angle_signed (float)
incident_angle (float)
cell_type (str)
n_va (int)
n_vb (int)
- crossing_angle: float#
- crossing_angle_signed: float#
- incident_angle: float#
- cell_type: str#
- n_va: int#
- n_vb: int#
- tcren.orient.docking.crossing_incident_from_vector(v_canon)[source]#
(crossing, crossing_signed, incident)degrees from a Vα→Vβ vector in canonical axes.v_canonis[vx, vy, vz]along canonical x (groove width), y (groove long axis, peptide N→C) and z (MHC→TCR normal). The crossing angle is measured in the groove plane (xy) from the long axis; the incident angle is the elevation out of that plane.- Parameters:
v_canon (ndarray)
- Return type:
tuple[float, float, float]
- tcren.orient.docking.docking_angles(structure)[source]#
Crossing + incident angle of a chain-typed TCR-pMHC complex.
The structure must already be chain-typed (
classify_chains) and MHC-annotated (annotate_mhc) so the canonical frame can be fit. The frame is taken from the query’s own Cα cloud (PCA), so the result needs no native database.- Parameters:
structure (Structure) – a chain-typed, MHC-annotated TCR-pMHC structure.
- Returns:
A
DockingAngleswith the crossing and incident angles.- Raises:
ValueError – if a receptor chain pair (TRA/TRB or TRG/TRD) is missing, or the canonical frame is degenerate.
- Return type:
Command line#
tcren.cli module#
Command-line interface for tcren.
Subcommands:
tcren info— environment / dependency check.tcren annotate— chain typing + region markup (TCR/MHC/peptide;--regionsto filter,--pseudofor MHC pseudosequence residues) for input structures.tcren contacts— annotated contact table for input structures.tcren derive-potential— derive a TCRen potential from a contact-map table.tcren score— end-to-end candidate scoring (drop-in forrun_TCRen.R).
- tcren.cli.paper_bootstrap(structures=<typer.models.OptionInfo object>, canonical=<typer.models.OptionInfo object>)[source]#
Fetch HF structure sets into notebooks/data/<Set>/ (gitignored; non-structure inputs are already committed under natcompsci2022/data_legacy/).
- Parameters:
structures (bool)
canonical (bool)
- Return type:
None
- tcren.cli.annotate(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, regions=<typer.models.OptionInfo object>, pseudo=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>)[source]#
Annotate chains and emit a per-residue region-markup table.
Covers TCR (CDR/FR), MHC groove (helices/floor) and peptide in one pass —
--regionsrestricts the output to one chain class.--pseudoadditionally marks the NetMHCpan MHC pseudosequence residues (regionMPS). MHC groove +MPSrequire MHC annotation, which runs automatically when needed.- Parameters:
structures (Path)
out (Path)
regions (str)
pseudo (bool)
organism (str)
- Return type:
None
- tcren.cli.contacts(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, cutoff=<typer.models.OptionInfo object>, interface=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>)[source]#
Compute and emit an annotated contact table.
- Parameters:
structures (Path)
out (Path)
cutoff (float)
interface (str)
organism (str)
- Return type:
None
- tcren.cli.orient(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, metadata=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, reference_id=<typer.models.OptionInfo object>, force_pca=<typer.models.OptionInfo object>, threads=<typer.models.OptionInfo object>, mmcif=<typer.models.OptionInfo object>, compress=<typer.models.OptionInfo object>)[source]#
Build a canonical database: orient native TCR-pMHC complexes into the common MHC frame.
Derives the per-class canonical frame and writes every complex into it (A–E chains). This is how the bundled
Canonical2026set is produced; usesuperimposeto bring a new structure into an existing canonical database.- Parameters:
structures (Path)
out (Path)
metadata (Path)
organism (str)
reference_id (str)
force_pca (bool)
threads (int)
mmcif (bool)
compress (bool)
- Return type:
None
- tcren.cli.superimpose(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, db=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, threads=<typer.models.OptionInfo object>, mmcif=<typer.models.OptionInfo object>, compress=<typer.models.OptionInfo object>)[source]#
Superimpose structure(s) onto a canonical database by MHC.
Detects each input’s MHC chains, class, and species, then superposes its conserved groove Cα onto every database structure of the same class and species and averages the transforms into one consensus placement. The database defaults to
data/Canonical2026(populated at install).-saccepts a file, directory,.tar.gz, or a shell glob.-ois an output directory, or — for a single input — a structure file whose extension must match--mmCIF/--compress. Annotation is one batched mmseqs call;-tthreads the alignment + write.- Parameters:
structures (str)
out (Path)
db (Path)
organism (str)
threads (int)
mmcif (bool)
compress (bool)
- Return type:
None
- tcren.cli.derive_potential(contact_maps=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, summary=<typer.models.OptionInfo object>, nonred=<typer.models.OptionInfo object>, variant=<typer.models.OptionInfo object>, pseudocount=<typer.models.OptionInfo object>, loo=<typer.models.OptionInfo object>)[source]#
Derive a TCRen potential from observed contacts.
- Parameters:
contact_maps (Path)
out (Path)
summary (Path | None)
nonred (bool)
variant (str)
pseudocount (int)
loo (bool)
- Return type:
None
- tcren.cli.fetch_data(canonical=<typer.models.OptionInfo object>)[source]#
Populate
data/with the reference structure sets from the HF dataset.Run once at install (
setup.shdoes this). FetchesNative2026(orientation references) and, by default,Canonical2026(the defaultsuperimposedatabase) into$TCREN_DATA_DIR/ repodata/. Skips folders already present.- Parameters:
canonical (bool)
- Return type:
None
- tcren.cli.build_mhc_ref(species=<typer.models.OptionInfo object>, force_download=<typer.models.OptionInfo object>)[source]#
Download and curate the MHC allele reference (IMGT/HLA + UniProt mouse).
- Parameters:
species (str)
force_download (bool)
- Return type:
None
- tcren.cli.fetch_recent(dest=<typer.models.OptionInfo object>, discover=<typer.models.OptionInfo object>, after=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>)[source]#
Download recent TCR-pMHC structures from RCSB into data/pdb_recent.
Seeds with the Native2026 ids; with –discover also full-text-searches RCSB for new entries. Each is pulled as mmCIF (.cif.gz; handles extended PDB ids), annotated, and kept only if it has all 5 required chains (MHCa + b2m/MHCb + peptide + TCR pair).
- Parameters:
dest (Path)
discover (bool)
after (str)
organism (str)
- Return type:
None
- tcren.cli.score(structures=<typer.models.OptionInfo object>, candidates=<typer.models.OptionInfo object>, potential=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, interface=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, cutoff=<typer.models.OptionInfo object>)[source]#
Score candidate epitopes against input structures (end-to-end pipeline).
- Parameters:
structures (Path)
candidates (Path)
potential (str | None)
out (Path)
interface (str)
organism (str)
cutoff (float)
- Return type:
None
- tcren.cli.pipeline(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, no_superimpose=<typer.models.OptionInfo object>, db=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, cutoff=<typer.models.OptionInfo object>)[source]#
Run the full pipeline and write per-interface energies for each structure.
structure → annotate (alleles + chains) → superimpose → resmarkup / canonical Cα / contacts → score (TCRen for TCR↔peptide, MJ for TCR↔MHC and peptide↔MHC) + total.
- Parameters:
structures (Path)
out (Path)
no_superimpose (bool)
db (Path)
organism (str)
cutoff (float)
- Return type:
None
- tcren.cli.refine(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, substitute=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, n_steps=<typer.models.OptionInfo object>, restraint_w=<typer.models.OptionInfo object>, seed=<typer.models.OptionInfo object>, mmcif=<typer.models.OptionInfo object>, compress=<typer.models.OptionInfo object>)[source]#
Potential-guided rigid-body refinement of the peptide pose (knowledge-based, not physics).
Optionally
--substitutea new equal-length peptide first, then run a Monte-Carlo refinement scored by the DOPE atom-level statistical potential (restrained to the input pose; independent of the TCRen/MJ scoring potentials). Writes one structure per input and prints the final DOPE energy. (For physics-grade relaxation use Rosetta FlexPepDock externally.)- Parameters:
structures (str)
out (Path)
substitute (str)
organism (str)
n_steps (int)
restraint_w (float)
seed (int)
mmcif (bool)
compress (bool)
- Return type:
None