tcren package#

Structure I/O#

tcren.structure.model module#

Lightweight structure data model.

These dataclasses wrap the parsed contents of a PDB/mmCIF file in the shape the rest of the pipeline needs: per-chain residue lists carrying both a 0-based sequential index (matching the legacy mir residue.index) and the original author numbering, plus heavy-atom coordinates for contact computation.

class tcren.structure.model.Atom(name, element, coord)[source]#

Bases: object

A single (heavy) atom.

Parameters:
  • name (str)

  • element (str)

  • coord (ndarray)

name: str#
element: str#
coord: ndarray#
class tcren.structure.model.Residue(seq_index, pdb_index, insertion_code, aa, resname, atoms)[source]#

Bases: object

A polymer residue.

Variables:
  • seq_index (int) – 0-based sequential index over the chain’s resolved polymer residues (the legacy residue.index); independent of author numbering gaps.

  • pdb_index (int) – Author residue number (residue.index.pdb).

  • insertion_code (str) – Author insertion code ('' when absent).

  • aa (str) – One-letter amino-acid code ('X' for unknown).

  • resname (str) – Three-letter residue name (HIS, MSE …).

  • atoms (tuple[tcren.structure.model.Atom, ...]) – Heavy atoms of the residue.

Parameters:
  • seq_index (int)

  • pdb_index (int)

  • insertion_code (str)

  • aa (str)

  • resname (str)

  • atoms (tuple[Atom, ...])

seq_index: int#
pdb_index: int#
insertion_code: str#
aa: str#
resname: str#
atoms: tuple[Atom, ...]#
property ca: ndarray | None#

Cα coordinate, or None if the residue has no Cα atom.

property cb: ndarray | None#

Cβ coordinate, or None if the residue has no Cβ atom (e.g. glycine).

property cb_or_ca: ndarray | None#

Cβ coordinate, falling back to Cα (glycine / missing Cβ); None if neither.

class tcren.structure.model.Chain(chain_id, residues, chain_type=None, chain_supertype=None, allele_info=None, regions=<factory>)[source]#

Bases: object

A polymer chain and its annotations.

Parameters:
  • chain_id (str)

  • residues (list[Residue])

  • chain_type (str | None)

  • chain_supertype (str | None)

  • allele_info (str | None)

  • regions (list[RegionMarkup])

chain_id: str#
residues: list[Residue]#
chain_type: str | None#
chain_supertype: str | None#
allele_info: str | None#
regions: list[RegionMarkup]#
sequence()[source]#

One-letter sequence in residue order.

Return type:

str

by_seq_index(seq_index)[source]#

Return the residue at a given sequential index, or None.

Parameters:

seq_index (int)

Return type:

Residue | None

class tcren.structure.model.RegionMarkup(region_type, start_seq_index, end_seq_index, sequence, residues)[source]#

Bases: object

An annotated region (CDR/FR for TCR, groove regions for MHC).

Parameters:
  • region_type (str)

  • start_seq_index (int)

  • end_seq_index (int)

  • sequence (str)

  • residues (list[Residue])

region_type: str#
start_seq_index: int#
end_seq_index: int#
sequence: str#
residues: list[Residue]#
class tcren.structure.model.Structure(pdb_id, chains, complex_species=None, cell_type=None)[source]#

Bases: object

A parsed complex: a set of annotated chains.

Parameters:
  • pdb_id (str)

  • chains (list[Chain])

  • complex_species (str | None)

  • cell_type (str | None)

pdb_id: str#
chains: list[Chain]#
complex_species: str | None#
cell_type: str | None#
chain(chain_id)[source]#

Return the chain with the given id (raises KeyError if absent).

Parameters:

chain_id (str)

Return type:

Chain

by_type(*types)[source]#

Return chains whose chain_type is in types.

Parameters:

types (str)

Return type:

list[Chain]

tcren.structure.io module#

Parse PDB / mmCIF files into the tcren.structure.model data model.

Accepts plain .pdb/.ent/.cif/.mmcif files, their gzip-compressed forms (.pdb.gz/.cif.gz …), and — for batches — directories or .tar/.tar.gz archives of any of those (see iter_structures()). Structure identifiers are resolved from the file name by structure_id_from_path().

tcren.structure.io.is_structure_file(name)[source]#

True if name is a (optionally gzipped) PDB/mmCIF structure file.

Parameters:

name (str | Path)

Return type:

bool

tcren.structure.io.structure_id_from_path(path)[source]#

Resolve a structure identifier from a file name.

Strips a trailing .gz and the structure extension, then takes the part before the first _ (so 4x5w_renumbered.cif and 1ao7.pdb.gz and 6uk4_TCRpMHCmodels.pdb all resolve to their PDB id).

Parameters:

path (str | Path)

Return type:

str

tcren.structure.io.parse_structure(path, pdb_id=None, model=0, keep_hydrogens=True)[source]#

Parse a structure file into a Structure.

Residues are taken in author order; only amino-acid residues (standard or modified, via the extended three→one table) are kept — waters, ions and ligands are dropped. Each kept residue receives a 0-based sequential seq_index per chain, matching the legacy mir residue.index.

Parameters:
  • path (str | Path) – Path to a .pdb/.ent or .cif/.mmcif file.

  • pdb_id (str | None) – Structure identifier; defaults to the file stem.

  • model (int) – Model index to read (default 0 — the first model).

  • keep_hydrogens (bool) – Keep hydrogen atoms (default True — the legacy mir contact definition counts hydrogens when a structure provides them).

Returns:

The parsed Structure.

Return type:

Structure

tcren.structure.io.import_structure(path, pdb_id=None, model=0, keep_hydrogens=True, trim_c_gene=True, keep_c_gene=False, min_constant_score=80.0)[source]#

Parse a structure and prepare it for interface analysis.

Wraps parse_structure(), records the αβ/γδ cell type from the TCR constant region, and — by default — trims that constant region so downstream analysis works on the variable domains and the interface.

Parameters:
  • path (str | Path) – as in parse_structure().

  • pdb_id (str | None) – as in parse_structure().

  • model (int) – as in parse_structure().

  • keep_hydrogens (bool) – as in parse_structure().

  • trim_c_gene (bool) – Trim the TCR constant domain (default True).

  • keep_c_gene (bool) – Retain the constant domain even if trim_c_gene is set. Use this for molecular-dynamics / FlexPepDock and any workflow that needs the full chain — those depend on the presence of the C-gene.

  • min_constant_score (float) – Minimum constant-region alignment score to trim on.

Returns:

The imported Structure with cell_type set.

Return type:

Structure

tcren.structure.io.structure_paths(src)[source]#

List structure files for src (a single file or a directory), sorted.

Recognises plain and gzipped PDB/mmCIF (.pdb, .cif.gz, …). For archives or streaming, use iter_structures().

Parameters:

src (str | Path)

Return type:

list[Path]

tcren.structure.io.iter_structures(src, importer=<function import_structure>, on_error='raise', **kwargs)[source]#

Yield (pdb_id, Structure) for a file, directory, or .tar/.tar.gz archive.

Handles plain and gzipped PDB/mmCIF (.pdb/.cif/.pdb.gz/.cif.gz …); a directory is scanned for those; a tar archive is streamed member-by-member (each member materialised to a temp file so the path-based importer works unchanged). The identifier is resolved per file by structure_id_from_path().

Parameters:
  • src (str | Path) – structure file, directory, or tar archive.

  • importer (Callable[[...], Structure]) – per-file parser — import_structure() (default, trims the C-gene) or parse_structure() (parity-pure). Extra kwargs are forwarded to it.

  • on_error (str) – "raise" (default) or "skip" to ignore files that fail to parse.

Return type:

Iterator[tuple[str, Structure]]

tcren.structure.io.pdb_lines(structure, transform=None, keep_hydrogens=True)[source]#

ATOM/TER/END record lines for structure (optionally coordinate-transformed).

One conformer per atom name per residue (drops duplicate altlocs). transform is an optional coord -> coord callable (e.g. for an oriented frame); identity if None. Author residue numbers + insertion codes are preserved.

Parameters:
  • structure (Structure)

  • keep_hydrogens (bool)

Return type:

list[str]

tcren.structure.io.cif_lines(structure, transform=None, keep_hydrogens=True)[source]#

Minimal mmCIF atom_site loop for structure (optionally transformed).

Same atom selection as pdb_lines() (one conformer per atom name per residue). Only the _atom_site category is written — enough to round-trip coordinates + chain/residue identity through the Biopython MMCIF parser, which is all tcren consumes.

Parameters:
  • structure (Structure)

  • keep_hydrogens (bool)

Return type:

list[str]

tcren.structure.io.write_pdb(structure, path, transform=None, keep_hydrogens=True)[source]#

Write structure to a PDB file; return the path.

A .gz suffix (foo.pdb.gz) transparently gzip-compresses the output.

Parameters:
  • structure (Structure)

  • path (str | Path)

  • keep_hydrogens (bool)

Return type:

Path

tcren.structure.io.structure_output_path(directory, pdb_id, mmcif=False, compress=False)[source]#

Build an output path <directory>/<pdb_id>.<ext> from format flags.

.pdb by default, .cif if mmcif, with a trailing .gz if compress.

Parameters:
  • directory (str | Path)

  • pdb_id (str)

  • mmcif (bool)

  • compress (bool)

Return type:

Path

tcren.structure.io.write_structure(structure, path, transform=None, keep_hydrogens=True)[source]#

Format-dispatch writer: PDB or mmCIF, optionally gzipped (by the path suffix).

Parameters:
  • structure (Structure)

  • path (str | Path)

  • keep_hydrogens (bool)

Return type:

Path

Annotation#

tcren.annotation.arda_adapter module#

TCR chain annotation via the arda library.

Extracts each chain’s amino-acid sequence, runs arda’s AIRR annotation, and projects the returned region coordinates (1-based, end-inclusive into the input sequence) back onto structure residues as RegionMarkup. Region names are mapped to the legacy mir vocabulary (FR1/CDR1/…/FR4).

tcren.annotation.arda_adapter.annotate_chain(chain, organism)[source]#

Annotate one chain with arda; return the AIRR record if it is a TCR chain.

Mutates chain in place when arda recognises it as TRA/TRB: sets chain_type, chain_supertype ("TRAB"), allele_info and regions. Returns the arda record (for any locus) or None if arda produced no locus.

Parameters:

organism (str)

Return type:

dict | None

tcren.annotation.arda_adapter.apply_records(chains, by_id)[source]#

Project a cached {chain_id: record} map onto chains in place (no arda call).

Parameters:

by_id (dict[str, dict])

Return type:

None

tcren.annotation.arda_adapter.score_records(chains, by_id)[source]#

(receptor_ids, summed mmseqs2_score) from already-computed records.

Parameters:

by_id (dict[str, dict])

Return type:

tuple[list[str], float]

tcren.annotation.arda_adapter.annotate_chains(chains, organism)[source]#

Annotate a batch of chains in a single arda call; apply records in place.

One mmseqs invocation for all chains (the per-call process/DB overhead dominates the actual alignment, so batching is ~hundreds× faster than per-chain calls). Returns a {chain_id: record} map for chains that had a sequence.

Parameters:

organism (str)

Return type:

dict[str, dict]

tcren.annotation.arda_adapter.annotate_tcr_chains(structure, organism='human')[source]#

Annotate all chains; return ids recognised as antigen-receptor (TCR/BCR) chains.

Parameters:

organism (str)

Return type:

list[str]

tcren.annotation.arda_adapter.annotate_tcr_chains_scored(structure, organism='human')[source]#

Annotate all chains; return (receptor_ids, summed mmseqs2_score).

The summed mmseqs2 alignment score over the receptor chains measures how well the structure’s TCR/BCR chains match this organism’s germline reference — the signal used to pick the correct species when annotating against human vs mouse.

Parameters:

organism (str)

Return type:

tuple[list[str], float]

tcren.annotation.chains module#

Chain classification: TRA/TRB (via arda), PEPTIDE, and (provisional) MHC.

Precise MHC sub-typing (MHCa/MHCb/B2M, class I/II) is added in Phase B; here MHC chains are left with the generic type "MHC" so the TCR↔peptide scoring path is complete.

tcren.annotation.chains.classify_chains(structure, organism='human', peptide_max_len=30, autodetect_species=True, precomputed_records=None)[source]#

Classify every chain of structure in place.

TRA/TRB are assigned from arda’s locus call; the shortest remaining chains (length ≤ peptide_max_len) become PEPTIDE; longer remaining chains are tagged "MHC".

Parameters:
  • structure (Structure) – Structure to annotate (mutated in place).

  • organism (str) – arda organism ("human"/"mouse").

  • peptide_max_len (int) – Maximum residue count for a chain to be called PEPTIDE.

  • autodetect_species (bool) – Annotate against both supported species (human and mouse) and keep whichever gives the higher total mmseqs alignment score over the receptor chains. TCR/BCR germlines are organism-specific, so the wrong species scores measurably lower (e.g. mouse BM3.3 scores ~435 vs ~197 under human); this avoids mis-typing a chain under the wrong reference. Ties keep the requested organism. Disable to force organism verbatim.

  • precomputed_records (dict[str, dict[str, dict]] | None) – Optional {organism: {chain_id: record}} of arda records for this structure’s chains, to reuse instead of calling arda (the batch path in annotate_structure_set() annotates the whole dataset in one mmseqs call per organism and injects the per-structure slices).

Return type:

None

tcren.annotation.cgene module#

αβ vs γδ T-cell classification from the TCR constant region (C-gene).

arda annotates the variable V(D)J region but not the constant domain. When a structure includes an ordered constant domain, aligning each chain to the TCR constant references (TRAC/TRBC1/TRBC2 → αβ; TRGC/TRDC → γδ) identifies the chain (α/β/γ/δ) unambiguously and therefore the cell type. This is authoritative for αβ-vs-γδ and independent of the (occasionally ambiguous, e.g. TRAV/DV) V-gene call. Variable-domain-only chains carry no constant region and yield no call (cell type "unknown").

tcren.annotation.cgene.MIN_CONSTANT_SCORE = 80.0#

Minimum local-alignment score to accept a constant-domain match. The V domain alone scores ~30-45 against any constant; a real constant domain scores in the hundreds.

class tcren.annotation.cgene.ConstantCall(chain_id, gene, chain_class, cell_type, score)[source]#

Bases: object

A TCR constant-region identification for one chain.

Parameters:
  • chain_id (str)

  • gene (str)

  • chain_class (str)

  • cell_type (str)

  • score (float)

chain_id: str#
gene: str#
chain_class: str#
cell_type: str#
score: float#
tcren.annotation.cgene.classify_chain_constant(sequence, min_score=80.0)[source]#

Identify the constant region of a single chain sequence, if one is present.

Parameters:
  • sequence (str)

  • min_score (float)

Return type:

ConstantCall | None

tcren.annotation.cgene.constant_span(sequence, min_score=80.0)[source]#

Return the (start, end) query span aligning to the best TCR constant region.

start/end are 0-based half-open indices into sequence. Returns None if no constant domain is present (score below min_score). The constant region is C-terminal, so callers trim residues with seq_index >= start.

Parameters:
  • sequence (str)

  • min_score (float)

Return type:

tuple[int, int] | None

tcren.annotation.cgene.classify_constants(structure, min_score=80.0)[source]#

Identify the constant region of every chain that has one.

Parameters:

min_score (float)

Return type:

list[ConstantCall]

tcren.annotation.cgene.cell_type(structure, min_score=80.0)[source]#

Return "ab", "gd" or "unknown" from the constant regions present.

γδ wins if any γ/δ constant is found; otherwise αβ if any α/β constant is found; otherwise "unknown" (no ordered constant domain — e.g. variable-only chains).

Parameters:

min_score (float)

Return type:

str

MHC mapping#

tcren.mhc.imgt module#

Download and parse MHC allele references (IMGT/HLA + UniProt mouse H-2 + B2M).

Produces MhcAllele records labelled with species, MHC class and chain role (MHCa = class-I heavy or class-II alpha; MHCb = class-II beta; B2M). Human alleles come from IMGT/HLA (hla_prot.fasta); mouse H-2 and beta-2-microglobulin come from reviewed UniProt entries.

class tcren.mhc.imgt.MhcAllele(allele, locus, mhc_class, chain_role, species, sequence)[source]#

Bases: object

A reference MHC allele sequence with its functional labels.

Parameters:
  • allele (str)

  • locus (str)

  • mhc_class (str)

  • chain_role (str)

  • species (str)

  • sequence (str)

allele: str#
locus: str#
mhc_class: str#
chain_role: str#
species: str#
sequence: str#
tcren.mhc.imgt.download_human(cache_dir, force=False)[source]#

Download IMGT/HLA hla_prot.fasta into cache_dir.

Parameters:
  • cache_dir (Path)

  • force (bool)

Return type:

Path

tcren.mhc.imgt.download_mouse(cache_dir, force=False)[source]#

Download mouse H-2 / B2m and human B2M into cache_dir.

Parameters:
  • cache_dir (Path)

  • force (bool)

Return type:

tuple[Path, Path]

tcren.mhc.imgt.parse_human(path)[source]#

Parse IMGT/HLA, keeping classical loci collapsed to two-field resolution.

Parameters:

path (Path)

Return type:

list[MhcAllele]

tcren.mhc.imgt.parse_mouse(mouse_path, human_b2m_path)[source]#

Parse reviewed mouse H-2 / B2m and human B2M from UniProt FASTA headers.

Parameters:
  • mouse_path (Path)

  • human_b2m_path (Path)

Return type:

list[MhcAllele]

tcren.mhc.reference module#

Build and load the curated MHC reference shipped under database/mhc/.

The committed reference is a single FASTA (alleles.aa.fasta) whose headers encode the metadata (allele|locus|mhc_class|chain_role|species) plus a metadata.tsv mirror. The mmseqs search index is built on demand into a gitignored cache (mirroring arda’s commit-FASTA / build-index-on-demand split).

tcren.mhc.reference.build(species=('human', 'mouse'), cache_dir=PosixPath('/home/runner/work/tcren/tcren/data/mhc_cache'), out_dir=PosixPath('/home/runner/work/tcren/tcren/database/mhc'), force_download=False)[source]#

Download, curate and write the committed MHC reference.

Parameters:
  • species (tuple[str, ...]) – Which species to include.

  • cache_dir (Path) – Where raw downloads are cached (gitignored).

  • out_dir (Path) – Where the curated alleles.aa.fasta + metadata.tsv are written.

  • force_download (bool) – Re-download even if cached files exist.

Returns:

Path to the written alleles.aa.fasta.

Return type:

Path

tcren.mhc.reference.reference_fasta(out_dir=PosixPath('/home/runner/work/tcren/tcren/database/mhc'))[source]#

Path to the committed reference FASTA (raise if the reference is not built).

Parameters:

out_dir (Path)

Return type:

Path

tcren.mhc.reference.reference_db(cache_dir=PosixPath('/home/runner/work/tcren/tcren/data/mhc_cache'))[source]#

Path to a compiled, pre-indexed mmseqs DB of the allele reference (built once, cached).

mmseqs easy-search otherwise rebuilds the target DB and its k-mer prefilter index from the ~28k-allele FASTA on every call. Caching createdb saves little; the dominant cost is the prefilter index, so we also run createindex once. Reusing this DB cuts a single-structure MHC search from ~4.5 s to ~0.9 s. Built into the gitignored data/mhc_cache when missing or older than the FASTA.

Parameters:

cache_dir (Path)

Return type:

Path

tcren.mhc.reference.parse_header(header)[source]#

Parse a reference FASTA header back into its metadata fields.

Parameters:

header (str)

Return type:

dict[str, str]

tcren.mhc.mapper module#

Map a structure’s MHC chains to allele / class / role via mmseqs.

Searches each not-yet-typed chain against the curated MHC reference and assigns the best hit’s class (MHCI/MHCII), chain role (MHCa/MHCb/B2M), locus and allele. Class is reconciled across the complex (B2M ⇒ class I; a class-II beta chain ⇒ class II).

class tcren.mhc.mapper.MhcCall(chain_id, chain_role, mhc_class, allele, locus, species, identity, bits, qstart, qend, tstart, tend, cigar)[source]#

Bases: object

Result of mapping one chain to the MHC reference.

Parameters:
  • chain_id (str)

  • chain_role (str)

  • mhc_class (str)

  • allele (str)

  • locus (str)

  • species (str)

  • identity (float)

  • bits (float)

  • qstart (int)

  • qend (int)

  • tstart (int)

  • tend (int)

  • cigar (str)

chain_id: str#
chain_role: str#
mhc_class: str#
allele: str#
locus: str#
species: str#
identity: float#
bits: float#
qstart: int#
qend: int#
tstart: int#
tend: int#
cigar: str#
tcren.mhc.mapper.map_mhc(structure, sensitivity=5.7)[source]#

Map the structure’s MHC chains against the curated reference.

Parameters:
  • structure (Structure) – A structure whose TCR/peptide chains are already typed.

  • sensitivity (float) – mmseqs search sensitivity.

Returns:

One MhcCall per chain that produced a reference hit.

Return type:

list[MhcCall]

tcren.mhc.mapper.calls_from_hits(candidates, best, key=None)[source]#

Build reconciled MhcCall`s for ``candidates` from precomputed mmseqs hits.

key(chain) -> str maps a candidate chain to its key in best (default the chain id; a batched search uses "<struct_idx>|<chain_id>"). Lets one mmseqs search over many structures’ chains be sliced back per structure — no per-structure mmseqs call.

Parameters:

best (dict[str, dict])

Return type:

list[MhcCall]

tcren.mhc.mapper.apply_mhc_calls(structure, calls)[source]#

Write MHC calls onto the structure’s chains in place.

Parameters:
Return type:

None

tcren.mhc.domains module#

Canonical MHC groove region definitions.

Loads the bundled mhc_canonical.json: for each "<class>|<role>" key it holds a canonical mature chain sequence and the 0-based positions of each groove region (HELIX_A1/HELIX_A2 for class I, HELIX_A1/HELIX_B1 for class II, and GROOVE_FLOOR). Region boundaries follow established mature-numbering ranges for the α1/α2 (class I) and α1/β1 (class II) groove domains. Regions are projected onto query chains in tcren.mhc.regions.

tcren.mhc.domains.canonical_groove()[source]#

Return the bundled canonical groove definitions.

Return type:

dict

tcren.mhc.domains.groove_for(mhc_class, chain_role)[source]#

Canonical groove definition for a (class, role), or None if none exists.

Parameters:
  • mhc_class (str)

  • chain_role (str)

Return type:

dict | None

tcren.mhc.regions module#

Project canonical MHC groove regions onto a structure’s MHC chains.

Each MHC chain is aligned (global, BLOSUM62) to the canonical chain for its class/role; the canonical region positions are then mapped through the alignment onto the chain’s residues, producing RegionMarkup entries (HELIX_A1, HELIX_A2/HELIX_B1, GROOVE_FLOOR) in the same schema as the TCR region markup.

tcren.mhc.regions.partition_chain(chain, mhc_class, chain_role)[source]#

Return groove RegionMarkups for one MHC chain (empty for B2M / unknown roles).

Parameters:
  • chain (Chain)

  • mhc_class (str)

  • chain_role (str)

Return type:

list[RegionMarkup]

tcren.mhc.regions.partition_mhc(structure, calls)[source]#

Assign groove regions to every mapped MHC chain in the structure (in place).

Parameters:
Return type:

None

tcren.mhc.regions.annotate_mhc(structure)[source]#

Map and partition the MHC chains of an (already chain-typed) structure.

Returns the MhcCall list and, in place, sets each MHC chain’s type (MHCa/MHCb/B2M), class supertype, allele and groove regions.

Parameters:

structure (Structure)

Return type:

list[MhcCall]

tcren.mhc.regions.annotate_mhc_batch(structures, sensitivity=5.7)[source]#

MHC-annotate many (chain-typed) structures with a SINGLE mmseqs search.

Gathers every candidate MHC chain across all structures, runs one easy_search (mmseqs parallelises internally — no Python threads, no per-structure call), then slices the hits back and applies the calls + groove partitioning to each structure in place. This is the batched equivalent of calling annotate_mhc() per structure, for dataset-scale work.

Parameters:
  • structures (list[Structure])

  • sensitivity (float)

Return type:

None

tcren.mhc.linker module#

Detect and split covalently linked (single-chain) peptides via MHC alignment.

Engineered single-chain pMHC constructs fuse the peptide to an MHC chain through a flexible (usually Gly/Ser-rich) linker, so the peptide is not a separate chain. Aligning each chain to the MHC reference reveals this: the MHC domain aligns, leaving an unaligned terminal segment that — after stripping the linker — is the peptide. This module provides the alignment check and a splitter that lifts such peptides into their own chain.

No covalently linked peptides occur in the bundled TCR3D / PDB datasets (all conventional, separate-chain complexes); this is robustness for engineered and predicted structures.

class tcren.mhc.linker.MhcAlignmentCheck(chain_id, best_ref, score, query_start, query_end, n_term_extra, c_term_extra)[source]#

Bases: object

Result of aligning a chain onto the MHC reference (a chain-identity check).

Parameters:
  • chain_id (str)

  • best_ref (str)

  • score (float)

  • query_start (int)

  • query_end (int)

  • n_term_extra (int)

  • c_term_extra (int)

chain_id: str#
best_ref: str#
score: float#
query_start: int#
query_end: int#
n_term_extra: int#
c_term_extra: int#
property is_mhc: bool#
tcren.mhc.linker.check_against_mhc(chain)[source]#

Align a chain onto the MHC reference and report coverage / terminal extensions.

Parameters:

chain (Chain)

Return type:

MhcAlignmentCheck

tcren.mhc.linker.detect_linked_peptide(chain, min_len=7, max_len=25)[source]#

Return the residues of a peptide fused to an MHC chain, or None.

Looks for a peptide-length segment (after stripping an adjacent Gly/Ser linker) at the N- or C-terminus of a chain whose remainder aligns to the MHC reference.

Parameters:
  • chain (Chain)

  • min_len (int)

  • max_len (int)

Return type:

list | None

tcren.mhc.linker.split_linked_peptides(structure, peptide_chain_id='p')[source]#

Split covalently linked peptides off their MHC chains, in place.

For each chain carrying a fused peptide, the peptide residues are removed and added as a new PEPTIDE chain. Returns the list of chain ids that were split (empty if none).

Parameters:
  • structure (Structure)

  • peptide_chain_id (str)

Return type:

list[str]

tcren.mhc.pseudo module#

MHC pseudosequence (MPS) annotation.

NetMHCpan defines a 34-residue “pseudosequence” per allele — the polymorphic groove positions that contact the peptide (class I: α1/α2 of MHCa; class II: α1 of MHCa + β1 of MHCb). The committed mhci_pseudo.fa / mhcii_pseudo.fa (see scripts/build_pseudo_fasta.py) hold the unique pseudosequences.

annotate_pseudo() adds an MPS region to a chain-typed + MHC-annotated structure, on demand. The 34 pseudo positions are scattered along the chain (not a contiguous motif), so an mmseqs/local search can’t find them — there is no shared k-mer to seed on. Instead we thread each candidate 34-mer through the chain with a fitting alignment (gaps in the chain are free, the pseudosequence may not gap), which recovers the positions because NetMHCpan lists them N→C. The best-scoring pseudosequence is chosen (one hit), and its identically-matched residues are marked — across MHCa only for class I, split across MHCa+MHCb for class II, never β2m. Scoring all ~4k pseudosequences this way is ~0.1 s, so no prebuilt index is needed.

tcren.mhc.pseudo.annotate_pseudo(structure)[source]#

Add an MPS region to each groove chain from the best-matching pseudosequence.

structure must already be chain-typed + MHC-annotated. Returns the chosen pseudosequence id (or None if there is no MHC). The best hit is selected once over the class groove sequence (MHCa for class I; MHCa+MHCb for class II) and its residues marked per chain.

Parameters:

structure (Structure)

Return type:

str | None

Contacts#

tcren.contacts.geometry module#

Atom-level contact and Cα-distance computation.

Ports the legacy mir compute-pdb-contacts / compute-pdb-geom steps using a scipy.spatial.cKDTree for the all-atom neighbour search.

tcren.contacts.geometry.all_atom_contacts(structure, cutoff=5.0)[source]#

Closest inter-chain atom contact for each residue pair within cutoff Å.

For every pair of residues on different chains that have at least one heavy-atom pair within cutoff (inclusive, matching the legacy dist <= 5), the row with the minimum atom–atom distance is kept.

Returns:

chain.id.from, residue.index.from, chain.id.to, residue.index.to, residue.aa.from, residue.aa.to, atom.from, atom.to, dist. Each unordered residue pair appears once, in (chain.id, residue.index) lexicographic order.

Return type:

Columns

Parameters:
tcren.contacts.geometry.representative_atom_contacts(structure, kind='ca', cutoff=12.0)[source]#

Inter-chain residue contacts by a single representative atom per residue.

kind="ca" uses Cα (default cutoff 12 Å); kind="cb" uses Cβ with a glycine/ missing-Cβ fallback to Cα (default cutoff 8 Å). Mirrors all_atom_contacts()’ residue-pair schema (atom.from/atom.to carry the representative atom kind).

Parameters:
  • structure (Structure)

  • kind (str)

  • cutoff (float)

Return type:

DataFrame

tcren.contacts.geometry.ca_distance_matrix(structure)[source]#

Pairwise Cα–Cα distance matrix over all residues with a Cα atom.

Returns:

(matrix, keys) where matrix[a, b] is the Cα distance and keys[a] is the (chain_id, seq_index) of row/column a.

Parameters:

structure (Structure)

Return type:

tuple[ndarray, list[tuple[str, int]]]

tcren.contacts.definitions module#

Flexible multi-threshold contact definition.

Beyond the legacy single 5 Å all-atom contact (the TCRen parity default, d1), this adds two coarser residue-level layers: d2 over Cβ atoms (Cα for glycine) and d3 over Cα atoms. The layers nest from tight side-chain proximity to backbone neighbourhood, giving the 2D maps and scoring a tunable contact model without changing the 5 Å default.

class tcren.contacts.definitions.ContactDefinition(d1=5.0, d2=8.0, d3=12.0)[source]#

Bases: object

Three nested contact thresholds (Å).

Variables:
  • d1 (float) – closest heavy-atom distance (all-atom contact).

  • d2 (float) – closest Cβ distance (Cα for glycine / missing Cβ).

  • d3 (float) – closest Cα distance.

Parameters:
  • d1 (float)

  • d2 (float)

  • d3 (float)

d1: float#
d2: float#
d3: float#
tcren.contacts.definitions.multi_contacts(structure, definition=ContactDefinition(d1=5.0, d2=8.0, d3=12.0))[source]#

Stacked inter-chain residue contacts across the three layers.

Returns the union of the d1/d2/d3 residue-pair tables with a layer column ("d1"/"d2"/"d3") and the layer’s distance. A residue pair can appear in several layers; callers filter by layer as needed.

Parameters:
Return type:

DataFrame

tcren.contacts.table module#

Annotate and symmetrise the residue contact table.

Joins per-residue annotations (chain type, region type, region start, amino acid) onto the raw contacts and mirrors the R rbind(contacts, swapped) symmetrisation, yielding the fully annotated, bidirectional contact table the contact map is built from.

tcren.contacts.table.residue_annotation(structure)[source]#

Per-residue annotation table for joining onto contacts.

Columns: chain.id, residue.index, chain.type, chain.supertype, region.type, region.start, residue.aa. region.type/region.start are null for residues without a region annotation.

Parameters:

structure (Structure)

Return type:

DataFrame

tcren.contacts.table.symmetrize(contacts)[source]#

Return contacts plus their from/to-swapped mirror (R rbind semantics).

Parameters:

contacts (DataFrame)

Return type:

DataFrame

tcren.contacts.table.tidy_contacts(structure, cutoff=5.0)[source]#

Symmetrised, fully annotated contact table for a structure.

Each inter-chain residue contact appears in both directions, with chain type, region type, region start and amino acid attached on both the from and to sides — the input to tcren.contactmap.ContactMap.

Parameters:
Return type:

DataFrame

tcren.contactmap module#

Residue-level contact map and interface partitioning.

A ContactMap wraps the annotated, symmetrised contact table and exposes the three biological interfaces (TCR↔peptide, TCR↔MHC, peptide↔MHC). The TCR↔peptide interface is the central object for scoring and reproduces the schema of data/contact_maps_PDB.csv once chains and regions are annotated.

class tcren.contactmap.ContactMap(pdb_id, contacts, peptide_length=None)[source]#

Bases: object

Annotated, symmetrised residue contacts for one structure.

Parameters:
  • pdb_id (str)

  • contacts (DataFrame)

  • peptide_length (int | None)

pdb_id: str#
contacts: DataFrame#
peptide_length: int | None#
classmethod from_structure(structure, cutoff=5.0)[source]#

Build a contact map from an (annotated) structure.

Parameters:
Return type:

ContactMap

interface(which)[source]#

Return the contacts of one interface with within-region positions.

Parameters:

which (Literal['tcr_peptide', 'tcr_mhc', 'peptide_mhc']) – "tcr_peptide", "tcr_mhc" or "peptide_mhc".

Returns:

Filtered contacts with added pos.from/pos.to columns.

Return type:

DataFrame

tcr_peptide()[source]#

Convenience accessor for the TCR↔peptide interface.

Return type:

DataFrame

to_csv(path)[source]#

Write the full annotated contact table to CSV.

Parameters:

path (str | Path)

Return type:

None

Potentials#

tcren.potential.model module#

Pairwise residue-level statistical potentials.

A Potential is a long-form table of pairwise amino-acid energies keyed on (residue.aa.from, residue.aa.to). The “from” side is conventionally the TCR residue and the “to” side the antigen (peptide) residue, matching the orientation of the legacy R pipeline. Potentials can be loaded from the two CSV layouts shipped with the project (wide and long) and exported to a dense matrix for fast scoring.

tcren.potential.model.AA20: tuple[str, ...] = ('L', 'F', 'I', 'M', 'V', 'W', 'Y', 'C', 'H', 'A', 'G', 'P', 'T', 'S', 'Q', 'N', 'D', 'E', 'R', 'K')#

20 standard amino acids (one-letter), TCRen ordering used in the paper.

tcren.potential.model.AA21: tuple[str, ...] = ('A', 'I', 'L', 'V', 'R', 'H', 'K', 'C', 'M', 'S', 'T', 'D', 'E', 'N', 'Q', 'G', 'P', 'Y', 'F', 'W', '-')#

21 amino acids plus the gap symbol.

Type:

Alphabet of the alignment-matrix variant

class tcren.potential.model.Potential(name, matrix, alphabet)[source]#

Bases: object

A pairwise amino-acid potential in long form.

Variables:
  • name (str) – Identifier of the potential (e.g. "TCRen", "MJ", "Keskin").

  • matrix (polars.dataframe.frame.DataFrame) – Long-form table with columns residue.aa.from, residue.aa.to, value.

  • alphabet (tuple[str, ...]) – Amino-acid symbols present on each axis.

Parameters:
  • name (str)

  • matrix (DataFrame)

  • alphabet (tuple[str, ...])

name: str#
matrix: DataFrame#
alphabet: tuple[str, ...]#
value(aa_from, aa_to)[source]#

Return the energy for an ordered residue pair.

Parameters:
  • aa_from (str) – One-letter code of the “from” (TCR) residue.

  • aa_to (str) – One-letter code of the “to” (antigen) residue.

Returns:

The pairwise energy.

Raises:

KeyError – If the pair is absent from the potential.

Return type:

float

as_matrix()[source]#

Return a dense (n, n) matrix and an amino-acid → index map.

Rows are indexed by residue.aa.from, columns by residue.aa.to. Missing pairs are filled with nan.

Return type:

tuple[ndarray, dict[str, int]]

to_csv(path)[source]#

Write the potential to a long-form CSV (from, to, value).

Parameters:

path (str | Path)

Return type:

None

classmethod from_csv(path, name=None, value_col=None)[source]#

Load a potential from a CSV, auto-detecting wide vs long layout.

Two layouts are supported:

  • wideresidue.aa.from, residue.aa.to, <name> (e.g. TCRen_potential.csv with a TCRen value column).

  • longresidue.aa.from, residue.aa.to, potential, value (e.g. MJ_Keskin_potentials.csv); load a single named potential from it.

Parameters:
  • path (str | Path) – Path to the CSV file.

  • name (str | None) – Which potential to select (long layout) or the name to assign (wide layout). Defaults to the value-column name (wide) and is required when a long file holds more than one potential.

  • value_col (str | None) – Override the value column name for the wide layout.

Returns:

The loaded Potential.

Return type:

Potential

tcren.potential.model.tcren()[source]#

Load the bundled classic TCRen potential.

Return type:

Potential

tcren.potential.model.mj()[source]#

Load the bundled Miyazawa–Jernigan potential.

Return type:

Potential

tcren.potential.model.keskin()[source]#

Load the bundled Keskin contact potential.

Return type:

Potential

tcren.potential.derive module#

Derivation of the TCRen statistical potential from observed contact maps.

This is a direct port of the R derivations in code_paper/2_TCRen_derivation.Rmd (variant="classic") and tcren_am/tcren_am.Rmd (variant="am"). The classic variant reproduces TCRen_potential.csv; the alignment-matrix variant reproduces tcren_am/tcren.txt.

tcren.potential.derive.derive_tcren(contacts, include=None, exclude=None, pseudocount=1, variant='classic', beta=44.0, drop_cys=None)[source]#

Derive a TCRen potential from a table of residue contacts.

Parameters:
  • contacts (DataFrame) – Long table of TCR↔peptide contacts with at least residue.aa.from, residue.aa.to and (for filtering) pdb.id.

  • include (list[str] | None) – If given, keep only contacts whose pdb.id is in this list.

  • exclude (list[str] | None) – If given, drop contacts whose pdb.id is in this list.

  • pseudocount (int) – Added to every amino-acid pair count (default 1).

  • variant (str) – "classic" (natural-log log-odds over 20 aa, Cys dropped from the “from” axis) or "am" (log2/beta over 21 symbols including a gap, Cys retained).

  • beta (float) – Temperature divisor used by the "am" variant.

  • drop_cys (bool | None) – Override the per-variant default for dropping from == "C" rows.

Returns:

The derived Potential. For "am" the long matrix additionally carries a count column.

Return type:

Potential

tcren.potential.derive.derive_tcren_loo(contacts, pdb_ids, **kwargs)[source]#

Leave-one-out TCRen: derive once per structure, excluding it each time.

Parameters:
  • contacts (DataFrame) – Contact table (see derive_tcren()).

  • pdb_ids (list[str]) – Structures to leave out one at a time (also the inclusion set).

  • **kwargs – Forwarded to derive_tcren().

Returns:

Long table residue.aa.from, residue.aa.to, TCRen.LOO, pdb.id stacking the per-structure potentials.

Return type:

DataFrame

Scoring#

tcren.scoring module#

Candidate-peptide scoring by amino-acid substitution.

Ports the second half of run_TCRen.R: for each candidate peptide, substitute its amino acids at the contacted peptide positions of a structure’s contact map and sum the pairwise potential over all contacts. Lower scores indicate more favourable interactions.

tcren.scoring.score_peptides(contact_map, candidates, potential, interface='tcr_peptide', require_same_length=True, substituted_side=None)[source]#

Score candidate peptides against a structure’s contact map.

Parameters:
  • contact_map (ContactMap) – The structure’s contact map.

  • candidates (Iterable[str]) – Candidate peptide sequences (one-letter).

  • potential (Potential) – Pairwise potential to score with.

  • interface (Literal['tcr_peptide', 'tcr_mhc', 'peptide_mhc']) – Which interface to score over (default "tcr_peptide").

  • require_same_length (bool) – Only score candidates whose length matches the structure’s peptide length (mirrors the legacy length join). Ignored when the contact map has no recorded peptide length.

  • substituted_side (str | None) – "to" or "from" — which contact side the candidate is threaded onto. Defaults to the peptide side of interface.

Returns:

Columns complex.id, peptide, potential, score sorted by complex.id then ascending score.

Return type:

DataFrame

tcren.scoring.score_structures(contact_maps, candidates, potential, **kwargs)[source]#

Score candidates against several structures and stack the results.

Parameters:
Return type:

DataFrame

tcren.pipeline module#

End-to-end TCRen pipeline: structure → annotation → orientation → contacts → score.

One call takes a TCR-pMHC structure all the way through the tcren workflow:

  1. import the structure (C-gene trimmed);

  2. annotate chains — TCR loci/CDRs via arda, MHC allele/class/role + groove regions;

  3. superimpose onto the canonical database (canonical Cα frame; optional);

  4. markup + contacts — the per-residue region table and the 5 Å contact map;

  5. score each interface with its potential: TCRen for TCR↔peptide, MJ for TCR↔MHC and peptide↔MHC, plus the total.

The interface energy is the sum of the residue-pair potential over the observed contacts of that interface (the closest-atom contact per residue pair, as everywhere in tcren).

class tcren.pipeline.PipelineResult(pdb_id, mhc_calls, markup, contacts, scores, oriented=None, rmsd=None, extra=<factory>)[source]#

Bases: object

Everything the pipeline produces for one structure.

Parameters:
  • pdb_id (str)

  • mhc_calls (list[MhcCall])

  • markup (DataFrame)

  • contacts (DataFrame)

  • scores (dict[str, float])

  • oriented (Structure | None)

  • rmsd (float | None)

  • extra (dict)

pdb_id: str#
mhc_calls: list[MhcCall]#
markup: DataFrame#
contacts: DataFrame#
scores: dict[str, float]#
oriented: Structure | None#
rmsd: float | None#
extra: dict#
tcren.pipeline.run(structure, organism='human', superimpose=True, db_dir=None, cutoff=5.0)[source]#

Run the full pipeline on one structure (path or parsed Structure).

Parameters:
  • structure (str | Path | Structure) – a structure file (any tcren-readable format) or an already-parsed structure.

  • organism (str) – organism for TCR annotation.

  • superimpose (bool) – also orient onto the canonical database (sets oriented + rmsd).

  • db_dir (str | Path | None) – canonical database for superimpose (default data/Canonical2026).

  • cutoff (float) – contact distance threshold (Å).

Returns:

A PipelineResult with the markup, contacts, per-interface scores and (if requested) the canonical-frame oriented structure.

Return type:

PipelineResult

tcren.pipeline.score_row(result)[source]#

Flatten a PipelineResult to a one-row scores dict (for a CSV table).

Parameters:

result (PipelineResult)

Return type:

dict

tcren.refine package#

Peptide substitution + potential-guided refinement.

substitute_peptide() threads a new sequence onto the peptide backbone; refine_peptide() runs a knowledge-based rigid-body Monte-Carlo refinement of the peptide pose via the compiled tcren._refine kernel. The refinement energy is the DOPE atom-level distance-dependent statistical potential (Shen & Sali, Protein Science 2006), used here independently of the TCRen/MJ potentials tcren scores epitopes with — so the pose is not optimised against the same quantity it is later scored with. This is a lightweight, knowledge-based refine, NOT physics relaxation (that is Rosetta FlexPepDock, as a subprocess).

tcren.refine.substitute_peptide(structure, new_peptide, chain_type='PEPTIDE')[source]#

Return a copy of structure with the peptide chain threaded to new_peptide.

The peptide backbone (and Cβ) is preserved; side-chain atoms beyond Cβ are dropped (and Cβ too for any position mutated to glycine). new_peptide must equal the peptide length and use the 20 standard one-letter amino acids.

Raises:

ValueError – if there is no peptide chain, the length differs, or a code is non-standard.

Parameters:
  • structure (Structure)

  • new_peptide (str)

  • chain_type (str)

Return type:

Structure

tcren.refine.refine_peptide(structure, *, shell=12.0, restraint_w=0.5, n_steps=2000, trans_sigma=0.2, rot_sigma=0.05, temp0=1.0, temp1=0.05, seed=0)[source]#

Rigid-body refine the peptide pose against its TCR+MHC partners; (structure, energy).

The energy is the DOPE atom-level distance-dependent statistical potential summed over all peptide$leftrightarrow$partner heavy-atom pairs within DOPE’s range (its short-range bins are repulsive, so it provides its own clash term), plus a harmonic restraint to the input pose (restraint_w) that keeps the search local. Only partner atoms within shell Å of the peptide are considered. The structure must be chain-typed (peptide = chain of chain_type == 'PEPTIDE'). Requires the compiled _refine ext + the bundled DOPE table.

Parameters:
  • structure (Structure)

  • shell (float)

  • restraint_w (float)

  • n_steps (int)

  • trans_sigma (float)

  • rot_sigma (float)

  • temp0 (float)

  • temp1 (float)

  • seed (int)

Backbone-preserving peptide substitution.

score_peptides scores a candidate peptide virtually (it re-indexes the potential matrix over the native contact map — no atoms move). When you want to actually re-dock / refine a candidate you first need its coordinates: substitute_peptide() threads an equal-length sequence onto the existing peptide backbone, keeping N/Cα/C/O(+Cβ) and dropping the old side-chain atoms beyond Cβ (a refiner / rotamer repack rebuilds them). Pure data-model manipulation; returns a new structure.

tcren.refine.substitute.substitute_peptide(structure, new_peptide, chain_type='PEPTIDE')[source]#

Return a copy of structure with the peptide chain threaded to new_peptide.

The peptide backbone (and Cβ) is preserved; side-chain atoms beyond Cβ are dropped (and Cβ too for any position mutated to glycine). new_peptide must equal the peptide length and use the 20 standard one-letter amino acids.

Raises:

ValueError – if there is no peptide chain, the length differs, or a code is non-standard.

Parameters:
  • structure (Structure)

  • new_peptide (str)

  • chain_type (str)

Return type:

Structure

Data paths#

tcren.paths module#

Filesystem locations for tcren’s reference data.

The library’s runtime dataset lives in the repo data/ directory (or $TCREN_DATA_DIR): the canonical Native2026 structure set (HF isalgo/tcren_structures, gitignored), PDB_date.tsv and orient_metadata.json. Structures are fetched lazily; nothing here is bundled into the installed package.

tcren.paths.data_dir()[source]#

Root of the runtime dataset: $TCREN_DATA_DIR or the repo data/ directory.

Return type:

Path

tcren.paths.native_dir()[source]#

Directory holding the canonical Native2026 structures (data/Native2026).

Return type:

Path

tcren.paths.reference_structure_path(pdb_id)[source]#

Resolve a canonical reference structure by id (plain/gzipped PDB/mmCIF).

Looks under data/Native2026 first; if absent (e.g. a pip-installed library with no repo data/), lazily downloads it from the HF dataset into the HF cache. This makes orienting a new, non-canonical structure work out of the box for both the library and CLI.

Raises FileNotFoundError if it is neither local nor fetchable.

Parameters:

pdb_id (str)

Return type:

Path

Analysis#

tcren.analysis module#

Dataset-level analyses for the TCRen contact statistics and potentials.

Helpers for the analysis notebook / benchmarks, following the TCRen manuscript logic: potential heatmaps and comparisons, the distribution of TCR↔peptide contacts per structure and per region, and how contacts distribute over peptide / CDR3 positions as a function of peptide / CDR3 length. They take the manuscript contact + summary tables as explicit paths (the committed oracle lives under tests/assets/oracle/).

tcren.analysis.load_interface_contacts(contact_maps, summary)[source]#

Load and enrich the manuscript TCR↔peptide contact table.

Adds, per contact: peptide_pos (0-based peptide position = residue.index.to), peptide_len, cdr3_len (CDR3α/β length for the contacting TCR chain), cdr3_rel_pos (residue index relative to the chain’s first contacting CDR3 residue — a relative position, since the committed table carries no region start), and the nonred flag.

Parameters:
  • contact_maps (str | Path)

  • summary (str | Path)

Return type:

DataFrame

tcren.analysis.contacts_per_structure(df, nonred_only=True)[source]#

Number of TCR↔peptide contacts per structure (with TRA/TRB split).

Parameters:
  • df (DataFrame)

  • nonred_only (bool)

Return type:

DataFrame

tcren.analysis.region_contact_counts(df, nonred_only=True)[source]#

Total contacts by TCR region (CDR1/2/3, FR) and chain (TRA/TRB).

Parameters:
  • df (DataFrame)

  • nonred_only (bool)

Return type:

DataFrame

tcren.analysis.position_distribution(df, side='peptide', nonred_only=True)[source]#

Contact counts by position, stratified by chain/molecule length.

Parameters:
  • df (DataFrame) – enriched contacts (see load_interface_contacts()).

  • side (str) – "peptide" (peptide position vs peptide length) or "cdr3a" / "cdr3b" (relative CDR3 position vs CDR3 length, for that TCR chain).

  • nonred_only (bool) – restrict to non-redundant structures.

Returns:

length, position, n_contacts.

Return type:

Long counts

tcren.analysis.potential_long(potential)[source]#

Heatmap-ready long form of a potential: residue.aa.from, residue.aa.to, value.

Parameters:

potential (Potential)

Return type:

DataFrame

tcren.analysis.compare_potentials(a, b)[source]#

Join two potentials on their amino-acid pairs and add their difference.

Returns residue.aa.from, residue.aa.to, value_a, value_b, diff over shared pairs.

Parameters:
Return type:

DataFrame

tcren.analysis.potential_matrix(potential)[source]#

Dense matrix + (from_labels, to_labels) for plotting a potential heatmap.

Parameters:

potential (Potential)

Return type:

tuple[ndarray, list[str], list[str]]

Orientation#

tcren.orient.align module#

Bring a structure into a canonical reference frame by MHC superposition.

A query complex is oriented onto a native reference by superposing the conserved MHC groove Cα atoms (the helix/floor residues from tcren.mhc.regions). Because every structure is aligned to the same reference, all oriented complexes share one frame — the basis for overlaying structures and for 2D interface projection. Correspondence between query and reference groove residues is established by sequence alignment, so different alleles/numbering are handled.

class tcren.orient.align.OrientationResult(rotation, translation, rmsd, n_anchor_atoms, reference_id)[source]#

Bases: object

Rigid transform that maps a structure onto the canonical reference frame.

Parameters:
  • rotation (ndarray)

  • translation (ndarray)

  • rmsd (float)

  • n_anchor_atoms (int)

  • reference_id (str)

rotation: ndarray#
translation: ndarray#
rmsd: float#
n_anchor_atoms: int#
reference_id: str#
tcren.orient.align.align_to_native(structure, reference_id=None)[source]#

Compute the transform orienting structure onto a native reference by MHC.

structure must already be chain-typed and MHC-annotated (see tcren.mhc.annotate_mhc()). The reference (default a canonical complex for the structure’s MHC class) is loaded from the Native2026 dataset (tcren.paths).

Parameters:
  • structure (Structure)

  • reference_id (str | None)

Return type:

OrientationResult

tcren.orient.align.apply_transform(structure, result)[source]#

Return a copy of structure with the orientation transform applied to all atoms.

Parameters:
Return type:

Structure

tcren.orient.frame module#

Canonical TCR-pMHC frame by PCA: z ≈ PC1 (MHC→TCR), y ≈ PC2 (peptide), x ≈ PC3.

Every structure is first superposed onto a per-class native reference by its MHC groove Cα (tcren.orient.align.align_to_native()); a fixed per-class rotation R_canon then maps that reference frame into the canonical axes. R_canon is obtained by centring the reference complex’s Cα cloud at its centre of mass and taking its principal axes (PCA):

  • z = PC1 (largest variance, the MHC→TCR long axis), signed +z toward the TCR so the MHC sits at −z;

  • y = PC2 (the groove/peptide axis), signed +y toward the peptide C-terminus;

  • x = PC3 (the thin axis), signed for a right-handed frame.

R_canon + the variance fractions are cached in the bundled tcren/data/canonical_frame.json so orientation is reproducible and inspectable. When no native database is available the same PCA axes are fit directly from the query (the PCA fallback).

class tcren.orient.frame.CanonResult(rotation, translation, rmsd, n_anchor_atoms, reference_id, frame, reversed_dock=None, chain_map=<factory>)[source]#

Bases: object

Composed rigid transform that maps a structure into the canonical frame.

Parameters:
  • rotation (ndarray)

  • translation (ndarray)

  • rmsd (float)

  • n_anchor_atoms (int)

  • reference_id (str | None)

  • frame (Literal['native', 'pca'])

  • reversed_dock (bool | None)

  • chain_map (dict[str, str])

rotation: ndarray#
translation: ndarray#
rmsd: float#
n_anchor_atoms: int#
reference_id: str | None#
frame: Literal['native', 'pca']#
reversed_dock: bool | None#
chain_map: dict[str, str]#
tcren.orient.frame.canonical_frame(structure, reference_id=None, force_pca=False)[source]#

Compose the MHC superposition with the per-class R_canon (native), or fit the canonical axes directly from the query (PCA fallback when no DB / too few anchors).

Parameters:
  • structure (Structure)

  • reference_id (str | None)

  • force_pca (bool)

Return type:

CanonResult

tcren.orient.frame.build_canonical_frame()[source]#

(Re)compute R_canon for each class reference and return the artifact dict.

Writes nothing; the caller serialises to tcren/data/canonical_frame.json.

Return type:

dict

tcren.orient.exceptions module#

Detect reverse-docked TCR-pMHC complexes (a biological exception, flagged not flipped).

The canonical frame is fixed by peptide polarity (+y = peptide C-terminus). The conserved diagonal docking then places the VDJ chain (TRB/TRD, Vβ) on the peptide-C side (+y) and the VJ chain (TRA/TRG, Vα) on the peptide-N side (−y) — consistent with the CDR footprint CDR1α·CDR2α·CDR3α·CDR3β·CDR2β·CDR1β laid out N→C. A genuinely reverse-docked TCR lands with the α/β sides mirrored. We report it; we never force-flip, because the orientation is meaningful.

tcren.orient.exceptions.detect_reverse_dock(structure, rotation, translation, margin=2.0)[source]#

Apply the canonical transform and check the TCR α/β handedness.

Canonical: VDJ (TRB/TRD) at +y (peptide-C side) and VJ (TRA/TRG) at −y. Returns True when the VJ chain is on the +y side of the VDJ chain by more than margin Å (reverse dock), False for a canonical dock, and None when a TCR side is missing.

Parameters:
  • structure (Structure)

  • rotation (ndarray)

  • translation (ndarray)

  • margin (float)

Return type:

bool | None

tcren.orient.chains module#

Select a single TCR-pMHC complex and rename its chains to the canonical A–E scheme.

tcren.orient.chains.select_primary_complex(structure)[source]#

Keep one mutually-contacting chain per canonical role (one TCR-pMHC complex).

Chosen as a connected unit so chains that do not touch the peptide (notably β2m) are not grabbed from another copy: the primary peptide is the one most embedded in an MHC-α groove (then most TCR contacts, then shortest); the TCR α/β and MHC-α are taken by contacts to that peptide; β2m / MHC-β by contacts to the chosen MHC-α. No-op for single complexes.

Parameters:

structure (Structure)

Return type:

Structure

tcren.orient.chains.rename_chains(structure)[source]#

Return a copy with only the canonical complex, chain ids remapped per CHAIN_RENAME, plus the old→new map.

Chains with no canonical role (tags, additives, unrelated proteins) are dropped so the output is exactly the A–E TCR-pMHC complex. Raises ValueError if two source chains map to the same canonical id (unresolved multi-copy — run select_primary_complex() first).

Parameters:

structure (Structure)

Return type:

tuple[Structure, dict[str, str]]

tcren.orient.pipeline module#

Orchestrate canonicalization of TCR-pMHC structures into the common MHC frame.

tcren.orient.pipeline.canonicalize_structure(structure, reference_id=None, force_pca=False, select_primary=True)[source]#

Orient an (already chain-typed + MHC-annotated) structure into the canonical frame.

Returns the oriented, A–E renamed structure and the populated CanonResult (transform, frame, rmsd, reverse-dock flag, chain map). Coordinates are transformed; the chain roles drive the rename, so order matters (frame + reverse-dock are read before the transform clears region markup).

Parameters:
  • structure (Structure)

  • reference_id (str | None)

  • force_pca (bool)

  • select_primary (bool)

Return type:

tuple[Structure, CanonResult]

tcren.orient.pipeline.align_to_canonical(structure, reference_id=None, organism='human', force_pca=False)[source]#

Align a NEW (parsed) structure onto the Native2026 canonical frame.

Runs chain typing + MHC annotation, then canonicalize_structure(). The stored per-class R_canon is reused, so the result is in the same frame as the dataset and the composed transform in the returned CanonResult replays the placement exactly.

Parameters:
  • structure (Structure)

  • reference_id (str | None)

  • organism (str)

  • force_pca (bool)

Return type:

tuple[Structure, CanonResult]

tcren.orient.pipeline.check_oriented_complex(structure, max_peptide_len=25, max_offset=25.0, max_tcr_gap=15.0, max_orphan=70.0)[source]#

Geometric sanity check on an oriented A–E complex; (ok, reason).

Rejects structures whose canonical placement is inconsistent: missing / overlong peptide, peptide not at the groove centre (≈ origin), the TCR not engaging the peptide, or any chain stranded far from the complex (an orphan copy that survived primary-complex selection).

Parameters:
  • max_peptide_len (int)

  • max_offset (float)

  • max_tcr_gap (float)

  • max_orphan (float)

tcren.orient.pipeline.run_folder(structures, out, metadata=None, organism='human', reference_id=None, force_pca=False, threads=None, mmcif=False, compress=False)[source]#

Canonicalize a file or folder of structures; write oriented structures + a metadata table.

Output format follows mmcif (.cif vs .pdb) and compress (trailing .gz); plain PDB by default (pass compress=True to rebuild the gzipped Canonical2026 set).

Annotation is BATCHED — one mmseqs search for all TCR chains (per organism) and one for all MHC chains across the whole set (mmseqs parallelises internally; never per-structure, never Python-threaded). Only the embarrassingly-parallel, mmseqs-free stages — parsing and the structural alignment + write — use a thread pool (threads worker threads, default os.cpu_count()).

Parameters:
  • structures (str | Path)

  • out (str | Path)

  • metadata (str | Path | None)

  • organism (str)

  • reference_id (str | None)

  • force_pca (bool)

  • threads (int | None)

  • mmcif (bool)

  • compress (bool)

Return type:

DataFrame

tcren.orient.pipeline.run_superimpose(structures, out, db_dir=None, organism='human', mmcif=False, compress=False, threads=None)[source]#

Superimpose input structure(s) onto a canonical database; write oriented structures.

structures is a file, directory, .tar.gz, or a shell glob. out is an output directory, or — for a single input — a structure file whose extension must match mmcif/compress. Annotation is BATCHED (one mmseqs pass over all inputs); only the mmseqs-free ensemble alignment + write runs on the thread pool (threads workers, default all cores). See tcren.orient.superimpose() for the MHC-ensemble method.

Parameters:
  • structures (str | Path)

  • out (str | Path)

  • db_dir (str | Path | None)

  • organism (str)

  • mmcif (bool)

  • compress (bool)

  • threads (int | None)

Return type:

DataFrame

tcren.orient.docking module#

TCR docking geometry: crossing angle and incident (tilt) angle.

The TCR “docking angle” (Rudolph, Stanfield & Wilson 2006; Garcia et al.) describes how the αβ (or γδ) TCR sits on top of the peptide-MHC groove. It is computed here directly from the canonical frame (tcren.orient.frame.canonical_frame()), so no external package (TCRdock / STCRpy) is required:

  • the crossing angle is the angle between the Vα→Vβ pseudo-axis projected into the MHC groove plane and the groove long axis (peptide N→C, canonical +y — collinear with the MHC α1 helix to within a few degrees). Reported on [0, 180); canonical αβ TCRs cluster around ~20–70°. A signed variant ([-180, 180)) carries the handedness of the docking.

  • the incident (tilt) angle is the elevation of the same Vα→Vβ vector out of the groove plane (canonical z is the MHC→TCR normal): positive when Vβ rides higher above the groove than Vα.

The Vα/Vβ landmarks are the centroids of the variable-domain Cα atoms of the two receptor chains (TRA/TRB for αβ, TRG/TRD for γδ). The frame is fit from the query itself (PCA), so the calculation needs neither the native database nor mmseqs once the structure is chain-typed.

class tcren.orient.docking.DockingAngles(crossing_angle, crossing_angle_signed, incident_angle, cell_type, n_va, n_vb)[source]#

Bases: object

TCR docking geometry relative to the MHC groove (all angles in degrees).

Parameters:
  • crossing_angle (float)

  • crossing_angle_signed (float)

  • incident_angle (float)

  • cell_type (str)

  • n_va (int)

  • n_vb (int)

crossing_angle: float#
crossing_angle_signed: float#
incident_angle: float#
cell_type: str#
n_va: int#
n_vb: int#
tcren.orient.docking.crossing_incident_from_vector(v_canon)[source]#

(crossing, crossing_signed, incident) degrees from a Vα→Vβ vector in canonical axes.

v_canon is [vx, vy, vz] along canonical x (groove width), y (groove long axis, peptide N→C) and z (MHC→TCR normal). The crossing angle is measured in the groove plane (xy) from the long axis; the incident angle is the elevation out of that plane.

Parameters:

v_canon (ndarray)

Return type:

tuple[float, float, float]

tcren.orient.docking.docking_angles(structure)[source]#

Crossing + incident angle of a chain-typed TCR-pMHC complex.

The structure must already be chain-typed (classify_chains) and MHC-annotated (annotate_mhc) so the canonical frame can be fit. The frame is taken from the query’s own Cα cloud (PCA), so the result needs no native database.

Parameters:

structure (Structure) – a chain-typed, MHC-annotated TCR-pMHC structure.

Returns:

A DockingAngles with the crossing and incident angles.

Raises:

ValueError – if a receptor chain pair (TRA/TRB or TRG/TRD) is missing, or the canonical frame is degenerate.

Return type:

DockingAngles

Command line#

tcren.cli module#

Command-line interface for tcren.

Subcommands:

  • tcren info — environment / dependency check.

  • tcren annotate — chain typing + region markup (TCR/MHC/peptide; --regions to filter, --pseudo for MHC pseudosequence residues) for input structures.

  • tcren contacts — annotated contact table for input structures.

  • tcren derive-potential — derive a TCRen potential from a contact-map table.

  • tcren score — end-to-end candidate scoring (drop-in for run_TCRen.R).

tcren.cli.paper_bootstrap(structures=<typer.models.OptionInfo object>, canonical=<typer.models.OptionInfo object>)[source]#

Fetch HF structure sets into notebooks/data/<Set>/ (gitignored; non-structure inputs are already committed under natcompsci2022/data_legacy/).

Parameters:
  • structures (bool)

  • canonical (bool)

Return type:

None

tcren.cli.info()[source]#

Show version and dependency availability.

Return type:

None

tcren.cli.annotate(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, regions=<typer.models.OptionInfo object>, pseudo=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>)[source]#

Annotate chains and emit a per-residue region-markup table.

Covers TCR (CDR/FR), MHC groove (helices/floor) and peptide in one pass — --regions restricts the output to one chain class. --pseudo additionally marks the NetMHCpan MHC pseudosequence residues (region MPS). MHC groove + MPS require MHC annotation, which runs automatically when needed.

Parameters:
  • structures (Path)

  • out (Path)

  • regions (str)

  • pseudo (bool)

  • organism (str)

Return type:

None

tcren.cli.contacts(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, cutoff=<typer.models.OptionInfo object>, interface=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>)[source]#

Compute and emit an annotated contact table.

Parameters:
  • structures (Path)

  • out (Path)

  • cutoff (float)

  • interface (str)

  • organism (str)

Return type:

None

tcren.cli.orient(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, metadata=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, reference_id=<typer.models.OptionInfo object>, force_pca=<typer.models.OptionInfo object>, threads=<typer.models.OptionInfo object>, mmcif=<typer.models.OptionInfo object>, compress=<typer.models.OptionInfo object>)[source]#

Build a canonical database: orient native TCR-pMHC complexes into the common MHC frame.

Derives the per-class canonical frame and writes every complex into it (A–E chains). This is how the bundled Canonical2026 set is produced; use superimpose to bring a new structure into an existing canonical database.

Parameters:
  • structures (Path)

  • out (Path)

  • metadata (Path)

  • organism (str)

  • reference_id (str)

  • force_pca (bool)

  • threads (int)

  • mmcif (bool)

  • compress (bool)

Return type:

None

tcren.cli.superimpose(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, db=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, threads=<typer.models.OptionInfo object>, mmcif=<typer.models.OptionInfo object>, compress=<typer.models.OptionInfo object>)[source]#

Superimpose structure(s) onto a canonical database by MHC.

Detects each input’s MHC chains, class, and species, then superposes its conserved groove Cα onto every database structure of the same class and species and averages the transforms into one consensus placement. The database defaults to data/Canonical2026 (populated at install).

-s accepts a file, directory, .tar.gz, or a shell glob. -o is an output directory, or — for a single input — a structure file whose extension must match --mmCIF/--compress. Annotation is one batched mmseqs call; -t threads the alignment + write.

Parameters:
  • structures (str)

  • out (Path)

  • db (Path)

  • organism (str)

  • threads (int)

  • mmcif (bool)

  • compress (bool)

Return type:

None

tcren.cli.derive_potential(contact_maps=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, summary=<typer.models.OptionInfo object>, nonred=<typer.models.OptionInfo object>, variant=<typer.models.OptionInfo object>, pseudocount=<typer.models.OptionInfo object>, loo=<typer.models.OptionInfo object>)[source]#

Derive a TCRen potential from observed contacts.

Parameters:
  • contact_maps (Path)

  • out (Path)

  • summary (Path | None)

  • nonred (bool)

  • variant (str)

  • pseudocount (int)

  • loo (bool)

Return type:

None

tcren.cli.fetch_data(canonical=<typer.models.OptionInfo object>)[source]#

Populate data/ with the reference structure sets from the HF dataset.

Run once at install (setup.sh does this). Fetches Native2026 (orientation references) and, by default, Canonical2026 (the default superimpose database) into $TCREN_DATA_DIR / repo data/. Skips folders already present.

Parameters:

canonical (bool)

Return type:

None

tcren.cli.build_mhc_ref(species=<typer.models.OptionInfo object>, force_download=<typer.models.OptionInfo object>)[source]#

Download and curate the MHC allele reference (IMGT/HLA + UniProt mouse).

Parameters:
  • species (str)

  • force_download (bool)

Return type:

None

tcren.cli.fetch_recent(dest=<typer.models.OptionInfo object>, discover=<typer.models.OptionInfo object>, after=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>)[source]#

Download recent TCR-pMHC structures from RCSB into data/pdb_recent.

Seeds with the Native2026 ids; with –discover also full-text-searches RCSB for new entries. Each is pulled as mmCIF (.cif.gz; handles extended PDB ids), annotated, and kept only if it has all 5 required chains (MHCa + b2m/MHCb + peptide + TCR pair).

Parameters:
  • dest (Path)

  • discover (bool)

  • after (str)

  • organism (str)

Return type:

None

tcren.cli.score(structures=<typer.models.OptionInfo object>, candidates=<typer.models.OptionInfo object>, potential=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, interface=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, cutoff=<typer.models.OptionInfo object>)[source]#

Score candidate epitopes against input structures (end-to-end pipeline).

Parameters:
  • structures (Path)

  • candidates (Path)

  • potential (str | None)

  • out (Path)

  • interface (str)

  • organism (str)

  • cutoff (float)

Return type:

None

tcren.cli.pipeline(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, no_superimpose=<typer.models.OptionInfo object>, db=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, cutoff=<typer.models.OptionInfo object>)[source]#

Run the full pipeline and write per-interface energies for each structure.

structure → annotate (alleles + chains) → superimpose → resmarkup / canonical Cα / contacts → score (TCRen for TCR↔peptide, MJ for TCR↔MHC and peptide↔MHC) + total.

Parameters:
  • structures (Path)

  • out (Path)

  • no_superimpose (bool)

  • db (Path)

  • organism (str)

  • cutoff (float)

Return type:

None

tcren.cli.refine(structures=<typer.models.OptionInfo object>, out=<typer.models.OptionInfo object>, substitute=<typer.models.OptionInfo object>, organism=<typer.models.OptionInfo object>, n_steps=<typer.models.OptionInfo object>, restraint_w=<typer.models.OptionInfo object>, seed=<typer.models.OptionInfo object>, mmcif=<typer.models.OptionInfo object>, compress=<typer.models.OptionInfo object>)[source]#

Potential-guided rigid-body refinement of the peptide pose (knowledge-based, not physics).

Optionally --substitute a new equal-length peptide first, then run a Monte-Carlo refinement scored by the DOPE atom-level statistical potential (restrained to the input pose; independent of the TCRen/MJ scoring potentials). Writes one structure per input and prints the final DOPE energy. (For physics-grade relaxation use Rosetta FlexPepDock externally.)

Parameters:
  • structures (str)

  • out (Path)

  • substitute (str)

  • organism (str)

  • n_steps (int)

  • restraint_w (float)

  • seed (int)

  • mmcif (bool)

  • compress (bool)

Return type:

None