mir.basic package#

Submodules#

mir.basic.gene_usage module#

Gene usage tracking for immune repertoires.

GeneUsage accumulates V-J gene combination counts from LocusRepertoire or SampleRepertoire objects and exposes joint and marginal usage statistics together with Laplace-smoothed fractions.

Allele Handling#

By default, gene allele suffixes are stripped during initialization (e.g., TRBV1*01TRBV1) so that different allele naming conventions are treated as the same gene. This behavior can be disabled by setting strip_alleles=False when constructing a GeneUsage object.

When resampling using mir.common.sampling.resample_to_gene_usage(), clonotypes retain their original alleles while only stripped gene bases are used for frequency comparison.

mir.basic.gene_usage.precompute_olga_gene_usage_probabilities(*, species, locus, synthetic_n=10000000, n_jobs=None, seed=42, overwrite=False, progress=True, control_dir=None, control_manager=None, control_kwargs=None, cache_in_memory=True)[source]#

Precompute and persist OLGA V/J/VJ usage probabilities for one model.

This helper ensures a synthetic OLGA control exists on disk for the requested (species, locus, synthetic_n) and returns marginal and joint usage probabilities derived from that control. Generation can be parallelized via n_jobs.

Parameters:
  • species (str) – Species alias accepted by ControlManager.

  • locus (str) – IMGT locus code (for example, "TRB").

  • synthetic_n (int) – Number of synthetic clonotypes used to estimate usage.

  • n_jobs (int | None) – Number of worker processes for synthetic generation. When None, uses all available CPUs.

  • seed (int) – Random seed used when creating synthetic controls.

  • overwrite (bool) – Regenerate control even if a cached artifact exists.

  • progress (bool) – Whether to print progress during control generation.

  • control_dir (str | Path | None) – Optional control cache root. Defaults to manager default.

  • control_manager – Optional preconfigured ControlManager.

  • control_kwargs (dict | None) – Extra kwargs forwarded to ControlManager.ensure_and_load_control_df.

  • cache_in_memory (bool) – Whether to populate in-process OLGA usage cache.

Returns:

Dict with keys "v", "j", "vj" containing probability maps.

Return type:

dict[str, dict[object, float]]

mir.basic.gene_usage.get_gene_usage_from_olga_model(model)[source]#

Return V/J/VJ usage probabilities read directly from OLGA model marginals.

This is an analytical alternative to precompute_olga_gene_usage_probabilities(): instead of generating millions of synthetic sequences, it reads the IGoR probability parameters from the loaded model. The result is instantaneous, deterministic, and matches the asymptotic limit of the sampling approach.

Probabilities are aggregated at the major gene level (allele suffixes stripped), e.g. "TRBV5-1*01" and "TRBV5-1*02" are summed under "TRBV5-1".

Parameters:

model – A loaded OlgaModel instance.

Returns:

Dict with keys "v", "j", "vj" mapping to probability dicts. "vj" keys are (v_gene, j_gene) tuples.

Return type:

dict[str, dict[object, float]]

Example

>>> from mir.basic.pgen import OlgaModel
>>> from mir.basic.gene_usage import get_gene_usage_from_olga_model
>>> m = OlgaModel(locus="TRB", species="human")
>>> gu = get_gene_usage_from_olga_model(m)
>>> round(sum(gu["v"].values()), 6)
1.0
class mir.basic.gene_usage.GeneUsage(*, strip_alleles=True)[source]#

Bases: object

Joint and marginal V-J gene usage statistics.

Stores per-locus clonotype counts and duplicate-count totals for every observed (V-gene, J-gene) pair.

Parameters:

strip_alleles (bool, optional) – When True (default), remove allele suffixes during initialization so that TRBV1*01 and TRBV1 are treated as the same gene. When False, alleles are preserved as-is.

strip_alleles#

Whether allele suffixes were stripped during initialization.

Type:

bool

Examples

Build from a repertoire, automatically stripping alleles:

gu = GeneUsage.from_repertoire(trb_repertoire)
gu.vj_fraction("TRB")
{('TRBV12-3', 'TRBJ1-2'): 0.42, ...}

Build with alleles preserved:

gu = GeneUsage.from_repertoire(trb_repertoire, strip_alleles=False)
gu.vj_fraction("TRB")
{('TRBV12-3*01', 'TRBJ1-2*01'): 0.42, ...}
classmethod from_repertoire(repertoire, *, locus='', strip_alleles=True)[source]#

Build from a LocusRepertoire.

Parameters:
  • repertoire (LocusRepertoire) – Source locus repertoire.

  • locus (str) – Override locus. When empty the repertoire’s own locus is used.

  • strip_alleles (bool) – Whether to strip allele suffixes (default True).

Return type:

GeneUsage

classmethod from_sample(sample, *, strip_alleles=True)[source]#

Build from a SampleRepertoire.

Iterates over all loci in the sample.

Parameters:
  • sample (SampleRepertoire) – Source sample repertoire.

  • strip_alleles (bool) – Whether to strip allele suffixes (default True).

Return type:

GeneUsage

classmethod from_list(repertoires, *, strip_alleles=True)[source]#

Build by accumulating data from a list of repertoire objects.

Each element may be a LocusRepertoire or a SampleRepertoire.

Parameters:
  • repertoires – List of LocusRepertoire or SampleRepertoire objects.

  • strip_alleles (bool) – Whether to strip allele suffixes (default True).

Return type:

GeneUsage

classmethod from_dataframe(table, *, locus, v_col='v_gene', j_col='j_gene', duplicate_count_col='duplicate_count', strip_alleles=True)[source]#

Build from a DataFrame with V/J columns.

Parameters:
  • table (DataFrame | DataFrame) – Input table (pandas or polars) containing V/J gene fields and optionally duplicate counts.

  • locus (str) – IMGT locus code to assign to all rows.

  • v_col (str) – Column names for V/J genes.

  • j_col (str) – Column names for V/J genes.

  • duplicate_count_col (str) – Column name for duplicate count. If absent, duplicates default to 1.

  • strip_alleles (bool) – Whether to strip allele suffixes (default True).

Return type:

GeneUsage

property loci: list[str]#

Loci with observed data.

total(locus, *, count='clonotypes')[source]#

Total count for locus.

Parameters:
  • locus (str) – IMGT locus code.

  • count (str) – "clonotypes" (unique rearrangements) or "duplicates".

Return type:

int

vj_usage(locus, *, count='clonotypes')[source]#

Joint V-J usage for locus.

Parameters:
  • locus (str) – IMGT locus code.

  • count (str) – "clonotypes" or "duplicates".

Returns:

Dict mapping (v_base, j_base) to the requested count.

Return type:

dict[tuple[str, str], int]

v_usage(locus, *, count='clonotypes')[source]#

Marginal V-gene usage (sum over all J) for locus.

Parameters:
  • locus (str)

  • count (str)

Return type:

dict[str, int]

j_usage(locus, *, count='clonotypes')[source]#

Marginal J-gene usage (sum over all V) for locus.

Parameters:
  • locus (str)

  • count (str)

Return type:

dict[str, int]

vj_fraction(locus, *, count='clonotypes', pseudocount=1.0)[source]#

Laplace-smoothed V-J fraction for locus.

Fractions sum to 1 over observed pairs using:

(n_observed + pseudocount) / (total + n_observed_pairs * pseudocount)
Parameters:
  • locus (str) – IMGT locus code.

  • count (str) – "clonotypes" or "duplicates".

  • pseudocount (float) – Added to each count and the denominator term.

Return type:

dict[tuple[str, str], float]

v_fraction(locus, *, count='clonotypes', pseudocount=1.0)[source]#

Laplace-smoothed marginal V-gene fraction for locus.

Parameters:
  • locus (str)

  • count (str)

  • pseudocount (float)

Return type:

dict[str, float]

j_fraction(locus, *, count='clonotypes', pseudocount=1.0)[source]#

Laplace-smoothed marginal J-gene fraction for locus.

Parameters:
  • locus (str)

  • count (str)

  • pseudocount (float)

Return type:

dict[str, float]

usage_comparison(reference, locus, *, scope='vj', count='count_rearrangement', pseudocount=1.0)[source]#

Compare smoothed usage frequencies against another GeneUsage.

Frequencies are computed independently for self and reference using Laplace smoothing with the same pseudocount:

(n_key + pseudocount) / (total + n_observed_keys * pseudocount).

Parameters:
  • reference (GeneUsage) – Baseline gene usage to compare against (e.g. OLGA).

  • locus (str) – IMGT locus code.

  • scope (str) – "v", "j", or "vj".

  • count (str) – Count mode alias (default count_rearrangement).

  • pseudocount (float) – Additive smoothing constant (must be >= 0).

Returns:

{"p_self": ..., "p_reference": ..., "factor": ...}.

Return type:

Mapping from key (gene or VJ tuple) to

correction_factors(reference, locus, *, scope='vj', count='count_rearrangement', pseudocount=1.0)[source]#

Return correction factors P_self / P_reference by key.

Parameters:
  • reference (GeneUsage)

  • locus (str)

  • scope (str)

  • count (str)

  • pseudocount (float)

Return type:

dict[object, float]

mir.basic.gene_usage.zscore_to_sigmoid(z)[source]#

Map a (batch-corrected) z-score to a bounded sigmoid value in (0, 1).

sigmoid(z) = 1 / (1 + exp(-z))

This is the canonical transform to turn per-gene z-scores from compute_batch_corrected_gene_usage() into bounded, comparable corrected probabilities that can be directly used in PCA/UMAP embeddings.

Parameters:

z (ndarray | float) – Scalar or array of z-scores.

Return type:

np.ndarray or float with the same shape as z, values in (0, 1).

mir.basic.gene_usage.compute_batch_corrected_gene_usage(dataset, *, batch_field='batch_id', scope='vj', weighted=True, pseudocount=1.0, z_cap=6.0)[source]#

Compute batch-corrected gene usage for all samples/loci/genes.

Uses a pseudocount on raw counts prior to normalization:

p = (count + pseudocount) / (total + pseudocount * n_genes)

Then computes log_p, batch-wise winsorized (95%) mu and sigma over (locus, gene, batch_id), capped z-scores, and final corrected probabilities:

correction_factor = exp(z) pfinal_raw = p * correction_factor

Finally, for each (sample_id, locus) group we renormalize pfinal so probabilities sum to 1. If a group’s raw corrected mass is invalid or non-positive, we fall back to normalized raw p for that group.

Empty sample loci and loci absent in a sample are skipped without error.

Parameters:
  • dataset (RepertoireDataset)

  • batch_field (str)

  • scope (GeneScope)

  • weighted (bool)

  • pseudocount (float)

  • z_cap (float)

Return type:

pd.DataFrame

mir.basic.gene_usage.marginalize_batch_corrected_gene_usage(df, *, scope)[source]#

Marginalize batch-corrected VJ usage to V or J usage.

This helper converts output from compute_batch_corrected_gene_usage() computed with scope='vj' into V- or J-marginal usage by summing over the opposite dimension.

Parameters:
  • df (DataFrame) – DataFrame from compute_batch_corrected_gene_usage(..., scope='vj'). Required columns: sample_id, batch_id, locus, gene, p, pfinal, pavg.

  • scope (Literal['v', 'j']) – Target marginal scope: "v" (sum over J) or "j" (sum over V).

Returns:

Columns: sample_id, batch_id, locus, gene, p, pfinal, pavg.

Return type:

pd.DataFrame

mir.basic.pgen module#

mir.basic.token_tables module#

mir.basic.token_tables_pl module#

Polars-based rearrangement k-mer indexing and summarisation.

Mirrors the object-based API in token_tables using Polars DataFrames. The rearrangement table has columns:

id (Int64), locus (Utf8), v_gene (Utf8), c_gene (Utf8), junction_aa (Utf8), duplicate_count (Int64).

Functions#

  • expand_kmers — Expand each rearrangement row into one row per k-mer, adding kmer_pos and kmer_seq columns.

  • summarize_by_gene — Group by (locus, v_gene, c_gene, kmer_seq) → rearrangement_count, duplicate_count.

  • summarize_by_pos — Group by (locus, kmer_seq, kmer_pos).

  • summarize_by_v — Group by (locus, kmer_seq, v_gene).

  • summarize_by_c — Group by (locus, kmer_seq, c_gene).

  • fetch_by_kmer — Rows from the original table matching (locus, kmer_seq).

  • fetch_by_annotated_kmer— Rows matching (locus, v_gene, c_gene, kmer_seq).

mir.basic.token_tables_pl.expand_kmers(df, k)[source]#

Expand rearrangement table: one row per overlapping k-mer.

For each rearrangement with junction_aa of length n ≥ k, produces n − k + 1 rows with new columns kmer_pos (Int64) and kmer_seq (Utf8). Clonotypes shorter than k are dropped.

Parameters:
  • df (DataFrame) – Clonotype table with at least id, locus, v_gene, c_gene, junction_aa, duplicate_count.

  • k (int) – K-mer length.

Returns:

Expanded polars.DataFrame.

Return type:

DataFrame

mir.basic.token_tables_pl.summarize_by_gene(expanded)[source]#

Group by (locus, v_gene, c_gene, kmer_seq).

Returns columns: locus, v_gene, c_gene, kmer_seq, rearrangement_count, duplicate_count.

Parameters:

expanded (DataFrame)

Return type:

DataFrame

mir.basic.token_tables_pl.summarize_by_gene_chunked(df, k, *, chunk_size=100000)[source]#

Chunked summary by (locus, v_gene, c_gene, kmer_seq).

Parameters:
  • df (DataFrame)

  • k (int)

  • chunk_size (int)

Return type:

DataFrame

mir.basic.token_tables_pl.summarize_by_pos(expanded)[source]#

Group by (locus, kmer_seq, kmer_pos).

Returns columns: locus, kmer_seq, kmer_pos, rearrangement_count, duplicate_count.

Parameters:

expanded (DataFrame)

Return type:

DataFrame

mir.basic.token_tables_pl.summarize_by_pos_chunked(df, k, *, chunk_size=100000)[source]#

Chunked summary by (locus, kmer_seq, kmer_pos).

Parameters:
  • df (DataFrame)

  • k (int)

  • chunk_size (int)

Return type:

DataFrame

mir.basic.token_tables_pl.summarize_by_v(expanded)[source]#

Group by (locus, kmer_seq, v_gene).

Returns columns: locus, kmer_seq, v_gene, rearrangement_count, duplicate_count.

Parameters:

expanded (DataFrame)

Return type:

DataFrame

mir.basic.token_tables_pl.summarize_by_v_chunked(df, k, *, chunk_size=100000)[source]#

Chunked summary by (locus, kmer_seq, v_gene).

Parameters:
  • df (DataFrame)

  • k (int)

  • chunk_size (int)

Return type:

DataFrame

mir.basic.token_tables_pl.summarize_by_c(expanded)[source]#

Group by (locus, kmer_seq, c_gene).

Returns columns: locus, kmer_seq, c_gene, rearrangement_count, duplicate_count.

Parameters:

expanded (DataFrame)

Return type:

DataFrame

mir.basic.token_tables_pl.summarize_by_c_chunked(df, k, *, chunk_size=100000)[source]#

Chunked summary by (locus, kmer_seq, c_gene).

Parameters:
  • df (DataFrame)

  • k (int)

  • chunk_size (int)

Return type:

DataFrame

mir.basic.token_tables_pl.fetch_by_kmer(df, expanded, locus, kmer_seq)[source]#

Return rows from the original rearrangement table whose junction_aa contains the given k-mer at the specified locus.

Parameters:
  • df (DataFrame) – Original rearrangement table.

  • expanded (DataFrame) – Expanded k-mer table (from expand_kmers()).

  • locus (str) – Locus string to match.

  • kmer_seq (str) – K-mer sequence string to match.

Returns:

Subset of df (original columns only, deduplicated by id).

Return type:

DataFrame

mir.basic.token_tables_pl.fetch_by_annotated_kmer(df, expanded, locus, v_gene, c_gene, kmer_seq)[source]#

Return rows from the original rearrangement table matching a fully annotated k-mer query (locus, v_gene, c_gene, kmer_seq).

Parameters:
  • df (DataFrame) – Original rearrangement table.

  • expanded (DataFrame) – Expanded k-mer table (from expand_kmers()).

  • locus (str) – Locus string to match.

  • v_gene (str) – V-gene name to match.

  • c_gene (str) – C-gene name to match.

  • kmer_seq (str) – K-mer sequence string to match.

Returns:

Subset of df (original columns only, deduplicated by id).

Return type:

DataFrame

mir.basic.tokens module#

mir.basic.alphabets module#

Alphabets, constants, and amino-acid → reduced-alphabet translation.

This module holds the lightweight, GC-friendly parts that are faster in pure Python (bytes.translate) than in C. Heavy-lifting functions (codon translation, tokenisation, distances) live in the mirseq C extension.

Types#

  • Seq — Union type str | bytes | bytearray.

Helpers#

  • _to_bytes — Normalise Seq to bytes.

Alphabets#

  • NT_ALPHABET / AA_ALPHABET / REDUCED_AA_ALPHABET — 256-byte LUTs.

  • NT_MASK / AA_MASK / REDUCED_AA_MASK — Mask byte values.

Translation#

  • aa_to_reduced — AA → reduced via bytes.translate (fastest path).

  • validate — Check every byte belongs to an alphabet.

  • mask — Replace position(s) with a mask character.

  • matches — Wildcard-aware positional comparison.

  • matches_aa_reduced— Cross-alphabet wildcard match (AA vs reduced).

mir.basic.alphabets.make_alphabet(chars)[source]#

Build a 256-byte lookup table where allowed positions are 1.

Parameters:

chars (str)

Return type:

bytes

mir.basic.alphabets.aa_to_reduced(seq)[source]#

Convert an amino-acid sequence to the reduced physico-chemical alphabet.

Uses bytes.translate with a pre-built table — faster than C for this particular operation.

Parameters:

seq (str | bytes | bytearray)

Return type:

bytes

mir.basic.alphabets.validate(seq, alphabet)[source]#

Validate every byte of seq belongs to alphabet (256-byte LUT).

Parameters:
  • seq (str | bytes | bytearray)

  • alphabet (bytes)

Return type:

bytes

mir.basic.alphabets.mask(seq, position, mask_byte)[source]#

Return a copy of seq with the given position(s) replaced by mask_byte.

Parameters:
  • seq (str | bytes | bytearray)

  • position (int | slice | tuple[int, int])

  • mask_byte (int)

Return type:

bytes

mir.basic.alphabets.matches(a, b, mask_byte)[source]#

Wildcard-aware positional comparison.

Returns True when a and b have the same length and at every position the bytes are equal or at least one side carries mask_byte.

Parameters:
  • a (str | bytes | bytearray)

  • b (str | bytes | bytearray)

  • mask_byte (int)

Return type:

bool

mir.basic.alphabets.back_translate(aa_seq, unknown_codon='NNN')[source]#

Back-translate aa_seq to a nucleotide sequence.

Each residue is mapped to the most frequently used human codon (Kazusa Homo sapiens codon usage database). Non-standard residues (X, *, _, etc.) produce unknown_codon (default "NNN").

The returned sequence has length len(aa_seq) * 3.

Examples

>>> back_translate("CA")
'TGCGCC'
>>> back_translate("X")
'NNN'
Parameters:
  • aa_seq (str)

  • unknown_codon (str)

Return type:

str

mir.basic.alphabets.matches_aa_reduced(aa_seq, reduced_seq)[source]#

Wildcard-aware match between an amino-acid and a reduced-alphabet sequence.

Parameters:
  • aa_seq (str | bytes | bytearray)

  • reduced_seq (str | bytes | bytearray)

Return type:

bool

mir.basic.aliases module#

Centralized species/locus alias utilities.

This module provides canonicalization helpers shared across control setup, parser logic, and OLGA model resolution.

mir.basic.aliases.normalize_species_alias(species)[source]#
Parameters:

species (str)

Return type:

str

mir.basic.aliases.normalize_locus_alias(locus)[source]#
Parameters:

locus (str)

Return type:

str

mir.basic.aliases.normalize_airr_locus_value(value)[source]#
Parameters:

value (str)

Return type:

str

mir.basic.aliases.airr_aliases_for_locus(locus)[source]#
Parameters:

locus (str)

Return type:

set[str]

mir.basic.aliases.locus_search_tokens(locus_imgt)[source]#
Parameters:

locus_imgt (str)

Return type:

set[str]

mir.basic.mirseq_compat module#