API reference#

Library entry point#

Library-facing API.

A stable, import-friendly surface for embedding arda in other Python tools. The heavy lifting lands in Phase 2 (arda.annotate); this module keeps the public signature stable.

arda.adapter.annotate_sequences(sequences, seqtype='nt', organism='human', map_d=True)[source]#

Annotate FR/CDR regions for a batch of sequences.

Parameters:
  • sequences (Iterable[str] | Iterable[tuple[str, str]]) – Either raw sequence strings or (id, sequence) pairs.

  • seqtype (Literal['nt', 'aa']) – "nt" for nucleotide input, "aa" for amino acid.

  • organism (str) – One of the supported organisms (human, mouse, rat, rabbit, rhesus_monkey).

  • map_d (bool) – Map D segments (d_call/d2_call/np*) for VDJ-locus hits; False skips D mapping. Applies to nucleotide input only.

Returns:

A list of AIRR-style annotation record dicts (one per input sequence).

Runtime annotation#

Runtime annotation: map input sequences to the reference and transfer markup.

Pipeline: read input (FASTA/FASTQ) -> MMseqs2 search against the curated scaffold DB -> best hit per query -> project reference region markup onto the query (C++ hot path) -> AIRR TSV.

arda.annotate.mapper.annotate_file(input, output, organism='human', seqtype='nt', *, threads=0, sensitivity=None, strand='both', chunk_size=50000, map_d=True)[source]#

Annotate a FASTA/FASTQ file and stream an AIRR TSV.

The input is processed in bounded chunks with a background reader thread that prefetches the next chunk while the current one is annotated (mmseqs releases the GIL during its subprocess), so memory stays flat for arbitrarily large FASTQ and read parsing overlaps compute. The reference + target DB are loaded once and reused across all chunks.

Parameters:
  • input (str | Path)

  • output (str | Path)

  • organism (str)

  • seqtype (str)

  • threads (int)

  • sensitivity (float | None)

  • strand (str)

  • chunk_size (int)

  • map_d (bool)

Return type:

Path

arda.annotate.mapper.annotate_records(records, organism='human', seqtype='nt', *, threads=0, sensitivity=None, strand='both', map_d=True)[source]#

Annotate in-memory (id, sequence) records; return AIRR record dicts.

Parameters:
  • strand (str) – "both" (default, nt only) searches both strands and re-orients reverse-complement hits; "forward" searches the plus strand only. Ignored for protein input.

  • map_d (bool) – True (default) maps D segments into the junction of VDJ-locus hits (d_call/d2_call/np*); False skips D mapping (nt input only — D mapping never runs for protein input).

  • records (list[tuple[str, str]])

  • organism (str)

  • seqtype (str)

  • threads (int)

  • sensitivity (float | None)

Return type:

list[dict]

arda.annotate.mapper.build_index(organism='all', *, force=False)[source]#

(Re)build the precompiled mmseqs DBs shipped under database/.

Writes database/vdj/<org>/mmseqs/<seqtype>/db* + a VERSION marker so the runtime can use them out of the box (and detect a mmseqs-version mismatch). Skips up-to-date DBs unless force.

Parameters:
  • organism (str)

  • force (bool)

Return type:

None

Project reference region markup onto a query via the C++ hot path.

Takes a parsed mmseqs hit plus the reference entry for the matched scaffold and returns an AIRR-style record dict for the query.

Junction handling follows AIRR strictly: junction spans the conserved Cys104 through the [FW]118 that opens FR4; junction_aa starts with C and ends with F/W for a canonical rearrangement. A junction is reported even when not canonical (out-of-frame, missing the conserved residues). For an out-of-frame junction (V and J in different frames) the amino-acid translation inserts 1-2 N bases after the V germline end to restore the J frame; the codon that then contains an inserted N is rendered as _. The V/J split inside the junction is located from the transferred v_sequence_end / j_sequence_start.

arda.annotate.transfer.transfer_hit(query_id, query_seq, hit, ref, seqtype='nt', rev_comp=False, d_germlines=None)[source]#

Build an AIRR record by projecting ref region coords onto the query.

Parameters:
  • query_id (str)

  • query_seq (str)

  • hit (dict)

  • ref (RefEntry)

  • seqtype (str)

  • rev_comp (bool)

  • d_germlines (list[tuple[str, str]] | None)

Return type:

dict

Load the curated reference markup for runtime projection.

For nucleotide annotation we use markup.tsv + alleles.fasta; for amino acid annotation markup.aa.tsv + alleles.aa.fasta. Both expose region *_start/*_end columns in the same coordinate space (nt or aa), so the projection code is identical.

class arda.annotate.reference.RefEntry(locus, v_call, j_call, starts, ends, v_sequence_end=0, j_sequence_start=0)[source]#

Bases: object

Per-scaffold reference markup: region coords (in target space) + calls.

Parameters:
  • locus (str)

  • v_call (str)

  • j_call (str)

  • starts (list[int])

  • ends (list[int])

  • v_sequence_end (int)

  • j_sequence_start (int)

locus: str#
v_call: str#
j_call: str#
starts: list[int]#
ends: list[int]#
v_sequence_end: int#
j_sequence_start: int#
class arda.annotate.reference.Reference(organism, seqtype, target_fasta, entries, d_germlines)[source]#

Bases: object

In-memory reference for one (organism, seqtype).

Parameters:
  • organism (str)

  • seqtype (str)

  • target_fasta (Path)

  • entries (dict[str, RefEntry])

  • d_germlines (dict[str, list[tuple[str, str]]])

organism: str#
seqtype: str#
target_fasta: Path#
entries: dict[str, RefEntry]#
d_germlines: dict[str, list[tuple[str, str]]]#
get(scaffold_id)[source]#
Parameters:

scaffold_id (str)

Return type:

RefEntry | None

arda.annotate.reference.load_reference(organism, seqtype='nt')[source]#

Load reference markup + target FASTA path for an organism.

Parameters:
  • organism (str)

  • seqtype (str)

Return type:

Reference

Sequence I/O: streaming FASTA/FASTQ readers and chunking.

Native parsing (no BioPython). Transparently handles gzip by .gz extension.

arda.annotate.io.open_text(path)[source]#

Open a (possibly gzipped) text file for reading.

Parameters:

path (str | Path)

arda.annotate.io.read_sequences(path)[source]#

Yield (id, sequence) from a FASTA or FASTQ file (auto-detected).

Parameters:

path (str | Path)

Return type:

Iterator[tuple[str, str]]

arda.annotate.io.detect_format(path)[source]#

Return "fasta" or "fastq" by peeking at the first non-empty char.

Parameters:

path (str | Path)

Return type:

str

arda.annotate.io.write_fasta(records, path)[source]#

Write (id, sequence) records to a FASTA file.

Parameters:
  • records (Iterator[tuple[str, str]])

  • path (str | Path)

Return type:

Path

arda.annotate.io.chunked(it, size)[source]#

Yield lists of up to size items from an iterator.

Parameters:
  • it (Iterator)

  • size (int)

Return type:

Iterator[list]

Reference build#

Orchestrate the per-species reference database build.

For each locus: enumerate deduplicated V-J scaffolds, annotate them with IgBLAST, keep those with complete FR1-FR4 + CDR1-3 markup, translate to protein, and derive protein markup. Writes the committed artifacts under database/vdj/<organism>/ plus a comprehensive build.log.

arda.refbuild.build.build(organism='all')[source]#

Build one organism or "all" supported organisms.

Parameters:

organism (str)

Return type:

None

arda.refbuild.build.build_species(organism)[source]#

Build the reference DB for one organism. Returns the output directory.

Parameters:

organism (str)

Return type:

Path

Enumerate in-frame V-J reference scaffolds.

For markup transfer the FR/CDR region coordinates are fully determined by the V gene (FR1-3, CDR1-2, CDR3 start at the conserved Cys104) and the J gene (CDR3 end, FR4). The D segment lies inside the hypervariable CDR3 — query-specific at runtime — so we enumerate V×J scaffolds for every locus and, for VDJ loci, insert a short frame-neutral N spacer where D would sit so IgBLAST still annotates a plausible CDR3 + FR4.

Each scaffold is V + N*pad + J where pad keeps the J coding frame aligned to V’s reading frame (jframe from the IgBLAST aux file). Byte-identical scaffolds are deduplicated: one DB entry, with all contributing (V,J) allele pairs recorded.

class arda.refbuild.combinations.Scaffold(scaffold_id, locus, sequence, v_calls=<factory>, j_calls=<factory>, n_pad=0)[source]#

Bases: object

A deduplicated V-J reference scaffold.

Fields: scaffold_id (stable "{locus}_{index}"), locus, sequence (assembled V + N*pad + J), v_calls / j_calls (all alleles producing this scaffold), and n_pad (N nucleotides between V and J).

Parameters:
  • scaffold_id (str)

  • locus (str)

  • sequence (str)

  • v_calls (list[str])

  • j_calls (list[str])

  • n_pad (int)

scaffold_id: str#
locus: str#
sequence: str#
v_calls: list[str]#
j_calls: list[str]#
n_pad: int = 0#
arda.refbuild.combinations.load_j_frames(organism)[source]#

Parse bin/optional_file/<organism>_gl.aux -> {J allele: frame}.

Frame is the 0-based “first coding frame start position” (column 2).

Parameters:

organism (str)

Return type:

dict[str, int]

arda.refbuild.combinations.build_locus_scaffolds(locus, v_alleles, j_alleles, j_frames, *, d_spacer=None)[source]#

Build deduplicated V×J scaffolds for one locus.

Parameters:
  • locus (Locus) – The locus definition.

  • v_alleles (dict[str, str]) – {allele: ungapped V sequence}.

  • j_alleles (dict[str, str]) – {allele: ungapped J sequence}.

  • j_frames (dict[str, int]) – {J allele: 0-based coding frame} from the aux file.

  • d_spacer (int | None) – N spacer length for VDJ loci (default DEFAULT_D_SPACER_NT); forced to 0 for VJ loci.

Returns:

Scaffolds, one per unique assembled sequence.

Return type:

list[Scaffold]

Native nucleotide translation and reading-frame utilities.

No BioPython. The hot functions (translate, detect_coding_frame, reverse_complement, back_translate) are implemented in the C++ extension arda._markup and re-exported here; a pure-Python fallback keeps the module importable if the extension is unavailable. These mirror mirpy’s mirseq API so mirpy can later import arda and reuse them.

arda.refbuild.translate.translate(nt, frame=0)[source]#

Translate a nucleotide string from frame (0/1/2).

Parameters:
  • nt (str)

  • frame (int)

Return type:

str

arda.refbuild.translate.detect_coding_frame(nt)[source]#

Return the reading frame (0/1/2) with the fewest stop codons.

Parameters:

nt (str)

Return type:

int

arda.refbuild.translate.reverse_complement(nt)[source]#

Reverse-complement a nucleotide string (non-ACGT -> N).

Parameters:

nt (str)

Return type:

str

arda.refbuild.translate.back_translate(aa, unknown='NNN')[source]#

Mock back-translation via most-frequent human codons.

Parameters:
  • aa (str)

  • unknown (str)

Return type:

str

arda.refbuild.translate.aa_coords_from_nt(nt_start, nt_end, coding_start)[source]#

Map a 1-based closed nt interval to 1-based closed aa coordinates.

Parameters:
  • nt_start (int) – 1-based start of the region in the nucleotide sequence.

  • nt_end (int) – 1-based end (closed).

  • coding_start (int) – 1-based nt position where translation begins (frame origin).

Returns:

(aa_start, aa_end) 1-based closed, in the translated protein.

Return type:

tuple[int, int]

IMGT/V-QUEST germline reference download, parsing, and ungapping.

The IMGT V-QUEST reference directory ships gapped germline FASTAs laid out as <Species>/<IG|TR>/<GENE>.fasta (e.g. Homo_sapiens/IG/IGHV.fasta). Sequences carry IMGT-numbering gap dots; IgBLAST’s edit_imgt_file.pl ungaps them and rewrites headers to bare allele names (what makeblastdb wants).

This module:

  • downloads & extracts the reference zip into data/imgt (gitignored),

  • parses the original gapped FASTA headers for per-allele functionality,

  • ungaps a gene file via edit_imgt_file.pl into data/imgt/ungapped.

class arda.refbuild.imgt.ImgtAllele(allele, functionality, sequence)[source]#

Bases: object

A germline allele parsed from an IMGT FASTA header + sequence.

Parameters:
  • allele (str)

  • functionality (str)

  • sequence (str)

allele: str#
functionality: str#
sequence: str#
property is_functional: bool#
arda.refbuild.imgt.download_reference(*, force=False)[source]#

Download and extract the IMGT V-QUEST reference directory.

Returns the extraction root (containing the per-species directories). Idempotent unless force.

Parameters:

force (bool)

Return type:

Path

arda.refbuild.imgt.gene_fasta_path(species_dir, group, gene_stem)[source]#

Path to a gene-type FASTA, e.g. Homo_sapiens/IG/IGHV.fasta.

Handles the occasional top-level wrapper directory inside the zip.

Parameters:
  • species_dir (str)

  • group (str)

  • gene_stem (str)

Return type:

Path

arda.refbuild.imgt.parse_functionality(path)[source]#

Map allele name -> normalized functionality from gapped IMGT headers.

IMGT header: accession|allele|species|functionality|region|.... The functionality field may be wrapped, e.g. (F) / [F] for inferred.

Parameters:

path (Path)

Return type:

dict[str, str]

arda.refbuild.imgt.ungap_gene(species_dir, group, gene_stem)[source]#

Ungap a gene file with edit_imgt_file.pl; return the ungapped path.

Parameters:
  • species_dir (str)

  • group (str)

  • gene_stem (str)

Return type:

Path

arda.refbuild.imgt.read_fasta(path)[source]#

Read a FASTA file into (header, sequence) pairs (sequence joined).

Parameters:

path (Path)

Return type:

list[tuple[str, str]]

External tool wrappers#

Thin wrapper around the mmseqs binary.

Inspired by pymmseqs (MIT) but deliberately dependency-free: we only need binary discovery, a subprocess runner, and the createdb / search / convertalis (and easy-search) pipeline used by the annotator.

Discovery order for the binary: $ARDA_MMSEQS<project>/bin/mmseqsmmseqs on PATH. If none are found, a static binary is auto-fetched into <project>/bin/mmseqs (one-time, transparent) unless $ARDA_NO_AUTO_FETCH is set — so neither pip nor conda users need to install mmseqs manually.

exception arda.mmseqs.MMseqsError[source]#

Bases: RuntimeError

Raised when an mmseqs invocation exits non-zero.

arda.mmseqs.mmseqs_binary()[source]#

Locate the mmseqs executable, auto-fetching a static build if needed.

Resolution: $ARDA_MMSEQS<project>/bin/mmseqsmmseqs on PATH. If still not found, download a static binary into <project>/bin/mmseqs (one-time) unless $ARDA_NO_AUTO_FETCH is set.

Return type:

str

arda.mmseqs.run(args, *, check=True)[source]#

Run mmseqs <args> capturing stdout/stderr.

Parameters:
  • args (list[str])

  • check (bool)

Return type:

CompletedProcess

arda.mmseqs.version()[source]#

Return the mmseqs version string (e.g. 18.8cc5c).

Return type:

str

arda.mmseqs.createdb(fasta, db, *, dbtype=None)[source]#

Create an mmseqs sequence DB from a FASTA file.

dbtype: None auto-detect, 1 amino-acid, 2 nucleotide.

Parameters:
  • fasta (str | Path)

  • db (str | Path)

  • dbtype (int | None)

Return type:

Path

arda.mmseqs.search(query_db, target_db, result_db, tmp_dir, *, search_type=0, sensitivity=5.7, evalue=0.001, max_seqs=300, threads=1, extra=None)[source]#

Run mmseqs search with backtrace enabled (-a).

Parameters:
  • query_db (str | Path)

  • target_db (str | Path)

  • result_db (str | Path)

  • tmp_dir (str | Path)

  • search_type (int)

  • sensitivity (float)

  • evalue (float)

  • max_seqs (int)

  • threads (int)

  • extra (list[str] | None)

Return type:

Path

arda.mmseqs.convertalis(query_db, target_db, result_db, out_tsv, *, format_output='query,target,qstart,qend,tstart,tend,qlen,tlen,alnlen,mismatch,gapopen,cigar,qaln,taln,evalue,bits,pident', threads=1, search_type=None)[source]#

Convert an alignment result DB to a TSV with the requested columns.

search_type must be passed for nucleotide results (3) so convertalis can interpret the alignment; otherwise mmseqs cannot tell nt from translated.

Parameters:
  • query_db (str | Path)

  • target_db (str | Path)

  • result_db (str | Path)

  • out_tsv (str | Path)

  • format_output (str)

  • threads (int)

  • search_type (int | None)

Return type:

Path

One-shot createdb+search+convertalis producing a TSV.

strand (nucleotide search only): 1 forward, 2 both strands; None lets mmseqs default (forward).

Parameters:
  • query_fasta (str | Path)

  • target_fasta_or_db (str | Path)

  • out_tsv (str | Path)

  • tmp_dir (str | Path)

  • search_type (int)

  • sensitivity (float)

  • evalue (float)

  • max_seqs (int)

  • threads (int)

  • format_output (str)

  • strand (int | None)

  • extra (list[str] | None)

Return type:

Path

Wrapper around the downloaded IgBLAST release.

Used only at build time (Phase 1) to construct the curated reference DB; the runtime annotator does not depend on IgBLAST.

The IgBLAST release is expected under <project>/bin (placed there by setup.sh), laid out as:

bin/
  igblastn  igblastp  makeblastdb  edit_imgt_file.pl
  internal_data/   optional_file/

$IGDATA is pointed at bin/ so IgBLAST finds internal_data and the per-organism optional_file/<organism>_gl.aux auxiliary files.

exception arda.igblast.IgBlastError[source]#

Bases: RuntimeError

Raised when an IgBLAST tool invocation fails or is missing.

arda.igblast.igdata_env()[source]#

Environment with IGDATA pointing at the IgBLAST data root.

Return type:

dict[str, str]

arda.igblast.tool(name)[source]#

Resolve an IgBLAST tool path under bin/.

Parameters:

name (str)

Return type:

Path

arda.igblast.edit_imgt_file(imgt_fasta, out_fasta)[source]#

Ungap an IMGT germline FASTA via edit_imgt_file.pl.

Parameters:
  • imgt_fasta (str | Path)

  • out_fasta (str | Path)

Return type:

Path

arda.igblast.makeblastdb(in_fasta, out_db, *, dbtype='nucl')[source]#

Build a germline BLAST database from an ungapped FASTA.

Parameters:
  • in_fasta (str | Path)

  • out_db (str | Path)

  • dbtype (str)

Return type:

Path

arda.igblast.igblastn_airr(query_fasta, out_tsv, *, organism, germline_db_v, germline_db_j, germline_db_d=None, auxiliary_data=None, ig_seqtype='TCR', num_threads=1)[source]#

Run igblastn -outfmt 19 (AIRR rearrangement TSV).

Parameters:
  • query_fasta (str | Path)

  • out_tsv (str | Path)

  • organism (str)

  • germline_db_v (str | Path)

  • germline_db_j (str | Path)

  • germline_db_d (str | Path | None)

  • auxiliary_data (str | Path | None)

  • ig_seqtype (str)

  • num_threads (int)

Return type:

Path