API reference#
Library entry point#
Library-facing API.
A stable, import-friendly surface for embedding arda in other Python tools.
The heavy lifting lands in Phase 2 (arda.annotate); this module keeps the
public signature stable.
- arda.adapter.annotate_sequences(sequences, seqtype='nt', organism='human', map_d=True)[source]#
Annotate FR/CDR regions for a batch of sequences.
- Parameters:
sequences (Iterable[str] | Iterable[tuple[str, str]]) – Either raw sequence strings or
(id, sequence)pairs.seqtype (Literal['nt', 'aa']) –
"nt"for nucleotide input,"aa"for amino acid.organism (str) – One of the supported organisms (human, mouse, rat, rabbit, rhesus_monkey).
map_d (bool) – Map D segments (
d_call/d2_call/np*) for VDJ-locus hits;Falseskips D mapping. Applies to nucleotide input only.
- Returns:
A list of AIRR-style annotation record dicts (one per input sequence).
Runtime annotation#
Runtime annotation: map input sequences to the reference and transfer markup.
Pipeline: read input (FASTA/FASTQ) -> MMseqs2 search against the curated scaffold DB -> best hit per query -> project reference region markup onto the query (C++ hot path) -> AIRR TSV.
- arda.annotate.mapper.annotate_file(input, output, organism='human', seqtype='nt', *, threads=0, sensitivity=None, strand='both', chunk_size=50000, map_d=True)[source]#
Annotate a FASTA/FASTQ file and stream an AIRR TSV.
The input is processed in bounded chunks with a background reader thread that prefetches the next chunk while the current one is annotated (mmseqs releases the GIL during its subprocess), so memory stays flat for arbitrarily large FASTQ and read parsing overlaps compute. The reference + target DB are loaded once and reused across all chunks.
- Parameters:
input (str | Path)
output (str | Path)
organism (str)
seqtype (str)
threads (int)
sensitivity (float | None)
strand (str)
chunk_size (int)
map_d (bool)
- Return type:
Path
- arda.annotate.mapper.annotate_records(records, organism='human', seqtype='nt', *, threads=0, sensitivity=None, strand='both', map_d=True)[source]#
Annotate in-memory
(id, sequence)records; return AIRR record dicts.- Parameters:
strand (str) –
"both"(default, nt only) searches both strands and re-orients reverse-complement hits;"forward"searches the plus strand only. Ignored for protein input.map_d (bool) –
True(default) maps D segments into the junction of VDJ-locus hits (d_call/d2_call/np*);Falseskips D mapping (nt input only — D mapping never runs for protein input).records (list[tuple[str, str]])
organism (str)
seqtype (str)
threads (int)
sensitivity (float | None)
- Return type:
list[dict]
- arda.annotate.mapper.build_index(organism='all', *, force=False)[source]#
(Re)build the precompiled mmseqs DBs shipped under
database/.Writes
database/vdj/<org>/mmseqs/<seqtype>/db*+ aVERSIONmarker so the runtime can use them out of the box (and detect a mmseqs-version mismatch). Skips up-to-date DBs unlessforce.- Parameters:
organism (str)
force (bool)
- Return type:
None
Project reference region markup onto a query via the C++ hot path.
Takes a parsed mmseqs hit plus the reference entry for the matched scaffold and returns an AIRR-style record dict for the query.
Junction handling follows AIRR strictly: junction spans the conserved Cys104
through the [FW]118 that opens FR4; junction_aa starts with C and ends with
F/W for a canonical rearrangement. A junction is reported even when not
canonical (out-of-frame, missing the conserved residues). For an out-of-frame
junction (V and J in different frames) the amino-acid translation inserts 1-2 N
bases after the V germline end to restore the J frame; the codon that then
contains an inserted N is rendered as _. The V/J split inside the junction is
located from the transferred v_sequence_end / j_sequence_start.
- arda.annotate.transfer.transfer_hit(query_id, query_seq, hit, ref, seqtype='nt', rev_comp=False, d_germlines=None)[source]#
Build an AIRR record by projecting
refregion coords onto the query.- Parameters:
query_id (str)
query_seq (str)
hit (dict)
ref (RefEntry)
seqtype (str)
rev_comp (bool)
d_germlines (list[tuple[str, str]] | None)
- Return type:
dict
Load the curated reference markup for runtime projection.
For nucleotide annotation we use markup.tsv + alleles.fasta; for amino
acid annotation markup.aa.tsv + alleles.aa.fasta. Both expose region
*_start/*_end columns in the same coordinate space (nt or aa), so the
projection code is identical.
- class arda.annotate.reference.RefEntry(locus, v_call, j_call, starts, ends, v_sequence_end=0, j_sequence_start=0)[source]#
Bases:
objectPer-scaffold reference markup: region coords (in target space) + calls.
- Parameters:
locus (str)
v_call (str)
j_call (str)
starts (list[int])
ends (list[int])
v_sequence_end (int)
j_sequence_start (int)
- locus: str#
- v_call: str#
- j_call: str#
- starts: list[int]#
- ends: list[int]#
- v_sequence_end: int#
- j_sequence_start: int#
- class arda.annotate.reference.Reference(organism, seqtype, target_fasta, entries, d_germlines)[source]#
Bases:
objectIn-memory reference for one (organism, seqtype).
- Parameters:
organism (str)
seqtype (str)
target_fasta (Path)
entries (dict[str, RefEntry])
d_germlines (dict[str, list[tuple[str, str]]])
- organism: str#
- seqtype: str#
- target_fasta: Path#
- d_germlines: dict[str, list[tuple[str, str]]]#
- arda.annotate.reference.load_reference(organism, seqtype='nt')[source]#
Load reference markup + target FASTA path for an organism.
- Parameters:
organism (str)
seqtype (str)
- Return type:
Sequence I/O: streaming FASTA/FASTQ readers and chunking.
Native parsing (no BioPython). Transparently handles gzip by .gz extension.
- arda.annotate.io.open_text(path)[source]#
Open a (possibly gzipped) text file for reading.
- Parameters:
path (str | Path)
- arda.annotate.io.read_sequences(path)[source]#
Yield
(id, sequence)from a FASTA or FASTQ file (auto-detected).- Parameters:
path (str | Path)
- Return type:
Iterator[tuple[str, str]]
- arda.annotate.io.detect_format(path)[source]#
Return
"fasta"or"fastq"by peeking at the first non-empty char.- Parameters:
path (str | Path)
- Return type:
str
Reference build#
Orchestrate the per-species reference database build.
For each locus: enumerate deduplicated V-J scaffolds, annotate them with IgBLAST,
keep those with complete FR1-FR4 + CDR1-3 markup, translate to protein, and
derive protein markup. Writes the committed artifacts under
database/vdj/<organism>/ plus a comprehensive build.log.
- arda.refbuild.build.build(organism='all')[source]#
Build one organism or
"all"supported organisms.- Parameters:
organism (str)
- Return type:
None
- arda.refbuild.build.build_species(organism)[source]#
Build the reference DB for one organism. Returns the output directory.
- Parameters:
organism (str)
- Return type:
Path
Enumerate in-frame V-J reference scaffolds.
For markup transfer the FR/CDR region coordinates are fully determined by the V gene (FR1-3, CDR1-2, CDR3 start at the conserved Cys104) and the J gene (CDR3 end, FR4). The D segment lies inside the hypervariable CDR3 — query-specific at runtime — so we enumerate V×J scaffolds for every locus and, for VDJ loci, insert a short frame-neutral N spacer where D would sit so IgBLAST still annotates a plausible CDR3 + FR4.
Each scaffold is V + N*pad + J where pad keeps the J coding frame aligned
to V’s reading frame (jframe from the IgBLAST aux file). Byte-identical
scaffolds are deduplicated: one DB entry, with all contributing (V,J) allele
pairs recorded.
- class arda.refbuild.combinations.Scaffold(scaffold_id, locus, sequence, v_calls=<factory>, j_calls=<factory>, n_pad=0)[source]#
Bases:
objectA deduplicated V-J reference scaffold.
Fields:
scaffold_id(stable"{locus}_{index}"),locus,sequence(assembledV + N*pad + J),v_calls/j_calls(all alleles producing this scaffold), andn_pad(N nucleotides between V and J).- Parameters:
scaffold_id (str)
locus (str)
sequence (str)
v_calls (list[str])
j_calls (list[str])
n_pad (int)
- scaffold_id: str#
- locus: str#
- sequence: str#
- v_calls: list[str]#
- j_calls: list[str]#
- n_pad: int = 0#
- arda.refbuild.combinations.load_j_frames(organism)[source]#
Parse
bin/optional_file/<organism>_gl.aux-> {J allele: frame}.Frame is the 0-based “first coding frame start position” (column 2).
- Parameters:
organism (str)
- Return type:
dict[str, int]
- arda.refbuild.combinations.build_locus_scaffolds(locus, v_alleles, j_alleles, j_frames, *, d_spacer=None)[source]#
Build deduplicated V×J scaffolds for one locus.
- Parameters:
locus (Locus) – The locus definition.
v_alleles (dict[str, str]) –
{allele: ungapped V sequence}.j_alleles (dict[str, str]) –
{allele: ungapped J sequence}.j_frames (dict[str, int]) –
{J allele: 0-based coding frame}from the aux file.d_spacer (int | None) – N spacer length for VDJ loci (default
DEFAULT_D_SPACER_NT); forced to 0 for VJ loci.
- Returns:
Scaffolds, one per unique assembled sequence.
- Return type:
list[Scaffold]
Native nucleotide translation and reading-frame utilities.
No BioPython. The hot functions (translate, detect_coding_frame,
reverse_complement, back_translate) are implemented in the C++ extension
arda._markup and re-exported here; a pure-Python fallback keeps the module
importable if the extension is unavailable. These mirror mirpy’s mirseq API so
mirpy can later import arda and reuse them.
- arda.refbuild.translate.translate(nt, frame=0)[source]#
Translate a nucleotide string from
frame(0/1/2).- Parameters:
nt (str)
frame (int)
- Return type:
str
- arda.refbuild.translate.detect_coding_frame(nt)[source]#
Return the reading frame (0/1/2) with the fewest stop codons.
- Parameters:
nt (str)
- Return type:
int
- arda.refbuild.translate.reverse_complement(nt)[source]#
Reverse-complement a nucleotide string (non-ACGT ->
N).- Parameters:
nt (str)
- Return type:
str
- arda.refbuild.translate.back_translate(aa, unknown='NNN')[source]#
Mock back-translation via most-frequent human codons.
- Parameters:
aa (str)
unknown (str)
- Return type:
str
- arda.refbuild.translate.aa_coords_from_nt(nt_start, nt_end, coding_start)[source]#
Map a 1-based closed nt interval to 1-based closed aa coordinates.
- Parameters:
nt_start (int) – 1-based start of the region in the nucleotide sequence.
nt_end (int) – 1-based end (closed).
coding_start (int) – 1-based nt position where translation begins (frame origin).
- Returns:
(aa_start, aa_end)1-based closed, in the translated protein.- Return type:
tuple[int, int]
IMGT/V-QUEST germline reference download, parsing, and ungapping.
The IMGT V-QUEST reference directory ships gapped germline FASTAs laid out as
<Species>/<IG|TR>/<GENE>.fasta (e.g. Homo_sapiens/IG/IGHV.fasta).
Sequences carry IMGT-numbering gap dots; IgBLAST’s edit_imgt_file.pl ungaps
them and rewrites headers to bare allele names (what makeblastdb wants).
This module:
downloads & extracts the reference zip into
data/imgt(gitignored),parses the original gapped FASTA headers for per-allele functionality,
ungaps a gene file via
edit_imgt_file.plintodata/imgt/ungapped.
- class arda.refbuild.imgt.ImgtAllele(allele, functionality, sequence)[source]#
Bases:
objectA germline allele parsed from an IMGT FASTA header + sequence.
- Parameters:
allele (str)
functionality (str)
sequence (str)
- allele: str#
- functionality: str#
- sequence: str#
- property is_functional: bool#
- arda.refbuild.imgt.download_reference(*, force=False)[source]#
Download and extract the IMGT V-QUEST reference directory.
Returns the extraction root (containing the per-species directories). Idempotent unless
force.- Parameters:
force (bool)
- Return type:
Path
- arda.refbuild.imgt.gene_fasta_path(species_dir, group, gene_stem)[source]#
Path to a gene-type FASTA, e.g.
Homo_sapiens/IG/IGHV.fasta.Handles the occasional top-level wrapper directory inside the zip.
- Parameters:
species_dir (str)
group (str)
gene_stem (str)
- Return type:
Path
- arda.refbuild.imgt.parse_functionality(path)[source]#
Map allele name -> normalized functionality from gapped IMGT headers.
IMGT header:
accession|allele|species|functionality|region|.... The functionality field may be wrapped, e.g.(F)/[F]for inferred.- Parameters:
path (Path)
- Return type:
dict[str, str]
External tool wrappers#
Thin wrapper around the mmseqs binary.
Inspired by pymmseqs (MIT) but deliberately dependency-free: we only need
binary discovery, a subprocess runner, and the createdb / search /
convertalis (and easy-search) pipeline used by the annotator.
Discovery order for the binary: $ARDA_MMSEQS → <project>/bin/mmseqs →
mmseqs on PATH. If none are found, a static binary is auto-fetched into
<project>/bin/mmseqs (one-time, transparent) unless $ARDA_NO_AUTO_FETCH
is set — so neither pip nor conda users need to install mmseqs manually.
- exception arda.mmseqs.MMseqsError[source]#
Bases:
RuntimeErrorRaised when an
mmseqsinvocation exits non-zero.
- arda.mmseqs.mmseqs_binary()[source]#
Locate the
mmseqsexecutable, auto-fetching a static build if needed.Resolution:
$ARDA_MMSEQS→<project>/bin/mmseqs→mmseqsonPATH. If still not found, download a static binary into<project>/bin/mmseqs(one-time) unless$ARDA_NO_AUTO_FETCHis set.- Return type:
str
- arda.mmseqs.run(args, *, check=True)[source]#
Run
mmseqs <args>capturing stdout/stderr.- Parameters:
args (list[str])
check (bool)
- Return type:
CompletedProcess
- arda.mmseqs.createdb(fasta, db, *, dbtype=None)[source]#
Create an mmseqs sequence DB from a FASTA file.
dbtype:Noneauto-detect,1amino-acid,2nucleotide.- Parameters:
fasta (str | Path)
db (str | Path)
dbtype (int | None)
- Return type:
Path
- arda.mmseqs.search(query_db, target_db, result_db, tmp_dir, *, search_type=0, sensitivity=5.7, evalue=0.001, max_seqs=300, threads=1, extra=None)[source]#
Run
mmseqs searchwith backtrace enabled (-a).- Parameters:
query_db (str | Path)
target_db (str | Path)
result_db (str | Path)
tmp_dir (str | Path)
search_type (int)
sensitivity (float)
evalue (float)
max_seqs (int)
threads (int)
extra (list[str] | None)
- Return type:
Path
- arda.mmseqs.convertalis(query_db, target_db, result_db, out_tsv, *, format_output='query,target,qstart,qend,tstart,tend,qlen,tlen,alnlen,mismatch,gapopen,cigar,qaln,taln,evalue,bits,pident', threads=1, search_type=None)[source]#
Convert an alignment result DB to a TSV with the requested columns.
search_typemust be passed for nucleotide results (3) so convertalis can interpret the alignment; otherwise mmseqs cannot tell nt from translated.- Parameters:
query_db (str | Path)
target_db (str | Path)
result_db (str | Path)
out_tsv (str | Path)
format_output (str)
threads (int)
search_type (int | None)
- Return type:
Path
- arda.mmseqs.easy_search(query_fasta, target_fasta_or_db, out_tsv, tmp_dir, *, search_type=0, sensitivity=5.7, evalue=0.001, max_seqs=300, threads=1, format_output='query,target,qstart,qend,tstart,tend,qlen,tlen,alnlen,mismatch,gapopen,cigar,qaln,taln,evalue,bits,pident', strand=None, extra=None)[source]#
One-shot createdb+search+convertalis producing a TSV.
strand(nucleotide search only): 1 forward, 2 both strands;Nonelets mmseqs default (forward).- Parameters:
query_fasta (str | Path)
target_fasta_or_db (str | Path)
out_tsv (str | Path)
tmp_dir (str | Path)
search_type (int)
sensitivity (float)
evalue (float)
max_seqs (int)
threads (int)
format_output (str)
strand (int | None)
extra (list[str] | None)
- Return type:
Path
Wrapper around the downloaded IgBLAST release.
Used only at build time (Phase 1) to construct the curated reference DB; the runtime annotator does not depend on IgBLAST.
The IgBLAST release is expected under <project>/bin (placed there by
setup.sh), laid out as:
bin/
igblastn igblastp makeblastdb edit_imgt_file.pl
internal_data/ optional_file/
$IGDATA is pointed at bin/ so IgBLAST finds internal_data and the
per-organism optional_file/<organism>_gl.aux auxiliary files.
- exception arda.igblast.IgBlastError[source]#
Bases:
RuntimeErrorRaised when an IgBLAST tool invocation fails or is missing.
- arda.igblast.igdata_env()[source]#
Environment with
IGDATApointing at the IgBLAST data root.- Return type:
dict[str, str]
- arda.igblast.tool(name)[source]#
Resolve an IgBLAST tool path under
bin/.- Parameters:
name (str)
- Return type:
Path
- arda.igblast.edit_imgt_file(imgt_fasta, out_fasta)[source]#
Ungap an IMGT germline FASTA via
edit_imgt_file.pl.- Parameters:
imgt_fasta (str | Path)
out_fasta (str | Path)
- Return type:
Path
- arda.igblast.makeblastdb(in_fasta, out_db, *, dbtype='nucl')[source]#
Build a germline BLAST database from an ungapped FASTA.
- Parameters:
in_fasta (str | Path)
out_db (str | Path)
dbtype (str)
- Return type:
Path
- arda.igblast.igblastn_airr(query_fasta, out_tsv, *, organism, germline_db_v, germline_db_j, germline_db_d=None, auxiliary_data=None, ig_seqtype='TCR', num_threads=1)[source]#
Run
igblastn -outfmt 19(AIRR rearrangement TSV).- Parameters:
query_fasta (str | Path)
out_tsv (str | Path)
organism (str)
germline_db_v (str | Path)
germline_db_j (str | Path)
germline_db_d (str | Path | None)
auxiliary_data (str | Path | None)
ig_seqtype (str)
num_threads (int)
- Return type:
Path