API reference#

Library entry point#

Library-facing API.

A stable, import-friendly surface for embedding arda in other Python tools. The heavy lifting lands in Phase 2 (arda.annotate); this module keeps the public signature stable.

arda.adapter.annotate_sequences(sequences, seqtype='nt', organism='human', map_d=True)[source]#

Annotate FR/CDR regions for a batch of sequences.

Parameters:

sequences (Iterable[str] | Iterable[tuple[str, str]]) – Either raw sequence strings or (id, sequence) pairs.
seqtype (Literal['nt', 'aa']) – "nt" for nucleotide input, "aa" for amino acid.
organism (str) – One of the supported organisms (human, mouse, rat, rabbit, rhesus_monkey).
map_d (bool) – Map D segments (d_call/d2_call/np*) for VDJ-locus hits; False skips D mapping. Applies to nucleotide input only.

Returns:

A list of AIRR-style annotation record dicts (one per input sequence).

Runtime annotation#

Runtime annotation: map input sequences to the reference and transfer markup.

Pipeline: read input (FASTA/FASTQ) -> MMseqs2 search against the curated scaffold DB -> best hit per query -> project reference region markup onto the query (C++ hot path) -> AIRR TSV.

arda.annotate.mapper.annotate_file(input, output, organism='human', seqtype='nt', *, threads=0, sensitivity=None, strand='both', chunk_size=50000, map_d=True)[source]#

Annotate a FASTA/FASTQ file and stream an AIRR TSV.

The input is processed in bounded chunks with a background reader thread that prefetches the next chunk while the current one is annotated (mmseqs releases the GIL during its subprocess), so memory stays flat for arbitrarily large FASTQ and read parsing overlaps compute. The reference + target DB are loaded once and reused across all chunks.

Parameters:

input (str | Path)
output (str | Path)
organism (str)
seqtype (str)
threads (int)
sensitivity (float | None)
strand (str)
chunk_size (int)
map_d (bool)

Return type:

Path

arda.annotate.mapper.annotate_records(records, organism='human', seqtype='nt', *, threads=0, sensitivity=None, strand='both', map_d=True)[source]#

Annotate in-memory (id, sequence) records; return AIRR record dicts.

Parameters:

strand (str) – "both" (default, nt only) searches both strands and re-orients reverse-complement hits; "forward" searches the plus strand only. Ignored for protein input.
map_d (bool) – True (default) maps D segments into the junction of VDJ-locus hits (d_call/d2_call/np*); False skips D mapping (nt input only — D mapping never runs for protein input).
records (list[tuple[str, str]])
organism (str)
seqtype (str)
threads (int)
sensitivity (float | None)

Return type:

list[dict]

arda.annotate.mapper.build_index(organism='all', *, force=False)[source]#

(Re)build the precompiled mmseqs DBs shipped under database/.

Writes database/vdj/<org>/mmseqs/<seqtype>/db* + a VERSION marker so the runtime can use them out of the box (and detect a mmseqs-version mismatch). Skips up-to-date DBs unless force.

Parameters:

organism (str)
force (bool)

Return type:

None

Project reference region markup onto a query via the C++ hot path.

Takes a parsed mmseqs hit plus the reference entry for the matched scaffold and returns an AIRR-style record dict for the query.

Junction handling follows AIRR strictly: junction spans the conserved Cys104 through the [FW]118 that opens FR4; junction_aa starts with C and ends with F/W for a canonical rearrangement. A junction is reported even when not canonical (out-of-frame, missing the conserved residues). For an out-of-frame junction (V and J in different frames) the amino-acid translation inserts 1-2 N bases after the V germline end to restore the J frame; the codon that then contains an inserted N is rendered as _. The V/J split inside the junction is located from the transferred v_sequence_end / j_sequence_start.

arda.annotate.transfer.transfer_hit(query_id, query_seq, hit, ref, seqtype='nt', rev_comp=False, d_germlines=None, submitted_seq=None, anchors=None)[source]#

Build an AIRR record by projecting ref region coords onto the query.

query_seq is the coding-strand sequence all markup/coords/CIGARs are computed on. submitted_seq is the read AS SUBMITTED, stored verbatim in the AIRR sequence field; for a reverse-strand hit it is the reverse complement of query_seq and rev_comp is set, per AIRR (“if rev_comp is True, all output data are based on the reverse complement of sequence”). Defaults to query_seq (forward reads, where the two are identical).

Parameters:

query_id (str)
query_seq (str)
hit (dict)
ref (RefEntry)
seqtype (str)
rev_comp (bool)
d_germlines (list[tuple[str, str]] | None)
submitted_seq (str | None)
anchors (dict | None)

Return type:

dict

Load the curated reference markup for runtime projection.

For nucleotide annotation we use markup.tsv + alleles.fasta; for amino acid annotation markup.aa.tsv + alleles.aa.fasta. Both expose region *_start/*_end columns in the same coordinate space (nt or aa), so the projection code is identical.

class arda.annotate.reference.RefEntry(locus, v_call, j_call, starts, ends, v_sequence_end=0, j_sequence_start=0, c_call='', vj_end=0)[source]#

Bases: object

Per-scaffold reference markup: region coords (in target space) + calls.

Parameters:

locus (str)
v_call (str)
j_call (str)
starts (list[int])
ends (list[int])
v_sequence_end (int)
j_sequence_start (int)
c_call (str)
vj_end (int)

locus: str#

v_call: str#

j_call: str#

starts: list[int]#

ends: list[int]#

v_sequence_end: int#

j_sequence_start: int#

c_call: str#

vj_end: int#

property is_jc: bool#

a J followed by the CH1 exon, with no V.

Type:: A constant-region scaffold

class arda.annotate.reference.Reference(organism, seqtype, target_fasta, entries, d_germlines, anchors=<factory>)[source]#

Bases: object

In-memory reference for one (organism, seqtype).

Parameters:

organism (str)
seqtype (str)
target_fasta (Path)
entries (dict[str, RefEntry])
d_germlines (dict[str, list[tuple[str, str]]])
anchors (dict)

organism: str#

seqtype: str#

target_fasta: Path#

entries: dict[str, RefEntry]#

d_germlines: dict[str, list[tuple[str, str]]]#

anchors: dict#

get(scaffold_id)[source]#

Parameters:: scaffold_id (str)
Return type:: RefEntry | None

arda.annotate.reference.load_reference(organism, seqtype='nt')[source]#

Load reference markup + target FASTA path for an organism.

Parameters:

organism (str)
seqtype (str)

Return type:

Reference

Sequence I/O: streaming FASTA/FASTQ readers and chunking.

Native parsing (no BioPython). Transparently handles gzip by .gz extension.

arda.annotate.io.open_text(path)[source]#

Open a (possibly gzipped) text file for reading.

Parameters:: path (str | Path)

arda.annotate.io.read_sequences(path, *, with_qual=False)[source]#

Yield (id, sequence) from a FASTA or FASTQ file (auto-detected).

with_qual=True yields (id, sequence, qual) instead – the FASTQ Phred string, or None for FASTA (which has no quality). The default is unchanged and pays nothing: the quality line is consumed either way, only kept when asked (needed solely by the paired overlap-merge, arda.rnaseq.map.merge_pair()).

Parameters:

path (str | Path)
with_qual (bool)

Return type:

Iterator[tuple]

arda.annotate.io.detect_format(path)[source]#

Return "fasta" or "fastq" by peeking at the first non-empty char.

Parameters:: path (str | Path)
Return type:: str

arda.annotate.io.write_fasta(records, path)[source]#

Write (id, sequence) records to a FASTA file.

Parameters:

records (Iterator[tuple[str, str]])
path (str | Path)

Return type:

Path

arda.annotate.io.chunked(it, size)[source]#

Yield lists of up to size items from an iterator.

Parameters:

it (Iterator)
size (int)

Return type:

Iterator[list]

Per-segment AIRR CIGAR strings from the mmseqs scaffold alignment.

arda aligns a query to a V + N*pad + J [+ C] scaffold, not to each germline segment separately. AIRR wants a CIGAR per segment (v_cigar/j_cigar/c_cigar) whose reference is that segment’s germline; since the scaffold’s V part IS the V germline (target position == germline position), its J part IS the J allele, and its C part the CH1 exon, each segment’s CIGAR is the sub-walk of the one query->scaffold alignment whose target falls in that segment’s range.

CIGAR operators follow the AIRR spec (SAM subset):

S – query positions before the alignment starts (query 5’ offset). Required, precedes N.
N – reference positions before the alignment starts (germline 5’ offset). Required.
M / I (gap in reference) / D (gap in query) – the aligned body.
trailing S (query 3’ remainder) is emitted; trailing N (germline 3’ remainder) is optional per the spec and omitted (arda does not always know the full germline length, e.g. the C-region CH1 exon is longer than the shipped stub).

segment_cigars builds all three in a SINGLE pass over the aligned strings.

Correcting cigars for CONTIGS (Stage 3). A contig is just a long query, so BOTH ways to get its cigars end in segment_cigars and produce the same record (see arda.annotate.contig):

RE-ANNOTATE the assembled contig through mapper.annotate_records – one mmseqs alignment, then segment_cigars. No cigar arithmetic; check_cigar validates it.

MERGE the reads’ existing alignments column-by-column into the contig’s (C++ _markup.merge_alignment), skipping the alignment pass.

Both are built and proven byte-for-byte equal (tests/unit/test_contig_merge.py on 29 real GenBank contigs). MEASURED (arda-benchmark scripts/bench_contig_cigars.py): at scRNA-seq scale (~10^5 contigs/sample) merge is ~9x faster – the whole gap is mmseqs; the C++ stitch is ~3 % of merge’s wall and barely grows with read depth. Prefer merge when the assembly layout is available (the reads carry their scaffold + offset); re-annotate is the fallback when it is not.

arda.annotate.cigar.parse_cigar(cigar)[source]#

"57S291M1054S" -> [(57,"S"), (291,"M"), (1054,"S")]. Inverse of build_cigar().

Parameters:: cigar (str)
Return type:: list[tuple[int, str]]

arda.annotate.cigar.cigar_query_length(cigar)[source]#

Query (read/contig) bases the CIGAR spans – M/I/S/=/X; D and N are reference-side.

Parameters:: cigar (str)
Return type:: int

arda.annotate.cigar.cigar_reference_length(cigar)[source]#

Reference (germline) bases the CIGAR spans – M/D/N/=/X; I and S are query-side.

Parameters:: cigar (str)
Return type:: int

arda.annotate.cigar.check_cigar(cigar, query_len)[source]#

A CIGAR is consistent with a query of query_len iff its query-side ops sum to it.

This is the invariant a corrected/re-annotated sequence (a read OR an assembled contig – a contig is just a long query) must satisfy: v_cigar/j_cigar/c_cigar each lay over the WHOLE sequence, soft-clipping the parts outside their own segment. Use it to validate a cigar after correcting or re-deriving it.

Parameters:

cigar (str)
query_len (int)

Return type:

bool

arda.annotate.cigar.build_cigar(q_lead, g_lead, ops, q_trail)[source]#

Assemble one AIRR CIGAR: {q_lead}S {g_lead}N <body> {q_trail}S (parts of length 0 are dropped). ops is the per-column M/I/D sequence of the aligned body; consecutive equal operators are run-length encoded. Trailing germline N is intentionally omitted (optional).

Parameters:

q_lead (int)
g_lead (int)
ops (list[str])
q_trail (int)

Return type:

str

arda.annotate.cigar.segment_cigars(qaln, taln, qstart, tstart, qlen, t_vend, t_jstart, t_vjend)[source]#

Return {"v_cigar":…, "j_cigar":…, "c_cigar":…} (only the segments that have a body).

qaln/taln are the mmseqs aligned strings (- for gaps), qstart/tstart their 1-based query/target start, qlen the full query length. Boundaries are 1-based scaffold positions; pass 0 for an absent segment.

Parameters:

qaln (str)
taln (str)
qstart (int)
tstart (int)
qlen (int)
t_vend (int)
t_jstart (int)
t_vjend (int)

Return type:

dict[str, str]

D-segment mapping on a bare nucleotide junction — no read, no mmseqs search.

transfer._map_d already maps D into the V..J interior of a query, but it is fed v_sequence_end / j_sequence_start projected from an mmseqs scaffold hit. A VDJdb-style record has no read to align: it has a junction and a V/J call. The per-allele germlines shipped in database/vdj/<org>/cdr3_anchors.tsv close that gap, so the interior can be located directly and the existing mapper reused.

Junction space, as everywhere in arda.cdr3fix: the input runs Cys104 -> Phe/Trp118 inclusive.

Finding the interior. The V germline is exact at the junction’s 5’ end (V/D/J are not somatically mutated in TCR, and IGH mutation is rare this close to the anchor), so the V contribution is the longest common prefix of the junction and the V’s CDR3-region germline; the J contribution is the longest common suffix. Validated against OLGA ground truth on 1300 junctions across human IGH/TRB/TRD and mouse TRB: the prefix length is exact for 80-85 % of records and never underestimates (it can overshoot by 1-2 nt when the first N-region base happens to match), and the derived interior contains the whole true D segment in 94-99 % of records.

Only IGH, TRB and TRD have D germlines; VJ loci return an empty call.

class arda.annotate.dmap.DCall(locus='', d_call='', d_sequence_start=-1, d_sequence_end=-1, d_support='', d2_call='', d2_sequence_start=-1, d2_sequence_end=-1, d2_support='', np1='', np2='', np3='', v_sequence_end=-1, j_sequence_start=-1, extra=<factory>)[source]#

Bases: object

D mapping of one junction. Coordinates are 1-based closed, junction space.

Parameters:

locus (str)
d_call (str)
d_sequence_start (int)
d_sequence_end (int)
d_support (str)
d2_call (str)
d2_sequence_start (int)
d2_sequence_end (int)
d2_support (str)
np1 (str)
np2 (str)
np3 (str)
v_sequence_end (int)
j_sequence_start (int)
extra (dict)

locus: str = ''#

d_call: str = ''#

d_sequence_start: int = -1#

d_sequence_end: int = -1#

d_support: str = ''#

d2_call: str = ''#

d2_sequence_start: int = -1#

d2_sequence_end: int = -1#

d2_support: str = ''#

np1: str = ''#

np2: str = ''#

np3: str = ''#

v_sequence_end: int = -1#

j_sequence_start: int = -1#

extra: dict#

property called: bool#

property is_dd: bool#

arda.annotate.dmap.map_d_junction(junction_nt, v_call, j_call, species='human')[source]#

Map D (and a tandem second D) into a bare nucleotide junction.

Parameters:

junction_nt (str)
v_call (str)
j_call (str)
species (str)

Return type:

DCall

Two ways to give an assembled contig its AIRR cigars — and they agree.

A Stage-3 contig is a consensus of reads that Stage 1 already aligned to a scaffold. Its v_cigar/j_cigar/c_cigar + alignment strings can be produced two ways:

reannotate_contigs() — treat the contig as one long query and run it back through annotate_records() (one mmseqs alignment, then segment_cigars). Simple, exact, no new code; the cost is a second alignment pass.
merge_contigs() — stitch the reads’ existing alignments into the contig’s (the C++ _markup.merge_alignment per-column consensus over N reads), skipping the alignment pass. Wins when a sample has ~10^5 contigs (scRNA-seq).

Both converge on the same synthetic hit and reuse transfer_hit(), so their output is field-for-field comparable. Which is optimal is a measured question — see tests/unit/test_contig_merge.py and the arda-benchmark Phase-D benchmark.

class arda.annotate.contig.ReadPlacement(qaln, taln, qstart, tstart, offset)[source]#

Bases: object

One read’s placement in a contig: its scaffold alignment + contig offset.

qaln/taln are the read’s coding-strand aligned strings vs the scaffold (- for gaps, as Stage 1 emits). qstart/tstart are 1-based starts in the read / scaffold. offset is the 0-based position of the read’s first base within the contig (the assembly layout), in contig orientation.

Parameters:

qaln (str)
taln (str)
qstart (int)
tstart (int)
offset (int)

qaln: str#

taln: str#

qstart: int#

tstart: int#

offset: int#

class arda.annotate.contig.Contig(sequence_id, sequence, target, reads=<factory>)[source]#

Bases: object

An assembled contig + the reads it was built from, all hitting one scaffold.

Parameters:

sequence_id (str)
sequence (str)
target (str)
reads (list[ReadPlacement])

sequence_id: str#

sequence: str#

target: str#

reads: list[ReadPlacement]#

arda.annotate.contig.reannotate_contigs(records, organism='human', seqtype='nt', *, threads=0, sensitivity=None, strand='both', map_d=True)[source]#

Annotate assembled contigs by re-aligning them (baseline path).

records are (contig_id, contig_seq). A thin wrapper over annotate_records(): a contig is just a long query.

Parameters:

records (list[tuple[str, str]])
organism (str)
seqtype (str)
threads (int)
sensitivity (float | None)
strand (str)
map_d (bool)

Return type:

list[dict]

arda.annotate.contig.merge_contig(contig, reference, *, map_d=True)[source]#

Annotate one contig by stitching its reads’ alignments (merge path).

reference is a preloaded Reference (load it once for a whole sample; see merge_contigs()). Raises KeyError if the contig’s target scaffold is absent from the reference.

Parameters:

contig (Contig)
reference (Reference)
map_d (bool)

Return type:

dict

arda.annotate.contig.merge_contigs(contigs, organism='human', seqtype='nt', *, reference=None, map_d=True)[source]#

Annotate contigs by the merge path; loads the reference once for all of them.

Parameters:

contigs (list[Contig])
organism (str)
seqtype (str)
reference (Reference | None)
map_d (bool)

Return type:

list[dict]

Junction markup and repair#

Working from a bare (junction_aa, v_call, j_call, species) record — a VDJdb row, with no read to align — rather than from a sequenced fragment.

Markup and repair of bare (junction_aa, V, J) records — the VDJdb case.

Coordinate convention. Everything here is junction space: Cys104 through the Phe/Trp118 that opens FR4, both anchors included. That is what VDJdb’s cdr3 column actually holds (CASSARSGELFF with vEnd=4, jStart=7), and it is NOT arda’s cdr3 (which excludes both anchors). Conflating the two silently corrupts every coordinate emitted here.

The V and J germlines each template a known run of residues into the junction, and database/vdj/<organism>/cdr3_anchors.tsv ships them per allele. So marking up a record needs no germline search: align the junction’s 5’ end against the V’s templated residues (anchored at Cys104) and its 3’ end against the J’s (anchored at [FW]118), and read off the edit operations.

Both alignments are one semi-global Needleman-Wunsch anchored at the conserved residue with free end gaps on the junction-interior side. The free end gap is what makes the result honest: the germline templated run is an upper bound (V and J are exonuclease-trimmed), so the alignment stops wherever germline agreement stops paying for itself, and the untemplated N/D region is never scored. Concretely:

germline CASS    vs CCSS...   -> sub at index 1, d=1 -> repaired to CASS...
germline CASS    vs CGGS...   -> v_end = 1, no error (that is the V/N boundary)
germline TNEKLFF vs ...NEKLF  -> deletion, d=0 -> repaired to ...NEKLFF
germline TNEKLFF vs ...NNKLFF -> sub at index 8, d=4 -> REPORTED, not repaired

Detection and repair are deliberately separate: every germline disagreement is reported (with its position, extent and distance from the anchor), but only edits adjacent to the conserved anchor are applied. See _MAX_REPLACE for why.

Repair always targets a canonical junction. cdr3_repaired is only accepted when it opens with Cys104 and closes with Phe/Trp118 (_canonicalise); a repair that would hand back a junction missing either anchor is refused and the submission returned untouched. A repair exists to restore the anchors, and downstream every consumer trusts cdr3_repaired. So good implies canonical, by construction rather than by luck.

Fix-type names mirror VDJdb’s Cdr3Fixer so its cdr3fix JSON is directly comparable — on the committed 250-row fixture arda now reproduces VDJdb’s repair on all 100 records it flags, and agrees with its good/vCanonical/jCanonical verdicts on every row. The per-position errors list is arda’s addition.

class arda.cdr3fix.Cdr3Error(side, kind, pos, length, frm, to, dist=0, applied=False)[source]#

Bases: object

One edit between the observed junction and the germline-templated run.

pos indexes the observed junction and length is how far the error extends. frm is what the record has, to what the germline says. dist is the distance from the conserved anchor.

applied is true only when this edit was actually written into Cdr3Markup.cdr3_repaired. Being within _MAX_REPLACE of the anchor makes an edit eligible; the whole side’s repair is still discarded if its fix type comes back Failed* — no alignment, more than _MAX_FIX invented residues, more than _MAX_TRIM trimmed ones, or a result that would not be canonical. An error can therefore be reported with applied=False and the junction left alone — which is the point: detection and repair are separate decisions.

Parameters:

side (str)
kind (str)
pos (int)
length (int)
frm (str)
to (str)
dist (int)
applied (bool)

side: str#

kind: str#

pos: int#

length: int#

frm: str#

to: str#

dist: int = 0#

applied: bool = False#

class arda.cdr3fix.Cdr3Markup(cdr3, cdr3_repaired, v_call='', j_call='', locus='', species='', v_end=-1, j_start=-1, v_fix='FailedBadSegment', j_fix='FailedBadSegment', errors=<factory>, sequence_id='')[source]#

Bases: object

Result of marking up one (junction_aa, V, J) record.

Parameters:

cdr3 (str)
cdr3_repaired (str)
v_call (str)
j_call (str)
locus (str)
species (str)
v_end (int)
j_start (int)
v_fix (str)
j_fix (str)
errors (list[Cdr3Error])
sequence_id (str)

cdr3: str#

cdr3_repaired: str#

v_call: str = ''#

j_call: str = ''#

locus: str = ''#

species: str = ''#

v_end: int = -1#

j_start: int = -1#

v_fix: str = 'FailedBadSegment'#

j_fix: str = 'FailedBadSegment'#

errors: list[Cdr3Error]#

sequence_id: str = ''#

property v_canonical: bool#

Does the junction as repaired open with the conserved Cys104?

Read off cdr3_repaired, not the submission – restoring the anchor is the whole point of the repair, and VDJdb’s vCanonical/jCanonical mean the same thing. Reading the submission instead disagreed with VDJdb on 76 of 250 fixture rows.

property j_canonical: bool#: Does the junction as repaired close with the conserved Phe/Trp118?

property good: bool#

Both sides repaired, and the result carries both conserved anchors.

A repair exists to produce a canonical junction. One that ends up without its Cys104 or its Phe/Trp118 has not repaired the record, it has invented a junction nobody submitted – so it can never be good (see _canonicalise).

property fix_needed: bool#

to_cdr3fix()[source]#

The VDJdb cdr3fix JSON object, key-for-key.

Return type:: dict

explain()[source]#

One human-readable line: what happened, why, and where.

Return type:: str

class arda.cdr3fix.Anchor(locus, segment, templated_aa, functionality, status, source, anchor_nt=-1, partial_nt=0, germline_nt='')[source]#

Bases: object

A germline segment’s contribution to the junction.

Parameters:

locus (str)
segment (str)
templated_aa (str)
functionality (str)
status (str)
source (str)
anchor_nt (int)
partial_nt (int)
germline_nt (str)

locus: str#

segment: str#

templated_aa: str#

functionality: str#

status: str#

source: str#

anchor_nt: int = -1#

partial_nt: int = 0#

germline_nt: str = ''#

arda.cdr3fix.load_anchors(organism)[source]#

{(segment, allele): Anchor} for one organism; {} if not built.

Parameters:: organism (str)
Return type:: dict[tuple[str, str], Anchor]

arda.cdr3fix.markup_cdr3(cdr3, v_call, j_call, species='human', *, anchors=None, sequence_id='', max_replace=1)[source]#

Mark up and repair one junction. cdr3 is junction space (C..[FW]).

max_replace is how far from the conserved anchor an edit may sit and still be repaired; edits beyond it are reported with applied=False. Raising it repairs more, at the cost of rewriting N-region residues that merely look like germline (see _MAX_REPLACE).

Parameters:

cdr3 (str)
v_call (str)
j_call (str)
species (str)
anchors (dict | None)
sequence_id (str)
max_replace (int)

Return type:

Cdr3Markup

arda.cdr3fix.markup_records(df, *, cdr3='cdr3', v='v', j='j', species='species', sequence_id=None, organism=None, max_replace=1)[source]#

Mark up a whole table. Anchors are loaded (and cached) once per organism.

Parameters:

df (DataFrame)
cdr3 (str)
v (str)
j (str)
species (str)
sequence_id (str | None)
organism (str | None)
max_replace (int)

Return type:

list[Cdr3Markup]

arda.cdr3fix.markup_batch(df, **kw)[source]#

markup_records + to_frame.

Parameters:: df (DataFrame)
Return type:: DataFrame

arda.cdr3fix.to_frame(records)[source]#

Records -> a TSV-ready frame with the vdjdb-compatible cdr3fix column.

Parameters:: records (Iterable[Cdr3Markup])
Return type:: DataFrame

arda.cdr3fix.format_report(records, *, show_ok=False)[source]#

Human-readable log: a summary table, then a line per fixed/failed record.

show_ok=True also lists the records that needed no repair.

Parameters:

records (Iterable[Cdr3Markup])
show_ok (bool)

Return type:

str

Which D, and where — from an amino-acid junction alone.

A VDJdb-style record has no nucleotides, and D segments are short and trimmed at both ends, so the D is often invisible in the translated junction. Two independent sources of information remain, and they are complementary:

Where. The junction’s nucleotide length is known (3x its amino-acid length), so insVD + |D surviving| + insDJ is pinned. Marginalising the generative model’s insertion-length and D-trimming distributions therefore places the D even when the sequence says nothing at all about it. Measured against OLGA ground truth, the MAP d_start is a median 1 nt off for mouse TRB, 2 nt for human TRB, and 3 nt for TRD and IGH.

Which. The length constraint is nearly useless for identity — the D length distributions overlap, so the posterior barely moves off the prior. What identity the prior does carry is P(D | J), and for TRB that is mostly genomic order: TRBD2 lies 3’ of the whole TRBJ1 cluster, so a TRBJ1 junction can only have used TRBD1 (see _mask_forbidden). What otherwise identifies a D is the amino-acid match, and only where enough D survives: median surviving D is 17 nt for IGH (~5.7 aa) but 5 nt for human TRB (~1.7 aa).

So neither source alone is enough, and which one dominates flips by locus:

locus prior only aa only combined n (held-out seed, generated) human IGH 15 % 81 % 82 % 345 human TRB 76 % 70 % 82 % 595 human TRD 86 % 88 % 87 % 699 mouse TRB 76 % 83 % 85 % 699

“prior only” is beta = 0; “aa only” is argmax of the match score under a uniform prior, ties broken by marginal usage. Combining wins at IGH and both TRB; TRD is a wash, because one D gene (TRDD3) accounts for 85 % of rearrangements and the aa match already finds it.

The combination is log P(D | M, J) + beta * s_D: the length-and-J prior, tempered by the best gapless local alignment score s_D of the D’s three-frame translations against the non-templated middle of the junction. beta is fitted per locus and shipped in database/vdj/<org>/d_prior.tsv with the distributions themselves, so nothing here needs OLGA at runtime. It is flat above ~1.25 for TRB, so the shipped values are not delicate.

Honesty about the numbers. The table is measured on junctions drawn from the same generative model that supplies the prior, so the prior’s contribution is flattered. The amino-acid contribution is not — it is germline matching. Rearrangements that genomic order forbids are excluded from the truth: OLGA’s human TRB model emits TRBD2 x TRBJ1 in 8.7 % of draws, and scoring against those measures agreement with a model artifact.

Out of model, against nucleotide D calls (E <= 0.05) on the real GenBank fixtures: human TRB 94 %, IGH 85 %, TRD 91 %, mouse TRB 85 %. On TRB, note that both this posterior and the nucleotide caller enforce the same D2-x-J1 constraint, so their agreement on TRBJ1 records is guaranteed rather than earned; the TRBJ2 rows, where both D genes stay possible, score 91 % (human) and 81 % (mouse).

Priors exist only for the (organism, locus) pairs with a published model: human IGH, TRB and TRD, and mouse TRB. Everything else returns None rather than guessing.

class arda.dpost.DPosterior(locus, d_call, posterior, entropy, by_gene=<factory>, support_aa=0, d_start=-1, d_start_ci90=(-1, -1), n_middle_nt=0)[source]#

Bases: object

Posterior over the D gene, and over where it sits in the junction.

Parameters:

locus (str)
d_call (str)
posterior (float)
entropy (float)
by_gene (dict[str, float])
support_aa (int)
d_start (int)
d_start_ci90 (tuple[int, int])
n_middle_nt (int)

locus: str#

d_call: str#

posterior: float#

entropy: float#

by_gene: dict[str, float]#

support_aa: int = 0#

d_start: int = -1#

d_start_ci90: tuple[int, int] = (-1, -1)#

n_middle_nt: int = 0#

property confident: bool#: A hard call. 0.9 keeps ~the top decile of TRB records and most of IGH.

arda.dpost.posterior_d(junction_aa, v_call, j_call, species='human')[source]#

Posterior over the D gene (and its position) for an amino-acid junction.

junction_aa is junction space (Cys104 .. Phe/Trp118, both included), as in arda.cdr3fix. Returns None when the locus has no D, no shipped model, or the junction cannot be marked up.

Parameters:

junction_aa (str)
v_call (str)
j_call (str)
species (str)

Return type:

DPosterior | None

arda.dpost.load_d_prior(organism)[source]#

{locus: DPrior}; empty when the organism has no shipped model.

Parameters:: organism (str)
Return type:: dict[str, DPrior]

Bulk RNA-seq#

Stage 1 — RNA-seq filter + map.

Reuses the streaming, memory-bounded annotator (annotate.mapper._prep + _annotate_chunk + the background-reader/bounded-queue loop of annotate_file): MMseqs2 is the parallel layer and its k-mer prefilter rejects non-receptor reads before alignment, so mostly-non-receptor RNA-seq is cheap. The difference from arda annotate is that we write only the reads that map (keyed by read id, so the AIRR TSV is the read-id → junction map), plus an optional candidate FASTA and a run report.

Paired FASTQ mates are streamed independently, tagged <id>/1 / <id>/2 so query ids stay unique; a pair is kept if either mate maps (recall-first — the base id recovers the pair).

arda.rnaseq.map.map_rnaseq(r1, output, *, r2=None, organism='human', seqtype='nt', threads=0, sensitivity=None, strand='both', chunk_size=200000, map_d=True, reconstruct=False, min_score=75.0, max_seqs=300, kmer=-1, drop_constant_only=True, limit=None, emit_reads=None, report_path=None)[source]#

Filter + map an RNA-seq FASTQ (single or paired); write mapped reads as AIRR.

Parameters:

r1 (str | Path) – FASTA/FASTQ (gzip by .gz). Single-end, or R1 of a pair.
output (str | Path) – AIRR TSV of the mapped reads only (keyed by sequence_id).
r2 (str | Path | None) – R2 FASTQ for paired input; None for single-end.
min_score (float) – drop mapped reads below this MMseqs2 bit score. 0 disables the filter (recall-max). See _MIN_SCORE for the calibration.
kmer (int | None) – MMseqs2 -k. The memory knob: the nucleotide prefilter allocates 4**k index entries, so the tool default k=15 costs ~8.4 GB peak RSS whatever else you set. arda defaults to 13 (~0.7 GB, and never slower). None = MMseqs2’s default.
max_seqs (int) – MMseqs2 target hits per read. Does not change which reads are kept, only which V/J scaffold wins. See arda.annotate.mapper._MAX_SEQS.
limit (int | None) – analyse only the first limit reads (single-end) / read pairs (paired), then stop — a native head, so a subsample no longer needs an external zcat | head | gzip round-trip. None maps the whole file.
emit_reads (str | Path | None) – optional path — write the mapped reads’ sequences as FASTA (coding-strand oriented) for downstream handoff.
report_path (str | Path | None) – optional path — write the RnaseqReport as JSON.
organism (str)
seqtype (str)
threads (int)
sensitivity (float | None)
strand (str)
chunk_size (int)
map_d (bool)
reconstruct (bool)
drop_constant_only (bool)

Returns:

The run RnaseqReport (also printed by the CLI).

Return type:

RnaseqReport

arda.rnaseq.map.read_pairs(r1, r2=None, *, reconstruct=False, limit=None)[source]#

Stream (id, sequence) reads for single-end (r1 only) or paired input.

For paired input the two mates carry the same id, so they are tagged <id>/1 and <id>/2 to keep query ids unique (strip the suffix to recover the pair). With reconstruct, overlapping mates are merged into one fragment (merge_pair()) keyed by the bare id — giving a short read the mate’s V/J context; non-overlapping mates fall back to the tagged-independent form.

Parameters:

limit (int | None) – analyse only the first limit input records — reads (single-end) or read pairs (paired) — then stop, without decompressing the rest of the file. None reads everything. The mate-order / truncation checks below still run on every record actually read; a truncation beyond limit is simply never reached — that is the intent of a head-style limit, not a hole in the check.
r1 (str | Path)
r2 (str | Path | None)
reconstruct (bool)

Raises:

ValueError – if the two files disagree on read names or record count. This is not paranoia: a truncated R2 makes zip stop early and silently analyse a prefix, and a shuffled R2 pairs mate 1 of one fragment with mate 2 of another. Both were observed in this project’s own data and produced a published false discovery (a spurious R2-only blind spot) that had to be retracted. A pair of FASTQs is an assertion; check it.

Return type:

Iterator[tuple[str, str]]

arda.rnaseq.map.merge_pair(s1, s2, *, q1=None, q2=None, min_overlap=12, max_mismatch_rate=0.1)[source]#

Overlap-merge a read pair into one fragment, or None if they don’t overlap.

Aligns s1 (R1) with reverse_complement(s2) (R2 flipped to the same strand) by finding an exact _MERGE_ANCHOR-mer from the flipped R2’s 5’ end inside R1 (C-level str.find → O(len), so non-overlapping pairs — the common RNA-seq case — cost almost nothing), then verifying the implied overlap; the mate provides the V/J context a short read lacks.

In the overlap the two mates may disagree. Given Phred qualities q1/q2 the base with the higher quality wins per position (rc(R2)’s quality is q2 reversed, not complemented); without them R2 wins the whole overlap (the historical behaviour). Outside the overlap, R1 supplies its 5’ part and R2 its 3’ tail.

Parameters:

s1 (str)
s2 (str)
q1 (str | None)
q2 (str | None)
min_overlap (int)
max_mismatch_rate (float)

Return type:

str | None

class arda.rnaseq.map.RnaseqReport(input, organism, total_reads=0, mapped_reads=0, per_locus=<factory>, constant_only_fragments=0, isotype_from_mate=0, min_score=0.0, threads=0, wall_seconds=0.0, reads_per_second=0.0, peak_rss_mb=0.0)[source]#

Bases: object

Counts + timing for one map run (written as JSON with --report).

Parameters:

input (str)
organism (str)
total_reads (int)
mapped_reads (int)
per_locus (dict[str, int])
constant_only_fragments (int)
isotype_from_mate (int)
min_score (float)
threads (int)
wall_seconds (float)
reads_per_second (float)
peak_rss_mb (float)

input: str#

organism: str#

total_reads: int = 0#

mapped_reads: int = 0#

per_locus: dict[str, int]#

constant_only_fragments: int = 0#

isotype_from_mate: int = 0#

min_score: float = 0.0#

threads: int = 0#

wall_seconds: float = 0.0#

reads_per_second: float = 0.0#

peak_rss_mb: float = 0.0#

property mapped_fraction: float#

as_dict()[source]#

Return type:: dict

Stage 3 — contig assembly (anchored greedy overlap-extension).

Reconstruct the clonotypes that Stage 1 maps but cannot call: a long CDR3 (V(DD)J ultralong, ~20-40 aa) does not fit in one 100-150 bp read, so no read spans the junction and correct’s complete-junction filter drops every read of it. Assembly-based extractors recover these by assembly; this module does the same on the reads Stage 1 already mapped.

Why overlap-extension and not a de Bruijn graph: every clonotype sharing a germline V/J contributes identical k-mers, so a dBG collapses distinct clones exactly across the CDR3 (the region of interest). This is the reason the Pevzner-lab Ig assemblers use a read graph, not a k-mer graph (Safonova 2015, 10.1093/bioinformatics/btv238). We exploit arda’s own anchors instead: Stage 1 gives every V-side read a cdr3_start offset, so reads of one clone are already coordinate-aligned at the CDR3 – we seed from those and extend 3’ through the CDR3 into J, where the sequence is clone-specific, and stop before running deep into the (shared) constant region. Seeding never extends 5’ into the germline V, which is what keeps distinct clones apart and bounds the germline-k-mer blow-up.

The assembled contig physically unites a clone’s V-side reads with its J/C-side reads under one junction, so correct also gets the clone’s isotype (from the J/C mates’ c_class) for free – the long clones were previously invisible to both.

Output is a per-member-read AIRR fragment (sequence_id -> the contig’s complete junction), meant to be concatenated with the Stage-1 mapped AIRR and fed to correct in one pass: a read that was incomplete in Stage 1 is dropped there and kept here, so fragments count once.

arda.rnaseq.assemble.assemble_contigs(airr_tsv, output, *, organism='human', k=21, min_overlap=21, min_id=0.9, max_ext_past_cdr3=130, scan_cap=400, threads=0, map_d=True, report_path=None)[source]#

Assemble long-CDR3 contigs from Stage-1 mapped reads and attribute their junctions.

Reads the Stage-1 mapped-reads AIRR (needs sequence, rev_comp, locus, cdr3_start and the v/j/c_call columns), assembles per-locus contigs, re-annotates them (reannotate_contigs()), and writes an AIRR TSV with one row per incomplete member read carrying the contig’s complete junction (and the read’s own c_class so isotype survives). Concatenate this with the mapped AIRR and run correct once: the read’s incomplete Stage-1 row is dropped and this complete row kept, so each fragment is counted exactly once.

The contig’s D call travels with the junction (d_call, d2_call, d_support, d2_support, np1-np3). An ultralong CDR3 is the one place a tandem D-D is both most likely and least visible to a single read, so the contig is where it must be called.

Parameters:

airr_tsv (str | Path) – Stage-1 mapped-reads AIRR TSV.
output (str | Path) – assembled-reads AIRR TSV (header only if nothing assembles).
map_d (bool) – map D segments on the assembled contig (default True).
max_ext_past_cdr3 (int) – stop extending a contig once it reaches this many nt past the CDR3 start – enough to cross the junction into J without running into the shared C region.
scan_cap (int) – per-step cap on candidate reads examined for a (germline-frequent) k-mer.
organism (str)
k (int)
min_overlap (int)
min_id (float)
threads (int)
report_path (str | Path | None)

Returns:

An AssembleReport.

Return type:

AssembleReport

class arda.rnaseq.assemble.AssembleReport(reads_in: 'int' = 0, seeds: 'int' = 0, contigs: 'int' = 0, contigs_complete: 'int' = 0, reads_rescued: 'int' = 0)[source]#

Bases: object

Parameters:

reads_in (int)
seeds (int)
contigs (int)
contigs_complete (int)
reads_rescued (int)

reads_in: int = 0#

seeds: int = 0#

contigs: int = 0#

contigs_complete: int = 0#

reads_rescued: int = 0#

as_dict()[source]#

Return type:: dict

Stage 2 — CDR3 error correction (sequencing-error model).

Collapses sequencing-error CDR3 variants onto their parent clonotype, using seqtree neighbour search (a fast edit-bounded index) to find substitution/indel neighbours. A clonotype C is an error child of a more-abundant neighbour P (differing by n_subs substitutions and n_indel inserted/deleted bases) iff the expected number of such misread parent reads – count[P] * p_sub**n_subs * p_ind**n_indel – is at least count[C]. The rates are PER BASE and the per-mismatch probability is length-scaled (p_sub = error_rate * L): a single mismatch over a longer junction sheds proportionally more error mass, so the default error_rate = 0.001 reproduces vdjtools’ ~1/20 at a 45 nt (15 aa) junction and scales elsewhere. A multi-base (in-frame SHM) indel costs p_ind**len and is kept as a real clonotype. The count is the SPANNING read depth – reads that fully observe the junction – so the test is over the reads that actually saw the discriminating base ("2/2, not 2/200"); error_method in {binom, betabinom} instead piles up partial reads per position for extra depth at very low coverage. Children route to the parent; chains collapse to the ultimate ancestor; count[parent] * p_err >= count[child] with p_err < 1 gives strictly increasing counts along parent pointers, so there are no cycles.

arda.rnaseq.correct.correct_airr(airr_tsv, output, *, organism='human', map_d=True, max_subs=2, max_indel=0, error_rate=0.001, indel_rate=0.001, require_vj=True, error_method='simple', complete_only=True, coverage=True, read_map=None, extra_airr=None, report_path=None)[source]#

Aggregate mapped reads into clonotypes and collapse CDR3 sequencing errors.

Parameters:

airr_tsv (str | Path) – Stage-1 mapped-reads AIRR TSV (needs junction, sequence_id).
organism (str) – reference organism, used only to map D into each clonotype’s junction.
map_d (bool) – append d_call/d2_call/d_support/d2_support, called once per clonotype on its corrected junction (see _clonotype_d()). Default True.
output (str | Path) – corrected clonotype table TSV (junction, junction_aa, v_call, j_call, c_call, locus, duplicate_count, consensus_count, and with map_d the four D columns), sorted by abundance. A clonotype is keyed by (locus, v_call, j_call, junction). Per the AIRR schema, duplicate_count is the number of READS supporting the clonotype (both paired mates of a molecule count) and consensus_count is the number of distinct fragment consensuses (the two mates of one molecule are one consensus). c_call is the clonotype’s dominant isotype CLASS (from c_class: IGHG, IGHA, …), preferring a resolved class over the ambiguous IGHC; empty when no read carried a constant call.
max_subs (int) – max substitutions between an error child and its parent (seqtree neighbour search).
max_indel (int) – max inserted/deleted bases searched for indel error children (default 0). A 1-2 bp instrument indel is a frameshift and is already dropped by complete_only, so on complete junctions the indel search only costs time (~160x slower) and collapses nothing; a multi-base in-frame SHM indel costs (indel_rate*L)**len and is kept as a real clonotype either way. Set it > 0 only with --all-junctions (frameshift indels kept).
error_rate (float) – per-BASE substitution error rate (~Phred 30 = 0.001). The per-substitution collapse probability is length-scaled, error_rate * junction_len, so the default reproduces vdjtools’ ~1/20 at a 45 nt (15 aa) junction and scales for other lengths.
indel_rate (float) – per-BASE indel error rate (instrument-dependent; default 0.001, length-scaled).
error_method (str) – "simple" (default) tests on spanning read counts; "binom" / "betabinom" pile up partial reads per discriminating position for extra depth at very low coverage (_error_pileup()).
require_vj (bool) – only collapse neighbours sharing v_call and j_call (default True – a true sequencing error does not change the germline-anchored V/J call).
complete_only (bool) – keep only reads whose junction spans both conserved anchors, is in frame, and has no stop codon (see _COMPLETE). A read that stops short of the [FW]118 anchor yields a prefix of a junction, not a clonotype. Setting this False reproduces the raw per-read behaviour and is almost never what you want. (This governs which reads DEFINE clonotypes, not how they are counted – see coverage.)
coverage (bool) – count a clonotype’s abundance as EVERY read that encompasses its junction (aligns to it), not only the reads that span it end-to-end (default True). A long CDR3 is covered by many partial V-side / J-side reads that never reach both anchors; counting only spanning reads under-reports it non-uniformly (the deficit scales with CDR3 length). Coverage counting (_assign_coverage()) is the true expression estimate. False reverts to spanning-read counts.
read_map (str | Path | None) – optional TSV sequence_id -> junction (the corrected clonotype a read ends up in) — the read-id → junction map after correction.
extra_airr (str | Path | None) – optional Stage-3 assembled-reads AIRR (from assemble_contigs()), concatenated with airr_tsv before aggregation. Its rows carry a contig’s complete junction for reads whose own Stage-1 junction was incomplete, so a long-CDR3 clone no single read spans is counted once (the read’s incomplete Stage-1 row is dropped by complete_only).
report_path (str | Path | None)

Returns:

A CorrectReport.

Return type:

CorrectReport

class arda.rnaseq.correct.CorrectReport(clonotypes_in: 'int' = 0, clonotypes_out: 'int' = 0, reads: 'int' = 0, collapsed: 'int' = 0, reads_with_junction: 'int' = 0, reads_incomplete: 'int' = 0)[source]#

Bases: object

Parameters:

clonotypes_in (int)
clonotypes_out (int)
reads (int)
collapsed (int)
reads_with_junction (int)
reads_incomplete (int)

clonotypes_in: int = 0#

clonotypes_out: int = 0#

reads: int = 0#

collapsed: int = 0#

reads_with_junction: int = 0#

reads_incomplete: int = 0#

as_dict()[source]#

Return type:: dict

Reference build#

Orchestrate the per-species reference database build.

For each locus: enumerate deduplicated V-J scaffolds, annotate them with IgBLAST, keep those with complete FR1-FR4 + CDR1-3 markup, translate to protein, and derive protein markup. Writes the committed artifacts under database/vdj/<organism>/ plus a comprehensive build.log.

arda.refbuild.build.build(organism='all', *, one_allele_per_gene=False)[source]#

Build one organism or "all" supported organisms.

Parameters:

organism (str)
one_allele_per_gene (bool)

Return type:

None

arda.refbuild.build.build_species(organism, *, one_allele_per_gene=False)[source]#

Build the reference DB for one organism. Returns the output directory.

one_allele_per_gene builds scaffolds from a single representative allele per gene (*01 where it exists, else the lowest-numbered) – roughly a 4x smaller reference with no allele-level ambiguity. Off by default.

Parameters:

organism (str)
one_allele_per_gene (bool)

Return type:

Path

Enumerate in-frame V-J reference scaffolds.

For markup transfer the FR/CDR region coordinates are fully determined by the V gene (FR1-3, CDR1-2, CDR3 start at the conserved Cys104) and the J gene (CDR3 end, FR4). The D segment lies inside the hypervariable CDR3 — query-specific at runtime — so we enumerate V×J scaffolds for every locus and, for VDJ loci, insert a short frame-neutral N spacer where D would sit so IgBLAST still annotates a plausible CDR3 + FR4.

Each scaffold is V + N*pad + J where pad keeps the J coding frame aligned to V’s reading frame (jframe from the IgBLAST aux file). Byte-identical scaffolds are deduplicated: one DB entry, with all contributing (V,J) allele pairs recorded.

class arda.refbuild.combinations.Scaffold(scaffold_id, locus, sequence, v_calls=<factory>, j_calls=<factory>, n_pad=0)[source]#

Bases: object

A deduplicated V-J reference scaffold.

Fields: scaffold_id (stable "{locus}_{index}"), locus, sequence (assembled V + N*pad + J), v_calls / j_calls (all alleles producing this scaffold), and n_pad (N nucleotides between V and J).

Parameters:

scaffold_id (str)
locus (str)
sequence (str)
v_calls (list[str])
j_calls (list[str])
n_pad (int)

scaffold_id: str#

locus: str#

sequence: str#

v_calls: list[str]#

j_calls: list[str]#

n_pad: int = 0#

arda.refbuild.combinations.load_j_frames(organism)[source]#

Parse bin/optional_file/<organism>_gl.aux -> {J allele: frame}.

Frame is the 0-based “first coding frame start position” (column 2).

Parameters:: organism (str)
Return type:: dict[str, int]

arda.refbuild.combinations.load_j_fr4_offsets(organism)[source]#

Parse the same aux file -> {J allele: (cdr3_stop, extra_bp)}, both 0-based nt counts.

IgBLAST’s aux carries five columns: allele, coding-frame start, chain type, CDR3 stop, and extra bp beyond the J coding end. arda only ever read column 2. Columns 4 and 5 pin FR4 inside the J exactly:

fwr4 = j_seq[cdr3_stop + 1 : len(j_seq) - extra_bp]

That is how a J + C scaffold gets an FR4 at all: igblastn cannot annotate a V-less sequence, so those scaffolds are not routed through it (see refbuild.build).

Verified against every V-J scaffold arda builds: the string this yields is byte-identical to IgBLAST’s own fwr4 on all 125 human J alleles where both exist – including IgBLAST’s own non-multiple-of-3 cases (IGHJ6*02 has extra_bp = 0 and a 34 nt FR4; 726 V-J scaffolds already carry one). Reproducing IgBLAST, quirks included, is the requirement here: the two scaffold kinds must agree, or a J->C read and a V-J read of the same clone disagree on FR4.

Pseudogene J entries carry only three columns and are skipped – they have no FR4 to report.

Parameters:: organism (str)
Return type:: dict[str, tuple[int, int]]

arda.refbuild.combinations.load_v_fwr3_stops(organism)[source]#

Parse bin/internal_data/<organism>/<organism>.ndm.imgt -> {V allele: FWR3 stop}.

IgBLAST’s own IMGT annotation of its V germlines. Column 11 is the 1-based FWR3 stop, and FR3-IMGT ends at position 104 – the conserved 2nd-CYS – so the Cys104 codon starts at fwr3_stop - 3 (0-based). This is the authoritative V junction anchor, and it is the same IgBLAST metadata the rest of the build already trusts.

It does NOT cover every IMGT allele (IgBLAST ships a subset), hence the motif fallback in refbuild.build._v_anchor.

Parameters:: organism (str)
Return type:: dict[str, int]

arda.refbuild.combinations.build_locus_scaffolds(locus, v_alleles, j_alleles, j_frames, *, d_spacer=None)[source]#

Build deduplicated V×J scaffolds for one locus.

Parameters:

locus (Locus) – The locus definition.
v_alleles (dict[str, str]) – {allele: ungapped V sequence}.
j_alleles (dict[str, str]) – {allele: ungapped J sequence}.
j_frames (dict[str, int]) – {J allele: 0-based coding frame} from the aux file.
d_spacer (int | None) – N spacer length for VDJ loci (default DEFAULT_D_SPACER_NT); forced to 0 for VJ loci.

Returns:

Scaffolds, one per unique assembled sequence.

Return type:

list[Scaffold]

Native nucleotide translation and reading-frame utilities.

No BioPython. The hot functions (translate, detect_coding_frame, reverse_complement, back_translate) are implemented in the C++ extension arda._markup and re-exported here; a pure-Python fallback keeps the module importable if the extension is unavailable. These mirror mirpy’s mirseq API so mirpy can later import arda and reuse them.

arda.refbuild.translate.translate(nt, frame=0)[source]#

Translate a nucleotide string from frame (0/1/2).

Parameters:

nt (str)
frame (int)

Return type:

str

arda.refbuild.translate.detect_coding_frame(nt)[source]#

Return the reading frame (0/1/2) with the fewest stop codons.

Parameters:: nt (str)
Return type:: int

arda.refbuild.translate.reverse_complement(nt)[source]#

Reverse-complement a nucleotide string (non-ACGT -> N).

Parameters:: nt (str)
Return type:: str

arda.refbuild.translate.back_translate(aa, unknown='NNN')[source]#

Mock back-translation via most-frequent human codons.

Parameters:

aa (str)
unknown (str)

Return type:

str

arda.refbuild.translate.aa_coords_from_nt(nt_start, nt_end, coding_start)[source]#

Map a 1-based closed nt interval to 1-based closed aa coordinates.

Parameters:

nt_start (int) – 1-based start of the region in the nucleotide sequence.
nt_end (int) – 1-based end (closed).
coding_start (int) – 1-based nt position where translation begins (frame origin).

Returns:

(aa_start, aa_end) 1-based closed, in the translated protein.

Return type:

tuple[int, int]

IMGT/V-QUEST germline reference download, parsing, and ungapping.

The IMGT V-QUEST reference directory ships gapped germline FASTAs laid out as <Species>/<IG|TR>/<GENE>.fasta (e.g. Homo_sapiens/IG/IGHV.fasta). Sequences carry IMGT-numbering gap dots; IgBLAST’s edit_imgt_file.pl ungaps them and rewrites headers to bare allele names (what makeblastdb wants).

This module:

downloads & extracts the reference zip into data/imgt (gitignored),
parses the original gapped FASTA headers for per-allele functionality,
ungaps a gene file via edit_imgt_file.pl into data/imgt/ungapped.

class arda.refbuild.imgt.ImgtAllele(allele, functionality, sequence)[source]#

Bases: object

A germline allele parsed from an IMGT FASTA header + sequence.

Parameters:

allele (str)
functionality (str)
sequence (str)

allele: str#

functionality: str#

sequence: str#

property is_functional: bool#

arda.refbuild.imgt.download_reference(*, force=False)[source]#

Download and extract the IMGT V-QUEST reference directory.

Returns the extraction root (containing the per-species directories). Idempotent unless force.

Parameters:: force (bool)
Return type:: Path

arda.refbuild.imgt.gene_fasta_path(species_dir, group, gene_stem)[source]#

Path to a gene-type FASTA, e.g. Homo_sapiens/IG/IGHV.fasta.

Handles the occasional top-level wrapper directory inside the zip.

Parameters:

species_dir (str)
group (str)
gene_stem (str)

Return type:

Path

arda.refbuild.imgt.parse_functionality(path)[source]#

Map allele name -> normalized functionality from gapped IMGT headers.

Parameters:: path (Path)
Return type:: dict[str, str]

arda.refbuild.imgt.ungap_gene(species_dir, group, gene_stem)[source]#

Ungap a gene file with edit_imgt_file.pl; return the ungapped path.

Parameters:

species_dir (str)
group (str)
gene_stem (str)

Return type:

Path

arda.refbuild.imgt.read_fasta(path)[source]#

Read a FASTA file into (header, sequence) pairs (sequence joined).

Parameters:: path (Path)
Return type:: list[tuple[str, str]]

External tool wrappers#

Thin wrapper around the mmseqs binary.

Inspired by pymmseqs (MIT) but deliberately dependency-free: we only need binary discovery, a subprocess runner, and the createdb / search / convertalis (and easy-search) pipeline used by the annotator.

Discovery order for the binary: $ARDA_MMSEQS → <project>/bin/mmseqs → mmseqs on PATH. If none are found, a static binary is auto-fetched into <project>/bin/mmseqs (one-time, transparent) unless $ARDA_NO_AUTO_FETCH is set — so neither pip nor conda users need to install mmseqs manually.

exception arda.mmseqs.MMseqsError[source]#

Bases: RuntimeError

Raised when an mmseqs invocation exits non-zero.

arda.mmseqs.mmseqs_binary()[source]#

Locate the mmseqs executable, auto-fetching a static build if needed.

Resolution: $ARDA_MMSEQS → <project>/bin/mmseqs → mmseqs on PATH. If still not found, download a static binary into <project>/bin/mmseqs (one-time) unless $ARDA_NO_AUTO_FETCH is set.

Return type:: str

arda.mmseqs.run(args, *, check=True)[source]#

Run mmseqs <args> capturing stdout/stderr.

Parameters:

args (list[str])
check (bool)

Return type:

CompletedProcess

arda.mmseqs.version()[source]#

Return the mmseqs version string (e.g. 18-8cc5c).

Return type:: str

arda.mmseqs.version_key(v)[source]#

Canonical form of a version string, for comparing an index marker to the running mmseqs.

mmseqs version prints the same release+commit with different punctuation across builds: the official static binary says 18-8cc5c, the bioconda build 18.8cc5c. They are the same mmseqs and produce byte-compatible indexes; only the separator differs. Comparing the raw strings with == therefore rejected every committed index the moment the toolchain moved from conda’s mmseqs to the static one – so the precompiled DBs shipped in database/ were never used and every run rebuilt a private cache instead.

Fold each run of separator characters to a single - and lowercase. This bridges the cosmetic difference while still distinguishing genuinely different versions (17-b804f != 18-8cc5c), so an incompatible index is never accepted.

Parameters:: v (str)
Return type:: str

arda.mmseqs.createdb(fasta, db, *, dbtype=None)[source]#

Create an mmseqs sequence DB from a FASTA file.

dbtype: None auto-detect, 1 amino-acid, 2 nucleotide.

Parameters:

fasta (str | Path)
db (str | Path)
dbtype (int | None)

Return type:

Path

arda.mmseqs.search(query_db, target_db, result_db, tmp_dir, *, search_type=0, sensitivity=5.7, evalue=0.001, max_seqs=300, threads=1, kmer=None, extra=None)[source]#

Run mmseqs search with backtrace enabled (-a).

Parameters:

kmer (int | None) – MMseqs2 -k. This is the memory knob. The nucleotide prefilter allocates a k-mer index table of 4**k entries, so the default k=15 costs 4**15 * 8 B ~ 8.6 GB regardless of database size, thread count or chunk size. k=13 costs ~0.7 GB. Leave None for MMseqs2’s own default.
query_db (str | Path)
target_db (str | Path)
result_db (str | Path)
tmp_dir (str | Path)
search_type (int)
sensitivity (float)
evalue (float)
max_seqs (int)
threads (int)
extra (list[str] | None)

Return type:

Path

arda.mmseqs.convertalis(query_db, target_db, result_db, out_tsv, *, format_output='query,target,qstart,qend,tstart,tend,qlen,tlen,alnlen,mismatch,gapopen,cigar,qaln,taln,evalue,bits,pident', threads=1, search_type=None)[source]#

Convert an alignment result DB to a TSV with the requested columns.

search_type must be passed for nucleotide results (3) so convertalis can interpret the alignment; otherwise mmseqs cannot tell nt from translated.

Parameters:

query_db (str | Path)
target_db (str | Path)
result_db (str | Path)
out_tsv (str | Path)
format_output (str)
threads (int)
search_type (int | None)

Return type:

Path

arda.mmseqs.top_hit(result_db, out_db)[source]#

Reduce an alignment DB to the single best-scoring hit per query.

MMseqs2 already stores each query’s results sorted by descending score, so taking the first line per entry is the best hit – this is the idiom mmseqs’ own filterdb usage message shows. Verified on 100 k reads against the human reference: 4,101 queries, identical target and identical bit score to a full sort-and-dedupe in polars, on every one.

Why it matters: with –max-seqs 300, 4,101 hitting queries produced 804,341 alignment rows (194 MB of TSV, each row carrying cigar/qaln/taln). Parsing that dominated arda’s peak RSS – 877 MB, against 284 MB for the mmseqs subprocess itself. Reducing before convertalis writes 1.0 MB instead, and costs 0.04 s.

Parameters:

result_db (str | Path)
out_db (str | Path)

Return type:

Path

arda.mmseqs.easy_search(query_fasta, target_fasta_or_db, out_tsv, tmp_dir, *, search_type=0, sensitivity=5.7, evalue=0.001, max_seqs=300, threads=1, format_output='query,target,qstart,qend,tstart,tend,qlen,tlen,alnlen,mismatch,gapopen,cigar,qaln,taln,evalue,bits,pident', strand=None, extra=None)[source]#

One-shot createdb+search+convertalis producing a TSV.

strand (nucleotide search only): 1 forward, 2 both strands; None lets mmseqs default (forward).

Parameters:

query_fasta (str | Path)
target_fasta_or_db (str | Path)
out_tsv (str | Path)
tmp_dir (str | Path)
search_type (int)
sensitivity (float)
evalue (float)
max_seqs (int)
threads (int)
format_output (str)
strand (int | None)
extra (list[str] | None)

Return type:

Path

Wrapper around the downloaded IgBLAST release.

Used only at build time (Phase 1) to construct the curated reference DB; the runtime annotator does not depend on IgBLAST.

The IgBLAST release is expected under <project>/bin (placed there by setup.sh), laid out as:

bin/
  igblastn  igblastp  makeblastdb  edit_imgt_file.pl
  internal_data/   optional_file/

$IGDATA is pointed at bin/ so IgBLAST finds internal_data and the per-organism optional_file/<organism>_gl.aux auxiliary files.

exception arda.igblast.IgBlastError[source]#

Bases: RuntimeError

Raised when an IgBLAST tool invocation fails or is missing.

arda.igblast.igdata_env()[source]#

Environment with IGDATA pointing at the IgBLAST data root.

Return type:: dict[str, str]

arda.igblast.tool(name)[source]#

Resolve an IgBLAST tool path under bin/.

Parameters:: name (str)
Return type:: Path

arda.igblast.edit_imgt_file(imgt_fasta, out_fasta)[source]#

Ungap an IMGT germline FASTA via edit_imgt_file.pl.

Parameters:

imgt_fasta (str | Path)
out_fasta (str | Path)

Return type:

Path

arda.igblast.makeblastdb(in_fasta, out_db, *, dbtype='nucl')[source]#

Build a germline BLAST database from an ungapped FASTA.

Parameters:

in_fasta (str | Path)
out_db (str | Path)
dbtype (str)

Return type:

Path

arda.igblast.igblastn_airr(query_fasta, out_tsv, *, organism, germline_db_v, germline_db_j, germline_db_d=None, auxiliary_data=None, ig_seqtype='TCR', num_threads=1)[source]#

Run igblastn -outfmt 19 (AIRR rearrangement TSV).

Parameters:

query_fasta (str | Path)
out_tsv (str | Path)
organism (str)
germline_db_v (str | Path)
germline_db_j (str | Path)
germline_db_d (str | Path | None)
auxiliary_data (str | Path | None)
ig_seqtype (str)
num_threads (int)

Return type:

Path