Reference database build ======================== The reference build is an offline, reproducible process (``arda build-db``). It is only needed to regenerate ``database/vdj//``; normal annotation uses the committed references. Pipeline -------- #. **Download** the IMGT/V-QUEST germline reference directory. #. **Ungap** each gene file with IgBLAST's ``edit_imgt_file.pl``. #. **Enumerate** deduplicated in-frame V·J scaffolds. The D segment is *not* enumerated — it only affects the CDR3 interior, which is query-specific at runtime — but VDJ loci get a short frame-neutral N spacer where D would sit. #. **Annotate** the scaffolds with ``igblastn -outfmt 19`` (AIRR) and extract the FR/CDR coordinates with polars. #. **Translate** each scaffold and derive protein markup. Outputs (per organism) ---------------------- * ``alleles.fasta`` / ``alleles.aa.fasta`` — scaffold nucleotide / protein seqs. * ``markup.tsv`` / ``markup.aa.tsv`` — region coordinates + sequences. * ``combinations.tsv`` — scaffold → contributing (V, J) allele pairs. * ``build.log`` — per-locus counts and dropped/incomplete summaries. Reading frames -------------- V germline is normalized to its coding frame by stop-free frame detection (the IMGT "codon start" header field is unreliable for 5'-partial alleles). The J coding frame comes from the IgBLAST auxiliary file; junction N-padding keeps the J frame aligned to V so the conserved FR4 ``[FW]GXG`` motif translates correctly.