Reference database build
========================

The reference build is an offline, reproducible process (``arda build-db``). It
is only needed to regenerate ``database/vdj/<organism>/``; normal annotation uses
the committed references.

Pipeline
--------

#. **Download** the IMGT/V-QUEST germline reference directory.
#. **Ungap** each gene file with IgBLAST's ``edit_imgt_file.pl``.
#. **Enumerate** deduplicated in-frame V·J scaffolds. The D segment is *not*
   enumerated — it only affects the CDR3 interior, which is query-specific at
   runtime — but VDJ loci get a short frame-neutral N spacer where D would sit.
#. **Annotate** the scaffolds with ``igblastn -outfmt 19`` (AIRR) and extract the
   FR/CDR coordinates with polars.
#. **Translate** each scaffold and derive protein markup.

Outputs (per organism)
----------------------

* ``alleles.fasta`` / ``alleles.aa.fasta`` — scaffold nucleotide / protein seqs.
* ``markup.tsv`` / ``markup.aa.tsv`` — region coordinates + sequences.
* ``combinations.tsv`` — scaffold → contributing (V, J) allele pairs.
* ``build.log`` — per-locus counts and dropped/incomplete summaries.

Reading frames
--------------

V germline is normalized to its coding frame by stop-free frame detection (the
IMGT "codon start" header field is unreliable for 5'-partial alleles). The J
coding frame comes from the IgBLAST auxiliary file; junction N-padding keeps the
J frame aligned to V so the conserved FR4 ``[FW]GXG`` motif translates correctly.