Reference database build#

The reference build is an offline, reproducible process (arda build-db). It is only needed to regenerate database/vdj/<organism>/; normal annotation uses the committed references.

Pipeline#

  1. Download the IMGT/V-QUEST germline reference directory.

  2. Ungap each gene file with IgBLAST’s edit_imgt_file.pl.

  3. Enumerate deduplicated in-frame V·J scaffolds. The D segment is not enumerated — it only affects the CDR3 interior, which is query-specific at runtime — but VDJ loci get a short frame-neutral N spacer where D would sit.

  4. Annotate the scaffolds with igblastn -outfmt 19 (AIRR) and extract the FR/CDR coordinates with polars.

  5. Translate each scaffold and derive protein markup.

Outputs (per organism)#

  • alleles.fasta / alleles.aa.fasta — scaffold nucleotide / protein seqs.

  • markup.tsv / markup.aa.tsv — region coordinates + sequences.

  • combinations.tsv — scaffold → contributing (V, J) allele pairs.

  • build.log — per-locus counts and dropped/incomplete summaries.

Reading frames#

V germline is normalized to its coding frame by stop-free frame detection (the IMGT “codon start” header field is unreliable for 5’-partial alleles). The J coding frame comes from the IgBLAST auxiliary file; junction N-padding keeps the J frame aligned to V so the conserved FR4 [FW]GXG motif translates correctly.