mir.common package#
Submodules#
mir.common.clonotype module#
mir.common.diversity module#
mir.common.filter module#
mir.common.alleles module#
mir.common.gene_library module#
mir.common.parser module#
mir.common.repertoire module#
mir.common.metaclonotype module#
mir.common.single_cell module#
mir.common.single_cell_parser module#
mir.common.single_cell_repair module#
mir.common.single_cell_util module#
mir.common.sampling module#
mir.common.pool module#
mir.common.control module#
mir.common.io_parallel module#
Parallel Default And Fallback Policy#
Default mode uses parallel parsing with 4 workers.
Sequential fallback is used when any of these are true:
n_jobs is set to 1.
Parsed row count is below 10,000 (parallel_min_rows default).
The file fits in one chunk (n_rows <= chunk_size).
Practical estimate for typical AIRR tables:
Small to medium AIRR files (~3,000 rows at ~0.07 MB gz) represent approximately 43,000 rows per MB gz for similarly narrow AIRR tables.
Under this approximation, 10,000 rows corresponds to roughly 0.23 MB gz.
Rule of thumb:
If a gzipped AIRR file is substantially below about 0.23 MB, sequential loading is typically chosen.
If it is above about 0.23 MB, parallel loading is typically beneficial and selected by default.
mir.common.repertoire_dataset module#
TSV And Parquet I/O Layouts#
The repertoire classes provide Polars-first TSV/Parquet I/O helpers with roundtrip-safe schemas:
LocusRepertoire:
to_tsv / from_tsv
to_parquet / from_parquet
SampleRepertoire:
single-file: one TSV/Parquet with a locus column
split-loci: one file per locus via split_loci=True
RepertoireDataset:
per_sample_locus layout: one file per sample and locus
- single_file layout: one combined file with sample_id and locus columns
plus separate metadata.tsv
All dataset loaders operate with worker tasks on individual samples.