GLIPH-like multi-family enrichment and full-graph clustering (TRB)#

This notebook implements a highly customized MIR workflow inspired by GLIPH concepts for TRB clonotypes with expanded token families and a full graph stage. It is inspired by ideas described in the GLIPH paper and is not a literal line-by-line reimplementation of the original software.

  1. Load gliph_trb.tsv.gz from the local benchmark cache.

  2. Collapse each study to a unique list of (v_gene, junction_aa) clonotypes before token counting.

  3. Build five token families per clonotype:

    • v+3-mer

    • pos+3mer

    • ungapped 4-mer

    • gapped 4-mer

    • gapped 5-mer

  4. Compare sample-vs-control VJ usage after V-only control normalization to diagnose residual gene-usage drift.

  5. For each study and family, run binomial enrichment against a V-only matched real TRB control sample of size 1,000,000.

  6. Plot separate volcano plots for each family/study pair, showing only the enrichment ratio > 1.0 side.

  7. Merge all enriched tokens into a heterogeneous graph (token-clonotype edges + clonotype-clonotype Hamming <= 1 edges).

  8. Expand the clonotype set by one Hamming-hop from the current graph.

  9. Build one-mode projections (clonotype and k-mer sides), run connected components and Leiden communities, and evaluate concordance against stimulus and gliph_cluster_id.

Reference: Glanville J, Huang H, Nau A, et al. Identifying specificity groups in the T cell receptor repertoire. Nature. 2017;547(7661):94-98. doi:10.1038/nature22976. PMID:28636589. PubMed: https://pubmed.ncbi.nlm.nih.gov/28636589/

The implementation reuses mir/** components for token extraction, control normalization, binomial enrichment, and edit-distance graph construction.

[1]:
import importlib.metadata as _meta
import sys as _sys
print(f"Python {_sys.version.split()[0]}")
for _pkg in ["mirpy-lib", "numpy", "pandas", "matplotlib", "scipy", "polars"]:
    try:
        print(f"  {_pkg}: {_meta.version(_pkg)}")
    except _meta.PackageNotFoundError:
        pass

from __future__ import annotations

from collections import Counter, defaultdict
from itertools import combinations
from pathlib import Path
import importlib
import re
import time

import igraph as ig
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
import polars as pl
from IPython.display import Markdown, display
from sklearn.metrics import (
    adjusted_mutual_info_score,
    adjusted_rand_score,
    normalized_mutual_info_score,
)

import mir.graph.token_graph as token_graph_mod
import mir.biomarkers.gliph as gliph_mod
importlib.reload(token_graph_mod)
importlib.reload(gliph_mod)

from mir.biomarkers.gliph import (
    GliphTokenArtifacts,
    build_full_gliph_clonotype_graph,
    build_kmer_projection_graph,
    combine_enriched_token_maps,
    compare_gliph_token_incidence,
    extract_gliph_artifacts_batch_from_repertoire,
)
from mir.basic.gene_usage import GeneUsage
from mir.common.repertoire import LocusRepertoire
from mir.utils.notebook_assets import ensure_airr_benchmark, find_repo_root

# Publication-quality matplotlib style (Nature/Science aesthetics)
plt.rcParams.update({
    "font.family": "sans-serif",
    "font.sans-serif": ["Arial", "Helvetica", "DejaVu Sans"],
    "font.size": 10,
    "axes.titlesize": 11,
    "axes.labelsize": 10,
    "xtick.labelsize": 9,
    "ytick.labelsize": 9,
    "legend.fontsize": 9,
    "figure.dpi": 150,
    "savefig.dpi": 300,
    "axes.linewidth": 0.8,
    "axes.spines.top": False,
    "axes.spines.right": False,
    "xtick.major.size": 3.5,
    "ytick.major.size": 3.5,
    "xtick.direction": "out",
    "ytick.direction": "out",
    "axes.grid": False,
})

Python 3.12.12
  mirpy-lib: 1.0.0
  numpy: 2.4.4
  pandas: 2.3.3
  matplotlib: 3.10.9
  scipy: 1.17.1
  polars: 1.39.3
/Users/mikesh/vcs/mirpy/venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2]:
# Configuration: analysis parameters and runtime caps

import re
import sys
from pathlib import Path

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from IPython.display import display, Markdown

repo_root = find_repo_root()
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

SEED = 42
np.random.seed(SEED)

STUDY_KEYWORDS = {
    'Glanville2017': 'glanville',
    'Huang2020': 'huang',
}

FAMILIES = ['v3', 'pos3', 'u3', 'u4', 'g4', 'g5']
TOKEN_THREADS = 8
COUNT_MODE = 'clonotype'
MIN_TOKEN_CLONOTYPES = 2
KMER_PSEUDOCOUNT = 1
SIG_FDR = 0.05
SIG_ENRICHMENT = 1.0
SIG_ODDS = 1.5
MIN_CLUSTER_SIZE = 2
CLONE_EDGE_MIN_WEIGHT = 0.35
MAX_GRAPH_NODES = 800
TOP_LABELS = 5

# Unnormalized control background size used for token extraction.
# This keeps runtime predictable while preserving native control composition.
CONTROL_BACKGROUND_N = 1_000_000

FAMILY_LABELS = {
    'v3': 'V3', 'pos3': 'Pos3', 'u3': 'U3', 'u4': 'U4', 'g4': 'G4', 'g5': 'G5'
}
FAMILY_COLORS = {
    'v3': '#FF6B6B', 'pos3': '#4ECDC4', 'u3': '#577590', 'u4': '#45B7D1',
    'g4': '#FFA07A', 'g5': '#98D8C8'
}
STIMULUS_PALETTE = {
    'HIV': '#e74c3c', 'SIV': '#3498db', 'CMV': '#f39c12',
    'CEF': '#27ae60', 'EBV': '#9b59b6', 'Flu': '#1abc9c',
}

CLUSTER_METHODS = ['components', 'leiden']
AA_RE = re.compile(r'^[A-Z]+$')

REPO_GLPH_PATH = repo_root / 'airr_benchmark' / 'gliph' / 'gliph_trb.tsv.gz'
NOTEBOOK_GLPH_PATH = repo_root / 'notebooks' / 'assets' / 'large' / 'airr_benchmark' / 'gliph' / 'gliph_trb.tsv.gz'
HF_PATTERN = ['gliph/gliph_trb.tsv.gz']
FORCE_HF_REFRESH = False
CONTROL_LOCUS = 'TRB'

[3]:
# Helper functions for GLIPH data processing

def _resolve_gliph_path(force_hf_refresh: bool = FORCE_HF_REFRESH) -> Path:
    """Resolve path to gliph_trb.tsv.gz dataset."""
    if REPO_GLPH_PATH.exists() and not force_hf_refresh:
        return REPO_GLPH_PATH
    if NOTEBOOK_GLPH_PATH.exists() and not force_hf_refresh:
        return NOTEBOOK_GLPH_PATH
    ds_root = ensure_airr_benchmark(repo_root=repo_root, allow_patterns=HF_PATTERN)
    path = ds_root / 'gliph' / 'gliph_trb.tsv.gz'
    if not path.exists():
        raise FileNotFoundError(f'Could not find gliph_trb.tsv.gz at {path}')
    return path


def _repertoire_summary(repertoire: LocusRepertoire) -> dict[str, int]:
    """Get summary statistics for a repertoire."""
    clones = repertoire.clonotypes
    return {
        'clonotypes': len(clones),
        'total_duplicates': sum(c.duplicate_count for c in clones),
        'with_aa': sum(1 for c in clones if c.junction_aa and len(c.junction_aa) >= 5),
    }

[4]:
# Load GLIPH table and build per-study repertoires

gliph_path = _resolve_gliph_path()
print(f'Loading GLIPH data from {gliph_path}')

raw_df = pd.read_csv(gliph_path, sep='\t', compression='infer', low_memory=False)

df = pd.DataFrame({
    'sequence_id': raw_df.index.astype(str),
    'junction_aa': raw_df['junction_aa'].astype(str).str.strip(),
    'v_gene': raw_df['v_gene'].astype(str).str.strip(),
    'j_gene': raw_df.get('j_gene', '').astype(str).str.strip() if 'j_gene' in raw_df.columns else '',
    'duplicate_count': pd.to_numeric(raw_df.get('duplicate_count', 1), errors='coerce').fillna(1).astype(int),
    'reference_id': raw_df['reference_id'].astype(str).str.strip(),
    'stimulus': raw_df.get('stimulus', '').astype(str).str.strip() if 'stimulus' in raw_df.columns else '',
    'epitope': raw_df.get('epitope', '').astype(str).str.strip() if 'epitope' in raw_df.columns else '',
    'gliph_cluster_id': raw_df.get('gliph_cluster_id', '').astype(str).str.strip() if 'gliph_cluster_id' in raw_df.columns else '',
})

# Keep only valid amino-acid clonotypes
df = df[(df['junction_aa'].str.len() >= 5) & (df['junction_aa'].str.match(AA_RE))].copy()

# Deduplicate once for token incidence model
df_dedup = df.groupby(['reference_id', 'junction_aa', 'v_gene'], as_index=False, sort=False).agg({
    'duplicate_count': 'sum',
    'j_gene': 'first',
    'stimulus': 'first',
    'epitope': 'first',
    'gliph_cluster_id': 'first',
}).reset_index(drop=True)
df_dedup['sequence_id'] = df_dedup.index.astype(str)
df_dedup['row_id'] = df_dedup['sequence_id']

# Build repertoires per study for extraction
study_repertoires = {}
for study, sdf in df_dedup.groupby('reference_id', sort=True):
    pl_df = pl.from_pandas(
        sdf[['sequence_id', 'junction_aa', 'v_gene', 'j_gene', 'duplicate_count']],
        include_index=False,
    )
    study_repertoires[str(study)] = LocusRepertoire.from_polars(
        pl_df,
        locus=CONTROL_LOCUS,
        repertoire_id=str(study),
    )

print(f'Loaded {len(raw_df):,} raw clonotypes')
print(f'After filtering: {len(df):,}')
print(f'Deduplicated clonotypes: {len(df_dedup):,}')
print('Per-study repertoire sizes:')
for study, rep in study_repertoires.items():
    print(f'  {study}: {rep.clonotype_count:,}')

display(df_dedup.groupby('reference_id').size().to_frame('clonotypes'))
display(df_dedup.head(3))

Loading GLIPH data from /Users/mikesh/vcs/mirpy/airr_benchmark/gliph/gliph_trb.tsv.gz
Loaded 17,936 raw clonotypes
After filtering: 17,936
Deduplicated clonotypes: 13,864
Per-study repertoire sizes:
  Glanville2017: 3,930
  Huang2020: 9,934
clonotypes
reference_id
Glanville2017 3930
Huang2020 9934
reference_id junction_aa v_gene duplicate_count j_gene stimulus epitope gliph_cluster_id sequence_id row_id
0 Glanville2017 CASSPNGIMNTEAFF TRBV7-6 50835 TRBJ1-1 MegaIL2 nan C3282 0 0
1 Glanville2017 CASTQSGGTNEKLFF TRBV28 110298 TRBJ1-4 MegaIL2 nan C3374 1 1
2 Glanville2017 CASSAWDRGRFTEAFF TRBV25-1 91536 TRBJ1-1 MegaIL2 nan C626 2 2
[5]:
# Build control repertoire once and extract all families once

from mir.common.control import ControlManager

ctrl_manager = ControlManager()
print('Loading real human TRB control repertoire...')
t0 = time.perf_counter()
ctrl_raw = ctrl_manager.ensure_and_load_control_df('real', 'human', CONTROL_LOCUS)
print(f'Loaded {len(ctrl_raw):,} control rows in {time.perf_counter() - t0:.1f}s')
ctrl_raw_pd = ctrl_raw.to_pandas() if hasattr(ctrl_raw, 'to_pandas') else ctrl_raw

# Convert full control to repertoire object
ctrl_pl = pl.from_pandas(
    pd.DataFrame({
        'sequence_id': np.arange(len(ctrl_raw_pd)).astype(str),
        'junction_aa': ctrl_raw_pd['junction_aa'].astype(str).str.strip(),
        'v_gene': ctrl_raw_pd['v_gene'].astype(str).str.strip(),
        'j_gene': ctrl_raw_pd.get('j_gene', '').astype(str).str.strip() if 'j_gene' in ctrl_raw_pd.columns else '',
        'duplicate_count': pd.to_numeric(ctrl_raw_pd.get('duplicate_count', 1), errors='coerce').fillna(1).astype(int),
    }),
    include_index=False,
).filter((pl.col('junction_aa').str.len_chars() >= 5))

control_repertoire_full = LocusRepertoire.from_polars(
    ctrl_pl,
    locus=CONTROL_LOCUS,
    repertoire_id='real_human_trb_full',
)

# Unnormalized random sample from control repertoire to keep extraction bounded
rng = np.random.default_rng(SEED)
control_clones_all = control_repertoire_full.clonotypes
n_bg = min(CONTROL_BACKGROUND_N, len(control_clones_all))
bg_idx = rng.choice(len(control_clones_all), size=n_bg, replace=False)
control_repertoire = LocusRepertoire(
    [control_clones_all[i] for i in bg_idx],
    locus=CONTROL_LOCUS,
    repertoire_id=f'real_human_trb_bg_{n_bg}',
)

print(f'Using control background size: {control_repertoire.clonotype_count:,} (unnormalized sample)')

# Compute control artifacts once for all families
t0 = time.perf_counter()
ctrl_artifacts = extract_gliph_artifacts_batch_from_repertoire(
    control_repertoire,
    FAMILIES,
    count_mode=COUNT_MODE,
)
print(f'Control token extraction finished in {time.perf_counter() - t0:.1f}s')
for family in FAMILIES:
    print(f'  {family}: {len(ctrl_artifacts[family].counts):,} tokens')

Loading real human TRB control repertoire...
Loaded 28,257,621 control rows in 16.4s
Using control background size: 1,000,000 (unnormalized sample)
Control token extraction finished in 55.4s
  v3: 215,583 tokens
  pos3: 760,410 tokens
  u3: 7,692 tokens
  u4: 102,198 tokens
  g4: 30,468 tokens
  g5: 487,083 tokens
[6]:
# Compare marginal V/J usage (study vs shared control background) on log-log axes
# N.B. Huang2020 does not report J genes
from adjustText import adjust_text
import contextlib
import io

def _usage_frame(usage: dict[str, int], label: str) -> pd.DataFrame:
    return pd.DataFrame({'gene': list(usage.keys()), label: list(usage.values())})

def _plot_usage_scatter(ax, merged: pd.DataFrame, x_col: str, y_col: str, title: str):
    if merged.empty:
        ax.set_title(f'{title} (no overlap)')
        ax.axis('off')
        return

    merged = merged.copy()
    merged[x_col] = merged[x_col].astype(float)
    merged[y_col] = merged[y_col].astype(float)
    eps = 1e-8
    merged['x_plot'] = merged[x_col] + eps
    merged['y_plot'] = merged[y_col] + eps
    merged['log2_fc'] = np.log2(merged['y_plot'] / merged['x_plot'])

    ax.scatter(merged['x_plot'], merged['y_plot'], s=14, alpha=0.65, color='#4c78a8')
    min_lim = min(float(merged['x_plot'].min()), float(merged['y_plot'].min()))
    max_lim = max(float(merged['x_plot'].max()), float(merged['y_plot'].max())) * 1.05
    ax.plot([min_lim, max_lim], [min_lim, max_lim], '--', color='gray', linewidth=1.0)
    ax.set_xscale('log')
    ax.set_yscale('log')
    ax.set_xlim(min_lim, max_lim)
    ax.set_ylim(min_lim, max_lim)
    ax.set_xlabel('control frequency (log scale)')
    ax.set_ylabel('study frequency (log scale)')
    ax.set_title(title)

    top_over = merged.nlargest(5, 'log2_fc')
    top_under = merged.nsmallest(5, 'log2_fc')
    label_df = pd.concat([top_over, top_under], axis=0).drop_duplicates('gene')

    texts = []
    for _, row in label_df.iterrows():
        texts.append(ax.text(row['x_plot'], row['y_plot'], str(row['gene']), fontsize=7))
    if texts:
        with contextlib.redirect_stdout(io.StringIO()):
            adjust_text(texts, ax=ax, arrowprops=dict(arrowstyle='-', color='black', lw=0.4))


control_usage = GeneUsage.from_repertoire(control_repertoire)
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))

for col_i, study in enumerate(sorted(study_repertoires.keys())):
    study_usage = GeneUsage.from_repertoire(study_repertoires[study])

    ctrl_v = control_usage.v_fraction(CONTROL_LOCUS, count='clonotypes', pseudocount=1.0)
    std_v = study_usage.v_fraction(CONTROL_LOCUS, count='clonotypes', pseudocount=1.0)
    v_df = _usage_frame(ctrl_v, 'control').merge(_usage_frame(std_v, 'study'), on='gene', how='outer').fillna(0.0)
    _plot_usage_scatter(axes[0, col_i], v_df, 'control', 'study', f'{study} V usage')

    ctrl_j = control_usage.j_fraction(CONTROL_LOCUS, count='clonotypes', pseudocount=1.0)
    std_j = study_usage.j_fraction(CONTROL_LOCUS, count='clonotypes', pseudocount=1.0)
    j_df = _usage_frame(ctrl_j, 'control').merge(_usage_frame(std_j, 'study'), on='gene', how='outer').fillna(0.0)
    _plot_usage_scatter(axes[1, col_i], j_df, 'control', 'study', f'{study} J usage')

plt.tight_layout()
plt.show()

../_images/notebooks_gliph_analysis_6_0.png

Tokenization and Binomial Enrichment#

Each token family is tested separately with a binomial enrichment model:

  • Sample clonotypes are deduplicated first to unique (reference_id, v_gene, junction_aa) rows.

  • Controls are resampled to match V usage only.

  • For each token, we test sample support (count_1 / total_sample_clonotypes) against p_background = count_2 / total_control_clonotypes.

  • The enrichment cutoff for downstream graph construction is FDR < 0.05, enrichment_ratio > 1.0, and at least 2 supporting clonotypes.

[7]:
# Enrichment helper functions

def _run_enrichment(
    sample_art: GliphTokenArtifacts,
    ctrl_art: GliphTokenArtifacts,
) -> pd.DataFrame:
    return compare_gliph_token_incidence(
        sample_art,
        ctrl_art,
        test='binom',
        p_adj_method='fdr_bh',
        pseudocount=KMER_PSEUDOCOUNT,
    )


def _significant_tokens(comp: pd.DataFrame, sample_art: GliphTokenArtifacts) -> tuple[pd.DataFrame, pd.Series]:
    if hasattr(comp, 'to_pandas'):
        comp = comp.to_pandas()
    if 'token' in comp.columns and comp.index.name != 'token':
        comp = comp.set_index('token')
    support = pd.Series(sample_art.clonotype_counts, name='sample_clonotypes')
    comp = comp.join(support, how='left').fillna({'sample_clonotypes': 0})
    comp['sample_clonotypes'] = comp['sample_clonotypes'].astype(int)
    sig_mask = (
        (comp['p_val_adj'] < SIG_FDR)
        & (comp['freq_fc'] > SIG_ODDS)
        & (comp['sample_clonotypes'] >= MIN_TOKEN_CLONOTYPES)
    )
    return comp, sig_mask


def _family_from_token(token: str) -> str:
    return token.split('::', 1)[0]


def _token_display_name(token: str) -> str:
    parts = token.split('::')
    family = parts[0]
    if family == 'v3':
        return f"{parts[1]}:{parts[2]}"
    if family == 'pos3':
        return f"{parts[1]}@{parts[2]}:{parts[3]}"
    return parts[1]

[8]:
# Run per-study enrichment with shared control background

study_results: dict[str, dict] = {fam: {} for fam in FAMILIES}
enrichment_overview_rows: list[dict[str, object]] = []

for study, study_rep in study_repertoires.items():
    sdf = df_dedup[df_dedup['reference_id'] == study].copy()
    print(f'\n=== {study} ===')

    t0 = time.perf_counter()
    sample_batch = extract_gliph_artifacts_batch_from_repertoire(
        study_rep,
        FAMILIES,
        count_mode=COUNT_MODE,
    )
    print(f'  Sample token extraction (all families): {time.perf_counter() - t0:.1f}s')

    for family in FAMILIES:
        sample_art = sample_batch[family]
        ctrl_art = ctrl_artifacts[family]

        comp = _run_enrichment(sample_art, ctrl_art)
        comp, sig_mask = _significant_tokens(comp, sample_art)
        enriched_tokens = set(comp.index[sig_mask])

        study_results[family][study] = {
            'study_art': sample_art,
            'ctrl_art': ctrl_art,
            'comparison': comp,
            'sig_mask': sig_mask,
            'enriched_tokens': enriched_tokens,
            'study_df': sdf,
        }

        enrichment_overview_rows.append({
            'reference_id': study,
            'family': family,
            'token_family': FAMILY_LABELS[family],
            'n_tokens_total': len(comp),
            'n_tokens_sig': int(sig_mask.sum()),
            'median_sample_clonotypes': float(comp['sample_clonotypes'].median()),
        })
        print(f'  {family}: {int(sig_mask.sum())} enriched / {len(comp)} total')

enrichment_overview = pd.DataFrame(enrichment_overview_rows).sort_values(
    ['reference_id', 'family']
).reset_index(drop=True)
display(
    enrichment_overview.style
    .background_gradient(cmap='YlOrRd', axis=0)
    .format(precision=4)
)


=== Glanville2017 ===
  Sample token extraction (all families): 0.1s
  v3: 0 enriched / 216782 total
  pos3: 0 enriched / 763136 total
  u3: 0 enriched / 7698 total
  u4: 0 enriched / 102283 total
  g4: 0 enriched / 30492 total
  g5: 0 enriched / 487482 total

=== Huang2020 ===
  Sample token extraction (all families): 10.5s
  v3: 0 enriched / 217287 total
  pos3: 0 enriched / 765298 total
  u3: 0 enriched / 7704 total
  u4: 0 enriched / 102554 total
  g4: 0 enriched / 30512 total
  g5: 0 enriched / 488724 total
  reference_id family token_family n_tokens_total n_tokens_sig median_sample_clonotypes
0 Glanville2017 g4 G4 30492 0 0.0000
1 Glanville2017 g5 G5 487482 0 0.0000
2 Glanville2017 pos3 Pos3 763136 0 0.0000
3 Glanville2017 u3 U3 7698 0 0.0000
4 Glanville2017 u4 U4 102283 0 0.0000
5 Glanville2017 v3 V3 216782 0 0.0000
6 Huang2020 g4 G4 30512 0 0.0000
7 Huang2020 g5 G5 488724 0 0.0000
8 Huang2020 pos3 Pos3 765298 0 0.0000
9 Huang2020 u3 U3 7704 0 0.0000
10 Huang2020 u4 U4 102554 0 0.0000
11 Huang2020 v3 V3 217287 0 0.0000

Separate Volcano Plots by Family and Study#

Each panel shows one family/study binomial comparison. Tokens passing FDR < 0.05, enrichment_ratio > 1.0, and sample_clonotypes >= 2 are highlighted in red.

[9]:
studies = sorted(next(iter(study_results.values())).keys())
n_studies = len(studies)

fig, axes = plt.subplots(
    nrows=len(FAMILIES),
    ncols=n_studies,
    figsize=(7 * n_studies, 4.5 * len(FAMILIES)),
    squeeze=False,
)

for row_i, family in enumerate(FAMILIES):
    for col_i, study in enumerate(studies):
        ax = axes[row_i, col_i]
        comp = study_results[family][study]['comparison'].copy()
        sig = study_results[family][study]['sig_mask']

        comp = comp[comp['freq_fc'] > 1.0].copy()
        sig = sig.reindex(comp.index).fillna(False)
        if comp.empty:
            ax.set_title(f"{FAMILY_LABELS[family]} | {study} (enrichment>1.0: none)")
            ax.axis('off')
            continue

        comp['log2fc'] = np.log2(comp['freq_fc'].clip(lower=1e-6))
        comp['neglog10p'] = -np.log10(comp['p_val_adj'].clip(lower=1e-300))

        ax.scatter(
            comp.loc[~sig, 'log2fc'],
            comp.loc[~sig, 'neglog10p'],
            s=8, alpha=0.35, color='steelblue', label='n.s. (enrichment>1.0)'
        )
        ax.scatter(
            comp.loc[sig, 'log2fc'],
            comp.loc[sig, 'neglog10p'],
            s=12, alpha=0.8, color='tomato', label=f'sig (n={int(sig.sum())})'
        )

        top = comp[sig].nlargest(TOP_LABELS, 'sample_clonotypes')
        for _, row in top.iterrows():
            token_label = _token_display_name(row.name)
            ax.annotate(token_label, xy=(row['log2fc'], row['neglog10p']), fontsize=6, ha='left', va='bottom')

        ax.axvline(np.log2(SIG_ODDS), color='grey', linestyle='--', linewidth=0.8)
        ax.axhline(-np.log10(SIG_FDR), color='grey', linestyle=':', linewidth=0.8)
        ax.set_xlabel('log2(enrichment ratio)')
        ax.set_ylabel('-log10(FDR)')
        ax.set_title(f'{FAMILY_LABELS[family]} | {study} (enrichment > 1.0)')
        ax.legend(fontsize=8)

plt.tight_layout()
plt.show()
../_images/notebooks_gliph_analysis_11_0.png

Full GLIPH Graph and One-Mode Projections#

For each study, all enriched k-mers from all five families are merged and used in a full graph workflow:

  • Start with token-clonotype edges from enriched k-mers.

  • Add clonotype-clonotype edges from Hamming <= 1 using the mirpy edit-distance graph builder.

  • Expand by one Hamming hop: add all Hamming neighbors of already selected clonotypes.

  • Build one-mode projections (graph-theory term):

    • clonotype projection (clonotypes connected if they co-occur with shared enriched tokens and/or Hamming edge)

    • k-mer projection (k-mers connected if at least one clonotype carries both)

[10]:
def _stimulus_color_map(values: pd.Series) -> dict[str, str]:
    unique_values = [v for v in sorted(values.fillna('').astype(str).str.strip().unique()) if v]
    fallback = ['#bc6c25', '#219ebc', '#b5179e', '#588157', '#ff7f11', '#4d908e']
    colors = {}
    for idx, value in enumerate(unique_values):
        colors[value] = STIMULUS_PALETTE.get(value, fallback[idx % len(fallback)])
    return colors


def _token_shape(family: str) -> str:
    shape_map = {
        'v3': 'circle',
        'pos3': 'square',
        'u3': 'triangle-up',
        'u4': 'triangle-down',
        'g4': 'diamond',
        'g5': 'rectangle',
    }
    return shape_map.get(family, 'circle')


def _build_augmented_bipartite_graph(
    study_df: pd.DataFrame,
    token_to_clones: dict[str, set[str]],
    clone_to_tokens: dict[str, set[str]],
    token_family: dict[str, str],
    full_clone_graph: ig.Graph,
) -> ig.Graph:
    token_nodes = sorted(token_to_clones)
    clone_nodes = sorted(str(name) for name in full_clone_graph.vs['name'])
    all_nodes = token_nodes + clone_nodes
    node_idx = {name: i for i, name in enumerate(all_nodes)}

    graph = ig.Graph(n=len(all_nodes), directed=False)
    graph.vs['name'] = all_nodes
    graph.vs['kind'] = ['token'] * len(token_nodes) + ['clone'] * len(clone_nodes)

    edges: list[tuple[int, int]] = []
    edge_kind: list[str] = []
    edge_color: list[str] = []
    edge_width: list[float] = []

    for clone_id, tokens in clone_to_tokens.items():
        if clone_id not in node_idx:
            continue
        for token in sorted(tokens):
            if token not in node_idx:
                continue
            edges.append((node_idx[token], node_idx[clone_id]))
            edge_kind.append('token_clone')
            edge_color.append('#c7c7c7')
            edge_width.append(0.5)

    for edge in full_clone_graph.es:
        left = str(full_clone_graph.vs[edge.source]['name'])
        right = str(full_clone_graph.vs[edge.target]['name'])
        if left not in node_idx or right not in node_idx:
            continue
        edges.append((node_idx[left], node_idx[right]))
        is_hamming = bool(edge['is_hamming']) if 'is_hamming' in full_clone_graph.es.attributes() else False
        shared = int(edge['shared_kmers']) if 'shared_kmers' in full_clone_graph.es.attributes() else 0
        edge_kind.append('clone_hamming' if is_hamming else 'clone_shared')
        edge_color.append('#4d4d4d' if is_hamming else '#8d99ae')
        edge_width.append(0.8 + 0.2 * np.log2(shared + 1.0))

    if edges:
        graph.add_edges(edges)
        graph.es['kind'] = edge_kind
        graph.es['color'] = edge_color
        graph.es['width'] = edge_width

    dup_map = study_df.set_index('row_id')['duplicate_count'].astype(float).to_dict()
    graph.vs['color'] = [
        FAMILY_COLORS.get(token_family.get(v['name'], ''), '#999999') if v['kind'] == 'token' else '#9e9e9e'
        for v in graph.vs
    ]
    graph.vs['shape'] = [
        _token_shape(token_family.get(v['name'], '')) if v['kind'] == 'token' else 'circle'
        for v in graph.vs
    ]

    degree_map = {v['name']: graph.degree(v.index) for v in graph.vs}
    graph.vs['size'] = [
        5.0 + 1.8 * np.log2(degree_map.get(v['name'], 0) + 1.0) if v['kind'] == 'token'
        else 2.0 + 2.2 * np.log2(dup_map.get(v['name'], 1.0) + 1.0)
        for v in graph.vs
    ]
    return graph


def _prepare_clone_graph_style(study_df: pd.DataFrame, clone_graph: ig.Graph) -> ig.Graph:
    graph = clone_graph.copy()
    if graph.vcount() == 0:
        return graph
    dup_map = study_df.set_index('row_id')['duplicate_count'].astype(float).to_dict()
    stimulus_map = study_df.set_index('row_id')['stimulus'].astype(str).to_dict()
    stim_colors = _stimulus_color_map(study_df['stimulus'])
    graph.vs['color'] = [stim_colors.get(stimulus_map.get(str(name), ''), '#bdbdbd') for name in graph.vs['name']]
    graph.vs['shape'] = ['circle'] * graph.vcount()
    graph.vs['size'] = [2.0 + 2.2 * np.log2(dup_map.get(str(name), 1.0) + 1.0) for name in graph.vs['name']]
    if graph.ecount() and 'weight' in graph.es.attributes():
        graph.es['width'] = [0.35 + 0.18 * np.log2(float(weight) + 1.0) for weight in graph.es['weight']]
    else:
        graph.es['width'] = [0.5] * graph.ecount()
    graph.es['kind'] = [
        'clone_hamming' if ('is_hamming' in graph.es.attributes() and bool(edge['is_hamming'])) else 'clone_shared'
        for edge in graph.es
    ]
    graph.es['color'] = ['#4d4d4d' if k == 'clone_hamming' else '#a8dadc' for k in graph.es['kind']]
    return graph


def _project_kmer_graph_with_style(
    study_df: pd.DataFrame,
    token_to_clones: dict[str, set[str]],
    token_family: dict[str, str],
) -> ig.Graph:
    graph, token_degree = build_kmer_projection_graph(token_to_clones)
    if graph.vcount() == 0:
        return graph

    graph.vs['color'] = [FAMILY_COLORS.get(token_family.get(name, ''), '#999999') for name in graph.vs['name']]
    graph.vs['shape'] = [_token_shape(token_family.get(name, '')) for name in graph.vs['name']]
    graph.vs['size'] = [4.0 + 1.6 * np.log2(token_degree.get(name, 0) + 1.0) for name in graph.vs['name']]
    if graph.ecount() and 'weight' in graph.es.attributes():
        graph.es['width'] = [0.35 + 0.15 * np.log2(float(weight) + 1.0) for weight in graph.es['weight']]
    else:
        graph.es['width'] = [0.45] * graph.ecount()
    graph.es['kind'] = ['token_cooccurrence'] * graph.ecount()
    graph.es['color'] = ['#d0d0d0'] * graph.ecount()
    return graph


def _plot_graph(
    graph: ig.Graph,
    title: str,
    *,
    node_legend_items: list[tuple[str, str]] | None = None,
    edge_legend_items: list[tuple[str, str]] | None = None,
    label_nodes: list[str] | None = None,
    label_map: dict[str, str] | None = None,
    max_nodes: int = MAX_GRAPH_NODES,
) -> None:
    from matplotlib.lines import Line2D

    if graph.vcount() == 0:
        print(f'{title}: no nodes to plot')
        return
    if graph.vcount() > max_nodes:
        print(f'{title}: {graph.vcount()} nodes - subsampling to {max_nodes}')
        keep = sorted(range(graph.vcount()), key=lambda idx: -graph.degree(idx))[:max_nodes]
        graph = graph.induced_subgraph(keep)

    labels = []
    label_nodes = set(label_nodes or [])
    for vertex in graph.vs:
        name = str(vertex['name'])
        if name in label_nodes:
            labels.append(label_map.get(name, name) if label_map else name)
        else:
            labels.append(None)

    layout = graph.layout('fr')
    fig, ax = plt.subplots(figsize=(10, 8))
    ig.plot(
        graph,
        layout=layout,
        target=ax,
        vertex_size=graph.vs['size'] if 'size' in graph.vs.attributes() else 6,
        vertex_color=graph.vs['color'] if 'color' in graph.vs.attributes() else '#9e9e9e',
        vertex_shape=graph.vs['shape'] if 'shape' in graph.vs.attributes() else 'circle',
        vertex_label=labels,
        vertex_label_size=8,
        edge_width=graph.es['width'] if 'width' in graph.es.attributes() else 0.5,
        edge_color=graph.es['color'] if 'color' in graph.es.attributes() else '#cccccc',
    )
    ax.set_title(title)

    handles = []
    if node_legend_items:
        handles.extend([mpatches.Patch(color=color, label=label) for label, color in node_legend_items])
    if edge_legend_items:
        handles.extend([
            Line2D([0], [0], color=color, lw=2.0, label=label)
            for label, color in edge_legend_items
        ])
    if handles:
        ax.legend(handles=handles, loc='upper right', fontsize=8)

    plt.tight_layout()
    plt.show()


def _top_token_labels(token_to_clones: dict[str, set[str]], top_n: int = TOP_LABELS) -> list[str]:
    ranked = sorted(token_to_clones, key=lambda token: (-len(token_to_clones[token]), token))[:top_n]
    return ranked


def _top_clone_labels(study_df: pd.DataFrame, clone_ids: list[str], top_n: int = TOP_LABELS) -> list[str]:
    dup_map = study_df.set_index('row_id')['duplicate_count'].astype(float).to_dict()
    return sorted(clone_ids, key=lambda clone_id: (-dup_map.get(clone_id, 0.0), clone_id))[:top_n]

[11]:
combined_graph_payloads: dict[str, dict[str, object]] = {}

for study in sorted(next(iter(study_results.values())).keys()):
    study_payloads = {family: study_results[family][study] for family in FAMILIES}
    study_df = study_payloads[FAMILIES[0]]['study_df']

    artifacts_by_family = {family: study_payloads[family]['study_art'] for family in FAMILIES}
    enriched_by_family = {family: study_payloads[family]['enriched_tokens'] for family in FAMILIES}
    token_to_clones, clone_to_tokens, token_family = combine_enriched_token_maps(
        artifacts_by_family,
        enriched_by_family,
    )

    full_clone_graph, clone_to_tokens_expanded, hamming_graph = build_full_gliph_clonotype_graph(
        study_df,
        token_to_clones,
        hamming_threshold=1,
        hamming_threads=TOKEN_THREADS,
        expand_hamming_neighbors=True,
        min_kmer_edge_weight=CLONE_EDGE_MIN_WEIGHT,
    )

    combined_graph_payloads[study] = {
        'study_df': study_df,
        'token_to_clones': token_to_clones,
        'clone_to_tokens': clone_to_tokens_expanded,
        'token_family': token_family,
        'full_clone_graph': full_clone_graph,
        'hamming_graph': hamming_graph,
    }

    print(f'\n=== {study} | full GLIPH graphs ===')
    family_counts = {family: len(study_payloads[family]['enriched_tokens']) for family in FAMILIES}
    print('Enriched token counts by family:', family_counts)
    print('Combined enriched token total:', len(token_to_clones))
    print('Initial token-linked clonotypes:', len(clone_to_tokens))
    print('Full hamming graph:', hamming_graph.vcount(), 'nodes /', hamming_graph.ecount(), 'edges')
    print('Expanded full clone graph:', full_clone_graph.vcount(), 'nodes /', full_clone_graph.ecount(), 'edges')

    bipartite_graph = _build_augmented_bipartite_graph(
        study_df,
        token_to_clones,
        clone_to_tokens_expanded,
        token_family,
        full_clone_graph,
    )
    _plot_graph(
        bipartite_graph,
        title=f'{study} | enriched tokens + clonotypes + hamming<=1 edges',
        node_legend_items=[(FAMILY_LABELS[family], FAMILY_COLORS[family]) for family in FAMILIES] + [('clonotype', '#9e9e9e')],
        edge_legend_items=[
            ('token-clonotype', '#c7c7c7'),
            ('shared token (clone-clone)', '#8d99ae'),
            ('hamming<=1 (clone-clone)', '#4d4d4d'),
        ],
        label_nodes=_top_token_labels(token_to_clones),
    )

    clone_graph = _prepare_clone_graph_style(study_df, full_clone_graph)
    stim_legend = [(label, color) for label, color in _stimulus_color_map(study_df['stimulus']).items()]
    clone_label_map = (
        study_df.assign(row_id=study_df['row_id'].astype(str))
        .drop_duplicates('row_id')
        .set_index('row_id')['junction_aa']
        .astype(str)
        .to_dict()
    )
    top_clone_ids = _top_clone_labels(study_df, [str(name) for name in clone_graph.vs['name']])
    _plot_graph(
        clone_graph,
        title=f'{study} | full clonotype graph (shared enriched k-mers + hamming<=1)',
        node_legend_items=stim_legend,
        edge_legend_items=[
            ('shared token (clone-clone)', '#a8dadc'),
            ('hamming<=1 (clone-clone)', '#4d4d4d'),
        ],
        label_nodes=top_clone_ids,
        label_map=clone_label_map,
    )

    kmer_graph = _project_kmer_graph_with_style(study_df, token_to_clones, token_family)
    _plot_graph(
        kmer_graph,
        title=f'{study} | k-mer one-mode projection (co-occurrence graph)',
        node_legend_items=[(FAMILY_LABELS[family], FAMILY_COLORS[family]) for family in FAMILIES],
        edge_legend_items=[('token co-occurrence', '#d0d0d0')],
        label_nodes=_top_token_labels(token_to_clones),
    )


=== Glanville2017 | full GLIPH graphs ===
Enriched token counts by family: {'v3': 0, 'pos3': 0, 'u3': 0, 'u4': 0, 'g4': 0, 'g5': 0}
Combined enriched token total: 0
Initial token-linked clonotypes: 0
Full hamming graph: 3930 nodes / 182 edges
Expanded full clone graph: 0 nodes / 0 edges
Glanville2017 | enriched tokens + clonotypes + hamming<=1 edges: no nodes to plot
Glanville2017 | full clonotype graph (shared enriched k-mers + hamming<=1): no nodes to plot
Glanville2017 | k-mer one-mode projection (co-occurrence graph): no nodes to plot
Skipping sequence with non-canonical amino acid: CASSYXQGVRGRLYGYTF
Skipping sequence with non-canonical amino acid: CASSYXQGVRGRLYGYTF
Skipping sequence with non-canonical amino acid: CASSFPIXLAGTRNTGELFF
Skipping sequence with non-canonical amino acid: CASSFPIXLAGTRNTGELFF

=== Huang2020 | full GLIPH graphs ===
Enriched token counts by family: {'v3': 0, 'pos3': 0, 'u3': 0, 'u4': 0, 'g4': 0, 'g5': 0}
Combined enriched token total: 0
Initial token-linked clonotypes: 0
Full hamming graph: 9934 nodes / 1725 edges
Expanded full clone graph: 0 nodes / 0 edges
Huang2020 | enriched tokens + clonotypes + hamming<=1 edges: no nodes to plot
Huang2020 | full clonotype graph (shared enriched k-mers + hamming<=1): no nodes to plot
Huang2020 | k-mer one-mode projection (co-occurrence graph): no nodes to plot
[12]:
def _canonical_study_name(study_name: str) -> str | None:
    s = str(study_name).lower()
    for canon, keyword in STUDY_KEYWORDS.items():
        if keyword in s:
            return canon
    return None


def _purity_score(y_true: pd.Series, y_pred: pd.Series) -> float:
    tab = pd.crosstab(y_pred, y_true)
    if tab.empty:
        return float('nan')
    return float(tab.max(axis=1).sum() / tab.to_numpy().sum())


def _cluster_projected_graph(
    study_df: pd.DataFrame,
    clone_graph: ig.Graph,
    method: str,
    min_cluster_size: int = MIN_CLUSTER_SIZE,
) -> tuple[dict[str, int], dict[str, int]]:
    clone_ids_all = study_df['row_id'].astype(str).tolist()
    labels_all = {clone_id: -1 for clone_id in clone_ids_all}
    if clone_graph.vcount() == 0 or clone_graph.ecount() == 0:
        return labels_all, {'n_clusters': 0, 'n_clustered': 0, 'n_total': len(clone_ids_all)}

    if method == 'components':
        membership = np.full(clone_graph.vcount(), -1, dtype=int)
        for comp_id, vertices in enumerate(clone_graph.components()):
            for vertex in vertices:
                membership[vertex] = comp_id
    elif method == 'leiden':
        membership = np.array(
            clone_graph.community_leiden(weights='weight', objective_function='modularity', n_iterations=5).membership,
            dtype=int,
        )
    else:
        raise ValueError(f'Unknown method: {method}')

    keep_mask = np.zeros_like(membership, dtype=bool)
    for label in np.unique(membership):
        if (membership == label).sum() >= min_cluster_size:
            keep_mask |= (membership == label)
    membership = np.where(keep_mask, membership, -1)

    for clone_id, label in zip(clone_graph.vs['name'], membership):
        labels_all[str(clone_id)] = int(label)

    stats = {
        'n_clusters': int(len(set(membership[membership >= 0]))),
        'n_clustered': int((membership >= 0).sum()),
        'n_total': len(clone_ids_all),
    }
    return labels_all, stats


def _make_cluster_table(
    study_df: pd.DataFrame,
    clone_to_tokens: dict[str, set[str]],
    token_family: dict[str, str],
    labels_map: dict[str, int],
    top_n: int = 15,
) -> pd.DataFrame:
    rows = []
    study_index = study_df.set_index('row_id')
    valid_labels = sorted({label for label in labels_map.values() if label >= 0})
    for label in valid_labels:
        member_ids = [clone_id for clone_id, value in labels_map.items() if value == label]
        member_rows = study_index.loc[study_index.index.intersection(member_ids)]
        token_support = Counter()
        family_support = Counter()
        for clone_id in member_ids:
            for token in clone_to_tokens.get(clone_id, set()):
                token_support[token] += 1
                family_support[token_family.get(token, _family_from_token(token))] += 1
        top_tokens = ', '.join(
            f"{_token_display_name(token)} ({count})"
            for token, count in token_support.most_common(5)
        )
        top_families = ', '.join(f'{FAMILY_LABELS[family]}:{count}' for family, count in family_support.most_common())
        log2_mean_duplicate_count = (
            float(np.log2(1.0 + member_rows['duplicate_count'].astype(float)).mean())
            if len(member_rows)
            else np.nan
        )
        rows.append({
            'cluster': label,
            'size': len(member_ids),
            'log2_mean_duplicate_count': log2_mean_duplicate_count,
            'top_families': top_families,
            'top_enriched_tokens': top_tokens,
            'stimuli': ', '.join(member_rows['stimulus'].astype(str).value_counts().head(3).index.tolist()),
            'sequences': ' | '.join(member_rows['junction_aa'].astype(str).tolist()[:5]),
        })
    if not rows:
        return pd.DataFrame(columns=['cluster', 'size', 'log2_mean_duplicate_count', 'top_families', 'top_enriched_tokens', 'stimuli', 'sequences'])
    return pd.DataFrame(rows).sort_values('size', ascending=False).head(top_n)


def _concordance_row(
    study_df: pd.DataFrame,
    labels_map: dict[str, int],
    target_col: str,
    target_name: str,
    canonical_study: str | None,
    method: str,
) -> dict[str, object]:
    tmp = study_df[['row_id', target_col]].copy()
    tmp['target'] = tmp[target_col].fillna('').astype(str).str.strip()
    tmp = tmp[~tmp['target'].str.lower().isin({'', 'nan', 'none', 'na'})].copy()

    if target_col == 'gliph_cluster_id' and not tmp.empty:
        target_sizes = tmp['target'].value_counts()
        keep = target_sizes[target_sizes >= 3].index
        tmp = tmp[tmp['target'].isin(keep)].copy()

    n_target_total = len(tmp)
    if n_target_total == 0:
        return {
            'canonical_study': canonical_study,
            'reference_id': str(study_df['reference_id'].iloc[0]),
            'method': method,
            'config': f'full_gliph+{method}',
            'target_name': target_name,
            'n_total': len(study_df),
            'n_target_total': 0,
            'n_eval': 0,
            'coverage': 0.0,
            'ami': np.nan,
            'nmi': np.nan,
            'ari': np.nan,
            'purity': np.nan,
        }

    tmp['pred_cluster'] = tmp['row_id'].map(labels_map).fillna(-1).astype(int)
    coverage = float((tmp['pred_cluster'] >= 0).mean())
    eval_df = tmp[tmp['pred_cluster'] >= 0].copy()

    if len(eval_df) < 2 or eval_df['pred_cluster'].nunique() < 2 or eval_df['target'].nunique() < 2:
        ami = nmi = ari = purity = np.nan
    else:
        ami = adjusted_mutual_info_score(eval_df['target'], eval_df['pred_cluster'])
        nmi = normalized_mutual_info_score(eval_df['target'], eval_df['pred_cluster'])
        ari = adjusted_rand_score(eval_df['target'], eval_df['pred_cluster'])
        purity = _purity_score(eval_df['target'], eval_df['pred_cluster'])

    return {
        'canonical_study': canonical_study,
        'reference_id': str(study_df['reference_id'].iloc[0]),
        'method': method,
        'config': f'full_gliph+{method}',
        'target_name': target_name,
        'n_total': len(study_df),
        'n_target_total': n_target_total,
        'n_eval': len(eval_df),
        'coverage': coverage,
        'ami': ami,
        'nmi': nmi,
        'ari': ari,
        'purity': purity,
    }
[13]:
cluster_tables: dict[tuple[str, str], pd.DataFrame] = {}
cluster_labels: dict[tuple[str, str], dict[str, int]] = {}
cluster_stats_rows: list[dict[str, object]] = []
concordance_rows: list[dict[str, object]] = []

for study, payload in combined_graph_payloads.items():
    study_df = payload['study_df']
    clone_to_tokens = payload['clone_to_tokens']
    token_family = payload['token_family']
    full_clone_graph = payload['full_clone_graph']
    canonical = _canonical_study_name(study)
    total_enriched = len(payload['token_to_clones'])
    n_hamming_edges = int(sum(bool(v) for v in full_clone_graph.es['is_hamming'])) if (full_clone_graph.ecount() and 'is_hamming' in full_clone_graph.es.attributes()) else 0

    for method in CLUSTER_METHODS:
        labels_map, stats = _cluster_projected_graph(
            study_df,
            full_clone_graph,
            method=method,
            min_cluster_size=MIN_CLUSTER_SIZE,
        )
        cluster_labels[(study, method)] = labels_map
        cluster_tables[(study, method)] = _make_cluster_table(study_df, clone_to_tokens, token_family, labels_map)

        cluster_stats_rows.append({
            'canonical_study': canonical,
            'reference_id': study,
            'method': method,
            'config': f'full_gliph+{method}',
            'n_enriched_tokens_total': total_enriched,
            'n_full_graph_nodes': int(full_clone_graph.vcount()),
            'n_full_graph_edges': int(full_clone_graph.ecount()),
            'n_hamming_edges': n_hamming_edges,
            'n_clusters': stats['n_clusters'],
            'n_clustered': stats['n_clustered'],
            'n_total': stats['n_total'],
            'clustered_frac': stats['n_clustered'] / stats['n_total'] if stats['n_total'] else np.nan,
        })

        concordance_rows.append(_concordance_row(
            study_df=study_df,
            labels_map=labels_map,
            target_col='gliph_cluster_id',
            target_name='gliph_cluster_id',
            canonical_study=canonical,
            method=method,
        ))

        if canonical == 'Glanville2017':
            concordance_rows.append(_concordance_row(
                study_df=study_df,
                labels_map=labels_map,
                target_col='stimulus',
                target_name='stimulus',
                canonical_study=canonical,
                method=method,
            ))

cluster_stats_df = pd.DataFrame(cluster_stats_rows).sort_values(
    ['canonical_study', 'reference_id', 'method']
).reset_index(drop=True)
concordance_df = pd.DataFrame(concordance_rows).sort_values(
    ['canonical_study', 'reference_id', 'target_name', 'method']
).reset_index(drop=True)

print('Cluster statistics table:')
display(
    cluster_stats_df.style
    .background_gradient(cmap='YlOrRd', axis=0)
    .format(precision=4)
)

print('Concordance table:')
display(
    concordance_df.style
    .background_gradient(cmap='YlOrRd', axis=0)
    .format(precision=4)
)
Cluster statistics table:
  canonical_study reference_id method config n_enriched_tokens_total n_full_graph_nodes n_full_graph_edges n_hamming_edges n_clusters n_clustered n_total clustered_frac
0 Glanville2017 Glanville2017 components full_gliph+components 0 0 0 0 0 0 3930 0.0000
1 Glanville2017 Glanville2017 leiden full_gliph+leiden 0 0 0 0 0 0 3930 0.0000
2 Huang2020 Huang2020 components full_gliph+components 0 0 0 0 0 0 9934 0.0000
3 Huang2020 Huang2020 leiden full_gliph+leiden 0 0 0 0 0 0 9934 0.0000
Concordance table:
/Users/mikesh/vcs/mirpy/venv/lib/python3.12/site-packages/pandas/io/formats/style.py:3807: RuntimeWarning: All-NaN slice encountered
  smin = np.nanmin(gmap) if vmin is None else vmin
/Users/mikesh/vcs/mirpy/venv/lib/python3.12/site-packages/pandas/io/formats/style.py:3808: RuntimeWarning: All-NaN slice encountered
  smax = np.nanmax(gmap) if vmax is None else vmax
  canonical_study reference_id method config target_name n_total n_target_total n_eval coverage ami nmi ari purity
0 Glanville2017 Glanville2017 components full_gliph+components gliph_cluster_id 3930 260 0 0.0000 nan nan nan nan
1 Glanville2017 Glanville2017 leiden full_gliph+leiden gliph_cluster_id 3930 260 0 0.0000 nan nan nan nan
2 Glanville2017 Glanville2017 components full_gliph+components stimulus 3930 3930 0 0.0000 nan nan nan nan
3 Glanville2017 Glanville2017 leiden full_gliph+leiden stimulus 3930 3930 0 0.0000 nan nan nan nan
4 Huang2020 Huang2020 components full_gliph+components gliph_cluster_id 9934 1202 0 0.0000 nan nan nan nan
5 Huang2020 Huang2020 leiden full_gliph+leiden gliph_cluster_id 9934 1202 0 0.0000 nan nan nan nan
[14]:
# Diagnostics and compact run summary
for key in sorted(cluster_tables.keys()):
    study, method = key
    tab = cluster_tables[key]
    print(f'\n=== full GLIPH graph + {method} | {study} ===')
    if tab.empty:
        print('  (no clusters)')
    else:
        display(tab.head(10))

focus_studies = {'Glanville2017', 'Huang2020'}
focus_conc = concordance_df[concordance_df['canonical_study'].isin(focus_studies)].copy()
display(focus_conc)

availability_rows = []
for study_name in sorted(focus_studies):
    sdf = df_dedup[df_dedup['reference_id'].str.contains(STUDY_KEYWORDS[study_name], case=False, na=False)].copy()
    if sdf.empty:
        continue
    availability_rows.append({
        'study': study_name,
        'n_rows': len(sdf),
        'n_with_gliph_cluster_id': int((sdf['gliph_cluster_id'].fillna('').astype(str).str.strip() != '').sum()),
        'n_with_stimulus': int((sdf['stimulus'].fillna('').astype(str).str.strip() != '').sum()),
    })
availability_df = pd.DataFrame(availability_rows)
print('Target-label availability:')
display(availability_df)

config_order = [f'full_gliph+{method}' for method in CLUSTER_METHODS]
for study_name in sorted(focus_conc['canonical_study'].dropna().unique()):
    study_df_plot = focus_conc[focus_conc['canonical_study'] == study_name].copy()
    fig, axes = plt.subplots(1, 2, figsize=(14, 4), sharey=True)
    for ax, metric in zip(axes, ['ami', 'nmi']):
        piv = study_df_plot.pivot_table(
            index='config', columns='target_name', values=metric, aggfunc='mean'
        ).reindex(config_order)
        piv_num = piv.apply(pd.to_numeric, errors='coerce').dropna(axis=0, how='all').dropna(axis=1, how='all')
        if piv_num.empty:
            ax.text(0.5, 0.5, 'No numeric data to plot', ha='center', va='center', transform=ax.transAxes)
        else:
            piv_num.plot(kind='bar', ax=ax)
        ax.set_title(f'{study_name} | {metric.upper()}')
        ax.set_xlabel('full GLIPH config')
        ax.set_ylabel(metric.upper())
        ax.set_ylim(0, 1)
        ax.tick_params(axis='x', rotation=0)
    plt.tight_layout()
    plt.show()

requested_targets = {
    'Glanville2017': ['gliph_cluster_id', 'stimulus'],
    'Huang2020': ['gliph_cluster_id'],
}
best_rows = []
for study_name, targets in requested_targets.items():
    for target_name in targets:
        sub_all = focus_conc[
            (focus_conc['canonical_study'] == study_name)
            & (focus_conc['target_name'] == target_name)
        ].copy()
        if sub_all.empty:
            best_rows.append(f'- {study_name} vs {target_name}: no rows available in concordance table.')
            continue
        if int(sub_all['n_target_total'].max()) == 0:
            best_rows.append(f'- {study_name} vs {target_name}: no non-empty labels in this dataset snapshot.')
            continue
        sub = sub_all.dropna(subset=['ami']).copy()
        if sub.empty:
            best_rows.append(
                f"- {study_name} vs {target_name}: labels exist but no evaluable configuration (needs >=2 predicted communities and >=2 target classes)."
            )
            continue
        row = sub.sort_values(['ami', 'n_eval'], ascending=[False, False]).iloc[0]
        best_rows.append(
            f"- {study_name} vs {target_name}: best `{row['config']}` with AMI={row['ami']:.3f}, NMI={row['nmi']:.3f}, ARI={row['ari']:.3f}, purity={row['purity']:.3f}, coverage={row['coverage']:.1%}."
        )

family_count_lines = []
for study in sorted(combined_graph_payloads):
    counts = {family: len(study_results[family][study]['enriched_tokens']) for family in FAMILIES}
    formatted = ', '.join(f"{family}={counts[family]}" for family in FAMILIES)
    family_count_lines.append(f'- {study} enriched token counts: {formatted}.')

summary_lines = [
    '## End-of-Notebook Summary',
    '',
    '- GLIPH input was deduplicated to unique `(reference_id, v_gene, junction_aa)` clonotypes before token counting.',
    f'- Token extraction uses `{COUNT_MODE}` counts, `{TOKEN_THREADS}` threads, and unnormalized control background `CONTROL_BACKGROUND_N={CONTROL_BACKGROUND_N:,}`.',
    '- Control token families are computed once and reused across both studies.',
    '- Separate binomial tests were run for `v3`, `pos3`, `u3`, `u4`, `g4`, and `g5` with pseudocount smoothing.',
    '- Concordance for `gliph_cluster_id` is evaluated only for target clusters with >=3 clonotypes.',
    '- Concordance targets: Glanville2017 -> `gliph_cluster_id`, `stimulus`; Huang2020 -> `gliph_cluster_id`.',
]
summary_lines.extend(family_count_lines)
summary_lines.extend(best_rows)
display(Markdown('\n'.join(summary_lines)))

=== full GLIPH graph + components | Glanville2017 ===
  (no clusters)

=== full GLIPH graph + leiden | Glanville2017 ===
  (no clusters)

=== full GLIPH graph + components | Huang2020 ===
  (no clusters)

=== full GLIPH graph + leiden | Huang2020 ===
  (no clusters)
canonical_study reference_id method config target_name n_total n_target_total n_eval coverage ami nmi ari purity
0 Glanville2017 Glanville2017 components full_gliph+components gliph_cluster_id 3930 260 0 0.0 NaN NaN NaN NaN
1 Glanville2017 Glanville2017 leiden full_gliph+leiden gliph_cluster_id 3930 260 0 0.0 NaN NaN NaN NaN
2 Glanville2017 Glanville2017 components full_gliph+components stimulus 3930 3930 0 0.0 NaN NaN NaN NaN
3 Glanville2017 Glanville2017 leiden full_gliph+leiden stimulus 3930 3930 0 0.0 NaN NaN NaN NaN
4 Huang2020 Huang2020 components full_gliph+components gliph_cluster_id 9934 1202 0 0.0 NaN NaN NaN NaN
5 Huang2020 Huang2020 leiden full_gliph+leiden gliph_cluster_id 9934 1202 0 0.0 NaN NaN NaN NaN
Target-label availability:
study n_rows n_with_gliph_cluster_id n_with_stimulus
0 Glanville2017 3930 3930 3930
1 Huang2020 9934 9934 9934
../_images/notebooks_gliph_analysis_17_4.png
../_images/notebooks_gliph_analysis_17_5.png

End-of-Notebook Summary#

  • GLIPH input was deduplicated to unique (reference_id, v_gene, junction_aa) clonotypes before token counting.

  • Token extraction uses clonotype counts, 8 threads, and unnormalized control background CONTROL_BACKGROUND_N=1,000,000.

  • Control token families are computed once and reused across both studies.

  • Separate binomial tests were run for v3, pos3, u3, u4, g4, and g5 with pseudocount smoothing.

  • Concordance for gliph_cluster_id is evaluated only for target clusters with >=3 clonotypes.

  • Concordance targets: Glanville2017 -> gliph_cluster_id, stimulus; Huang2020 -> gliph_cluster_id.

  • Glanville2017 enriched token counts: v3=0, pos3=0, u3=0, u4=0, g4=0, g5=0.

  • Huang2020 enriched token counts: v3=0, pos3=0, u3=0, u4=0, g4=0, g5=0.

  • Glanville2017 vs gliph_cluster_id: labels exist but no evaluable configuration (needs >=2 predicted communities and >=2 target classes).

  • Glanville2017 vs stimulus: labels exist but no evaluable configuration (needs >=2 predicted communities and >=2 target classes).

  • Huang2020 vs gliph_cluster_id: labels exist but no evaluable configuration (needs >=2 predicted communities and >=2 target classes).

Brief Conclusions and Context#

  • This pipeline reproduces a core GLIPH-like idea: motif/token enrichment against controls, followed by motif-linked clustering.

  • Compared to Huang et al., 2020 (Nat Biotech) (https://www.nature.com/articles/s41587-020-0505-4), this notebook is a lighter implementation and does not include all GLIPH2 constraints (e.g., global/local motif scoring calibration and full HLA-aware priors).

  • Relative to Glanville et al., 2017 (Nature) (https://pubmed.ncbi.nlm.nih.gov/28636589/), the bipartite token graph here captures motif-sharing structure similarly, but exact cluster boundaries may differ because we use explicit binomial enrichment + FDR thresholds and connected-component/community heuristics.

  • Practical interpretation:

    • Higher Cramer’s V / AMI / NMI and lower stimulated-unclustered fraction indicate better stimulus separation by token-defined clusters.

    • Silhouette can be conservative in sparse high-dimensional token spaces; use it together with chi2/Cramer’s V and cluster composition tables.

A run-specific summary is generated in the final code cell as End-of-Notebook Summary after all computations complete.

[15]:
# Repeat final summary at literal notebook end for convenience.
if 'summary_lines' in globals():
    display(Markdown('\n'.join(summary_lines)))
else:
    print('Run the previous analysis cell first to generate end-of-notebook summary.')

End-of-Notebook Summary#

  • GLIPH input was deduplicated to unique (reference_id, v_gene, junction_aa) clonotypes before token counting.

  • Token extraction uses clonotype counts, 8 threads, and unnormalized control background CONTROL_BACKGROUND_N=1,000,000.

  • Control token families are computed once and reused across both studies.

  • Separate binomial tests were run for v3, pos3, u3, u4, g4, and g5 with pseudocount smoothing.

  • Concordance for gliph_cluster_id is evaluated only for target clusters with >=3 clonotypes.

  • Concordance targets: Glanville2017 -> gliph_cluster_id, stimulus; Huang2020 -> gliph_cluster_id.

  • Glanville2017 enriched token counts: v3=0, pos3=0, u3=0, u4=0, g4=0, g5=0.

  • Huang2020 enriched token counts: v3=0, pos3=0, u3=0, u4=0, g4=0, g5=0.

  • Glanville2017 vs gliph_cluster_id: labels exist but no evaluable configuration (needs >=2 predicted communities and >=2 target classes).

  • Glanville2017 vs stimulus: labels exist but no evaluable configuration (needs >=2 predicted communities and >=2 target classes).

  • Huang2020 vs gliph_cluster_id: labels exist but no evaluable configuration (needs >=2 predicted communities and >=2 target classes).

[16]:
# Diagnose inconsistencies between original GLIPH IDs and our predicted clusters
inconsistency_frames = []

for study, payload in combined_graph_payloads.items():
    canonical = _canonical_study_name(study)
    if canonical not in {'Glanville2017', 'Huang2020'}:
        continue

    study_df = payload['study_df'][['row_id', 'gliph_cluster_id', 'stimulus', 'duplicate_count']].copy()
    study_df['gliph_cluster_id'] = study_df['gliph_cluster_id'].fillna('').astype(str).str.strip()
    study_df = study_df[~study_df['gliph_cluster_id'].str.lower().isin({'', 'nan', 'none', 'na'})].copy()
    if study_df.empty:
        continue

    cluster_sizes = study_df['gliph_cluster_id'].value_counts()
    keep_gliph = cluster_sizes[cluster_sizes >= 3].index
    study_df = study_df[study_df['gliph_cluster_id'].isin(keep_gliph)].copy()
    if study_df.empty:
        continue

    for method in CLUSTER_METHODS:
        labels_map = cluster_labels[(study, method)]
        tmp = study_df.copy()
        tmp['pred_cluster'] = tmp['row_id'].map(labels_map).fillna(-1).astype(int)
        tmp_eval = tmp[tmp['pred_cluster'] >= 0].copy()
        if tmp_eval.empty:
            continue

        rows = []
        for gliph_id, grp in tmp_eval.groupby('gliph_cluster_id'):
            pred_sizes = grp['pred_cluster'].value_counts().sort_values(ascending=False)
            n_pred = int(pred_sizes.shape[0])
            dominant = int(pred_sizes.iloc[0]) if n_pred else 0
            consistency = float(dominant / len(grp)) if len(grp) else np.nan
            split_flag = n_pred > 1
            rows.append({
                'canonical_study': canonical,
                'reference_id': study,
                'method': method,
                'gliph_cluster_id': gliph_id,
                'n_clonotypes_eval': int(len(grp)),
                'n_pred_clusters': n_pred,
                'dominant_pred_cluster': int(pred_sizes.index[0]) if n_pred else -1,
                'dominant_pred_size': dominant,
                'consistency_ratio': consistency,
                'is_split': split_flag,
                'stimulus_top2': ', '.join(grp['stimulus'].astype(str).value_counts().head(2).index.tolist()),
                'log2_mean_duplicate_count': float(np.log2(1.0 + grp['duplicate_count'].astype(float)).mean()),
                'pred_distribution': ', '.join(
                    f"{int(pred)}:{int(sz)}" for pred, sz in pred_sizes.head(4).items()
                ),
            })

        split_df = pd.DataFrame(rows).sort_values(
            ['is_split', 'consistency_ratio', 'n_clonotypes_eval'],
            ascending=[False, True, False],
        ).reset_index(drop=True)
        inconsistency_frames.append(split_df)

        print(f"\n=== {canonical} | full_gliph+{method} | GLIPH-ID inconsistencies ===")
        print(f"evaluated GLIPH IDs: {len(split_df)}")
        print(f"split GLIPH IDs (mapped to >1 predicted cluster): {int(split_df['is_split'].sum())}")
        display(split_df.head(12))

if inconsistency_frames:
    inconsistency_df = pd.concat(inconsistency_frames, ignore_index=True)
    summary = (
        inconsistency_df.groupby(['canonical_study', 'method'], as_index=False)
        .agg(
            n_gliph_ids=('gliph_cluster_id', 'nunique'),
            n_split_ids=('is_split', 'sum'),
            mean_consistency_ratio=('consistency_ratio', 'mean'),
            median_consistency_ratio=('consistency_ratio', 'median'),
        )
    )
    summary['split_fraction'] = summary['n_split_ids'] / summary['n_gliph_ids']
    print('\n=== Split/consistency summary ===')
    display(summary.sort_values(['canonical_study', 'split_fraction', 'mean_consistency_ratio'], ascending=[True, False, True]))
else:
    print('No evaluable GLIPH-ID inconsistencies found (insufficient non-empty gliph_cluster_id labels).')
No evaluable GLIPH-ID inconsistencies found (insufficient non-empty gliph_cluster_id labels).

Interpreting split_fraction and consistency_ratio#

  • split_fraction: fraction of original gliph_cluster_id groups that are split across more than one predicted cluster (n_split_ids / n_gliph_ids).

    • Closer to 0 means stronger one-to-one agreement with original GLIPH IDs.

    • Larger values indicate more fragmentation (one original GLIPH group mapped to multiple predicted clusters).

  • consistency_ratio (per GLIPH group): share of clonotypes assigned to its dominant predicted cluster.

    • Values near 1.0 indicate that most members of that GLIPH group stay together.

    • Lower values indicate stronger internal disagreement between original and predicted grouping.

  • In the summary table:

    • Higher split_fraction = more frequent GLIPH-group splitting.

    • Lower mean_consistency_ratio = weaker alignment overall.

Use both metrics together: split_fraction captures how often splitting occurs, while consistency_ratio captures how severe the splitting is within each affected group.