Edit-Distance Graph Analysis: GILGFVFTL-specific TCRs#

Builds Hamming and Levenshtein edit-distance graphs from GILGFVFTL-specific TRB CDR3 sequences pulled from the isalgo/airr_benchmark Hugging Face dataset (latest bundled vdjdb.slim.txt.gz, filtered to human HLA-A*02 entries). Finds the largest connected component, verifies the conserved RS motif, and draws the subgraph with hub nodes highlighted.

[1]:
import importlib.metadata as _meta
import sys as _sys
print(f"Python {_sys.version.split()[0]}")
for _pkg in ["mirpy-lib", "numpy", "pandas", "matplotlib", "scipy", "polars"]:
    try:
        print(f"  {_pkg}: {_meta.version(_pkg)}")
    except _meta.PackageNotFoundError:
        pass

import sys
import time
from pathlib import Path

import igraph as ig
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import pandas as pd

repo_root = Path.cwd().resolve().parent if Path.cwd().name == "notebooks" else Path.cwd().resolve()
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

from mir.common.clonotype import Clonotype
from mir.graph.edit_distance_graph import build_edit_distance_graph
from mir.utils.notebook_assets import ensure_airr_benchmark, find_airr_benchmark_vdjdb_slim

# Publication-quality matplotlib style (Nature/Science aesthetics)
plt.rcParams.update({
    "font.family": "sans-serif",
    "font.sans-serif": ["Arial", "Helvetica", "DejaVu Sans"],
    "font.size": 10,
    "axes.titlesize": 11,
    "axes.labelsize": 10,
    "xtick.labelsize": 9,
    "ytick.labelsize": 9,
    "legend.fontsize": 9,
    "figure.dpi": 150,
    "savefig.dpi": 300,
    "axes.linewidth": 0.8,
    "axes.spines.top": False,
    "axes.spines.right": False,
    "xtick.major.size": 3.5,
    "ytick.major.size": 3.5,
    "xtick.direction": "out",
    "ytick.direction": "out",
    "axes.grid": False,
})

Python 3.12.12
  mirpy-lib: 1.1.0
  numpy: 1.26.4
  pandas: 3.0.3
  matplotlib: 3.10.9
  scipy: 1.17.1
  polars: 1.40.1
/Users/mikesh/vcs/mirpy/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

1. Load data#

The notebook downloads airr_benchmark on first use, locates the latest bundled vdjdb.slim.txt.gz, and filters to human TRB rearrangements for the GILGFVFTL epitope with HLA-A*02 in mhc.a. Because the working list carries no consistent V-gene annotation across entries, we assign the placeholder "TRB" so the v_gene_match parameter can still be exercised.

[2]:
benchmark_root = ensure_airr_benchmark(repo_root, allow_patterns=["vdjdb/**"])
vdjdb_path = find_airr_benchmark_vdjdb_slim(benchmark_root)
vdjdb = pd.read_csv(vdjdb_path, sep="\t")

mask = vdjdb["gene"].eq("TRB") & vdjdb["antigen.epitope"].eq("GILGFVFTL")
if "species" in vdjdb.columns:
    mask &= vdjdb["species"].eq("HomoSapiens")
if "mhc.a" in vdjdb.columns:
    mask &= vdjdb["mhc.a"].fillna("").str.contains("A*02", regex=False)

cdr3s = sorted(vdjdb.loc[mask, "cdr3"].dropna().astype(str).unique())
rearrangements = [
    Clonotype(sequence_id=str(i), locus="TRB", v_gene="TRB", j_gene="TRBJ2-1", junction_aa=seq, duplicate_count=1, _validate=False)
    for i, seq in enumerate(cdr3s)
]

print(f"Loaded {len(rearrangements)} CDR3 sequences from {vdjdb_path}")
print(f"Length distribution: {sorted(set(len(s) for s in cdr3s))}")
print(f"Sequences with RS motif: {sum(1 for s in cdr3s if 'RS' in s)}")
Loaded 5236 CDR3 sequences from /Users/mikesh/vcs/mirpy/notebooks/assets/large/airr_benchmark/vdjdb/vdjdb-2025-12-29/vdjdb.slim.txt.gz
Length distribution: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 33, 38]
Sequences with RS motif: 680

2. Hamming graph (threshold = 1)#

[3]:
t0 = time.perf_counter()
g_hamming = build_edit_distance_graph(
    rearrangements,
    metric="hamming",
    threshold=1,
    n_jobs=1,   # use n_jobs > 1 for larger datasets
)
print(f"Hamming graph built in {time.perf_counter()-t0:.2f}s")
print(f"  Vertices : {g_hamming.vcount()}")
print(f"  Edges    : {g_hamming.ecount()}")

comps = g_hamming.components()
print(f"  Components: {len(comps)}")
cc_sizes = sorted(comps.sizes(), reverse=True)
print(f"  Top-5 component sizes: {cc_sizes[:5]}")
Hamming graph built in 0.06s
  Vertices : 5236
  Edges    : 4727
  Components: 3550
  Top-5 component sizes: [896, 129, 129, 116, 18]

2a. Largest connected component#

[4]:
largest_ham = comps.giant()
seqs_ham = largest_ham.vs["name"]

print(f"Largest CC: {largest_ham.vcount()} vertices, {largest_ham.ecount()} edges")
print("\nSample sequences from largest CC:")
for s in sorted(seqs_ham)[:10]:
    print(f"  {s}")
Largest CC: 896 vertices, 2985 edges

Sample sequences from largest CC:
  CAASSRSTDTQYF
  CAASTRSTDTQYF
  CACSGRSGETQYF
  CACSIRSTDTQYF
  CAGSIRSSYEQYF
  CAGSIRSTETQYF
  CAGSSRSGHEQYF
  CAGSTRSSYEQYF
  CANSFRSGETQYF
  CANSIRSSYEQYF

2b. Verify RS motif#

[5]:
rs_in_cc = [s for s in seqs_ham if "RS" in s]
print(f"RS-motif sequences in largest CC: {len(rs_in_cc)}")
print("Examples:")
for s in rs_in_cc[:8]:
    print(f"  {s}")

assert len(rs_in_cc) > 0, "Largest CC must contain RS-motif sequences"
RS-motif sequences in largest CC: 455
Examples:
  CAASSRSTDTQYF
  CAASTRSTDTQYF
  CACSGRSGETQYF
  CACSIRSTDTQYF
  CAGSIRSSYEQYF
  CAGSIRSTETQYF
  CAGSSRSGHEQYF
  CAGSTRSSYEQYF

3. Hamming graph with v_gene_match = True#

Because all rearrangements share the placeholder "TRB" V-gene, the filter leaves the graph unchanged. In real data with annotated V-genes this would restrict edges to within-V-gene pairs.

[6]:
g_hamming_vmatch = build_edit_distance_graph(
    rearrangements,
    metric="hamming",
    threshold=1,
    v_gene_match=True,
    n_jobs=1,
)
print(f"Hamming (v_gene_match=True): {g_hamming_vmatch.ecount()} edges  "
      f"(plain: {g_hamming.ecount()})")
Hamming (v_gene_match=True): 4727 edges  (plain: 4727)

4. Levenshtein graph (threshold = 1)#

[7]:
t0 = time.perf_counter()
g_lev = build_edit_distance_graph(
    rearrangements,
    metric="levenshtein",
    threshold=1,
    n_jobs=1,
)
print(f"Levenshtein graph built in {time.perf_counter()-t0:.2f}s")
print(f"  Vertices : {g_lev.vcount()}")
print(f"  Edges    : {g_lev.ecount()}  (≥ Hamming: {g_lev.ecount() >= g_hamming.ecount()})")

largest_lev = g_lev.components().giant()
seqs_lev = largest_lev.vs["name"]
print(f"\nLargest CC (Levenshtein): {largest_lev.vcount()} vertices")
rs_lev = [s for s in seqs_lev if "RS" in s]
print(f"RS-motif sequences: {len(rs_lev)}")
Levenshtein graph built in 0.23s
  Vertices : 5236
  Edges    : 4951  (≥ Hamming: True)

Largest CC (Levenshtein): 1195 vertices
RS-motif sequences: 473

Levenshtein with v_gene_match#

[8]:
g_lev_vmatch = build_edit_distance_graph(
    rearrangements,
    metric="levenshtein",
    threshold=1,
    v_gene_match=True,
    n_jobs=1,
)
print(f"Levenshtein (v_gene_match=True): {g_lev_vmatch.ecount()} edges  "
      f"(plain: {g_lev.ecount()})")
Levenshtein (v_gene_match=True): 4951 edges  (plain: 4951)

5. Visualise the largest Hamming CC#

Hub nodes (top 7 by degree) are drawn in red and labelled with their junction_aa and V-gene placeholder.

[9]:
subg = largest_ham
degrees = subg.degree()

N_HUBS = 7
top_idx = sorted(range(subg.vcount()), key=lambda i: degrees[i], reverse=True)[:N_HUBS]
hub_set = set(top_idx)

print(f"Hub nodes (top {N_HUBS} by degree):")
for i in top_idx:
    v = subg.vs[i]
    print(f"  {v['name']}  v_gene={v['v_gene']}  degree={degrees[i]}")
Hub nodes (top 7 by degree):
  CASSIRSSYEQYF  v_gene=TRB  degree=52
  CASSIRSTDTQYF  v_gene=TRB  degree=42
  CASSTRSTDTQYF  v_gene=TRB  degree=31
  CASSIRSAYEQYF  v_gene=TRB  degree=30
  CASSIRSGYEQYF  v_gene=TRB  degree=29
  CASSIRSTGELFF  v_gene=TRB  degree=28
  CASSGRSTDTQYF  v_gene=TRB  degree=24
[10]:
layout = subg.layout("fr")

vertex_colors = ["#e74c3c" if i in hub_set else "#aed6f1" for i in range(subg.vcount())]
vertex_sizes  = [12 + degrees[i] * 2 for i in range(subg.vcount())]
vertex_labels = [
    f"{subg.vs[i]['name']}\n{subg.vs[i]['v_gene']}" if i in hub_set else ""
    for i in range(subg.vcount())
]

fig, ax = plt.subplots(figsize=(14, 12))
ig.plot(
    subg,
    target=ax,
    layout=layout,
    vertex_color=vertex_colors,
    vertex_size=vertex_sizes,
    vertex_label=vertex_labels,
    vertex_label_size=8,
    vertex_label_color="black",
    edge_width=0.6,
    edge_color="#99a3a4",
)

hub_patch = mpatches.Patch(color="#e74c3c", label=f"Hub (top-{N_HUBS} by degree)")
node_patch = mpatches.Patch(color="#aed6f1", label="Other node")
ax.legend(handles=[hub_patch, node_patch], loc="upper right", fontsize=9)

ax.set_title(
    f"GILGFVFTL TCR CDR3 — Hamming edit-distance graph (threshold=1)\n"
    f"Largest CC: {subg.vcount()} nodes, {subg.ecount()} edges",
    fontsize=11,
)
ax.axis("off")
plt.tight_layout()
plt.show()
../_images/notebooks_edit_distance_graph_18_0.png