261 residues, chain A. Catalytic triad: Ser165 (nucleophile), Asp210 (acid), His242 (base). Oxyanion hole: Gly167, Asn246. All are DMS coldspots.
| Property | LCC WT | ICCG |
|---|---|---|
| Tm | 84.7°C | 90.9°C (+6.2°C) |
| PET degradation | 31% Pf-PET, 3d, 65°C | 90% in 10h, 72°C |
| Rate (Gf-PET) | 93.2 mgTA/h/mg | Higher at 72°C |
| Productivity | — | 16.7 g TPA/L/h |
Source: Tournier et al., Nature 2020. ICCG = F243I/D238C/S283C/Y127G. All residue numbers in this report use PDB/UniProt numbering.
| Mutation | DMS Fitness | Note |
|---|---|---|
| F243I | −0.130 | Slightly deleterious alone |
| D238C | not measured | D238G=+0.42 |
| S283C | +0.803 | Disulfide partner — beneficial alone |
| Y127G | not measured | Y127H=+0.25, Y127C=−1.57 |
| N246D | −1.027 | Strongly deleterious alone |
| Variant | Additional Muts | vs ICCG | DMS Singles Sum† | Coverage | Source |
|---|---|---|---|---|---|
| LCC-LANL ★ | +P38L/Y61C/M91I/L117P/A183V/H218Y/Q224H/S247L/T256I | 14.3× | +1.26 | 11/13 scored (D238C, Y127G absent from DMS) | NREL 2024 |
| RITK | +D53R/R143I/D193T/E208K | 8.3× | +0.90 | 3/8 scored | Fang 2023 |
| LCC-I40M | +6 mutations (many outside DMS range) | 3.6× | +0.67 | partial | ML study |
| ICCG/H252Y | +H218Y | 2.6× | +1.16 | 3/5 scored | Cribari |
| LCC-A2 | +H218Y/N248D | +40% | +1.43 | 4/6 scored | Zheng 2024 |
| ICCG | (baseline) | 1.0× | +0.67 | 2/4 scored | Tournier 2020 |
† Interpretation and limitations of DMS Singles Sum: This column is a naive linear sum of single-site DMS fitness values for scored mutations only — it does not predict true multi-mutant activity. Three caveats:
(1) Missing mutations (e.g. D238C, Y127G absent from DMS singles) contribute 0 to the sum, causing underestimation;
(2) Epistasis is entirely ignored — DMS directly measures F243I+P38L combinatorial fitness = −1.52, far below the additive sum of the two singles, demonstrating strong antagonism that this column cannot capture;
(3) Coverage varies widely across variants (3/8 to 11/13), making cross-row comparisons unreliable.
This column is a rough proxy for "how many DMS-beneficial singles does this variant carry" and must not be used for activity ranking.
★ LCC-LANL (14.3× ICCG, NREL/LANL ACS Catal. 2024). H218Y appears in 3 of the top-4 variants (single-mutant DMS fitness +0.490).
Definition: Residues with direct contact to PET/MHET substrate, identified by molecular docking of 2-HE(MHET)₃ into PDB 4EB0 (Tournier et al., Nature 2020). Highest priority zone.
| PDB | AA | Role | DMS fitness† | Scorecons | Published variant |
|---|---|---|---|---|---|
| 165 | Ser | Catalytic nucleophile | −0.685 (n=6) | 0.948 | — |
| 210 | Asp | Catalytic acid | −0.467 (n=5) | 1.000 | — |
| 242 | His | Catalytic base | −0.802 (n=6) | 1.000 | — |
| 95 | Tyr | Oxyanion hole (2nd) | −0.231 (n=5) | 0.883 | — |
| 166 | Met | Oxyanion hole (1st) | −0.719 (n=6) | 1.000 | — |
| 164 | His | Subsite S2 | −0.138 (n=6) | 1.000 | — |
| 125 | Phe | Hydrophobic groove | −0.542 (n=5) | 0.625 | M125I (LCC-LANL) |
| 127 | Tyr | Aromatic clamp | −0.680 (n=4) | 0.610 | Y127G (ICCG) |
| 130 | Ser | Substrate binding | −0.232 (n=5) | 0.984 | T130M (WCCM) |
| 94 | Gly | Binding pocket | −0.317 (n=4) | 1.000 | — |
| 190 | Trp | Aromatic binding | −0.782 (n=5) | 1.000 | — |
| 212 | Val | Binding pocket | +0.111 (n=4) | 0.836 | — |
| 243 | Phe | Binding pocket | −0.319 (n=6) | 0.748 | F243I (ICCG/LCC-LANL) |
| 246 | Asn | Binding pocket | −1.271 (n=6) | 0.985 | N246D (ICCG) |
†DMS fitness = mean log₂-enrichment across all measured substitutions at this position. Scorecons 0–1 (1=fully conserved); catalytic triad values from ganon scorecons_conservation.csv.
Definition: Any residue not in the binding site whose closest atom is ≤5Å from any atom of a binding site or catalytic triad residue. Computed from PDB 4EB0 all-atom coordinates. Priority: binding site > secondary shell > surface/core.
| PDB | AA | Dist to BS† | DMS fitness | Scorecons | Published variant |
|---|---|---|---|---|---|
| 92 | Ser | 2.9Å | −0.557 (n=3) | 0.910 | — |
| 93 | Pro | 1.3Å | −0.397 (n=5) | 1.000 | — |
| 96 | Thr | 1.3Å | −0.709 (n=3) | 1.000 | — |
| 97 | Ala | 3.0Å | −0.283 (n=5) | 0.725 | — |
| 101 | Ser | 3.0Å | −0.301 (n=6) | 0.884 | — |
| 102 | Leu | 4.0Å | −0.492 (n=4) | 0.697 | — |
| 104 | Trp | 3.7Å | +0.395 (n=4) | 1.000 | — |
| 123 | Ser | 3.0Å | +0.123 (n=4) | 0.697 | — |
| 124 | Arg | 1.3Å | −0.398 (n=5) | 0.514 | — |
| 126 | Asp | 1.3Å | −1.275 (n=5) | 1.000 | — |
| 128 | Pro | 1.3Å | −0.694 (n=5) | 0.984 | — |
| 129 | Asp | 1.3Å | +0.030 (n=5) | 0.914 | — |
| 131 | Arg | 1.3Å | −0.646 (n=7) | 0.978 | — |
| 132 | Ala | 3.2Å | −0.221 (n=4) | 0.786 | — |
| 133 | Ser | 2.9Å | +0.163 (n=6) | 0.567 | — |
| 134 | Gln | 3.0Å | −0.604 (n=5) | 0.986 | — |
| 162 | Ala | 4.7Å | +0.149 (n=5) | 0.630 | — |
| 163 | Gly | 1.3Å | −0.634 (n=6) | 1.000 | — |
| 167 | Gly | 1.3Å | −1.097 (n=4) | 1.000 | — |
| 168 | Gly | 2.5Å | −0.195 (n=4) | 1.000 | — |
| 169 | Gly | 2.9Å | −0.657 (n=5) | 1.000 | — |
| 170 | Gly | 2.9Å | −0.737 (n=5) | 0.965 | — |
| 187 | Leu | 2.9Å | −0.404 (n=4) | 1.000 | — |
| 188 | Thr | 2.9Å | −0.118 (n=4) | 0.854 | — |
| 189 | Pro | 1.3Å | −0.243 (n=5) | 0.766 | — |
| 191 | His | 1.3Å | −0.639 (n=3) | 0.792 | — |
| 192 | Thr | 3.8Å | +0.826 (n=5) | 0.763 | — |
| 207 | Ala | 3.0Å | +0.503 (n=5) | 0.845 | — |
| 208 | Glu | 3.3Å | +0.101 (n=4) | 0.897 | — |
| 209 | Ala | 1.3Å | +0.256 (n=5) | 0.462 | — |
| 211 | Thr | 1.3Å | +0.297 (n=3) | 0.770 | — |
| 213 | Ala | 1.3Å | +0.400 (n=3) | 1.000 | — |
| 214 | Pro | 3.4Å | +0.071 (n=6) | 0.845 | — |
| 215 | Val | 4.7Å | −0.112 (n=5) | 1.000 | — |
| 218 | His | 3.3Å | +0.041 (n=7) | 0.960 | H218Y (LCC-LANL) |
| 222 | Phe | 4.1Å | +0.204 (n=5) | 0.975 | — |
| 240 | Ala | 3.5Å | −0.327 (n=4) | 0.986 | — |
| 241 | Ser | 1.3Å | −0.418 (n=3) | 0.796 | — |
| 244 | Ala | 1.3Å | −0.078 (n=5) | 0.747 | — |
| 245 | Pro | 1.3Å | +0.361 (n=5) | 1.000 | — |
| 247 | Ser | 1.3Å | +0.213 (n=3) | 0.582 | S247L (LCC-LANL) |
| 248 | Asn | 4.7Å | −0.181 (n=6) | 0.550 | — |
†Min atom-to-atom distance to any binding site residue (incl. catalytic triad). DMS fitness = mean log₂-enrichment. Scorecons 0–1 (1=fully conserved). Yellow = our top-5 candidate. Blue = published variant mutation.
The DMS library was generated by random combinatorial mutagenesis, not targeted at specific known variants. None of the 7 published LCC variants has an exact combination match in the 8,179-variant dataset. This means we cannot directly validate any published variant's performance using DMS data.
| Variant | Activity | Total Muts | Singles in DMS | Key Missing | Subset Match in DMS? |
|---|---|---|---|---|---|
| LCC-LANL ★ | 14.3× ICCG | 13 | 11/13 | D238C, Y127G | 1 pair found: F243I+P38L combinatorial fitness = −1.52 (Directly measured DMS double-mutant; independent of Singles Sum — demonstrates strong antagonism between these two mutations in the LCC-LANL background) |
| RITK | 8.3× | 8 | 3/8 | D238C, Y127G, D53R, R143I, D193T | None |
| LCC-A2 | +40% | 6 | 4/6 | D238C, Y127G | None |
| ICCG/H252Y | 2.6× | 5 | 3/5 | D238C, Y127G | None |
| ICCG | 1.0× (baseline) | 4 | 2/4 | D238C, Y127G | None |
| WCCG | Tm+7°C | 4 | 1/4 | F243W, D238C, Y127G | None |
These two ICCG core mutations are absent from all DMS singles. D238 has D238G (+0.42), D238V (−0.90) but not D238C. Y127 has Y127H (+0.25), Y127N (−0.19) but not Y127G. This is a fundamental coverage gap — the DMS library simply didn't sample these specific amino acid substitutions.
The only subset match found: LCC-LANL contains both F243I and P38L, and this pair exists in DMS as a double with fitness −1.52. Additive prediction: F243I(−0.13) + P38L(+0.42) = +0.29. Actual: −1.52. Δ = −1.81 — severe antagonism. Yet in LCC-LANL (with 11 other mutations), the combination works brilliantly. This proves that higher-order epistasis rescues pairwise antagonism.
The distribution is roughly symmetric around 0 (WT level) with a slight left skew. The high beneficial fraction (30%) is unusual for enzymes — most DMS datasets show <10% beneficial. This indicates LCC has extensive room for improvement through mutation, consistent with it being a natural enzyme not previously optimized for PET degradation.
Each LCC variant is encapsulated in a water-in-oil micro-droplet with a PET substrate. After 40h of hydrolysis at 65°C, droplets are sorted by fluorescence (FACS) — brighter = more PET degraded = higher activity. Fitness = log₂(enrichment ratio) after sorting. A fitness of 0 = wild-type level; >0 = better than WT; <0 = worse than WT. The measurement integrates activity, stability, and expression into a single readout.
Total: 1,246 single mutants covering 259 of 261 positions. 30% beneficial rate is unusually high — LCC is a mutationally tolerant enzyme.
| k | Count | Mean Fitness | Note |
|---|---|---|---|
| 2 | 2,701 | −0.75 | Most common; antagonistic epistasis dominates |
| 3 | 1,902 | −0.93 | Further fitness decline on average |
| 4 | 1,225 | −1.10 | Best combo: +4.00 (R47H+T82A+A209V+S241P) |
| 5 | 636 | −1.00 | Best combo: +6.98 (top of entire dataset) |
| 6-13 | 544 | −1.321 | Diminishing returns at higher k |
Total: 8,179 variants. Mean fitness declines with k, but rare combinations massively outperform WT — the dataset contains hidden gems.
| # | Mutation | Fitness | Structural Zone | OHM Zone |
|---|---|---|---|---|
| 1 | N249H | +1.839 | Surface | structural_essential |
| 2 | N140D | +1.835 | Surface | safe_target |
| 3 | A207E | +1.827 | 2nd Shell | allosteric_core |
| 4 | T192P | +1.808 | 2nd Shell | safe_target |
| 5 | L142M | +1.789 | Surface | structural_essential |
| 6 | Q40P | +1.757 | Surface | safe_target |
| 7 | N44H | +1.664 | Surface | safe_target |
| 8 | N225K | +1.653 | Surface | allosteric_handle |
| 9 | Y95F | +1.621 | Binding site | structural_essential |
| 10 | Q217P | +1.597 | Surface | allosteric_handle |
OHM Zone legend:
allosteric_core = high ACI + conserved, directly relays catalytic signal;
allosteric_handle = high ACI + low conservation, tunable modulator — ideal engineering target; safe_target = low ACI, mutations are additive with low epistasis risk;
structural_essential = conserved but low ACI, maintains fold integrity.
| k | Mutations | Fitness |
|---|---|---|
| 5 | T121M+Y127N+A183V+F196S+A281T | +6.975 |
| 2 | I204F+A207T | +3.911 |
| 3 | K194R+I204N+N288I | +3.888 |
| 4 | R47H+T82A+A209V+S241P | +3.998 |
| 6 | T60S+P93R+S100P+P189L+S258T+P280Q | +4.607 |
The best 5-mutant combination (+6.98) is 3.8× the best single mutant (+1.84). This is not merely additive — specific combinations synergize. The challenge: with 259 positions and 20 amino acids, the combinatorial space is vast. We need computational tools to navigate it efficiently.
Each dot = one position (mean fitness across all single mutations at that position). Hover for details. Ideal targets: top-left quadrant (low conservation + high fitness).
Low conservation + high fitness. Evolutionarily unconstrained AND mutationally tolerant. Q217P (cons=0.531, fit=+0.34) and N140D (cons=0.713, fit=+0.06) sit here — safest candidates to engineer.
High conservation but positive fitness for specific substitutions. W104L (cons=1.0) and A207E (cons=0.845) — conserved positions where rare mutations improve function. Similar to ICCG's strategy. Higher epistasis risk.
High conservation + low fitness. Catalytic triad (Ser165, Asp210, His242) and structural core cluster here. Locked by evolution — almost all mutations are catastrophic.
Low conservation + low fitness. Not evolutionarily constrained, but mutations don't help either. These are surface-exposed or disordered positions with little functional relevance.
Method: BLASTp search against UniRef90 (E-value < 1e-30, query coverage > 80%) yielded ~150 LCC homologs. Multiple sequence alignment via Clustal Omega, then conservation scored by Scorecons (Valdar 2002). Score range 0→1, where 1.0 = identical across all orthologs.
Rationale: Conserved positions are under evolutionary constraint — mutations are more likely to be deleterious.
Conservation vs DMS fitness (per-position, single-mutation mean): Spearman ρ = −0.247 (p = 6.8×10⁻⁵, n=255 positions). As expected, more conserved positions have lower mean fitness across all substitutions.
NDCG@10% = 0.70 (computed on per-position mean single-mutation fitness, ranking 259 positions by negated conservation score) — conservation alone identifies ~70% of top-performing positions correctly.
| Position | WT AA | Mean Fitness | % Deleterious | Role |
|---|---|---|---|---|
| 165 | Ser | −0.69 | 83% | Catalytic nucleophile |
| 167 | Gly | −1.10 | 100% | Oxyanion hole |
| 210 | Asp | −0.47 | 80% | Catalytic acid |
| 242 | His | −0.80 | 100% | Catalytic base |
| 246 | Asn | −1.27 | 100% | Oxyanion hole |
| 96 | Thr | −0.71 | 100% | Buried structural |
| 170 | Gly | −0.74 | 100% | Near active site |
| Position | Mean Fitness | % Beneficial | Max | Note |
|---|---|---|---|---|
| 217 | +0.76 | 67% | +1.60 | allosteric handle ★ candidate |
| 207 | +0.50 | 60% | +1.83 | allosteric core ★ candidate |
| 286 | +0.44 | 60% | +1.33 | structural_essential C-term loop, ACI 60.1% ★ candidate |
| 117 | +0.64 | 100% | +1.05 | Surface exposed |
| 44 | +0.63 | 86% | +1.70 | N-terminus |
| 229 | +0.56 | 100% | +0.84 | Surface loop |
| Position | Scorecons | Median | Safe? | Status |
|---|---|---|---|---|
| 217 (Q→P) | 0.531 | 0.826 | Yes | Hotspot |
| 140 (N→D) | 0.713 | 0.826 | Yes | Neutral |
| 207 (A→E) | 0.845 | 0.826 | Borderline | Hotspot |
| 104 (W→L) | 1.000 | 0.826 | Conserved | Neutral |
| 286 (R→P) | 0.990 | 0.826 | Conserved | Hotspot |
Apparent contradiction: conserved yet hotspot? "Hotspot" means that specific substitutions at this position are beneficial (e.g. R286P = +1.33), while "conserved" means most organisms keep the wild-type residue. This happens when one or two specific mutations escape the evolutionary constraint — e.g. Pro rigidifies a loop in a way evolution didn't explore. ICCG similarly mutated the highly conserved N246 (scorecons 0.985) successfully.
DMS fitness alone ranks mutations by observed performance, but doesn't explain why they work or predict how they'll combine. We integrate 5 orthogonal zero-shot tools + DMS epistasis analysis — each capturing a different aspect of protein function — to identify positions with convergent multi-signal support, maximizing confidence for wet lab validation.
Identifies positions that participate in allosteric signal transduction to the active site. Mutations at allosteric positions can modulate activity through long-range effects — a mechanism distinct from direct fitness.
Builds a residue interaction network from atomic contacts in the PDB. Identifies structural hub residues. Mutations at hubs are usually catastrophic — exceptions are exceptionally valuable engineering targets.
Protein language model trained on millions of sequences. Captures evolutionary constraints beyond simple conservation — understands amino acid context and co-evolution patterns.
Arc Institute framework (Science 2026). The only tool that directly predicts multi-mutant fitness from single/double data using a neural network. Tests if top positions remain top in combination.
Exhaustive search of all 65,535 subsets of 16 tools via rank averaging. Best combo: ESM-2 3B + ThermoMPNN + OHM ACI (Spearman=0.248, NDCG@10%=0.833) — outperforms any single tool.
OHM (Ohm-based Allosteric Model) analyzes how perturbations at one residue propagate through the protein to the active site.
Output: one ACI score per position (not per amino acid) — it is a property of the position in the structure, not of specific mutations. ACI is a percentile (0–100%) measuring how strongly that position participates in signal transduction to the catalytic triad.
Higher ACI = stronger allosteric coupling to the catalytic triad. This is not simply "better" — it depends on context: high-ACI positions in the Allosteric Handle zone (high ACI + low conservation) are the preferred engineering targets, as mutations there can tune catalysis with manageable epistasis risk. High-ACI positions in the Allosteric Core (conserved) are risky to mutate. Low-ACI positions (Safe Target) produce additive, predictable effects.
OHM classifies each position into one of 4 zones based on ACI and conservation:
| Zone | n | Meaning |
|---|---|---|
| Allosteric Core | 39 | High ACI + conserved → signal relay backbone |
| Allosteric Handle | 23 | High ACI + not conserved → tunable modulators |
| Safe Target | 97 | Low ACI → mutations are additive, low epistasis risk |
| Structural Essential | 94 | Conserved but low ACI → structural integrity |
ACI vs DMS single-mutation fitness (per-mutation, n=1,228): Spearman=0.127, NDCG@10%=0.822. ACI has modest Spearman but high NDCG — it excels at identifying the top-10% beneficial positions, even if overall rank ordering is weaker.
Red = Path 1 (core relay), Orange = Path 2 (substrate access), Blue = Path 3 (distal handle). Yellow = catalytic triad. Magenta = candidate mutation sites.
His164→Ser165(cat.)→Gly168→Thr171→Ala207→Glu208→Asp210(cat.)
A207E sits on this path — highest ACI (98.4%) among beneficial mutations. Directly connects two catalytic residues.
Ser241→Phe243→Ala244→Pro245→Asn246(oxa.)→His242(cat.)
ICCG's F243I and N246D are both on this path — explains their epistatic rescue despite negative single fitness.
Gln217→Thr188→Leu187→Gly169→[junction]→Ser165(cat.)
Q217P = allosteric handle, rigidifies loop. Longest-range path (~25 Å surface→active site).
ACI radiates outward through a highly conserved, mutationally intolerant core:
| Pos | AA | ACI% | Cons | Fitness | Insight |
|---|---|---|---|---|---|
| 164 | His | 99.6 | 1.00 | −0.18 | DO NOT TOUCH — relay backbone |
| 168 | Gly | 99.2 | 1.00 | −0.13 | DO NOT TOUCH — relay backbone |
| 167 | Gly | 98.1 | 1.00 | −0.67 | Oxyanion hole backbone — ABSOLUTELY IMMUTABLE |
| 171 | Thr | 96.9 | 0.84 | +0.03 | Neutral — tolerable but no gain |
| 169 | Gly | 95.0 | 1.00 | −0.42 | Path 3 junction — Q217P relay feeds through here |
| 166 | Met | 93.8 | 1.00 | −0.47 | Conserved relay residue |
| 170 | Gly | 92.2 | 0.97 | −0.47 | Relay backbone |
| 163 | Gly | 91.9 | 1.00 | −0.46 | Relay backbone |
Insight: The Ser165 relay core is entirely locked by conservation + negative fitness. Engineering must approach from the periphery (Path 3: Q217P → ... → Ser165).
| Pos | AA | ACI% | Cons | Fitness | Insight |
|---|---|---|---|---|---|
| 207 | Ala | 98.4 | 0.85 | +0.23 | ★ A207E — HOTSPOT on relay! |
| 209 | Ala | 96.1 | 0.46 | +0.04 | Low conservation — handle zone |
| 213 | Ala | 94.6 | 1.00 | +0.13 | Synergistic (A213T+T230A Δ=+1.22) |
| 212 | Val | 94.2 | 0.84 | −0.02 | Neutral |
| 208 | Glu | 91.1 | 0.90 | −0.05 | Conserved, relay core |
| 217 | Gln | 84.9 | 0.53 | +0.34 | ★ Q217P — HANDLE + HOTSPOT |
Insight: Position 207 is the only high-ACI position near Asp210 that is also a DMS hotspot. All other high-ACI neighbors are conserved + deleterious.
| Pos | AA | ACI% | Cons | Fitness | Insight |
|---|---|---|---|---|---|
| 241 | Ser | 98.8 | 0.80 | −0.38 | Handle — substrate entry |
| 244 | Ala | 97.3 | 0.75 | −0.10 | Handle zone |
| 240 | Ala | 96.5 | 0.99 | −0.27 | Core relay, conserved |
| 245 | Pro | 95.7 | 1.00 | +0.15 | Only mutable residue on channel |
| 243 | Phe | 93.4 | 0.75 | −0.29 | ICCG F243I sits here |
| 246 | Asn | 93.0 | 0.99 | −0.75 | Oxyanion hole — ICCG N246D |
| 238 | Asp | 90.3 | 0.55 | −0.20 | ICCG D238C disulfide partner |
Insight: The His242 channel is tightly optimized. ICCG mutated two positions here (F243I, N246D) — both individually harmful but combinatorially rescued.
Left-drag = rotate · Right-drag = pan · Scroll = zoom · Click residue = info popup. Use sidebar tabs within each viewer to select specific paths.
RINpy (Residue Interaction Network in Python) builds a graph where each residue is a node and edges connect residues within 4.5 Å (non-bonded atomic contacts in PDB 4EB0). It then computes betweenness centrality (BC) — the fraction of all shortest paths in the network that pass through each residue.
High-BC residues are structural "hubs" — removing or modifying them disrupts the most communication pathways in the protein. Result: 258 nodes, 1,314 edges.
BC vs DMS single-mutation fitness (per-mutation, n=1,068): Spearman=+0.027 (near zero — BC alone barely predicts fitness), NDCG@10%=0.771. Hub residues tend to have lower fitness tolerance, but the relationship is weak at per-mutation level. BC contributes mainly through its inclusion in the best doubles combo.
| Mutation | BC Rank | Degree | Fitness | Category |
|---|---|---|---|---|
| N140D | #5/258 | 14 | +1.84 | Beneficial Hub |
| S136Y | #8/258 | 16 | +1.43 | Beneficial Hub |
| W104L | #104/258 | 9 | +1.39 | Moderate |
| R286P | #99/258 | 10 | +1.33 | Moderate |
| Q217P | #253/258 | 6 | +1.60 | Low BC (Safe) |
BC Rank: Betweenness Centrality rank out of 258 residues. #1 = most central hub (most shortest paths pass through it). Degree: Number of direct residue contacts within 4.5 Å — higher degree = more packed neighbors in the structure. Categories: Beneficial Hub = BC top-10% AND DMS fitness > +0.3 (rare: hub residues usually can't be mutated). Safe Peripheral = BC bottom-10%, very few structural contacts, safe to mutate without disrupting the fold. Moderate = BC in middle range, some structural role but not a critical hub.
ESM-2 (650M parameters) is a transformer-based protein language model trained on ~250M protein sequences. We use masked marginal scoring: for each position, mask the residue, and compute log P(mutant|context) − log P(wildtype|context). A positive score means the PLM considers the mutation more "natural" in this sequence context.
Unlike simple conservation (MSA counting), ESM-2 captures context-dependent co-evolutionary patterns. It can detect that a mutation is acceptable in this specific protein even if the residue is conserved across the family — because the surrounding context compensates.
Spearman ρ = 0.242 · NDCG@10% = 0.817 (n=1,246 single mutations scored by ESM-2 3B)
ESM-2 is a weak predictor of micro-droplet fitness. This is expected: DMS fitness integrates activity + stability + expression, while ESM-2 primarily captures evolutionary plausibility. ESM-2 is one input signal, not a standalone predictor.
| Mutation | Fitness | SASA | Min Dist† | Structural Zone | — | Location |
|---|---|---|---|---|---|---|
| N249H | +1.84 | Surface | 6.6Å | Surface | — | Surface, near C-terminus |
| N140D | +1.84 | Surface | 12.3Å | Surface | — | β-sheet edge, RINpy BC major hub (#5/258) |
| A207E | +1.83 | Buried | 3.0Å | 2nd Shell | — | Secondary shell, allosteric core (3.0Å to binding pocket) |
| T192P | +1.81 | Surface | 3.8Å | 2nd Shell | — | Secondary shell, surface loop |
| L142M | +1.79 | Buried | 14.9Å | Core | — | Core, buried |
| Q40P | +1.76 | Surface | 26.0Å | Surface | — | N-terminus, flexible |
| N44H | +1.66 | Surface | 25.5Å | Surface | — | N-terminus, flexible |
| N225K | +1.65 | Surface | 12.3Å | Surface | — | Surface loop |
| Y95F | +1.62 | Surface | — | Binding site | — | Oxyanion hole (Tournier 2020) — direct substrate contact |
| Q217P | +1.60 | Surface | 7.1Å | Surface | — | Allosteric handle, surface-exposed |
Pattern: 7/10 surface/distal; 2/10 (A207E, L142M) buried; 1/10 (Y95F) in binding site; 2/10 (A207E, T192P) in secondary shell. Binding site: 14 residues with direct PET/MHET contact from Tournier 2020 molecular docking (PDB 94, 95, 125, 127, 130, 164, 165, 166, 190, 210, 212, 242, 243, 246). Secondary shell: any residue with closest atom ≤5Å to any binding site residue (42 residues total, computed from PDB 4EB0). †Min Dist = minimum atom-to-atom distance to any binding site residue; "—" = residue IS in binding site.
ICCG Y127G — IN BINDING SITE (aromatic clamp, direct substrate contact). ICCG/LCC-LANL F243I — IN BINDING SITE (binding pocket). LCC-LANL M125I — IN BINDING SITE (hydrophobic groove). WCCM T130M — IN BINDING SITE. Key insight: ICCG and LCC-LANL primarily engineer the binding pocket itself, not just the secondary shell.
23 beneficial mutations within 10Å of position 217, forming a cluster in the 192–222 region. Key neighbors: T192P (+1.81), A213S (+1.10), F222C (+1.02), P214L (+0.75).
4 synergistic doubles involving Q217: Q217R+I252T (Δ=+2.45), G170D+Q217H (Δ=+2.17), Q217K+A250S (Δ=+1.96).
Position 217 is surface-exposed (46% rSASA), polar, 18Å from active site — ideal for engineering without disrupting the catalytic machinery.
Allosteric+Allosteric pairs: worst epistasis (mean Δ=−1.63). Secondary shell+shell: also bad (−1.55). Best pairs: other+other (−0.50) and close_to_active+close_to_active (~39% positive Δ). Structural proximity to active site may increase synergy potential.
MULTI-evolve (Arc Institute, Science 2026) trains a FCNN on measured single + double mutant fitness to predict multi-mutant combinations. We have no experimental doubles for the top-20 singles, so we tried using zero-shot pseudo-doubles instead.
Step 1: Use best doubles zero-shot combo (ESM-2 650M + BLOSUM62 + SaProt + OHM ACI, Sp=0.186 with real doubles) to score 190 pairwise doubles of top-20 singles.
Step 2: Normalize scores to DMS fitness scale using linear mapping learned from 2,467 real doubles (DMS_fitness = 0.0044 × zs_score − 3.87, R²=0.03).
Step 3: Train FCNN on 21 real singles + 190 normalized pseudo-doubles = 211 training points.
| Method | Sp | NDCG |
|---|---|---|
| ESM-2 3B additive (no training) | +0.177 | 0.703 |
| ZS combo additive (no training) | +0.166 | 0.706 |
| DMS additive (no training) | +0.105 | 0.667 |
| FCNN w/ ZS pseudo-doubles | −0.011 | 0.646 |
| FCNN w/ DMS additive pseudo | −0.023 | 0.620 |
The FCNN uses 5,180-dim one-hot features (259 positions × 20 AAs). It only sees the 20 positions in the top-20 singles during training. For k=3 variants involving any other position, the model has zero learned weights — it outputs random predictions.
Only 36 of 1,650 k=3 validation variants had even 1 mutation in the top-20. On those 36, the zero-shot FCNN achieves Sp=+0.215 — but this is too few to be reliable.
Replace one-hot with 16 zero-shot model scores per mutation (sum for doubles). Train Ridge/GBR on 1,787 real doubles:
| Method | k=3 Sp | k=3 NDCG |
|---|---|---|
| Ridge (16-dim, all doubles) | +0.129 | 0.681 |
| ESM-2 3B additive (no training) | +0.177 | 0.703 |
Use top-50 DMS doubles (fitness 2.1–3.9) as training data. 154 samples: 87 singles + 66 doubles + WT. 16-dim features, FCNN (128-64):
| Method | k=3 All (n=1650) | k=3 In-Training Positions (n=59) |
|---|---|---|
| FCNN 16-dim (top doubles) | Sp=+0.066 | Sp=+0.132, NDCG=0.728 |
| ESM-2 3B additive | Sp=+0.177 | Sp=+0.241, NDCG=0.710 |
Key finding: Within the trained position set, 16-dim FCNN achieves the best NDCG (0.728) — it captures some epistasis signal from real doubles. But it doesn't generalize to unseen positions.
For each of 259 positions, we count how many tools independently flag it as "interesting": (1) DMS fitness in top 25% · (2) ACI above median · (3) BC above median · (4) Appears in top-100 multi-mutants · (5) Low conservation (scorecons < median). Only 13 of 259 positions (5%) have 4+ tools agreeing — these are the consensus positions.
| PDB Pos | Structural Zone | n Tools | DMS Top25% | ACI > median | BC > median | In Top Multimuts | Low Conserv. | Candidate? |
|---|---|---|---|---|---|---|---|---|
| 217 | Surface | 4/5 | ✓ | ✓ | — | ✓ | ✓ | ★ Yes (Q217P) |
| 286 | Surface | 4/5 | ✓ | ✓ | ✓ | ✓ | — | ★ Yes (R286P) |
| 104 | 2nd Shell | 4/5 | ✓ | ✓ | ✓ | ✓ | — | ★ Yes (W104L) |
| 140 | Surface | 4/5 | ✓ | — | ✓ | ✓ | ✓ | ★ Yes (N140D) |
| 207 | 2nd Shell | 3/5 | ✓ | ✓ | — | ✓ | — | ★ Yes (A207E) |
| 197 | Surface | 5/5 | ✓ | ✓ | ✓ | ✓ | ✓ | — |
| 203 | Surface | 5/5 | ✓ | ✓ | ✓ | ✓ | ✓ | — |
| 192 | 2nd Shell | 5/5 | ✓ | ✓ | ✓ | ✓ | ✓ | — |
| 193 | Surface | 4/5 | ✓ | ✓ | — | ✓ | ✓ | — |
| 136 | Surface | 5/5 | ✓ | ✓ | ✓ | ✓ | ✓ | — |
Orange = 5 candidate mutations. Red = catalytic triad. Grey = protein backbone. Drag to rotate, scroll to zoom.
3D viewer labels use PDB 4EB0 residue numbers (= DMS position + 34).
| Mutation | Viewer label | Location | Structural Role |
|---|---|---|---|
| Q217P | GLN217 | Surface loop | Allosteric handle — remote from active site, modulates via Path 3 |
| W104L | TRP104 | Core helix | Allosteric core — buried, part of signal relay network |
| A207E | ALA207 | Active site adjacent | On Path 1 relay — directly influences catalytic Asp210 |
| R286P | ARG286 | C-terminal loop | Structural essential — Pro rigidifies C-terminus |
| N140D | ASN140 | β-sheet edge | RINpy BC #5/258 (major hub) — sits on many structural shortest paths; defies trend as a beneficial hub |
| Ser165 | SER165 | Active site | Catalytic nucleophile — DO NOT TOUCH |
| Asp210 | ASP210 | Active site | Catalytic acid — DO NOT TOUCH |
| His242 | HIS242 | Active site | Catalytic base — DO NOT TOUCH |
| Mutation | Fitness | Deep Mining Rank† | ACI % | BC Centrality Rank (of 258 positions) | Scorecons | OHM Zone | Hotspot? | In Top Combos? | Consensus |
|---|---|---|---|---|---|---|---|---|---|
| Q217P | +1.60 | #1 | 84.9% | #253 | 0.531 | handle | Yes | ✓ | 4/5 |
| W104L | +1.39 | #2 | 78.7% | #104 | 1.000 | core | — | ✓ | 4/5 |
| A207E | +1.83 | #3 | 98.4% | #132 | 0.845 | core | Yes | ✓ | 3/5 |
| R286P | +1.33 | #4 | 60.1% | #99 | 0.990 | essential | Yes | ✓ | 3/5† |
| N140D | +1.84 | #5 | 39.9% | #5 | 0.713 | hub | — | ✓ | 4/5 |
This is why we propose a phased approach with k=3 as a safer first test.
Select 15 top mutations → synthesize C(15,2)=105 pairwise doubles. Measure in micro-droplet or plate assay. Unlocks real epistasis data for MULTI-evolve retraining.
Retrain FCNN on 105 real doubles + 15 singles. Predict k=3..10 with epistasis-aware model. Expected: identify variants exceeding ICCG (Tm > 84°C, >90% PET in 10h).
For 2,467 double mutants where both singles are measured: Δ = fobserved(AB) − [f(A) + f(B)]. Positive Δ = synergy (better than expected). Negative Δ = antagonism (worse than expected). We then grouped pairs by whether each single is beneficial (>+0.3), neutral, or deleterious (<−0.3).
| Pair Type | n | Mean Δ | % Synergy | % Antagonism | Mean Observed |
|---|---|---|---|---|---|
| Beneficial + Beneficial | 192 | −1.724 | 12.5% | 87.5% | −0.425 |
| Beneficial + Neutral | 606 | −1.276 | 19.3% | 80.7% | −0.610 |
| Beneficial + Deleterious | 424 | −0.737 | 31.4% | 68.6% | −0.826 |
| Neutral + Neutral | 429 | −0.653 | 33.8% | 66.2% | −0.635 |
| Deleterious + Neutral | 589 | −0.245 | 45.5% | 54.5% | −0.963 |
| Deleterious + Deleterious | 227 | +0.496 | 63.4% | 36.6% | −0.995 |
Combining two beneficial mutations is the worst strategy. 87.5% of beneficial+beneficial pairs show antagonism (mean Δ = −1.72). The average observed fitness of two beneficial mutations combined is −0.425 — worse than wild-type, despite an additive prediction of +1.30.
Conversely, two deleterious mutations combined synergize 63.4% of the time (mean Δ = +0.50). This explains why the best k=5 in DMS (T121M+Y127N+A183V+F196S+A281T = +6.98) uses 5 individually deleterious mutations.
We tested 16 zero-shot scoring functions (no DMS data used for training), including the recently published FAMPNN (Full-Atom MPNN, ICML 2025). Each scores every single mutation independently. We then exhaustively searched all 216−1 = 65,535 subsets, combining scores via rank averaging (convert each model's scores to ranks, then average). Evaluated by Spearman ρ and NDCG@10% against DMS single-mutation fitness (n=1,228 mutations with all 16 scores available).
| # | Model | Type | Spearman | NDCG@10% |
|---|---|---|---|---|
| 1 | ESM-2 3B | PLM | +0.242 | 0.817 |
| 2 | ESM-2 650M | PLM | +0.215 | 0.789 |
| 3 | SaProt 650M | Structure-PLM | +0.204 | 0.809 |
| 4 | ESM-1v | PLM | +0.191 | 0.789 |
| 5 | ESM-2 150M | PLM | +0.186 | 0.764 |
| 6 | ThermoMPNN | ddG/Structure | +0.179 | 0.803 |
| 7 | MSA log-odds | Evolution | +0.162 | 0.807 |
| 8 | ProstT5 | Structure-PLM | +0.179 | 0.773 |
| 9 | Conservation | MSA/Scorecons | +0.132 | 0.793 |
| 10 | OHM ACI | Allostery | +0.127 | 0.822 |
| 11 | MSA mut-freq | Evolution | +0.132 | 0.819 |
| 12 | BLOSUM62 | Substitution | +0.123 | 0.807 |
| 13 | ProFAM | Autoregressive PLM | +0.154 | 0.816 |
| 14 | FAMPNN | Full-Atom Design | +0.115 | 0.800 |
| 15 | ProteinMPNN | Structure | +0.140 | 0.803 |
| 16 | RINpy BC | Network | +0.027 | 0.771 |
| Size | Best Combination | Spearman | NDCG@10% |
|---|---|---|---|
| 1 | ESM-2 3B | +0.242 | 0.817 |
| 2 | ESM-2 3B + OHM ACI | +0.248 | 0.811 |
| 3 | ESM-2 3B + ThermoMPNN + OHM ACI | +0.248 | 0.833 |
| 4 | + ESM-2 650M | +0.254 | 0.808 |
| 5 | + SaProt + Conservation | +0.252 | 0.799 |
| 6 | + BLOSUM62 | +0.249 | 0.801 |
| ... | Spearman decreases monotonically as more features are added | ||
| 15 | All 15 features | +0.213 | 0.793 |
| Size | Best Combination | Spearman | NDCG@10% |
|---|---|---|---|
| 3 | ThermoMPNN + FAMPNN + OHM ACI | +0.200 | 0.823 |
| 5 | ThermoMPNN + BLOSUM62 + FAMPNN + OHM ACI + RINpy | +0.200 | 0.826 |
For each double mutant A+B with both singles measured (n=1,787), predict fitness as score(A)+score(B). Same 16 zero-shot features (including FAMPNN), exhaustive subset search (32,767 combos). Also tested with DMS additive f(A)+f(B) included as a 16th feature.
| # | Model | Spearman | NDCG@10% |
|---|---|---|---|
| 1 | ESM-2 650M | +0.159 | 0.674 |
| 2 | ESM-2 3B | +0.154 | 0.669 |
| 3 | ESM-1v | +0.141 | 0.677 |
| 4 | SaProt | +0.140 | 0.664 |
| 5 | ProstT5 | +0.137 | 0.668 |
| 6 | ESM-2 150M | +0.131 | 0.668 |
| — | DMS additive f(A)+f(B) | +0.127 | 0.680 |
| 7 | MSA log-odds | +0.128 | 0.663 |
| 8 | OHM ACI | +0.103 | 0.678 |
| 9 | ThermoMPNN | +0.066 | 0.676 |
Note: ESM-2 650M beats 3B for doubles. ThermoMPNN drops significantly (stability ≠ combinatorial fitness). All PLM additive scores outperform DMS additive f(A)+f(B).
| Category | Best Combination | Spearman | NDCG |
|---|---|---|---|
| Best Sp (zero-shot) | ESM2-650M + BLOSUM62 + SaProt + OHM ACI | +0.186 | 0.676 |
| Best NDCG (zero-shot) | ESM2-3B + 650M + ProtMPNN + MSA-lo + MSA-mf + Cons | +0.145 | 0.696 |
| Best Sp (+DMS add) | ESM2-650M + BLOSUM62 + OHM ACI + DMS_add | +0.202 | 0.674 |
| Best NDCG (+DMS add) | ProstT5 + Cons + OHM ACI + RINpy + DMS_add | +0.166 | 0.703 |
| Tool | What It Measures | Sp / NDCG | Key Finding for Our Candidate |
|---|---|---|---|
| DMS Fitness | Direct experimental activity+stability+expression | — / — | All 5 mutations are beneficial singles (+1.33 to +1.84) |
| OHM ACI | Allosteric communication to active site | 0.127 / 0.822 | A207E on Path 1 (ACI 98.4%), Q217P on Path 3 (handle). Orthogonal paths |
| RINpy BC | Structural hub identification (betweenness centrality) | 0.027 / 0.771 | N140D = structural hub #5/258 (rare beneficial hub). Q217P = low BC #253/258 (safe peripheral, allosteric effect via OHM relay not structural network) |
| ESM-2 3B | Evolutionary constraint (masked marginal, 1246 singles) | 0.242 / 0.817 | Best single zero-shot predictor. Correctly flags catalytic triad |
| SaProt 650M | Structure-aware PLM (AA + 3Di tokens) | 0.204 / 0.809 | Adds structure signal; enters best doubles combo but not singles |
| ProstT5 | Structure-aware PLM (3Di conditional LLR) | 0.179 / 0.773 | Moderate; enters best doubles NDCG combo |
| ThermoMPNN | Stability prediction (ddG from PDB) | 0.179 / 0.803 | Captures stability — orthogonal to PLM. In best singles combo |
| ProFAM | Autoregressive protein family LM (251M params) | 0.154 / 0.816 | Moderate Spearman, good NDCG. Family-specific autoregressive model. |
| FAMPNN | Full-atom protein design (ICML 2025) | 0.115 / 0.800 | Low Spearman but high NDCG — best at finding top-10%. In best NDCG combo. |
| Conservation | Evolutionary constraint (150 orthologs, Scorecons) | 0.132 / 0.793 | Q217P & N140D below median (safe). W104L & R286P conserved — like ICCG |
| MULTI-evolve | Multi-mutant fitness prediction (FCNN) | — / — | Converges on same top positions. Needs real doubles to unlock |
| Epistasis (DMS obs.) | Observed non-additive interactions from 2,467 doubles (not a predictor) | — / — | Mean Δ=−0.70. Core×core pairs synergize. Data-derived, not zero-shot. |
| Best Singles Combo | ESM-2 3B + ThermoMPNN + OHM ACI (rank avg) | 0.248 / 0.833 | 65,535 subsets searched. 3 orthogonal signals: evolution + stability + allostery |
| Best Doubles Combo | ESM2-650M + BLOSUM62 + SaProt + OHM ACI (rank avg) | 0.186 / 0.676 | Different optimal combo for doubles. OHM ACI appears in both. |