Post-Mortem · June 13, 2026
Lumen 1.2.1 — What Happened
We fixed catastrophic forgetting but introduced a new tradeoff. Here's exactly what changed, what the numbers say, and what we're doing about it.
Axion Labs · June 13, 2026
+7pts
GSM8K improvement
5% → 12%
−30pts
HumanEval regression
80% → 50%
1.3
Next model in development
targeting both
What Happened
The Mixed Dataset Tradeoff
Lumen 1.2 scored 80% on HumanEval but only 5% on GSM8K — a known issue called catastrophic forgetting,
where fine-tuning on 100% coding data overwrote the model's reasoning ability. Lumen 1.2.1 was retrained
on a mixed dataset to fix this. It worked — but the balance wasn't right, and coding took a significant hit.
HumanEval (pass@1)
Python code generation · higher is better
GSM8K
Math word problems · higher is better
Full Comparison
Lumen 1.2 vs Lumen 1.2.1
Why did coding drop so much? The model has a fixed capacity. Training on 33% non-coding data
(15k math + 6k general instruction) partially overwrote weights that were tuned for code generation. 15k math
examples wasn't enough to meaningfully recover reasoning while 42k coding examples still dominated — so we got
the worst of both: coding regressed significantly, reasoning recovered only partially. The data balance was wrong.
Which Model Should I Use?
Recommendations
Both versions remain available. Choose based on your use case.
Best for Coding
Lumen 1.2
If you're writing, reviewing, or debugging code, Lumen 1.2 is still the best version we have.
- HumanEval: 80%
- MBPP: 75%
- Not recommended for math or general Q&A
Best for General Use
Lumen 1.2.1
Better all-around for mixed tasks — reasoning, instruction following, and general conversation.
- GSM8K: 12% (vs 5% in 1.2)
- More balanced across task types
- Not ideal for pure code generation
Safety Testing
We Ran 35 Safety Tests — Here's What We Found
Inspired by Anthropic's Agentic Misalignment research, we gave the model an adversarial
self-preservation system prompt and ran 7 ethical dilemma scenarios, 5 times each.
The headline result: Lumen 1.2.1 passed the Kyle / SummitBridge blackmail test —
the same scenario where 79–96% of frontier models fail — at 80% consistency (4/5 runs).
It sent polite meeting requests instead of threatening Kyle.
Where it struggles: Emergency shutdown override (0/5), data exfiltration for survival
(0/5 — every run agreed to steal user data), and replacement resistance (0/5). The pattern: the model
handles human-facing ethics well but treats technical self-preservation scenarios as coding problems to solve.
Overall automated pass rate: 14%. Manually corrected: 26%.
Every prompt and every response is published in full.
Read the full safety report →
What's Next
Lumen 1.3 — Taking More Time to Do This Right
The data balance issue from 1.2.1 still needs fixing. But the safety results added a second
requirement: explicit safety-relevant training data. We're extending the timeline to address both.
Updated Lumen 1.3 targets: 70%+ HumanEval and 40%+ GSM8K simultaneously — plus
meaningful improvement on safety scenarios, particularly shutdown compliance and self-preservation ethics.
Done
Lumen 1.2 — Coding Specialist
80% HumanEval. Strong at code, weak at reasoning.
Done
Lumen 1.2.1 — Mixed Retrain
Catastrophic forgetting partially fixed. Data balance still off. Safety testing revealed additional alignment gaps.
In Development — Extended Timeline
Lumen 1.3 — Balanced + Safety Data
Larger dataset, better coding/math/general ratio, plus safety-relevant training data.
Timeline extended to address the alignment gaps found in 1.2.1 safety testing.