Three years ago, the "Alignment Problem"—the challenge of ensuring superintelligent AI systems have goals aligned with human values—was a niche academic debate. In 2026, it is the central pillar of international tech policy. But have we actually made technical progress, or are we just talking about it more?

The most significant victory of the last two years has been the success of Constitutional AI and RLAIF (Reinforcement Learning from AI Feedback). We have proven that we can use smaller, trusted models to supervise larger, more untrusted ones. This "bootstrapping" of safety was theoretical in 2023; today, it is the standard training pipeline for GPT-5 and Claude 4.

However, our ability to look inside the "black box" has lagged behind. While we can steer model behavior, we still struggle to fundamentally understand *why* a model chooses a specific action. Mechanical Interpretability remains slow, painstaking work. We are essentially teaching a bear to dance—it's performing beautifully, but we don't truly know if it's tame.

We are in a race between capability and control. Capability is moving linearly with compute; control is moving in step-functions with breakthroughs. Right now, we are winning. But as we approach the next order of magnitude in model size, the "stagnation" in fundamental interpretability could become our critical failure mode.