When AI leaders need outside help testing their models
Anthropic and OpenAI both use Cognition's SWE Gym to test their models. When leading AI companies rely on external evaluation tools, it reveals a gap in internal benchmarking—and raises questions about deployment standards for everyone else.
Anthropic and OpenAI are both using Cognition AI's SWE Gym—a collection of real-world code repositories paired with evaluation tests—to benchmark their models' coding abilities. Cognition just raised $175 million at a $2 billion valuation, and their product launched in mid-October.
Here's what matters: The fact that both companies are looking outside their own walls for evaluation infrastructure tells us something about the state of AI readiness assessment. When I've written before about how evaluation infrastructure must come first, this is exactly the gap I'm talking about. Internal benchmarks apparently aren't capturing the full picture of model performance on novel tasks.
SWE Gym addresses a real problem—models that ace known benchmarks but stumble when they encounter fresh code. The tests use repositories the models haven't seen before, which means they can't just memorize solutions. Anthropic's PM confirmed they "extensively" used these tests to assess improvements.
For product counsel, this creates a practical question: If the leading AI companies don't have sufficient internal evaluation tools to validate their models before release, what does that mean for everyone else's deployment standards? We're essentially watching the emergence of third-party validation infrastructure for AI—similar to how security testing became its own industry. That's probably necessary, but it also means we're now dependent on external tools to establish deployment thresholds.
https://www.theinformation.com/articles/anthropic-openai-using-cognitions-coding-test