When AI leaders need outside help testing their models

Anthropic and OpenAI are both using Cognition AI's SWE Gym—a collection of real-world code repositories paired with evaluation tests—to benchmark their models' coding abilities. Cognition just raised $175 million at a $2 billion valuation, and their product launched in mid-October.

Here's what matters: The fact that both companies are looking outside their own walls for evaluation infrastructure tells us something about the state of AI readiness assessment. When I've written before about how evaluation infrastructure must come first, this is exactly the gap I'm talking about. Internal benchmarks apparently aren't capturing the full picture of model performance on novel tasks.

SWE Gym addresses a real problem—models that ace known benchmarks but stumble when they encounter fresh code. The tests use repositories the models haven't seen before, which means they can't just memorize solutions. Anthropic's PM confirmed they "extensively" used these tests to assess improvements.

For product counsel, this creates a practical question: If the leading AI companies don't have sufficient internal evaluation tools to validate their models before release, what does that mean for everyone else's deployment standards? We're essentially watching the emergence of third-party validation infrastructure for AI—similar to how security testing became its own industry. That's probably necessary, but it also means we're now dependent on external tools to establish deployment thresholds.

https://www.theinformation.com/articles/anthropic-openai-using-cognitions-coding-test

You might also like

When autonomous AI creates liability, you can't explain

Agentic AI systems don't wait for instructions—they decide and act independently