When AI leaders need outside help testing their models
Anthropic and OpenAI both use Cognition's SWE Gym to test their models. When leading AI companies rely on external evaluation tools, it reveals a gap in internal benchmarking—and raises questions about deployment standards for everyone else.