Ant Group trained a model by removing the usual middleman

Ant Group's Online RLHF eliminates reward models from training, cutting costs 50% while processing 1 trillion tokens on eight GPUs—proof that removing complexity beats adding scale.

1 min read
Ant Group trained a model by removing the usual middleman
Photo by Jithin Vijayamohanan / Unsplash

VentureBeat reports that Ant Group's engineers trained their Ring 1T model using a method that cuts out reward models entirely. Traditional reinforcement learning from human feedback uses three separate models working together—a policy model, a reference model, and a reward model that predicts human preferences. Ant Group's Online RLHF approach learns directly from human feedback during training instead of building a separate system to predict what humans want. That dropped training costs by 50% and reduced memory requirements.

Ring 1T processed 1 trillion tokens using eight H100 GPUs and handles context windows up to 1 million tokens. For product teams, this changes the engineering equation—you can iterate on model behavior without maintaining separate reward prediction infrastructure. The practical impact shows up in two places: teams working with tighter compute budgets can train models that would otherwise be out of reach, and deployment becomes simpler when you're managing one model instead of three synchronized systems.

The efficiency gain comes from architectural simplification, not from throwing more hardware at the problem. This is another example of China looking for alternative solutions to advance capabilities without relying solely on scale and processing power.

https://venturebeat.com/ai/inside-ring-1t-ant-engineers-solve-reinforcement-learning-bottlenecks-at