Ant Group trained a model by removing the usual middleman

VentureBeat reports that Ant Group's engineers trained their Ring 1T model using a method that cuts out reward models entirely. Traditional reinforcement learning from human feedback uses three separate models working together—a policy model, a reference model, and a reward model that predicts human preferences. Ant Group's Online RLHF approach learns directly from human feedback during training instead of building a separate system to predict what humans want. That dropped training costs by 50% and reduced memory requirements.

Ring 1T processed 1 trillion tokens using eight H100 GPUs and handles context windows up to 1 million tokens. For product teams, this changes the engineering equation—you can iterate on model behavior without maintaining separate reward prediction infrastructure. The practical impact shows up in two places: teams working with tighter compute budgets can train models that would otherwise be out of reach, and deployment becomes simpler when you're managing one model instead of three synchronized systems.

The efficiency gain comes from architectural simplification, not from throwing more hardware at the problem. This is another example of China looking for alternative solutions to advance capabilities without relying solely on scale and processing power.

https://venturebeat.com/ai/inside-ring-1t-ant-engineers-solve-reinforcement-learning-bottlenecks-at

You might also like

When autonomous AI creates liability, you can't explain

Agentic AI systems don't wait for instructions—they decide and act independently