Could Google and UCLA's SRL Unlock True AI Intelligence?

Share this article
Share this article
Prioritise Us on Google
Supervised Reinforcement Learning teaches AI to think and reason in steps to solve complex problems
Google & UCLA researchers unveil Supervised Reinforcement Learning to help small language models reliably solve multi-step problems using dense rewards

You know how professors ask you to show your workings, rather than just the answer? Researchers at Google and UCLA are trying to teach AI to do exactly that by rewarding the process, not just the outcome. 

In its new research paper, Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning, Google Cloud AI Research team up in collaboration with UCLA to unveil Supervised Reinforcement Learning (SRL), a new training approach to teach AI to solve complex multi-step problems which today’s LLM’s struggle with.

SRL helps to train the model by breaking complex problems down to simple steps and asking it to explain its reasoning before committing to an action.

The model receives a reward signal based on how similar the model’s course of action is compared to the expert's course of action. 

This allows for a flexible, smooth process with richer learning signals, as the model is rewarded and learns even when the final output is wrong.

Supervised Learning and Reinforcement Learning: Why do they fall short?

Supervised Fine Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) are two popular training methods used today. 

Supervised Fine Tuning follows an imitation training method where the model is directly fed the expert demonstrations on how to solve the problem. 

This limits the model’s reasoning capabilities as it doesn’t learn to generalise beyond the training data, much like a student memorising a maths problem rather than understanding it.

| Credit: Research paper: Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

This problem becomes severe when the training data is limited. 

RLVR, on the other hand, encourages reasoning and self-reflection by providing reward signals based on outcomes. 

But, for tasks that require complex reasoning, the success rate of models tends to drop to zero as a single wrong step can derail the entire reasoning chain.

This means that models end up receiving negative learning signals which slows learning progress, even when the reasoning chain is partly correct. 

Supervised Reinforcement Learning Model Training

SRL sits between copying an entire solution in case of SFT and optimising only for the final answer, such as in RLVL.

| Credit: Research paper: Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

For training, an expert solution is broken down into steps, such that each step has a clear logical action to be implemented.

For each step, a new input prompt is created and fed to the model, which is tasked with predicting what the next logical action is.

After each step, the reward function measures the similarity between the expert step and the model’s step, providing a dense learning signal.

The monologue generated by the model is not under scrutiny, freeing the model up to develop its own reasoning style rather than mimicking the teacher’s internal thinking.

This enables a single expert solution to produce many rich training examples.

Youtube Placeholder
Gemini Robotics 1.5: Thinking while acting

Does SRL lead to more human-like reasoning?

The research shows significantly higher performance in SRL-trained models compared to baseline models, with the best results achieved using an SRL → RLVR pipeline.

A key achievement is the model’s ability to adjust its reasoning mid-problem, inserting steps naturally, similar to human problem solving.

The model also demonstrated the ability to verify and refine answers before finalising them, a reasoning pattern previously unseen at this scale.

The researchers say that SRL trained models could be used to power more versatile AI agents

Whether SRL becomes the new standard in AI training remains to be seen, but it clearly moves closer to reasoning that feels less mechanical and more genuinely intelligent.

Company portals