Why Run RL? - RunRL Blog

Models like Claude Sonnet 3.7, Gemini Pro 2.5, and o4-mini have rightfully earned their places atop LLM rankings. With the right prompts and tools, they make incredibly capable agents. Still, there's a way to beat all of them, for a fraction of the cost.

Big model companies make generalist LLMs. They're trained to field user queries about any subject on the internet. But an AI travel agent doesn't need to be good at differential equations or poetry—it just needs to be excellent at finding and booking vacation packages.

Reinforcement learning (RL) optimizes models for specific tasks, creating specialist models with expertise that beats the generalists.

The promise of reinforcement learning

For a generic LLM agent, every day is like the first day on the job. Tools and tasks need to be explained exhaustively to coax models into proper behavior.

Specialist agents have knowledge encoded on the level of their weights, so they know intuitively how to use the tools you provide them to excel at their work.

How it works

RL works by improving a language model or agent's performance on a specific task, which is quantified by a reward function. The reward function can be truly anything, so long as you can measure it. In the past, this has often been human preference ratings (as in RLHF), but rewards written in code are even more powerful since they can give rise to superhuman models. This sort of RL on verifiable domains is what spurred the advancements of OpenAI's o1 and DeepSeek's R1.

During training, the model learns to produce more of the outputs it is rewarded for and fewer of the outputs for which it is penalized. The result is an agent that reliably does what it's told.

RL generalizes better than ordinary fine-tuning, avoiding overfitting. It also doesn't require you to already have examples of good outputs.

Applications

Chemistry

We asked LLMs to generate molecules that inhibit a protein in the coronavirus, and defined the reward function to be the simulated binding strength of the molecule to the protein's active site.

After just two hours of reinforcement learning, a 3 billion parameter model small enough to run on an iPhone learns enough chemistry to outperform Claude 3.7 Thinking with web search!

Reward progression of molecules over training

Browser use

We asked Claude to come up with a "Form from Hell", which is super annoying to fill out. While this stumps OpenAI Operator and Claude 3.7, our 3B model learns to fill out the form in under a minute and a half:

Where to begin

If you already have evals for your agents, you're in a good position to start using RL. Success at the metrics you care about becomes the reward function that RL optimizes. If not, our friends at The LLM Data Company are building tools to help you measure model performance and define rewards for RL.