Reinforcement Learning is the Way to AGI

It was only after the release of the Deepseek R1 model that I realised that RL was our best shot to AGI, and possibly even super intelligence

To summarize the past (2019 to 2024):

When you perform regular language model pre-training, what you’re doing is making the model predict the next token over a very very large corpus of data. In theory the idea is that the language model not only learns facts about the world ([see this](Language Models as Knowledge Bases?)) but also is able to create an internal model of the world and so gains what we colloquially call intelligence (see this). In learning to predict the next token, it is able to understand the logic behind it, and so it is able to learn skills present in the pre-training corpus. (Examples include math, summarizing content, programming etc.)
But merely pretraining isn’t enough. Pre trained models aren’t suitable for conversation because the corpus these models are trained on don’t have conversational data. What happens next is the process of converting these massive models into things that are conversationally friendly. The first paper on this was InstructGPT but the basic idea is clear: make a model give answers that humans prefer (without deviating too much from the underlying base model which has the reasoning and the facts)
Then people did other things to increase their model’s intelligence (like finetuning on task specific datasets), hiring PhDs to annotate data etc, everything to increase the quality of the training corpus. Model training companies used synthetic data (from the language model itself) to finetune smaller models, and improve on the model the data came from. There are many other tricks in the bag but these are the ones that come to mind now

This got us really far! We went from GPT-2 (which couldn’t count to five properly) by increasing the size of the corpus, increasing the number of parameters, efficiency improvements (like Flash Attention), architectural innovations like MoE and so on to GPT-4 and Claude 3.5 Sonnet which are used by millions of developers around the world for code, and the mathematician Terence Tao used GPT-4 as an accompaniment in reading math papers and writing them too.

Then there came a sense that progress was slowing down. Older models from GPT-1 to 2 was a huge jump, 2 to 3 as well. 3 to 4 was good, but the rate of growth was lesser. And people became concerned that data on the internet was running out, and they became concerned that scaling was hitting a wall. We’ll let memoirs by AI lab insiders tell us if those were true or not in hindsight. Big labs decided to move to a different line of research all together. Instead of getting language models to get better at predicting the next token or following human preferences, they decided to get them to be right at solving complex problems

Normally when a child learns to walk, it tries many different strategies. You see kids trying to roll, and they crawl and they do a lot of different things. Eventually one of those things ends up working and the kid feels a sense of satisfaction when it can walk, and it does that more and more. You started off with an arm length sized human being and now you’ve to chase it to stop it from running into the traffic.

This is a good analogy for how reinforcement learning works with language models. In RL with a verifier, instead of focusing on how close the produced text mimics other text, or is amenable to human preferences, model trainers just give the (already pretrained) model a fixed reward for getting the answer correct.

An oversimplified version of how it works is: they’d ask it some math questions, put them in a verifier, and if the verifier marked them right, they’d give the model a plus 10. If the model marked them wrong, it would get a zero reward. So the model would focus on getting the right answer somehow or the other, and we’d let it explore the state space of tokens for a long time. This is a really large change! We’re sort of letting it do whatever it wants without enforcing any limits on how similar to the (human generated) text it is. In theory, the model can figure out new ways of thinking that humans didn’t because it’s only optimising for the right answer.

So if you scale this up with a very large pre-trained model (aka one that is already trained to predict the next token, and so already “knows” a good amount of math and coding), and incentivise it only to get the right answer without caring how it thinks, the models do really really well! They get to the point where they’re better than the vast majority of humans on competitive programming and advanced mathematics questions (OpenAI’s o3 on CodeForces and FrontierMath is what I’m talking about). This is, in my opinion, how we will get to better than human reasoning. If you saw AlphaGo play Lee Sedol, this is what happened in Move 37.

We’ll get the (already very capable) language model to focus only on getting the right answer, and do this at a very very large scale. And then it probably will invent some combination of human words that leads it to the right answer, and that will be the end of it. (Or the start of it depending on whom you ask).

This is a large qualitative increase in the way language models work. They’re no longer predicting the next token. They’re being trained on how to solve problems outside the confines of human reasoning that pre-training imposed.

Of course, there’s a catch to this. This only works where you can give language models a clear reward. This is really easy for math (check the answer), and code (run the code!), but less so for other domains. These domains provide really simple ways of giving language models clear yes/no answers, as opposed to other domains where the answer isn’t yes/no, but it is more fine grained. There are domains (say being the CEO of OpenAI) where the answer won’t be known years later.

And in many ways this is concerning for alignment. It’s not that hard to get a regular text predicting language model to avoid words or phrases that we consider bad. But suppose there’s an environment which simulates the real world economically, and someone trains a model by incentivizing it to make more money. What happens say if the model realises that killing competitors gets a reward from the “verifier” (which in this case is the environment) in terms of higher profits? This is obviously a contrived example, but we know that reward hacking is a thing, and specification gaming (where a behaviour satisfies the literal specification of an objective without achieving the intended outcome) is also a thing. The nice part of RL not bound by predicting human text is that you’re not bound by human thinking. The not so nice part is you’re not bound by human thinking. (And of course one can solve this in many ways by doing RL only an aligned base model, or other interesting ideas but my point is that RL incentivizes behaviour that we wouldn’t necessarily expect!)

So, I hope you’re informed, inspired, scared (or some combination of the three) by my post.