How to have a career even when OpenAI's o3 drops

Nobody has lived through a technological change that in theory at least promises to take all of our jobs, change our existence as the apex species on the planet and potentially kill us. I’m not an expert on AI, and I’m far from an expert on labour markets. But I am often reminded of the economist Alain Enthoven, who remarked to an Air Force general challenging his analysis of nuclear war that “General, I’ve fought just as many nuclear wars as you have.”

The point is, we are living in truly unprecedented times. I’m talking specifically about advances in artificial intelligence, where OpenAI’s o3 model has blown past existing benchmarks and beat most humans at mathematics and competitive programming.

How good are AI models now?

The state of AI progress as of late December 2024 (edit: this was written before DeepSeek dropped it’s R1 model) is best understood through mathematics, where benchmarks show a clear progression in difficulty. I’m focusing on mathematics because, unlike many other fields, it gives us unambiguous signals about intelligence - from basic arithmetic all the way to problems that stump Fields medalists.

The progression has been relentless. Basic mathematical reasoning, which was once considered the hallmark of human intelligence, has been completely mastered by language models. Qwen2-Math from Alibaba hits 96.7% accuracy on middle-school math problems. DeepMind’s models are now competitive at the International Mathematical Olympiad level, missing a gold medal by just a few points.

But what’s truly surprising is OpenAI’s new o3 model solving 25% of FrontierMath problems. To understand how large of a change this is: FrontierMath was created by 60 top mathematicians including Fields medal winners and the US’s IMO coach Evan Chen. These are research-level mathematics problems so difficult that even Terence Tao, perhaps our generation’s greatest mathematician, says that “in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages.” Most language models scored less than 2% correct on these problems. If you went to Harvard, took their legendary Math 55 class (considered their hardest), and gathered the smartest students from that class over several decades, o3 would likely outperform most of them on FrontierMath.

This isn’t just mathematics. In competitive programming, another field that attracts the smartest humans, o3 has an elo rating of 2727 on CodeForces, putting it at the 99.95 percentile. There are probably, at maximum, 50 200 (edit thanks to Abigail from NUS Hackers) people in the world who could beat this model in a competitive programming contest. The model’s capabilities extend beyond pure logical reasoning too - it can generate correct code to pass over seventy percent of tasks on SWE-bench-verified, a benchmark measuring ability to solve real-world software issues from popular Python libraries.

Naturally, people are not happy about this. Some on Twitter are scared, with memes going around that software engineers should quit their jobs, or that computer science students should drop out of their degrees and do something else like plumbing or farming instead. Many of these are made in jest, but they reflect a real anxiety among students and junior professionals: if an AI model can do what I do but much better and faster, what happens to my career?

I’ll answer this from the perspective of someone who is just starting their career in a technical field.

Part 1: Be illegible

Back in 2019, the computer scientist François Chollet - creator of Keras, a popular machine learning library - wrote an influential paper called “On the Measure of Intelligence”. His key insight was that true intelligence isn’t about raw performance on specific tasks, but rather the ability to learn and generalize. To test this, he created the Abstract Reasoning Corpus (ARC-AGI), a set of problems designed to measure an AI’s ability to learn from examples. This benchmark would go on to fundamentally shape how AI labs approached the development of more intelligent systems.

What happened next reveals a crucial truth about AI development: Once AI labs had a clear benchmark to chase, they could measure progress and optimize their systems. The presence of right and wrong answers made it possible to train “verifier” models that could check if an AI was reasoning correctly, and to generate synthetic data to improve performance. These techniques, combined with other advances, enabled OpenAI’s o3 model to achieve an unprecedented 88% score on the ARC-AGI evaluation set - far beyond what previous models could achieve.

But this remarkable progress hinges on one critical factor: the existence of datasets that clearly define correct and incorrect answers. Without such verification, it’s nearly impossible to train superhuman AI models. You can’t optimize what you can’t measure, and you can’t generate synthetic training data without a source of ground truth. This explains why domains with unambiguous right and wrong answers - like mathematics or programming - are the first to see superhuman AI performance. The very existence of clear metrics makes these fields vulnerable to automation.

A hilarious tweet by the account @typedfemale said: “what i learned from today: if you have strong opinions about what constitutes reasoning - never make a dataset that allows your critics to prove you wrong publicly”. And while the sarcasm is funny, it points to a deeper truth: if your task or job is legible enough to be put into a dataset, AI companies will find a way to automate it, sooner or later. The signals for truth are the most clear in logical domains like mathematics or coding, but several companies are building human evaluated data for other domains too. Companies like Scale AI now provide data from PhD holders in other domains like chemistry, physics and even the social sciences. Everyday new data is being created to train models on how to think like experts in every field possible. Even the “softer” sciences like economics and psychology aren’t immune from this.

What lessons can one draw from this? I argue that the main lesson is that if your job is legible enough that people can make a dataset clearly pointing out what is right and what is wrong, you are at the highest risk for an AI model being “superhuman” at your job. It is even more risky if it is possible to articulate your thought process in a way that is verifiable.

Looking at this perspective, it makes more sense that competitive programming and mathematics have been attacked first by these models. Not only do they provide a clear source of ground truth, you can also hire people to give intermediate steps and there are tools like proof assistants to tell you if the proof is correct or not.

The best placed people are those for whom it will be unprofitable or extremely difficult to create or procure datasets that give a clear signal. These people, those who are “illegible” to the outside world, are those who are least likely to have their job, or parts of their job be automated by models like the ones discussed above.

A good example of this is the economist and public intellectual Tyler Cowen. You might think that public intellectuals would be the first to be automated because all they offer are rambling arguments with no substance. But Tyler’s work on Marginal Revolution, his food blog, podcast and grant program, (confession: I’m a grantee but did not get paid to write this), are exactly the sort of things that one would find hard to put in a dataset for a model. IWhile you could easily collect his outputs - the blog posts, food reviews, grant selections - into a dataset, there’s no way to capture or verify the thought process that generates these insights. Unlike mathematics or coding where we can break down and verify each step of reasoning, there’s no way to create a “verifier model” that could check if someone is thinking like Tyler does, or generate synthetic examples of his style of analysis. You can’t systematically label his cognitive steps as “right” or “wrong” the way you can with a mathematical proof or a programming solution. There is no clearly labelled dataset like there is for mathematics or coding, and it is not clear how it is possible to generate a dataset about someone whose thought process is as illegible as Tyler’s.

Another example is that of my friend who is a website designer. Let us call him K to protect his privacy. K’s clients include several people who work in technology, and want their personal websites designed. What they do is that they give him a vibe, and he works through what they actually want from their websites. K’s value isn’t the technical prowess of the website design: most of it is basic HTML, CSS and JavaScript. His value actually comes from understanding what his clients want, the general aesthetic they are going towards, and then converting this to a working website with excruciating detail. It is hard and expensive to find people to label this as ‘right’ or ‘wrong’ - the success of each design decision depends on contextual factors and the specific client’s unstated preferences. Unlike mathematics problems where there’s a clear correct answer, it is not clear at all how to build a dataset or a verifier model of his thought process to guide LLMs to the right answer.

Both Tyler Cowen and K share the same trait: their value comes from navigating ambiguous human preferences and synthesizing disparate information in ways that can’t be reduced to clear right or wrong answers. In other words, they have taste. Their work resists dataset creation because success is contextual, subjective, and often relies on tacit knowledge that can’t be easily formalized.

Part 2: Find skills which have skill divergence because of AI

Jeff Dean is one of the industry’s most legendary programmers. At Google, he built MapReduce, a software which made it extremely simple to write programs that could process enormous amounts of data in parallel. He helped create Spanner, a groundbreaking distributed database providing globally consistent transactions across data centers worldwide, even when networks fail. These and other innovations earned him election to the National Academy of Engineering, the American Academy of Arts and Sciences, and the Association for Computing Machinery. While most would stop there, Dean went on to lead Google Brain and is now Chief Scientist at Google DeepMind. He’s so revered among programmers that there exist Chuck Norris-style jokes about him.

Let us compare Jeff Dean to someone who is somewhat lower on the programming skill spectrum: the median computer science junior who has done an internship or two. At best, this person knows how to develop a full-stack application, has basic knowledge of databases (if any), and understands the fundamentals of AI (again, if any). Suppose that Jeff Dean’s at the 99th percentile of the programming skill spectrum, and this student is at the median.

If you gave Jeff Dean any of the AI models like Anthropic’s Claude 3.5 Sonnet, OpenAI’s o1 or AI coding tools like Cursor, his productivity would be massively improved. Jeff Dean’s value isn’t in writing boilerplate code that can be automated by a language model, or in solving algorithmic problems that o1 and similar models excel at. His main value to Google comes from his abilities to tackle harder problems, which benefit from AI assistance. These tools amplify Jeff Dean’s capabilities even more than they would amplify that of the median programmer. He can now spend time doing the high level systems architectural stuff or tackling complex bugs that make him so revered in the first place.

On the other hand, the junior CS student will also benefit compared to his previous self before AI. He will be able to produce more complex programs as compared to before, learn languages faster and make fewer errors. But my claim is that these tools benefit the highly skilled more than the less skilled. The CS student is limited by not just how much computer science they understand but also by not knowing which technical problems are worth attacking and how to break them down effectively. While Jeff Dean can use AI to rapidly prototype distributed systems, the CS student is still figuring out what makes a good system in the first place. Jeff Dean will be able to use these AI tools to explore more complex system designs, test out different architectural approaches faster, and tackle even harder technical challenges that were previously too time-consuming to attempt. The CS student will just be using AI to avoid syntax errors and learn programming languages faster. And while that’s valuable, it’s nowhere near the productivity gains that Jeff Dean gets from the same tools.

This isn’t necessary for all skills though. Consider the skill of writing SQL queries. Before ChatGPT and other tools came out, there would have been a large divergence between the expert who could write complex window functions and optimize queries for performance, and the beginner who struggled with basic SELECT statements. Now, AI can help anyone turn plain English into working SQL. The beginner’s ability to produce correct, moderately complex queries has increased dramatically. While the expert still has an edge in database design and knowing what questions to ask, the pure SQL writing skill gap between expert and beginner has compressed rather than expanded. Before, this may have been a good skill to invest time in mastering, but now the returns to pure SQL expertise have diminished significantly.

This gives us another dimension to understand career planning. Even if you may work in a “legible” field where there are benchmarks, not all benchmarks are created the same. Consider if an AI doing better at your field’s benchmarks makes you more productive or simply replaces what you do. A software architect becomes more productive when AI handles routine coding tasks, allowing them to test and iterate on different system designs faster. In contrast, someone whose main value was writing SQL queries finds their core skill becoming commoditized.

So if you’re looking to evaluate your career path now, focus on skills where the gap between the best and the rest is sharply diverging because of AI. This is where you should try to become the best in because the returns are so much larger.

Conclusion

The future belongs to people whose work cannot be easily reduced to a dataset, and who can use AI to become even better at what they do. Some jobs, like competitive programming or routine software development, are vulnerable because we can create clear benchmarks for performance. Others, like Tyler’s work or K’s website design, are safer because their core value comes from thought processes that are difficult to verify or replicate. And in many technical fields, AI will amplify the gap between the best and the rest - making Jeff Deans even more productive while merely helping average programmers avoid syntax errors.

Perhaps the meta-lesson from all of this is simpler: nobody saw this coming. A year ago, FrontierMath seemed impossible for AI. Six months ago, o3 didn’t exist. The world is changing faster than anyone expected, and we don’t know what’s going to come next. Being nimble is the only way out of this

Thanks to Daniel Tan, Trevor Chow, Judah, Clark, Minh, James and other people whose names I’m forgetting for reviewing this