OpenAI has created a new large language model that they call "o1", which has been "trained with reinforcement learning to perform complex reasoning." "o1 thinks before it answers -- it can produce a long internal chain of thought before responding to the user."
"OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
To put that in comparison, GPT-4o was at the 11th percentile for CodeForces programming competition coding. So between GPT-4o and o1 the improvement was from 11th percentile to 89th.
For AIME 2024, the improvement from GPT-4o to o1 was from 13.4-the percentile to 83.3-rd percentile.
For GPQA, for biology it's 63.2 to 68.4, for chemistry it's 43.0 to 65.6, for physics it's 68.6 to 94.2.
Quoting further:
"Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them."
"Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn't working. This process dramatically improves the model's ability to reason."
"Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles." "We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios."
"To stress-test our improvements, we conducted a suite of safety tests and red-teaming before deployment, in accordance with our Preparedness Framework(opens in a new window). We found that chain of thought reasoning contributed to capability improvements across our evaluations. Of particular note, we observed interesting instances of reward hacking."
"We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to 'read the mind' of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users."