#chainofthought

waynerad@diasp.org

OpenAI has created a new large language model that they call "o1", which has been "trained with reinforcement learning to perform complex reasoning." "o1 thinks before it answers -- it can produce a long internal chain of thought before responding to the user."

"OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).

To put that in comparison, GPT-4o was at the 11th percentile for CodeForces programming competition coding. So between GPT-4o and o1 the improvement was from 11th percentile to 89th.

For AIME 2024, the improvement from GPT-4o to o1 was from 13.4-the percentile to 83.3-rd percentile.

For GPQA, for biology it's 63.2 to 68.4, for chemistry it's 43.0 to 65.6, for physics it's 68.6 to 94.2.

Quoting further:

"Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them."

"Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn't working. This process dramatically improves the model's ability to reason."

"Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles." "We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios."

"To stress-test our improvements, we conducted a suite of safety tests and red-teaming before deployment, in accordance with our Preparedness Framework(opens in a new window). We found that chain of thought reasoning contributed to capability improvements across our evaluations. Of particular note, we observed interesting instances of reward hacking."

"We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to 'read the mind' of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users."

Learning to Reason with LLMs | OpenAI

#solidstatelife #ai #genai #llms #chainofthought

waynerad@diasp.org

Chain-of-thought prompting improves the correctness of answers from large language models (LLMs) for questions involving arithmetic or logical reasoning. Chain-of-thought prompting means modifying the prompt to trigger output of intermediate derivations. "Let's think step by step".

You might think it would be the opposite -- one might intuitively perceive this problem as more challenging as the model is required to express the entire problem solving process, and this might require a larger model size.

"The crux of our proof lies in applying circuit complexity theory. By framing the finite-precision Transformer as a computation model, we can precisely delineate its expressivity limitations through an analysis of its circuit complexity."

"Firstly, the constructions in our proof reveal the significance of several key components in the Transformer design, such as softmax attention, multi-head, and residual connection. We show how these components can be combined to implement basic operations like substring copying and bracket matching, which serve as building blocks for generating a complete chain-of-thought solution.

"Secondly, we highlight that these chain-of-thought derivations are purely written in a readable math language format, largely resembling how human write solutions. In a broad sense, our findings suggest that LLMs have the potential to convey meaningful human thoughts through grammatically precise sentences."

"Finally, one may ask how LLMs equipped with chain-of-thought can bypass the impossibility results outlined in Theorems 3.1 and 3.2. Actually, this can be understood via the effective depth of the Transformer circuit. By employing chain-of-thought, the effective depth is no longer L since the generated outputs are repeatedly looped back to the input layer. The dependency between output tokens leads to a significantly deeper circuit with depth proportional to the length of the chain-of-thought solution. Even if the recursive procedure is repeated within a fixed Transformer (or circuit), the expressivity can still be far beyond TC0."

"TC0" here refers to a circuit complexity class in circuit complexity theory. Other than telling you the TC stands for "threshold circuits", I can't really tell you much about it. This is the first time I've encountered circuit complexity theory. The paper has an appendix on circuit complexity theory (Appendix B) starting on page 14, if you really want to dig in to it.

What I want to convey here is that the gist of the proof is that asking the LLM to employ chain-of-thought is actually equivalent to a deeper circuit.

The paper goes on at length about dynamic programming, a computer science technique, and how chain-of-thought maps onto it.

"The basic idea of dynamic programming lies in breaking down a complex problem into a series of small subproblems that can be tackled in a sequential manner. Here, the decomposition ensures that there is a significant interconnection (overlap) among various subproblems, so that each subproblem can be efficiently solved by utilizing the answers (or other relevant information) obtained from previous ones."

"Formally, a general dynamic programming algorithm can be characterized via three key ingredients: the state space I, the transition function T, and the aggregation function A. The state space I represents the finite set of decomposed subproblems."

They go on to say you can represent the relationships between the subproblems using a data structure known as a directed acyclic graph, and that will tell you what order you need to solve the subproblems.

So the idea is to prove that chain-of-thought in LLMs has equivalent complexity to a logical reasoning problem expressed as a dynamic programming problem.

Towards revealing the mystery behind chain-of-thought: A theoretical perspective

#solidstatelife #ai #genai #llms #chainofthought