Chain-of-thought prompting improves the correctness of answers from large language models (LLMs) for questions involving arithmetic or logical reasoning. Chain-of-thought prompting means modifying the prompt to trigger output of intermediate derivations. "Let's think step by step".
You might think it would be the opposite -- one might intuitively perceive this problem as more challenging as the model is required to express the entire problem solving process, and this might require a larger model size.
"The crux of our proof lies in applying circuit complexity theory. By framing the finite-precision Transformer as a computation model, we can precisely delineate its expressivity limitations through an analysis of its circuit complexity."
"Firstly, the constructions in our proof reveal the significance of several key components in the Transformer design, such as softmax attention, multi-head, and residual connection. We show how these components can be combined to implement basic operations like substring copying and bracket matching, which serve as building blocks for generating a complete chain-of-thought solution.
"Secondly, we highlight that these chain-of-thought derivations are purely written in a readable math language format, largely resembling how human write solutions. In a broad sense, our findings suggest that LLMs have the potential to convey meaningful human thoughts through grammatically precise sentences."
"Finally, one may ask how LLMs equipped with chain-of-thought can bypass the impossibility results outlined in Theorems 3.1 and 3.2. Actually, this can be understood via the effective depth of the Transformer circuit. By employing chain-of-thought, the effective depth is no longer L since the generated outputs are repeatedly looped back to the input layer. The dependency between output tokens leads to a significantly deeper circuit with depth proportional to the length of the chain-of-thought solution. Even if the recursive procedure is repeated within a fixed Transformer (or circuit), the expressivity can still be far beyond TC0."
"TC0" here refers to a circuit complexity class in circuit complexity theory. Other than telling you the TC stands for "threshold circuits", I can't really tell you much about it. This is the first time I've encountered circuit complexity theory. The paper has an appendix on circuit complexity theory (Appendix B) starting on page 14, if you really want to dig in to it.
What I want to convey here is that the gist of the proof is that asking the LLM to employ chain-of-thought is actually equivalent to a deeper circuit.
The paper goes on at length about dynamic programming, a computer science technique, and how chain-of-thought maps onto it.
"The basic idea of dynamic programming lies in breaking down a complex problem into a series of small subproblems that can be tackled in a sequential manner. Here, the decomposition ensures that there is a significant interconnection (overlap) among various subproblems, so that each subproblem can be efficiently solved by utilizing the answers (or other relevant information) obtained from previous ones."
"Formally, a general dynamic programming algorithm can be characterized via three key ingredients: the state space I, the transition function T, and the aggregation function A. The state space I represents the finite set of decomposed subproblems."
They go on to say you can represent the relationships between the subproblems using a data structure known as a directed acyclic graph, and that will tell you what order you need to solve the subproblems.
So the idea is to prove that chain-of-thought in LLMs has equivalent complexity to a logical reasoning problem expressed as a dynamic programming problem.
Towards revealing the mystery behind chain-of-thought: A theoretical perspective