AlphaCode 2 was created by first, starting with the Gemini Pro model, and then did fine-tuning with a fine-tuning algorithm called GOLD. GOLD stands for "generation by off-policy learning from demonstrations". The nonsensical (to you, probably) term "off-policy" comes from the terminology used in reinforcement learning. The function in reinforcement learning that outputs an action for the agent to take is called the "policy", for reasons I've never figured out. Anyway, an "on-policy" learning algorithm executes the actions and learning from them, interactively. An "off-policy" algorithm learn from a recording, and lacks the ability to actually carry out the actions. In the case of GOLD, the "actions" are derived from "demonstrations". As you might guess, in this context, the "demonstrations" are code samples of working code written by humans for the system to learn from.

The code consists of 15,000 problems with 30 million solutions written by humans. They come from a dataset called CodeContests, which consists of problems and code submitted by contestants from 5 coding contests: Aizu, AtCoder, CodeChef, Codeforces, and HackerEarth.

They said they did two rounds of fine-tuning a generated a "family" of models (all of which were used for code generation), but don't say what the differences were in the different models between each "family".

So, that's the fine-tuning phase. How does AlphaCode 2 actually solve problems?

You give it a coding problem and it generates a million code samples, using a randomized temperature parameter for each sample to encourage diversity. All of them are generated in the C++ programming language.

Next, it tests each solution against example input, and verifies that it generates the example output. This step removes 95% of the samples, bringing the number of candidates down to (only) 50,000.

At this point, they introduce a second model. This mode generates new input for the programs. This input is fed into the program. The second model does not generate the correct output that corresponds to the input, because it doesn't know how to do that. So what's done next is the output is handed to a clustering algorithm, which identifies programs that generate the same output and groups them together. The most popular 10 groups are kept and the rest discarded.

At this point, they introduce a third model. This model looks at the programs and estimates a correctness score. The 10 best are selected and submitted to the coding competition's auto-grader.

If you're wondering what the second and third models are and how they're trained, the third model is another Gemini Pro model trained by fine-tuning in a similar manner as the model that generates the programs. As for the second model, I don't know.

Anyway, a coding competition website called Codeforces was used for evaluation. 12 recent contests with at least 8,000 participants were selected. This resulted in 77 problems. AlphaCode 2 solved 43% of competition problems (vs 25% for the original AlphaCode). Mapping this to competition rankings, 43% puts AlphaCode 2 at the 85th percentile -- it's better than 85% of human competitors.

Commentary: I did Google Code Jam for 15 years before it was abruptly stopped this year. Although Google never gave a reason, I'm guessing it was because AI systems could solve coding problems. However, reading this makes me think that AI systems aren't quite as good at coding competitions as one might be led to believe. Part of this is because the problems used in coding competitions are very hard -- a 43% success rate is actually very impressive, much more impressive than it sounds. The 85% ranking against humans is impressive, too, because the programmers competing in code competitions are above average compared with programmers overall, I strongly suspect. Having said that, holy moly, 1 million programs generated for each problem? From multiple "families" of neural network models all fine-tuned on coding competition programs? The amount of computation power being thrown at solving coding problems is immense. Vastly more than any human brain. Humans are proving pretty hard to beat when it comes to competitive coding.

AlphaCode 2 Tech Report

#solidstatelife #ai #genai #llms #codellms #gemini