#brierscore

waynerad@diasp.org

Approaching human-Level forecasting with language models.

The idea here is to pit AI head-to-head against humans in forecasting competitions. They mention 5 of these: Metaculus, GJOpen, INFER, Polymarket, and Manifold. The way they are scored is with something called a "Brier score". To keep things simple, they limited their system to only yes/no "binary" questions. When dealing with "binary" questions, the way the Brier score is computed is, one option is assigned the value 0 (say, some event not happening by a certain date), or 1 (the event happening). The person -- or now, language model -- making the prediction actually predicts a probability -- a number between 0 and 1. Once the outcome is known, the difference between the prediction probability number and the actual outcome number is computed and then squared. For multiple predictions, these numbers are all averaged. In this way, the Brier score represents the "error" in the predictions. A perfect predictor will predict "1" for every event that actually happens, and "0" for every event that does not happen, leading to a Brier score of 0. If the predictor does not know if something will happen or not, they can say 0.5, which will lead to a Brier score of 0.25 no matter which outcome actually happens. It's better to do that then to predict 0 or 1 and be wrong.

This glosses over various details like how to handle when people change their predictions, how to handle multiple choice outcomes or numerical outcomes, but you get the idea. The Brier score represents your prediction error.

The researchers found language models are bad at predicting. With no additional information retrieval or fine-tuning, most language models do only a little better than picking at random, and the biggest and best models like GPT-4 and Claude-2 do better than chance but still much worse than humans.

For the dataset that they trained the model on, they used the above-mentioned 5 forecasting competitions and combined data from all of them to get a dataset of 33,664 binary questions. Here's an example showing what these binary questions look like:

"Question: Will Starship achieve liftoff before Monday, May 1st, 2023?"

"Background: On April 14th, SpaceX received a launch license for its Starship spacecraft. A launch scheduled for April 17th was scrubbed due to a frozen valve. SpaceX CEO Elon Musk tweeted: 'Learned a lot today, now offloading propellant, retrying in a few days . . . '"

"Resolution: Criteria This question resolves Yes if Starship leaves the launchpad intact and under its own power before 11:59pm ET on Sunday, April 30th."

"Key Dates: Begin Date: 2023-04-17, Close Date: 2023-04-30, Resolve Date: 2023-04-20."

The "begin date" is the date people can start making predictions. The "close date" is the last date people can make predictions. The "resolve date" is the date reality is checked to see if the prediction happened or not. But, for this example, the reason why the "resolve date" is before the "close date" is because the event occurred.

Their system consists of a retrieval system, a reasoning system, and a candidate selection system.

The retrieval system enables the system to do search engine searches. It consists of 4 steps: search query generation, news retrieval, relevance filtering and ranking, and text summarization. The summarization step is because large language models are limited by their context window, and that may be less of a limitation in the future.

The reasoning system works by first prompting the large language model to rephrase the question. The model is next asked to leverage the retrieved information and its pre-training knowledge to produce arguments for why the outcome may or may not occur. Since the model can generate weak arguments, to avoid treating them all as equal, it is instructed to weigh them by importance and aggregate them accordingly. Finally, "to prevent potential bias and miscalibration, the model is asked to check if it is over- or underconfident and consider historical base rates, prompting it to calibrate and amend the prediction accordingly."

This is called reasoning by "scratchpad prompting". Since the aggregate of predictions is usually superior to individual forecasts, this is repeated multiple times and the average is used.

All of this needs to be in place before fine-tuning because it's used by the fine-tuning system. The fine-tuning was done by selecting a subset of the data for fine-tuning, a subset where the model outperformed the human crowd. But they discard examples where the model is too much better than the crowd. They say this is because "We seek to fine-tune our model on strong forecasts" but at the same time, thus using the subset where the model outperformed the human crowd, but, "this can inadvertently cause overconfidence in our fine-tuned model" -- unless they discard the examples where the model exceeds the crowd prediction too much.

"The input to the model consists of the question, description, and resolution criteria, followed by summarized articles. The target output consists of a reasoning and a prediction. Importantly, the fine-tuning input excludes the scratchpad instructions. By doing so, we directly teach the model which reasoning to apply in a given context."

In addition they did a "hyperparameter sweep" where they tried to optimize the hyperparameters. The "hyperparameters" were the search query prompt, the summarization prompt, the number of articles to keep and rank, the reasoning prompt, and the ensembling method for combining multiple answers (they tested 5 different algorithms).

Anyway, the end result of all this is that the large language model had a Brier score of .179, while the crowd had .149, in a difference of only .03. So the system is very close to human accuracy. If traditional "accuracy" numbers are more intuitive to you, they gave 71.5% as their accuracy number, and 77.0% for the human crowd.

Approaching human-Level forecasting with language models

#solidstatelife #ai #genai #llms #futurology #predictionmarkets #brierscore