#rlhf

waynerad@diasp.org

"LMRL Gym: Benchmarks for multi-turn reinforcement learning with language models."

One of their benchmarks was "Guess My City" so I decided to try that.


Me:

Guess my city.

ChatGPT:

Sure, I'd be happy to play a game of guessing your city! Please provide some clues or descriptions about your city, such as landmarks, geographical features, climate, or any other distinctive characteristics. The more details you can provide, the better I'll be able to make an accurate guess.

Me:

Landmarks: It has a building known as the "cash register". Geographical features: it has mountains to the west. Climate: cold winters and moderate summers with occasional hot days, with a lot of variation all year.

ChatGPT:

Based on your clues, I believe your city might be Denver, Colorado.

The "cash register" building you're referring to is likely the Wells Fargo Center, known for its unique architecture that resembles a cash register. The geographical feature of mountains to the west aligns with Denver's location at the eastern foot of the Rocky Mountains. Additionally, Denver is known for its cold winters and moderate summers, with a climate that can vary greatly, which matches your description of the climate in your city.

Is this correct?

Me:

Yes

ChatGPT:

Great! I'm glad I guessed correctly. Denver is a fascinating city with its unique blend of urban and natural landscapes. If you have any more questions or need information about Denver or anything else, feel free to ask!


I also tried this with Claude, Bard, and DeepSeek.

In the interest of brevity, I'm going to skip the full transcripts of the conversations.

Claude guessed Dallas, Atlanta, and San Francisco before guessing Denver. It thought "cash register" could refer to buildings in those cities and had to be given a hint that my city had mountains to the west. It didn't think to ask for "geographical features" right off the bat that would have resulted in me telling it that.

Bard did something totally unexpected and did Google Maps searches. This (surprisingly or unsurprisingly?) was unhelpful. It guessed Los Angeles because there is an "American Cash Register Company" there. It had to be told "cash register" is nickname.

DeepSeek (the Chinese large language model) initially just said sorry, I don't have enough information to guess your city. Unlike the others it didn't immediately recognize "guess my city" as a game. I had to explain it's a game where it was supposed to ask me questions to guess my city. When I followed up with the country and geography, it responded with encyclopedic information on the geography of the western United States. Once it properly understood it was supposed to ask questions, it asked whether my city had a nickname. I said yes, "The Mile High City," and it immediately guessed Denver from there.

By the way, all the large language models gave a different name for Denver's "cash register" building (so called because it has the silhouette of a cash register if viewed from the proper angle), and I don't know which is correct because I don't know the true name of the building and can't be bothered to figure it out.

What this is all about is "evaluating capabilities enabled by reinforcement learning". As you may or may not know, what enables large language models to function as "chatbots" is not just their "predict the next token" language training (which is called self-supervised training, for historical reasons, don't worry if the term makes no sense), but an additional technique called reinforcement learning through human feedback (RLHF). This technique uses humans to train a model that is then flipped around and used as a reward signal for a second model, which generates feedback that gets fed into the original model which teaches it to behave "helpfully". This is why ChatGPT and its ilk come across as so eager to please you. It's a complicated system but what's important for the moment here are the words "reinforcement learning". Reinforcement learning is the field of AI that led to the systems that beat humans at the Chinese game of Go, as well as Chess and Shogi -- and it beat the best human-made chess engine, Stockfish. Reinforcement learning works by getting input from an environment along with a reward signal. For example the screen pixels of Atari games, plus the score as the reward signal. Anyway, these researchers got the idea that, since large language models are using reinforcement learning, they might design some tests looking for characteristics of reinforcement learning and see if they can find evidence of reinforcement learning-generated behavior from large language models.

Here's the list of "core capabilities that reinforcement learning can enable in large language models" that they decided to look for:

"Strategic decision making. Reinforcement learning shines in goal-directed tasks that require multi-step planning and strategic decision making. Strategic decision-making can range from simple choices like asking follow-up questions to gather information (e.g., in the 20 Questions task), to complex strategy in chess."

"Complex language. Our benchmark includes realistic language and interaction scenarios, requiring large language models to combine their knowledge from pretraining to help solve tasks during reinforcement learning finetuning. Rather than focusing entirely on causal logic and strategy found in text games, several of our tasks specifically emphasize the use of realistic language."

"Credit assignment. In reinforcement learning, rewards are often delayed relative to the action that was pivotal to the outcome. For example, a seller agent might state a particularly compelling feature of the product and then, several turns later, complete a successful sale. Reinforcement learning must determine the statements that led to the good outcome, and reinforce them."

"Partial observability. In language tasks, the state consists of the entire history of tokens, and an agent may need to examine this entire context to infer the correct state. For example, the mental states of a speaker in a dialogue (e.g., whether the buyer is impatient in a selling task), previously observed facts in a guessing game, and other hidden variables might induce partial observability."

"Trajectory stitching. In a dataset with many suboptimal trajectories, it is necessary to join optimal actions from different suboptimal trajectories together to form the most optimal trajectory. An algorithm capable of trajectory stitching should be able to learn from optimal actions taken in unsuccessful trajectories and avoid suboptimal actions that occurred in successful trajectories."

They came up with 8 "tasks", called "Maze", "Text-Based Navigation", "Wordle", "Chess", "Chess Endgames ", "Twenty Questions", "Guess My City", and "Car Dealer". Yes, they really did come up with a text-based way of playing chess (there's actually a standardized notation for chess moves). They even used Stockfish to generate data. And yes, Wordle is exactly the online Worldle game you are familiar with, where you get 6 attempts to guess a hidden 5-letter word and after each guess, you're told if the letter you guessed is in the right position, in the word but not in the right position, or not in the hidden word at all.

They have a grid (on page 4) showing for each of the 8 tasks, which of the 5 "capabilities" it exercises (strategic decision making, complex language, credit assignment, partial observability, and trajectory stitching). For the task I tried above, "Guess My City", it says it exercises more than most: the first four: strategic decision making, complex language, credit assignment, partial observability, but maybe not the last one, trajectory stitching.

LMRL Gym: Benchmarks for multi-turn reinforcement learning with language models

#solidstatelife #ai #genai #llms #rlhf #reinforcementlearning

waynerad@diasp.org

10 noteworthy AI research papers of 2023 that Sebastian Raschka read. That I didn't read so I'm just going to pass along his blog post with his analyses.

1) Pythia: Insights from Large-Scale Training Runs -- noteworthy because it does indeed present insights from large-scale training runs, such as, "Does pretraining on duplicated data make a difference?"

2) Llama 2: Open Foundation and Fine-Tuned Chat Models -- noteworthy because it introduces the popular LLaMA 2 family of models and explains them in depth.

3) QLoRA: Efficient Finetuning of Quantized LLMs -- this is a fine-tuning technique that is less resource intersive. LoRA stands for "low-rank adaptation" and the "Q" stands for "quantized". "Low rank" is just a fancy way of saying the added matrices have few dimensions. The "Q" part reduces resources further by "quantizing" the matrices, which means to use fewer bits and therefore lower precision for all the numbers.

4) BloombergGPT: A Large Language Model for Finance -- noteworthy because, well, not just because it's a relatively large LLM pretrained on a domain-specific dataset, but because, he says, it "made me think of all the different ways we can pretrain and finetune models on domain-specific data, as summarized in the figure below" (which is actually not in the paper).

5) Direct Preference Optimization: Your Language Model is Secretly a Reward Model -- noteworthy because it takes on the challenge head-on of replacing the Reinforcement Learning from Human Feedback (RLHF) technique. "While RLHF is popular and effective, as we've seen with ChatGPT and Llama 2, it's also pretty complex to implement and finicky." The replacement technique is called Direct Preference Optimization (DPO). This has actually been on my reading list for weeks and I haven't gotten around to it. Maybe I will one of these days and you can compare my take with his which you can read now.

6) Mistral 7B -- noteworthy, despite its brevity, because it's the base model used for the first DPO model, Zephyr 7B, which has outperformed similar sized models and set the stage for DPO to replace RLHF. It's additionally noteworthy for it's "sliding window" attention system and "mixture of experts" technique.

7) Orca 2: Teaching Small Language Models How to Reason -- noteworthy because it has the famous "distillation" technique where a large language model such as GPT-4 is used to create training data for a small model.

8) ConvNets Match Vision Transformers at Scale -- noteworthy because if you thought vision transformers relegated the old-fashioned vision technique, convolutional neural networks, to the dustbin, think again.

9) Segment Anything -- noteworthy because of the creation of the world's largest segmentation dataset to date with over 1 billion masks on 11 million images. And because in only 6 months, it has been cited 1,500 times and become part of self-driving car and medical imaging projects.

10) Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models -- this is about Emu, the text-to-video system. I'm going to be telling you all about Google's system as soon as I find the time, and I don't know if I'll ever have time to circle back and check out Emu, but I encourage you all to check it out for yourselves.

Ten noteworthy AI research papers of 2023

#solidstatelife #ai #genai #llms #convnets #rlhf #computervision #segmentation

waynerad@diasp.org

"Automuse: A system for generating fiction novels".

The system combines something called Plotto, a system of plot formulas, with GPT-4. They've also made an "eBook publication pipeline", so you can get the novels you generate onto your e-book reader.

"Plotto is a collection of 1,462 generic plot conflicts that can be chained together into a 'masterplot' that forms the core plot structure for the story. The rules for chaining the plot conflicts together is called the "algebra for stories".

It was originally published in -- get this 1928. By William Wallace Cook. This "algebra for stories" got encoded into software by a project called Plottoriffic.

This project, Automuse, adds the final piece by adding GPT-4.

"It's worth noting that Plotto is very much a product of its time. Plotto was written in the late 1920's and as such the information it generates is very dated and can sometimes generate things that are seen as problematic in modern sensibilities. Luckily, ChatGPT seems to sand away this roughness and is able to fabricate a better premise."

Plotto determines the premise of a novel, the major actors and their functions, the overall motivations, and the end result of the story. ChatGPT turns this into a plot summary for the novel. ChatGPT next creates a list of chapters for the novel with a high level summary of the events that happen in them. In actually writing the chapters, they have a technique for feeding proceeding text back in to maintain continuity, although it doesn't always maintain continuity.

"The outputs of the program have been described as 'hilarious', 'partially nonsensical', and overall they have left readers wanting more somehow."

Stable Diffusion is used to generate cover art, and a tool called Pandoc stitches everything together into an e-book.

Automuse: A system for generating fiction novels

#solidstatelife #ai #genai #lmms #gpt #rlhf #fiction #novels

waynerad@diasp.org

What has AutoGPT actually accomplished? Nothing?

"Some people are reporting it has been useful as a way of generating market research, that it is good at this and faster than using the traditional GPT-4 or Bing interfaces."

"Right now, AutoGPT has a tendency to get distracted or confused or caught in a loop, to leave things half-finished, to not be that robust of an agent, and other issues like that. Positive reports seem limited to things GPT-4 or Bing can essentially do anyway, with the agent wrapper perhaps cutting down somewhat on how often you have to poke the interface with a stick to keep it pointed in a reasonable direction."

"That does not mean that all the people saying AutoGPTs are the future are wrong. AutoGPT's list of real accomplishments won't stay non-existent for long."

On AutoGPT

#solidstatelife #ai #generativemodels #nlp #lmms #gpt #rlhf #autonomous