Dnext

July 23, 2024 3:05am

Richard Sutton interviewed by Edan Meyer. Rich Sutton literally half-wrote the book on reinforcement learning -- my textbook on reinforcement learning, Reinforcement Learning: An Introduction, was written by him and Andrew Barto. I've never seen him (or Andrew Barto) on video before so this was interesting to see. (Full disclosure, I only read about half of the book, and I 'cheated' and didn't do all the exercises.)

The thing that I thought was most interesting was his disagreement with the self-supervised learning approach. For those of you not up on the terminology, "self-supervised" is a term that means you take any data, and you mask out some piece of it, and try to train your neural network to "predict" the part that's masked out from the part that isn't masked. The easiest way to do this is to just unmask all the "past" data and mask all the "future" data and as the neural network to predict the "next word" or "next video frame" or "next" whatever. It's called "self-supervised" because neural network training started with paired inputs and outputs where the "outputs" that the neural network was to learn were written by humans, and this came to be called "supervised" learning. "Unsupervised" learning came to refer to throwing mountains of data at an algorithm and asking it to find whatever patterns are in there. So to describe this alternate mode where it's like "supervised" learning but the "correct answers" are created just by masking out input data, the term "self-supervised" was coined.

I thought "self-supervised" learning was a very important breakthrough. It's what led directly to ChatGPT and all the other chatbots we know and love (we do love them right?). But Rich Sutton is kind of a downer when it comes to self-suprevised learning.

"Outside of reinforcement learning is lots of guys trying to predict the next observation, or the next video frame. Their fixation on that problem is what I mean by they've done very little, because the thing you want to predict about the world is not the next frame. You want to predict consequential things. Things that matter. Things that you can influence. And things that are happening multiple steps in the future."

"The problem is that you have to interact the world. You have to predict and control it, and you have large sensory sensory motor vectors, then the question is what is my background? Well, if I'm a supervised learning guy, I say, maybe I can apply my supervised learning tools to them. They all want to have labels, and so the labels I have is the very next data point. So I should predict that that next data point. This is is a way of thinking perfectly consistent with their background, but if you're coming from the point of reinforcement learning you think about predicting multiple steps in the future. Just as you predict value functions, predict reward, you should also predict the other events -- these things will be causal. I want to predict, what will happen if I if I drop this? Will it spill? will there be water all over? what might it feel on me? Those are not single step predictions. They involve whole sequences of actions picking things up and then spilling them and then letting them play out. There are consequences, and so to make a model of the world it's not going to be like a video frame. It's not going to be like playing out the video. You model the world at a higher level."

I talked with Rich Sutton - Edan Meyer

#solidstatelife #ai #reinforcementlearning #rl