#sgd

waynerad@diasp.org

The full text of Simon J.D. Prince's book Understanding Deep Learning is available online for free -- though the author asks that you buy the book and write a (positive, one would hope) review on Amazon. He will make a 2nd edition if sales are good.

The book is around 500 pages and a glance at the table of contents shows it goes from fundamentals to very advanced topics: Supervised learning, shallow neural networks, deep neural networks, loss functions (maximum likelihood, univariate regression, classification, cross-entropy, etc), gradient descent, stochastic gradient descent, initialization, the backpropagation algorithm, hyperparameters, regularization, convolutional neural networks, residual networks, transformers, graph neural networks, unsupervised learning, generative adversarial networks (styleGAN, etc), normalizing flows, variational autoencoders, diffusion models, reinforcement learning, why does deep learning work? and ethics. Appendices for notation, mathematics, and probability.

Simon J.D. Prince: Understanding Deep Learning

#solidstatelife #ai #deeplearning #sgd #backpropagation #genai #gans #aieducation

waynerad@diasp.org

Neural networks with only 1/100th the number of parameters? If Kolmogorov-Arnold Networks live up to their promise.

Rethinking the central assumptions of how layers work in neural networks. Neural networks are inspired by brain neurons, where outputs from one neuron shoot down axons, cross synapses, and into dendrites of the next neuron. This idea resulted in the simplification of thinking of neurons as "nodes" on a mathematical graph and the connections between them as edges. In neural networks, the "edges" perform linear transformations, and the "nodes" perform nonlinear transformations. (When there is an edge connecting every node, you have what's called a "fully connected" neural network -- also known as a "multi-layer perceptron" (MLP), a term which makes no sense -- well it does if you understand the history of neural networks going back to the 1960s, but otherwise it doesn't. But it's used everywhere in the paper, so you ought to know "MLP" and "fully connected neural network" means the same thing in case you decide to dive into the paper.)

Actually when I learned neural networks, I learned that they alternate between "linear transformation" layers and "activation" layers. The reason for this is that if you did only "linear transformation" layers, you could combine them all into a single "linear transformation" layer. Mathematically, this is represented as a matrix operation. If you do a matrix operation, followed by another, you can combine the two into a third matrix operation that does the equivalent of the first two. Put another way, a neural network with only linear transformations can be smashed flat into a single-layer linear transformation. In order to be able to make deeper neural networks, and neural networks capable of learning nonlinear relationships between inputs and outputs, we have to stick non-linear layers in between. People initially did very complex activation functions, like tanh, but now most people use ReLU, which means "rectified linear", which is a fancy way of saying we chop off the negative numbers (function returns 0 for x < 0), which is, believe it or not, enough non-linearity for the whole system to work.

Here, though, the researchers start with the easier-to-visualize idea that "edges" represent linear transformations while "nodes" represent the non-linear "activation" functions. Even the term "activation" is inspired by the brain, where neurons only "spike" if they get sufficient input to "activate" them. Neural networks in computers, though, don't "spike" -- everything is done with continuous, floating-point numbers.

This conception of edges being linear and nodes being non-linear is important to keep in mind, because with Kolmogorov-Arnold Networks, aka KANs, we're going to flip those around.

The inspiration for this comes from the fact that there is a mathematical proof underlying neural networks, known as the universal approximation theorem. The universal approximation theorem proves that for any given mathematical function, it is possible for a neural network, mathematically modeled as alternating linear transformations and non-linear functions as described above, to approximate that mathematical function to within same error margin. It doesn't say how big a neural network you'll have to make, just that it's guaranteed to be possible if you make your neural network big enough.

Well, two mathematicians, Vladimir Arnold and Andrey Kolmogorov, proved a theorem called the Kolmogorov-Arnold Representation theorem, and the idea here is that this alternative theorem can leader to similar results as the universal approximation theorem, but in a different way.

The Kolmogorov-Arnold Representation theorem proves that for any mathematical function that takes multiple parameters, it is possible to find two sets of functions that, when combined in the right way, will give you the same output, but each of those functions only takes a single input parameter. In other words, a function that takes, say, a 3-dimensional input can be "decomposed" into two sets of functions that take only a single (1-dimensional) input. The way these are combined is, and I'll try to describe this simply and without going into gory detail (you can look up uhu if you want that), for each of the second set of functions, you sum up all combinations of the first set of functions that pertain to that function from the second set, then run those outputs through the second set, and then sum up all of those. Or to put even more simply, you doing a combination of "summation" and "composition" where "composition" means taking the output of one function and making it the input of another function (think "f(g(x))").

The The Kolmogorov-Arnold Representation theorem doesn't tell you what these functions are, only that they are proven to exist. These researchers, though, since they are looking for a way to approximate to arbitrary accuracy, chose splines as the functions to use. If you're not familiar with splines, I have some links below. Splines are used in video games and computer-generated imagery in movies and computer-aided design (CAD) programs for manufactured objects and lots of other places to generate smooth curves.

Now think of your mental image of a neural network as nodes and edges. Now the splines will be the functions along the edges. The nodes will simply sum up the output of the edges. So the "edges" do the splines and represent the "composition" part of the Kolmogorov-Arnold Representation, and the "nodes" represent the summations. And now you can see, we've reversed our linearity & non-linearity: the "edges" are now the "non-linear" part of the system, and the "nodes" are now the "linear" part. (While the original Kolmogorov-Arnold Representation prescribed exactly two types of functions and in specific quantities, this generalization allows for any number of sets of functions, which correspond to "layers" in the neural network, and any number of functions in each set, which corresponds to the "size" (number of neurons) in each layer.)

So that's the essence of the idea, glossing over a lot of mathematical details. In the paper they prove a theorem that this architecture can approximate any mathematical function to within a given arbitrary error margin, provide the mathematical function is continuously differentiable. So now we have our analogy with the universal approximation theorem for Kolmogorov-Arnold Networks. The "continuously differentiable" assumption is helpful, too, because it means we can learn all the parameters to the splines with backpropagation and gradient descent!

Kolmogorov-Arnold Networks have a couple more tricks up their sleeves, too. One is that the network can be made larger without throwing all the learned parameters away and starting over -- sort of. With regular neural networks, if you wanted to, say, double the size of the network, you have to make a new network twice as big and train it rom scratch with your training dataset. With Kolmogorov-Arnold Networks, you can increase the precision of your splines by adding more parameters to them, without throwing away any of the parameters you've already learned. (You go from "coarse-grained spline" to "fine-grained spline".) But I say "sort of" because you can't make any arbitrary change, like change the number of layers. But it's really cool that they have some ability to expand their learning ability by adding parameters after the fact, something normal neural networks can't do at all. (They call expanding the capacity of the neural network in this manner "grid extension". It's called "grid extension" because historically the set of points defining a spline is called the "grid".)

Another trick of the sleeve of Kolmogorov-Arnold Networks is the lack of "catastrophic forgetting". It turns out in regular neural networks, using the same "activation" function for all neurons in a layer causes them all to be causally linked, which means if the neurons on one part learn a task, they will forget it when neurons on another part learn a second task. Neural networks have to be trained on all tasks they have to learn all at once. (Companies like OpenAI have learned to "ensemble" separately trained neural networks to deal with this problem.) In the Kolmogorov-Arnold Networks, if neurons on one part learned one task, neurons on another part could learn a different task, and the two wouldn't interfere with each other. So Kolmogorov-Arnold Networks don't have the "catastrophic forgetting" problem.

They came up with an "auto-discovery" system and discovered the auto-discovered KANs were smaller than the human-constructed ones. Humans tend to construct networks that look like the math formulas, but the "auto-discovery" system could find shortcuts.

They make the claim that KANs are more "interpretable" than regular neural networks. Well, maybe for the smallest KANS. For example, one of the "toy" functions that they tested on was x/y (i.e. a 2-dimensional input where the output is just one number divided by the other). They peek "under the hood" and discover what the splines are approximating is exp(log(x) - log(y)). This kind of "interpretability" I assume would diminish rapidly once the network starts to get big. This network was a [2, 1, 1] network, meaning 2 neurons for the input, a 1 neuron hidden layer, and a 1 neuron output (only 2 layers because the input isn't counted as a "layer"). Imagine those numbers going into the hundreds and how are you going to interpret?

They tested the system by testing its ability to learn a set of mathematical functions. These functions were: Jacobian elliptic functions, elliptic integrals, Bessel functions, Associated Legendre functions, and spherical harmonics -- and I'm not going to say any more about them because I have no idea what any of those are. They also got 27 equations from Richard Feynman's textbooks. What they do with these is show they can get better results than regular neural networks with the same number of parameters, plus their "auto-discovery" system can find substantially more compact representations of the formulas. They would like to interpret these KANs and learn more about how these math formulas could be represented more compactly.

For the big test, they get into a branch of mathematics called knot theory. The idea is to take a knot as input, and find its "signature". I don't know enough about knot theory to tell you what at knot's "signature" is. But I can tell you they compared their system with DeepMind's, and DeepMind had 78% accuracy with a 4-layer, 300-neuron-on-each-layer neural network, while they had 81.6% accuracy on a [17, 1, 14] (2-layer) KAN. DeepMind's neural network had 30,000 parameters, while theirs had 200.

It remains to be seen whether this algorithm can be deployed at scale and reduce the number of parameters of neural networks in the wild by a factor of 100. But it is a fascinating development nonetheless. See below for video saying KANs are overhyped.

#solidstatelife #ai #sgd #backpropagation #mlps

https://arxiv.org/abs/2404.19756

waynerad@diasp.org

Breakthrough in neural networks that could enable them to get much deeper than current neural networks. "Towards training without depth limits: Batch normalization without gradient explosion".

So what this is about is making deep neural networks even deeper, and to do that problems with rank collapse, exploding and/or vanishing gradients have to be solved.

Before diving in, a bit of background information. The main idea of a neural network is you simulate the connections between neurons as linear functions, and you represent these mathematically as matrices. This enables you to bring in a whole body of mathematics known as "linear algebra" to bear on the problem of making neural networks. The columns represent your inputs and your rows represent your outputs and everything in the matrix represents your connection "weights", which are the parameters of the neural network (or at least most of them). If you have a lot of 0s then you have a sparsely connected neural network, and if you have every cell filled in then you have a densely connected network.

The way you train your network is by calculating the difference between the output and the "correct" output -- the "error". You use these "error" values to make adjustments to your parameters so that next time around, there will be less error.

If you have multiple layers, though, you have a problem. The first is that combining multiple layers of linear transformations is the equivalent of another linear transformation that combines them all. So you insert non-linear layers in between your linear layers. These have become known as "activation" layers. Some of these are complex functions, like hyperbolic tangent (tanh), and some are simple, like ReLU (which is short for "rectified linear" and just means y = x for positive numbers and y = 0 for negative numbers). What matters is that these nonlinear layers are differentiable. Because what you do in order to extend the "error correction" process into deeper layers is you use the chain rule in calculus. So as long as you can calculate a "gradient" on every layer, where a "gradient" represents a multi-dimensional equivalent of a "derivative" as you learned in calculus, you can apply the chain rule and keep going into deeper layers. This is where we get terms like "gradient descent" (and "stochastic gradient descent") (because you "descend" the "gradient" to do your "error correction"), and "backpropagation", because you propagate your "error correction" signal backwards through the layers, in the opposite direction as when you're doing your "forward pass" from input to output.

When training neural networks, there is a technique called "batch normalization". "Normal" here is a reference to the "normal distribution" in statistics. What this does is adjust the parameters of a layer in a neural network such that the inputs to the next layer have an average of 0 and a standard deviation of 1. This has the effect of making the inputs more "orthogonal", which you can think of as being "less parallel". When two input calculations become "parallel" and the inputs to the next layer become identical, you have in essence lost an input parameter to the next layer. I'm using the word "parallel" because parallel lines close together or overlapping are easy to visualize, but the actual word for this in linear algebra is "collinear". This loss of an input parameter is called "rank collapse", because everything has to have strange names. Ok, "rank" here is another term for the number of dimensions in a vector. So when your vector has duplicates, then you've in essence decreased the number of dimensions. Hence "rank collapse".

So, orthogonal good. And "batch normalization" increases orthogonality, so people do it. Normalization is in fact done from the very start, with the initialization step, where all the parameters are initialized to random values. They're initialized to random values, but then they are normalized, and this helps keep everything orthogonal right from the start.

Paradoxically, however, batch normalization causes a problem called "exploding gradients". When reducing the amount of variance on the forward pass, it can actually get amplified on the backward pass, when the backpropagation algorithm updates the parameter values. The deeper the neural network, the greater the sensitivity to small changes in the parameters. Also there's that word "batch" in "batch normalization". Normalization is done in batches because the entire training set is too big, usually. But if a batch is not statistically representative of the entire training set, then the batch normalization process itself introduces noise into the process. This noise can get amplified in the backward pass and, because the effect is cumulative across layers, has a greater effect the deeper the neural network.

Various attempts to mitigate the exploding gradients problem have been invented, but those can paradoxically cause the reverse problem -- "vanishing gradients". What to do?

What the researchers here decided to do was to try to come up with a metric to track "orthogonality" directly. The metric they invented they call the "isometry gap". The way it is calculated is by taking a matrix X, transposing it, multiplying that by X, and then on the numerator taking the determinant and then taking that to the dth root where d is the number of dimensions (columns) of X, and on the denominator taking the trace of that same matrix multiply and dividing by d, and then taking the negative logarithm of the whole thing.

That's probably not intuitive at all, so here's the intuition they describe for it: first think of the columns of X as representing a parallelogram, but with more dimensions -- the higher-dimension equivalent of a parallelogram. (This is called a "parallelepiped". Going from 2 to 3 dimensions, picture a box but allowing side lengths on some dimensions to be any length and allowing any angles.) This is formed simply by interpreting the columns of X as vectors. Now you can think of the determinant in the numerator as equivalent to the volume squared of this higher-dimension equivalent of a parallelogram.

This brings us to the trace term in the denominator. The intuition is that this represents "the sum squared-norms of the columns of X", or to put in even more regular language, you're getting a measure of how far the columns are from the origin, except it is a combined measure for all the columns. This is a bit hard to see because first, the word "norms" here means "magnitudes" (distances from the origin), and has nothing to do with the word "normal" we've been using until now. And second, you wouldn't (or at least I wouldn't) guess this from how this is calculated. The X-transposed-times-X operation gives you a square matrix with the number of columns of X for all sides of the square (this is the same number we've been calling "d" -- number of dimensions) (the number of rows disappears) (in linear algebra, this is known as a "Gram matrix"). Next, the trace operation is made by summing up the diagonal elements of the matrix. The two operations put together represent a calculation summing up squares of elements of the original matrix X, like terms in the Pythagorean distance formula.

Anyway, the result of taking the ratio of these two terms, plus all the other details I'm skipping over -- the root d and divide by d and negative logarithm and everything -- is a number that "provides a scale invariant notion of volume and isometry."

"On the one hand, if there is any collinearity between the columns, the volume will vanish and the isometry gap will be infinity." "On the other hand, isometry gap = 0 implies X-transpose-times-X is a scaled identity matrix." So you can see here the intuition that this number = 0 means perfect orthogonality, and infinity represents "collinearity between the columns" and thus "rank collapse".

The next concept we have to introduce is the concept of a Lyapunov function. These actually come from the realm of differential equations. The idea is that a differential equation can have an equilibrium point, but the equilibrium point may be stable or unstable. For certain classes of differential equations, if you can prove there exists a Lyapunov function, then you can prove the equilibrium of the differential equation is stable.

While the mathematical formalism (linked to below) is beyond my understanding of differential equations, the intuition here is that we want to invent a neural network such that the isometry gap function is always decreasing. Lyapunov functions can be thought of as "energy levels" for the differential equation which can be thought of as "wanting" to go to lower energy levels. The gradient of the Lyapunov function has to relate to the vectors of the differential equation. (See video explainer below for this intuition). If a function can be shown to be a Lyapunov function and shown to be decreasing over iterations of a system, that means that system will stabilize. And here, it will stabilize in a state with orthogonality in the input vectors at every layer, preventing rank collapse.

At this point we've established a metric for rank collapse and can demonstrate the batch normalization prevents rank collapse, but what about the problem of exploding gradients? They say that to avoid exploding gradients, you first make the number of training examples in each batch the same as the number of dimensions, and then you make sure all the parameters are orthogonal at initialization.

They say they need two main modifications to avoid gradient explosion: First, n = d, which looks like it means the number of training examples is the same as the number of dimensions, and that weights W for layer l are random orthogonal matrices. Orthogonality here is tested with W-transpose-times-W gives you a diagonal identity matrix, plus a Haar measure.

The Haar measure comes from group theory (mathematics of symmetry groups -- in this case "locally compact topological groups") and in looking it up I immediately landed on a video about Lie Groups which was incomprehensible to me. As well as the formal definition from Wolfram (linked to below). From what I can tell, the reason group theory is being brought into this is because group theory is immensely concerted with maintaining invariant properties no matter what operations are performed. In this case, what we want to guarantee remains invariant is orthogonality. As best I can tell, here, the idea is to construct orthogonal matrices in such a way that they remain (provably) orthogonal under matrix multiplication. The Haar measure over the orthogonal group enables this. They say something called Weingarten calculus is required to do the calculations.

In searching for information on Weingarten calculus, I came across a 14-page article and what is essentially a 49-page book, both of which you can download in PDF format. And both of which I found incomprehensible.

As for that bit about making the number of training examples in each batch the same as the number of dimensions, I speculate that that may have been a necessary assumption to make their mathematical proofs doable, but that it might not matter in the real world.

So, to sum up: it looks like the problems of exploding gradients and vanishing gradients has been solved. From this we should expect in the not-so-distant future, neural networks will become incredibly deep. Wait, aren't they incredibly deep already? Well, they're going to become incredibly incredibly deep. With the problems of rank collapse, exploding and vanishing gradients solved, there will be essentially no limit to how deep a neural network can become.

Having said that, this paper seems to prove such a thing is possible but not actually do it. Yes, the researchers created bare-bones neural networks to demonstrate that it works, but the heavy lifting will have to be done by others in the future. The problems that need to be solved include: how to guarantee this system will work for a variety of activation functions. The work here only addressed linear transformations. Well, the mathematical proofs only applied to linear transformations. They did experiments with sin and tanh activation layers (appendix E starting on page 30). Their theory should be extendable to activation layers. Maybe these researchers will do that, or maybe that work will fall to others. Furthermore, this work relies on very advanced mathematics, such as Weingarten calculus, and that will have to be incorporated into industry-standard tools like PyTorch to become widely available to developers who want to make ultra-deep neural networks.

The key to it all seems to be to initialize all the parameters of the neural network such that they are both "orthogonal" for all inputs at every layer, and guaranteed to remain orthogonal throughout the lifetime of the neural network, no matter what transformations are done to the parameters during the neural network's training. Exactly how that is accomplished, I can't fully explain to you because it involves mathematics more advanced than I'm able to understand.

What are the implications of all this? Currently, neural networks are limited to a few hundred layers, and they "cheat" and have inputs that skip layers in order to achieve this. (These are called residual networks.) If it's really true that the "exploding/vanishing gradients" problem has been solved, then once that has been incorporated into the toolchain, e.g. PyTorch, then we could be seeing neural networks with thousands or millions of layers. And vastly more intelligent than today's neural networks.

If people can get the number of layers into the billions, then neural networks will be approaching the size of the brain, which has ~100 billion neurons and ~1,000 trillion synapses. But considering current neural networks are in the mere hundreds of layers and are already showing considerable intelligence, I'm guessing AI systems will exceed human intelligence well before the number of layers extends into the billions.

Just to be clear, I'm the one speculating on thousands or millions (or billions) of layers, not the authors. (They never make any such claims.) I'm speculating on, if the problem of exploding/vanishing gradients is really and truly solved, what does that imply? I was actually thinking, billions of layers means there are layers billions of connections away from sensory input. I don't think, even though the brain has ~100 billion neurons, that any of them are billions of connections away from sensory input. At least, I would be very surprised if that was the case. So I think, in practice, nobody will ever make a neural network with billions of layers.

On the other hand, no one should ever say "never". Maybe someday neural networks will have billions of layers and be "superintelligences" that are as much smarter than us humans as we are from ants.

Towards training without depth limits: Batch normalization without gradient explosion

#solidstatelife #ai #sgd #gradients #explodinggradients #vanishinggradients #linearalgebra #batchnormalization #backpropagation #weingarten