#linearalgebra

waynerad@diasp.org

Breakthrough in neural networks that could enable them to get much deeper than current neural networks. "Towards training without depth limits: Batch normalization without gradient explosion".

So what this is about is making deep neural networks even deeper, and to do that problems with rank collapse, exploding and/or vanishing gradients have to be solved.

Before diving in, a bit of background information. The main idea of a neural network is you simulate the connections between neurons as linear functions, and you represent these mathematically as matrices. This enables you to bring in a whole body of mathematics known as "linear algebra" to bear on the problem of making neural networks. The columns represent your inputs and your rows represent your outputs and everything in the matrix represents your connection "weights", which are the parameters of the neural network (or at least most of them). If you have a lot of 0s then you have a sparsely connected neural network, and if you have every cell filled in then you have a densely connected network.

The way you train your network is by calculating the difference between the output and the "correct" output -- the "error". You use these "error" values to make adjustments to your parameters so that next time around, there will be less error.

If you have multiple layers, though, you have a problem. The first is that combining multiple layers of linear transformations is the equivalent of another linear transformation that combines them all. So you insert non-linear layers in between your linear layers. These have become known as "activation" layers. Some of these are complex functions, like hyperbolic tangent (tanh), and some are simple, like ReLU (which is short for "rectified linear" and just means y = x for positive numbers and y = 0 for negative numbers). What matters is that these nonlinear layers are differentiable. Because what you do in order to extend the "error correction" process into deeper layers is you use the chain rule in calculus. So as long as you can calculate a "gradient" on every layer, where a "gradient" represents a multi-dimensional equivalent of a "derivative" as you learned in calculus, you can apply the chain rule and keep going into deeper layers. This is where we get terms like "gradient descent" (and "stochastic gradient descent") (because you "descend" the "gradient" to do your "error correction"), and "backpropagation", because you propagate your "error correction" signal backwards through the layers, in the opposite direction as when you're doing your "forward pass" from input to output.

When training neural networks, there is a technique called "batch normalization". "Normal" here is a reference to the "normal distribution" in statistics. What this does is adjust the parameters of a layer in a neural network such that the inputs to the next layer have an average of 0 and a standard deviation of 1. This has the effect of making the inputs more "orthogonal", which you can think of as being "less parallel". When two input calculations become "parallel" and the inputs to the next layer become identical, you have in essence lost an input parameter to the next layer. I'm using the word "parallel" because parallel lines close together or overlapping are easy to visualize, but the actual word for this in linear algebra is "collinear". This loss of an input parameter is called "rank collapse", because everything has to have strange names. Ok, "rank" here is another term for the number of dimensions in a vector. So when your vector has duplicates, then you've in essence decreased the number of dimensions. Hence "rank collapse".

So, orthogonal good. And "batch normalization" increases orthogonality, so people do it. Normalization is in fact done from the very start, with the initialization step, where all the parameters are initialized to random values. They're initialized to random values, but then they are normalized, and this helps keep everything orthogonal right from the start.

Paradoxically, however, batch normalization causes a problem called "exploding gradients". When reducing the amount of variance on the forward pass, it can actually get amplified on the backward pass, when the backpropagation algorithm updates the parameter values. The deeper the neural network, the greater the sensitivity to small changes in the parameters. Also there's that word "batch" in "batch normalization". Normalization is done in batches because the entire training set is too big, usually. But if a batch is not statistically representative of the entire training set, then the batch normalization process itself introduces noise into the process. This noise can get amplified in the backward pass and, because the effect is cumulative across layers, has a greater effect the deeper the neural network.

Various attempts to mitigate the exploding gradients problem have been invented, but those can paradoxically cause the reverse problem -- "vanishing gradients". What to do?

What the researchers here decided to do was to try to come up with a metric to track "orthogonality" directly. The metric they invented they call the "isometry gap". The way it is calculated is by taking a matrix X, transposing it, multiplying that by X, and then on the numerator taking the determinant and then taking that to the dth root where d is the number of dimensions (columns) of X, and on the denominator taking the trace of that same matrix multiply and dividing by d, and then taking the negative logarithm of the whole thing.

That's probably not intuitive at all, so here's the intuition they describe for it: first think of the columns of X as representing a parallelogram, but with more dimensions -- the higher-dimension equivalent of a parallelogram. (This is called a "parallelepiped". Going from 2 to 3 dimensions, picture a box but allowing side lengths on some dimensions to be any length and allowing any angles.) This is formed simply by interpreting the columns of X as vectors. Now you can think of the determinant in the numerator as equivalent to the volume squared of this higher-dimension equivalent of a parallelogram.

This brings us to the trace term in the denominator. The intuition is that this represents "the sum squared-norms of the columns of X", or to put in even more regular language, you're getting a measure of how far the columns are from the origin, except it is a combined measure for all the columns. This is a bit hard to see because first, the word "norms" here means "magnitudes" (distances from the origin), and has nothing to do with the word "normal" we've been using until now. And second, you wouldn't (or at least I wouldn't) guess this from how this is calculated. The X-transposed-times-X operation gives you a square matrix with the number of columns of X for all sides of the square (this is the same number we've been calling "d" -- number of dimensions) (the number of rows disappears) (in linear algebra, this is known as a "Gram matrix"). Next, the trace operation is made by summing up the diagonal elements of the matrix. The two operations put together represent a calculation summing up squares of elements of the original matrix X, like terms in the Pythagorean distance formula.

Anyway, the result of taking the ratio of these two terms, plus all the other details I'm skipping over -- the root d and divide by d and negative logarithm and everything -- is a number that "provides a scale invariant notion of volume and isometry."

"On the one hand, if there is any collinearity between the columns, the volume will vanish and the isometry gap will be infinity." "On the other hand, isometry gap = 0 implies X-transpose-times-X is a scaled identity matrix." So you can see here the intuition that this number = 0 means perfect orthogonality, and infinity represents "collinearity between the columns" and thus "rank collapse".

The next concept we have to introduce is the concept of a Lyapunov function. These actually come from the realm of differential equations. The idea is that a differential equation can have an equilibrium point, but the equilibrium point may be stable or unstable. For certain classes of differential equations, if you can prove there exists a Lyapunov function, then you can prove the equilibrium of the differential equation is stable.

While the mathematical formalism (linked to below) is beyond my understanding of differential equations, the intuition here is that we want to invent a neural network such that the isometry gap function is always decreasing. Lyapunov functions can be thought of as "energy levels" for the differential equation which can be thought of as "wanting" to go to lower energy levels. The gradient of the Lyapunov function has to relate to the vectors of the differential equation. (See video explainer below for this intuition). If a function can be shown to be a Lyapunov function and shown to be decreasing over iterations of a system, that means that system will stabilize. And here, it will stabilize in a state with orthogonality in the input vectors at every layer, preventing rank collapse.

At this point we've established a metric for rank collapse and can demonstrate the batch normalization prevents rank collapse, but what about the problem of exploding gradients? They say that to avoid exploding gradients, you first make the number of training examples in each batch the same as the number of dimensions, and then you make sure all the parameters are orthogonal at initialization.

They say they need two main modifications to avoid gradient explosion: First, n = d, which looks like it means the number of training examples is the same as the number of dimensions, and that weights W for layer l are random orthogonal matrices. Orthogonality here is tested with W-transpose-times-W gives you a diagonal identity matrix, plus a Haar measure.

The Haar measure comes from group theory (mathematics of symmetry groups -- in this case "locally compact topological groups") and in looking it up I immediately landed on a video about Lie Groups which was incomprehensible to me. As well as the formal definition from Wolfram (linked to below). From what I can tell, the reason group theory is being brought into this is because group theory is immensely concerted with maintaining invariant properties no matter what operations are performed. In this case, what we want to guarantee remains invariant is orthogonality. As best I can tell, here, the idea is to construct orthogonal matrices in such a way that they remain (provably) orthogonal under matrix multiplication. The Haar measure over the orthogonal group enables this. They say something called Weingarten calculus is required to do the calculations.

In searching for information on Weingarten calculus, I came across a 14-page article and what is essentially a 49-page book, both of which you can download in PDF format. And both of which I found incomprehensible.

As for that bit about making the number of training examples in each batch the same as the number of dimensions, I speculate that that may have been a necessary assumption to make their mathematical proofs doable, but that it might not matter in the real world.

So, to sum up: it looks like the problems of exploding gradients and vanishing gradients has been solved. From this we should expect in the not-so-distant future, neural networks will become incredibly deep. Wait, aren't they incredibly deep already? Well, they're going to become incredibly incredibly deep. With the problems of rank collapse, exploding and vanishing gradients solved, there will be essentially no limit to how deep a neural network can become.

Having said that, this paper seems to prove such a thing is possible but not actually do it. Yes, the researchers created bare-bones neural networks to demonstrate that it works, but the heavy lifting will have to be done by others in the future. The problems that need to be solved include: how to guarantee this system will work for a variety of activation functions. The work here only addressed linear transformations. Well, the mathematical proofs only applied to linear transformations. They did experiments with sin and tanh activation layers (appendix E starting on page 30). Their theory should be extendable to activation layers. Maybe these researchers will do that, or maybe that work will fall to others. Furthermore, this work relies on very advanced mathematics, such as Weingarten calculus, and that will have to be incorporated into industry-standard tools like PyTorch to become widely available to developers who want to make ultra-deep neural networks.

The key to it all seems to be to initialize all the parameters of the neural network such that they are both "orthogonal" for all inputs at every layer, and guaranteed to remain orthogonal throughout the lifetime of the neural network, no matter what transformations are done to the parameters during the neural network's training. Exactly how that is accomplished, I can't fully explain to you because it involves mathematics more advanced than I'm able to understand.

What are the implications of all this? Currently, neural networks are limited to a few hundred layers, and they "cheat" and have inputs that skip layers in order to achieve this. (These are called residual networks.) If it's really true that the "exploding/vanishing gradients" problem has been solved, then once that has been incorporated into the toolchain, e.g. PyTorch, then we could be seeing neural networks with thousands or millions of layers. And vastly more intelligent than today's neural networks.

If people can get the number of layers into the billions, then neural networks will be approaching the size of the brain, which has ~100 billion neurons and ~1,000 trillion synapses. But considering current neural networks are in the mere hundreds of layers and are already showing considerable intelligence, I'm guessing AI systems will exceed human intelligence well before the number of layers extends into the billions.

Just to be clear, I'm the one speculating on thousands or millions (or billions) of layers, not the authors. (They never make any such claims.) I'm speculating on, if the problem of exploding/vanishing gradients is really and truly solved, what does that imply? I was actually thinking, billions of layers means there are layers billions of connections away from sensory input. I don't think, even though the brain has ~100 billion neurons, that any of them are billions of connections away from sensory input. At least, I would be very surprised if that was the case. So I think, in practice, nobody will ever make a neural network with billions of layers.

On the other hand, no one should ever say "never". Maybe someday neural networks will have billions of layers and be "superintelligences" that are as much smarter than us humans as we are from ants.

Towards training without depth limits: Batch normalization without gradient explosion

#solidstatelife #ai #sgd #gradients #explodinggradients #vanishinggradients #linearalgebra #batchnormalization #backpropagation #weingarten