Neural networks with only 1/100th the number of parameters? If Kolmogorov-Arnold Networks live up to their promise.

Rethinking the central assumptions of how layers work in neural networks. Neural networks are inspired by brain neurons, where outputs from one neuron shoot down axons, cross synapses, and into dendrites of the next neuron. This idea resulted in the simplification of thinking of neurons as "nodes" on a mathematical graph and the connections between them as edges. In neural networks, the "edges" perform linear transformations, and the "nodes" perform nonlinear transformations. (When there is an edge connecting every node, you have what's called a "fully connected" neural network -- also known as a "multi-layer perceptron" (MLP), a term which makes no sense -- well it does if you understand the history of neural networks going back to the 1960s, but otherwise it doesn't. But it's used everywhere in the paper, so you ought to know "MLP" and "fully connected neural network" means the same thing in case you decide to dive into the paper.)

Actually when I learned neural networks, I learned that they alternate between "linear transformation" layers and "activation" layers. The reason for this is that if you did only "linear transformation" layers, you could combine them all into a single "linear transformation" layer. Mathematically, this is represented as a matrix operation. If you do a matrix operation, followed by another, you can combine the two into a third matrix operation that does the equivalent of the first two. Put another way, a neural network with only linear transformations can be smashed flat into a single-layer linear transformation. In order to be able to make deeper neural networks, and neural networks capable of learning nonlinear relationships between inputs and outputs, we have to stick non-linear layers in between. People initially did very complex activation functions, like tanh, but now most people use ReLU, which means "rectified linear", which is a fancy way of saying we chop off the negative numbers (function returns 0 for x < 0), which is, believe it or not, enough non-linearity for the whole system to work.

Here, though, the researchers start with the easier-to-visualize idea that "edges" represent linear transformations while "nodes" represent the non-linear "activation" functions. Even the term "activation" is inspired by the brain, where neurons only "spike" if they get sufficient input to "activate" them. Neural networks in computers, though, don't "spike" -- everything is done with continuous, floating-point numbers.

This conception of edges being linear and nodes being non-linear is important to keep in mind, because with Kolmogorov-Arnold Networks, aka KANs, we're going to flip those around.

The inspiration for this comes from the fact that there is a mathematical proof underlying neural networks, known as the universal approximation theorem. The universal approximation theorem proves that for any given mathematical function, it is possible for a neural network, mathematically modeled as alternating linear transformations and non-linear functions as described above, to approximate that mathematical function to within same error margin. It doesn't say how big a neural network you'll have to make, just that it's guaranteed to be possible if you make your neural network big enough.

Well, two mathematicians, Vladimir Arnold and Andrey Kolmogorov, proved a theorem called the Kolmogorov-Arnold Representation theorem, and the idea here is that this alternative theorem can leader to similar results as the universal approximation theorem, but in a different way.

The Kolmogorov-Arnold Representation theorem proves that for any mathematical function that takes multiple parameters, it is possible to find two sets of functions that, when combined in the right way, will give you the same output, but each of those functions only takes a single input parameter. In other words, a function that takes, say, a 3-dimensional input can be "decomposed" into two sets of functions that take only a single (1-dimensional) input. The way these are combined is, and I'll try to describe this simply and without going into gory detail (you can look up uhu if you want that), for each of the second set of functions, you sum up all combinations of the first set of functions that pertain to that function from the second set, then run those outputs through the second set, and then sum up all of those. Or to put even more simply, you doing a combination of "summation" and "composition" where "composition" means taking the output of one function and making it the input of another function (think "f(g(x))").

The The Kolmogorov-Arnold Representation theorem doesn't tell you what these functions are, only that they are proven to exist. These researchers, though, since they are looking for a way to approximate to arbitrary accuracy, chose splines as the functions to use. If you're not familiar with splines, I have some links below. Splines are used in video games and computer-generated imagery in movies and computer-aided design (CAD) programs for manufactured objects and lots of other places to generate smooth curves.

Now think of your mental image of a neural network as nodes and edges. Now the splines will be the functions along the edges. The nodes will simply sum up the output of the edges. So the "edges" do the splines and represent the "composition" part of the Kolmogorov-Arnold Representation, and the "nodes" represent the summations. And now you can see, we've reversed our linearity & non-linearity: the "edges" are now the "non-linear" part of the system, and the "nodes" are now the "linear" part. (While the original Kolmogorov-Arnold Representation prescribed exactly two types of functions and in specific quantities, this generalization allows for any number of sets of functions, which correspond to "layers" in the neural network, and any number of functions in each set, which corresponds to the "size" (number of neurons) in each layer.)

So that's the essence of the idea, glossing over a lot of mathematical details. In the paper they prove a theorem that this architecture can approximate any mathematical function to within a given arbitrary error margin, provide the mathematical function is continuously differentiable. So now we have our analogy with the universal approximation theorem for Kolmogorov-Arnold Networks. The "continuously differentiable" assumption is helpful, too, because it means we can learn all the parameters to the splines with backpropagation and gradient descent!

Kolmogorov-Arnold Networks have a couple more tricks up their sleeves, too. One is that the network can be made larger without throwing all the learned parameters away and starting over -- sort of. With regular neural networks, if you wanted to, say, double the size of the network, you have to make a new network twice as big and train it rom scratch with your training dataset. With Kolmogorov-Arnold Networks, you can increase the precision of your splines by adding more parameters to them, without throwing away any of the parameters you've already learned. (You go from "coarse-grained spline" to "fine-grained spline".) But I say "sort of" because you can't make any arbitrary change, like change the number of layers. But it's really cool that they have some ability to expand their learning ability by adding parameters after the fact, something normal neural networks can't do at all. (They call expanding the capacity of the neural network in this manner "grid extension". It's called "grid extension" because historically the set of points defining a spline is called the "grid".)

Another trick of the sleeve of Kolmogorov-Arnold Networks is the lack of "catastrophic forgetting". It turns out in regular neural networks, using the same "activation" function for all neurons in a layer causes them all to be causally linked, which means if the neurons on one part learn a task, they will forget it when neurons on another part learn a second task. Neural networks have to be trained on all tasks they have to learn all at once. (Companies like OpenAI have learned to "ensemble" separately trained neural networks to deal with this problem.) In the Kolmogorov-Arnold Networks, if neurons on one part learned one task, neurons on another part could learn a different task, and the two wouldn't interfere with each other. So Kolmogorov-Arnold Networks don't have the "catastrophic forgetting" problem.

They came up with an "auto-discovery" system and discovered the auto-discovered KANs were smaller than the human-constructed ones. Humans tend to construct networks that look like the math formulas, but the "auto-discovery" system could find shortcuts.

They make the claim that KANs are more "interpretable" than regular neural networks. Well, maybe for the smallest KANS. For example, one of the "toy" functions that they tested on was x/y (i.e. a 2-dimensional input where the output is just one number divided by the other). They peek "under the hood" and discover what the splines are approximating is exp(log(x) - log(y)). This kind of "interpretability" I assume would diminish rapidly once the network starts to get big. This network was a [2, 1, 1] network, meaning 2 neurons for the input, a 1 neuron hidden layer, and a 1 neuron output (only 2 layers because the input isn't counted as a "layer"). Imagine those numbers going into the hundreds and how are you going to interpret?

They tested the system by testing its ability to learn a set of mathematical functions. These functions were: Jacobian elliptic functions, elliptic integrals, Bessel functions, Associated Legendre functions, and spherical harmonics -- and I'm not going to say any more about them because I have no idea what any of those are. They also got 27 equations from Richard Feynman's textbooks. What they do with these is show they can get better results than regular neural networks with the same number of parameters, plus their "auto-discovery" system can find substantially more compact representations of the formulas. They would like to interpret these KANs and learn more about how these math formulas could be represented more compactly.

For the big test, they get into a branch of mathematics called knot theory. The idea is to take a knot as input, and find its "signature". I don't know enough about knot theory to tell you what at knot's "signature" is. But I can tell you they compared their system with DeepMind's, and DeepMind had 78% accuracy with a 4-layer, 300-neuron-on-each-layer neural network, while they had 81.6% accuracy on a [17, 1, 14] (2-layer) KAN. DeepMind's neural network had 30,000 parameters, while theirs had 200.

It remains to be seen whether this algorithm can be deployed at scale and reduce the number of parameters of neural networks in the wild by a factor of 100. But it is a fascinating development nonetheless. See below for video saying KANs are overhyped.

#solidstatelife #ai #sgd #backpropagation #mlps

https://arxiv.org/abs/2404.19756

7