A single neural network that receives input from 6 "modalities": images, text, audio, depth, thermal, and inertial measurement unit (IMU) readings.

Based on that, you might think it's taking all these different input modalities and constructing a single, unified model of reality, much like humans do. But... that's not really what's going on here.

What's going on is it is training on images paired with each of the other 5 modalities. That is, images + text, images + audio, images + depth, images + thermal, and images + inertial measurement unit (IMU) readings.

And you might be wondering what training it does based on these pairs?

They used something called an InfoNCE loss function, which takes embeddings taken separately on each of the input pairs, and essentially computes a softmax of the combination.

There's that funny word "embeddings" again. A more intuitive word might be "encoding" or just "vector". It runs the input through an encoder and ends up with a vector that represents something meaningful about the input it started with. In this case, they are using a "transformer" architecture for all the modalities. "Transformer" is another unintuitive term from the machine learning world. It actually means the neural network uses an "attention" mechanism. Actually probably dozens or hundreds, not just one, like our conscious minds.

In the case of images, it uses the Vision Transformer (ViT). In the case of audio, it chops the audio into 2-second pieces and makes spectrograms, which get pumped into a Vision Transformer just like an image. Thermal images are images, so they get pumped straight into a Vision Transformer also. In the case of depth, it gets converted into "disparity maps", whatever those are, "for scale invariance", which then get pumped into a transformer. In the case of inertial measurement unit (IMU) readings, they are broken into 5-second pieces and run through a 1D convolutional network before, you guessed it, getting pumped into a transformer.

So, it calculates a separate embedding for each input modality. Yet, by having a loss function that combines the two, it creates in essence a "joint embedding space" -- the term you see them using in the blog post. It should also be noted that the loss function requires "negative" examples, in other words, while giving it the embeddings for each input in your pair, you also need to give it embeddings for the 2nd modality that are not part of your input pair and tell it, "these are negative examples". In this way the system learns learns in a "contrastive" manner reminiscent of CLIP (the contrastive learning technique that was the precursor to DALL-E).

(And in case you're wondering, no, I can't tell you where the term "InfoNCE" came from or what "NCE" stands for.)

So, what is all this good for? Well, one thing you can do is classification using text labels. It turns out that even though it was trained on image + something else pairs only, it can do classification without images. That is, you can give it audio and it can classify it using text, even though it was never trained on any audio + text pairs, only image + audio pairs and image + text pairs.

The other thing you can do is something they call "emergent compositionality". This is best illustrated with an example: let's say you input an image of fruits on a table and an audio clip of birds chirping. The system can retrieve an image that contains fruit and birds, say on a tree.

There is also discussion in the paper of the possibility of using this system as a way of evaluating pretrained vision models like DALL-E 2. And maybe the methodology explored here can be used to enhance pretrained models that currently handle text and images to also handle audio.

ImageBind: Holistic AI learning across six modalities

#solidstatelife #ai #multimodal

1

There are no comments yet.