Generating music from text. You can give it a text description like, "The main soundtrack of an arcade game. It is fast-paced and upbeat, with a catchy electric guitar riff. The music is repetitive and easy to remember, but with unexpected sounds, like cymbal crashes or drum rolls."
Go listen to the examples now.
How does it work?
"Creating text descriptions of general audio is considerably harder than describing images. First, it is not straightforward to unambiguously capture with just a few words the salient characteristics of either acoustic scenes (e.g., the sounds heard in a train station or in a forest) or music (e.g., the melody, the rhythm, the timbre of vocals and the many instruments used in accompaniment). Second, audio is structured along a temporal dimension which makes sequence-wide captions a much weaker level of annotation than an image caption."
"To address the main challenge of paired data scarcity, we rely on MuLan, a joint music-text model that is trained to project music and its corresponding text description to representations close to each other in an embedding space."
Here's that word "embedding" that makes sense to AI researchers but not to people outside. Remember, "embeddings" started out as a way of representing words in a way that is related to their meaning. They are big vectors and words with similar meaning are located in the same area of a high-dimensional space. Here, the embeddings don't represent words, they represent sounds that have meaning, such as musical notes.
"This shared embedding space eliminates the need for captions at training time altogether, and allows training on massive audio-only corpora. That is, we use the MuLan embeddings computed from the audio as conditioning during training, while we use MuLan embeddings computed from the text input during inference."
"By making no assumptions about the content of the audio signal, AudioLM learns to generate realistic audio from audio-only corpora, be it speech or piano music, without any annotation."
"Casting audio synthesis as a language modeling task in a discrete representation space, and leveraging a hierarchy of coarse-to-fine audio discrete units (or tokens), AudioLM achieves both high fidelity and long-term coherence over dozens of seconds."
MusicLM is an extension of AudioLM specifically for music. It is trained on a large unlabeled dataset of music. The tokenization of the embeddings is improved by incorporating an audio compression system called SoundStream. It creates 2 separate types of tokens, one for high-level concepts for modeling long-tem structure, and another for low-level acoustics. The language model MusicLM is built on is called w2v-BERT, which has 600 million parameters. They did a weird thing where they rip the model open, extract embeddings from the 7th layer, and cluster them, to produce 25 semantic tokens per second of audio. The network learns a mapping from MuLan tokens to semantic tokens this way.
What comes out of this process is a serious of audio tokens that get fed into the SoundStream decoder, instead of the encoder which is what is used during the training. The resulting audio has a bitrate of 24 khz, so not top quality but sounds ok.
MusicLM: Generating Music From Text
#solidstatelife #ai #generativemodels #music #audiogeneration