#multimodal

waynerad@diasp.org

The Scalable, Instructable, Multiworld Agent (SIMA) from DeepMind plays video games for you. You tell it what you want to do in regular language, and it goes into a 3D environment, including some provided by commercial video games, and carries out keyboard-and-mouse actions.

Before getting into how they did this, might be worth citing some of the reasons they thought this was challenging: Video games can be open-ended, visually complex, and have hundreds of different objects. Video games are asynchronous -- no turn taking like chess or Go, or many research environments, which stop and wait while the agent computes its next action. Each instance of a commercial video game needs its own GPU -- no running hundreds or thousands of actors per game per experiment as has been historically done in reinforcement learning. AI agents see the same screen pixels that a human player gets -- no access to internal game state, rewards, or any other "privileged information". AI agents use the same keyboard-and-mouse controls that humans do -- no handcrafted action spaces or high-level APIs.

In addition to all those challenges, they demanded their agents follow instructions in regular language, rather than simply pursuing a high score in the game, and the agents were not allowed to use simplified grammars or command sets.

"Since the agent-environment interface is human compatible, it allows agents the potential to achieve anything that a human could, and allows direct imitation learning from human behavior."

"A key motivation of SIMA is the idea that learning language and learning about environments are mutually reinforcing. A variety of studies have found that even when language is not necessary for solving a task, learning language can help agents to learn generalizable representations and abstractions, or to learn more efficiently." "Conversely, richly grounded learning can also support language learning."

I figure you're all eager to know what the games were. They were: Goat Simulator 3 (you play the goat), Hydroneer (you run a mining operation and dig for gold), No Man's Sky (you explore a galaxy of procedurally-generated planets), Satisfactory (you attempt to build a space elevator on an alien planet), Teardown (you complete heists by solving puzzles), Valheim (you try to survive in a world of Norse mythology), and Wobbly Life (you complete jobs to earn money to buy your own house).

However, before the games, they trained SIMA in research environments. Those, which you probably never heard of, are: Construction Lab (agents are challenged to build things from construction blocks), Playhouse (a procedurally-generated house), ProcTHOR (procedurally-generated rooms, such as offices and libraries), and WorldLab (an environment with better simulated physics).

The SIMA agent itself maps visual observations and language instructions to keyboard-and-mouse actions. But it does that in several stages. For input, it takes a language instruction from you, and the pixels of the screen.

The video and language instruction both go through encoding layers before being input to a single, large, multi-modal transformer. The transformer doesn't output keyboard and mouse actions directly. Instead, it outputs a "state representation" that gets fed into a reinforcement learning network, which translates the "state" into what in reinforcement learning parlance is called a "policy". A more intuitive regular word might be "strategy". Basically this is a function that, when given input from the environment including the agent's state within the environment, will output an action. Here, the actions are the same actions a human would take with mouse and keyboard.

The multi-modal transformer was trained from scratch. A recent new algorithm called Classifier-Free Guidance (CFG) was used, an algorithm inspired by the algorithms used by diffusion models to "condition" the diffusion model on the text you, the user, typed in.

Even in the research environments, it is hard to automate judging of whether an agent completed its tasks. Instructions may be such things as, "make a pile of rocks to mark this spot" or "see if you can jump over this chasm". The environment may not provide any signal indicating these have been fulfilled. There are some they can handle, though, like "move forward", "lift the green cube", and "use the knife to chop the carrots".

For commercial video games, all the agent gets is pixels on the screen, just like a human player, and has no access to the internal game state of the game. The games generally don't allow any game state to be saved and restored, something researchers like for reproducibility.

For video games, they resorted to detecting on-screen text using OCR. They did this in particular for two games, No Man's Sky and Valheim, "which both feature a significant amount of on-screen text."

Why not just have people look, i.e. have humans judge whether the instructions were followed? Turns out humans were "the slowest and most expensive." They were able to get judgments from humans who were experts at the particular game an agent was playing, though.

For automated judgment, if a task contains a knife, a cutting board, and a carrot, the agent may ascertain the goal ("cut the carrot on the cutting board") without relying on the language instruction. This example illustrates the need to differentiate between following a language task and inferring the language task from "environmental affordances".

How'd SIMA do? It looks like its success rate got up to about 60% for Playhouse, but only about 30% for Valheim. That's the percentage of tasks completed. The ranking goes Playhouse, Worldlab, Satisfactory, Construction Lab, No Man's Sky, Goat Simulator 3, and Valheim.

"Note that humans would also find some of these tasks challenging, and thus human-level performance would not be 100%."

Grouped by "skill category", movement instructions ("stop", "move", "look") were the easiest, while food and resource gathering instructions ("eat", "cook", "collect", "harvest") were the hardest.

For No Man's Sky, they did a direct comparison with humans. Human's averaged 60%, while SIMA had around 30%.

How long til the AIs can beat the humans?

A generalist AI agent for 3D virtual environments

#solidstatelife #ai #genai #llms #computervision #multimodal #videogames

waynerad@diasp.org

Problem solving across 100,633 lines of code in Google Gemini 1.5 Pro.

The code is for generating some animations.

"What controls the animations on the littlest Tokyo demo?"

The model finds the demo and explains the animations are embedded within a glTF model. The videos doesn't explain what glTF is -- apparently it stands for "GL Transmission Format", where "GL" in turn stands for "graphics library" as it does in "OpenGL".

"Show me some code to add a slider to control the speed of the animation. Use that kind of GUI the other demos have."

The show the code and the slider which was added to the scene, which works.

Next, they give it a screenshot of a demo and asked where the code for it was.

There were a couple hundred demos in the system (they never say exactly how many) and it correctly finds the one the matches the image.

"How can I modify the code to make the terrain flatter?"

Gemini finds the function that generates the height and the exact line within the function to modify. It also provided an explanation of why the change worked.

For the last taks they show, they use a 3D text demo that says "three.js".

"How can I change the text to say, 'goldfish' and make the mesh materials look really shiny and metallic?"

Gemini finds the correct demo and showed the precise lines in it to change, along with an explanation of how to change material properties such as metalness and roughness to get a shiny effect.

Problem solving across 100,633 lines of code | Gemini 1.5 Pro demo - Google

#solidstatelife #ai #genai #computervision #llms #multimodal #google #gemini

waynerad@diasp.org

"Generate your dating profile bio with AI". "Sign in with Google."

That's the only way to use it? Sign in with Google?

Anyway, they say it uses GPT-4-Vision. Upload screenshots from your dating apps, and GPT-4-Vision will analyze them and write a bio for you that increases your chances to get more matches.

Generate your dating profile bio with AI

#solidstatelife #ai #genai #llms #gpt #multimodal

waynerad@diasp.org

"Google is reportedly toying with the idea of using its latest Gemini AI models to analyze images from Google Photos and text from Search to put together a life story for users."

"The technology is currently being explored under 'Project Ellman', and would be powered by Google's new multimodal large language model Gemini." "The idea is to ingest different types of data from multiple sources, like photographs stored on Google Photos or public information pulled from the internet, to create a more personalized chatbot."

"Imagine opening ChatGPT but it already knows everything about your life. What would you ask it?"

Hmm. I usually ask about all sorts of other things. Do people really want a chatbot to ask questions about their own lives?

Google's Project Ellman aims for digital twin chatbot

#solidstatelife #ai #genai #llms #multimodal #gemini

waynerad@diasp.org

Gemini is Google's new multimodal LLM. Crucially, unlike OpenAI's GPT family of models, Gemini was not started as a language model and had other "modes" like images added later. Gemini was multimodal from its inception. "Multimodal" here just means it takes more than one type of input. In the case of Gemini, the input is: text, images, audio, and video. Not only that, but it can output images in addition to text.

It was trained on a large fleet of Google's TPU (tensor processing unit) accelerators across multiple data centers. Tools used include Jax, Pathways, GSPMD, XLA, and MegaScale XLA. For those not in the know, Pathways is a "large scale orchestration layer for accelerators" (by "accelerators" they mean Google's TPUs). GSPMD stands for "General and Scalable Parallelization for ML Computation Graphs" and is a parallelization system for common machine learning computations. "It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation." This brings us to Jax and XLA/MegaScale XLA. These work together. Jax is an autodifferentiation system (analogous to PyTorch and TensorFlow) designed to optimize computation for TPUs using XLA -- in fact the "AX" in "JAX" stands for "autograd and XLA". You might be wondering what the "J" stands for? "JIT" (just-in-time compiler) apparently. And what about XLA? XLA stands for "accelerated linear algebra" and is a compiler for machine learning. It compiles neural networks into machine code optimized for a given set of hardware. As for the "MegaScale" part, XLA in its original formulation did the whole compilation on one computer, and "MegaScale" XLA distributes the compilation across a whole bunch of computers, using original XLA on each computer to do the compilation of that computer's part.

Training occurred on such a large scale that a rare error called "silent data corruption" occurred about once every 2 weeks. This is about data corruption in actual hardware. SRAM has 1 bit flit per billion hours, from things like cosmic rays. So 1 undetectable error per 115,000 years. Or 1 per year if you have 115,000 machines. But wait, we're not done. Bit flips can happen in RAM (DIMMs), CPUs, GPUs, network interface cards (NICs), magnetic hard disks, flash drives, and interconnect wiring. But wait, we're not done. That's from things like cosmic rays. There's also manufacturing defects. Tiny dislocations in placement of routing blocks within the CPU can lead to race conditions in the arrival time of electrical signals, resulting in rare but unpredictable bit-flips. A transistor may simply wear out prematurely. Google said a "silent data corruption" occurred about once every 2 weeks. Suffice to say, while Google doesn't say how much computing power they threw at creating this model, it's extremely massive.

The tokenizer used for text was SentencePiece, which they say works better for multiple languages than the Byte Pair Encoding tokenizer used by OpenAI's GPT family of models. The difference between these two is that Byte Pair Encoding works by iteratively merging pairs of characters, based on how frequently they occur in canonical text, until a desired vocabulary size is reached. SentencePiece, on the other hand, uses self-supervised learning -- the same predictive learning methodology that GPT itself uses -- to create the tokens. Byte Pair Encoding requires a preprocessing step that breaks the input up into words beforehand. SentencePiece works by treating spaces as "just another character". For this reason, SentencePiece is supposed to work better on languages like Chinese and Japanese that don't care about putting spaces between words. SentencePiece is also said to do a better job at handling rare words (the so-called "out-of-vocabulary" words).

As for how images and video are encoded, they don't say explicitly but they do say it builds on previous Google models Flamingo, CoCa, and PaLI, so that is a clue. The way the system likely works is: first there is a "preprocessing" step that does boring things like make all the frames the same size, then there is a convolutional network that extracts "features" from the image, then there is an encoding step that encodes to a "latent space" -- you can think of this as being analogous to the "tokenization" step for text -- then, before all of this is combined with the text input, there is a "quantization" step. You can think of this "quantization" as being analogous to image or video compression. Each of the models mentioned use a different quantization algorithm, so we don't know which one Gemini uses. Flamingo uses "vector quantization", CoCa uses "mixture of experts" and PaLI uses "two-stage quantization". The important thing to understand here is that "quantization" has a "stabilizing" effect on video, as seen from the perspective of the neural network during training.

If you want to know what the data was that the model was trained on, well... they don't say. At all. Sorry, you don't get to know that.

Alright. Next what Google wants to do is trumpet the capabilities of the model. They do this primarily by citing its performance on various benchmarks. There's two summary tables, one for text and one for "multimodal", in the original announcement blog post, so if you want quick summary tables, I suggest you just go look at those. The "text" section is broken out into "general", "reasoning", "math", and "code", while the "multimodal" is broken out into "image", "video", and "audio".

Before we dive into this, need to mention that Gemini comes in different sizes, with "Nano", "Pro", and "Ultra" variants. If you're wondering why you're going to see "Ultra" so much, it's because most of the benchmarks were tested against the "Ultra" version.

The first benchmark is MMLU, which I told you all about when OpenAI advertised GPT-4's score (89%). Gemini beats that slightly with 90.04%. Human expert performance is said to be 89.8%. So GPT-4 almost reached the level of human experts and Gemini just barely passes it. If you believe the 89.8% score really deserves that many significant digits. Anyway, in case you don't remember all about MMLU, MMLU stands for "measuring massive multitask language understanding". It's a test for language models, and the basic idea that you test it on a huge variety of stuff. There are 57 tasks in total: 15,908 questions. "The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online. These include practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination. It also includes questions designed for undergraduate courses and questions designed for readers of Oxford University Press books. Some tasks cover a subject, like psychology, but at a specific level of difficulty, such as 'Elementary,' 'High School,' 'College,' or 'Professional.' For example, the 'Professional Psychology' task draws on questions from freely available practice questions for the Examination for Professional Practice in Psychology, while the 'High School Psychology' task has questions like those from Advanced Placement Psychology examinations."

Google says, "Achieving high performance requires specialist knowledge across many domains (e.g. law, biology, history, etc.), alongside reading comprehension and reasoning. We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought."

Up next is GSM8K. "GSM" stands for "grade school math". "8K" because it has 8,000 questions. Google says, "We find Gemini Ultra reaches 94.4% accuracy with chain-of-thought prompting and self-consistency compared to the previous best accuracy of 92% with the same prompting technique. Similar positive trends are observed in increased difficulty math problems drawn from middle- and high-school math competitions (MATH benchmark), with the Gemini Ultra model outperforming all competitor models, reaching 53.2% using 4-shot prompting. The model also outperforms the state of the art on even harder tasks derived from American Mathematical Competitions. Smaller models perform poorly on this challenging task scoring close to random, but Gemini Ultra can solve 32% of the questions, compared to the 30% solve rate for GPT-4."

To test for coding ability, they use a benchmark called HumanEval and a new one they invented just for this comparison they call Natural2Code. HumanEval apparently is a benchmark that challenges the model to take a function description and produce a Python implementation. Gemini Ultra gets 74.4% on this. The purpose of Natural2Code was they were afraid the model was encountering the questions somewhere in its training data, so they went out of their way to invent a whole new set of questions and verifying that none of them existed anywhere on the internet. (They call this "web leakage".) Gemini Ultra got 74.9% of these, which they said is better than GPT-4. (If you're wondering about AlphaCode 2, I'll get to that in a minute).

Up next we have multilingual tests, which consists of machine translation benchmarks, summarization benchmarks, and translated versions of common benchmarks.

For machine translation, the main benchmark is WMT 23. "WMT" just stands for "Workshop on Machine Translation" and "23" just means they used the 2023 version of the test. Mainly what this test involves is translating news stories between languages. A combination of automatic and human evaluation is used. The automatic evaluation is done with something called a BLEU score. BLEU stands for Bilingual Evaluation Understudy, and the way it works is it compares the machine-translated text to a set of high quality reference translations made by humans. "It has been shown that BLEU scores correlate well with human judgment of translation quality." Gemini Ultra got a score of 74.4, vs GPT-4's 73.8.

Because the WMT 23 test focuses on "high resource" languages (Spanish, German, Russian, Japanese, etc) and "mid-resource" languages, they used different benchmarks for "low-resource" languages. These benchmarks were Flores-200, NTREX, and an internal benchmark that they said used the Quechua language (spoken in Bolivia in South America and nearby countries). They said Gemini Ultra scored 27.0 and the next best model, PaLM 2-L, got 25.3 (not even GPT-4). This was for translations into and out of English only.

For multilingual understanding, they tested it with MGSM, is a translated variant of the math benchmark GSM8K, XLSum, and Wikilingua. MGSM stands for "multilingual grade school math". GPT-4 got 74.5, PaLM 2-L got 74.7, and Gemini Ultra got 79.0.

XLSum stands for "large-scale multilingual abstractive summarization for 44 languages" (well, close enough). You have the BBC to thank for this one. XLSum consists of about 1 million article-summary pairs from the BBC covering 44 languages. Gemini Ultra scores 17.6 vs PaLM 2-L's 15.4. WikiLingua is the same idea except it gets its content from WikiHow, and has 18 languages. PaLM 2-L scores 50.4, winning a rare victory against Gemini Ultra, which fell short at 48.9.

Before we leave the part about purely text-based evaluations, we have to talk about AlphaCode 2. AlphaCode 2 is built on top of Gemini Ultra but is not the same as just chucking programming problems into Gemini Ultra. AlphaCode 2 uses a specialized version of Gemini Pro tuned on competitive programming data, and combined with a system designed to first conduct a search over the space of all possible programs, then do tailored filtering, clustering, and ranking. It was tested against programming challenges from a programming competition website called Codeforces. The same 77 problems were given to the original AlphaCode (which I'm just going to call AlphaCode 1). AlphaCode 2 solved 43%, while the original AlphaCode 1 solved 25%. Comparing this to humans, this means AlphaCode 2 is better than 85% of humans, vs 50% for AlphaCode 1.

Some other tidbits worth mentioning are the graph that performance improves when you go from Nano to Pro to Ultra. In fact they have Nano-1 and Nano-2. The model sizes for these are 1.8 billion parameters and 3.25 billion parameters. Wait, did they ever say the sizes of Pro and Ultra models? They also say the "Nano" models are not trained from scratch but "distilled" from the larger Gemini models (in the interest of time, I'm going to skip describing how the "distillation" process works) and are further reduced to 4 bits (normally neural networks use 16-bit floating point numbers). They are intended to run on mobile phones and other small devices.

Anyway, these four (Nano-1, Nano-2, Pro, and Ultra) are evaluated on "factuality", "long-context", "math/science", "summarization", "reasoning", and "multilinguality". Every time you step up to a larger model, there's a marked improvement in all six of these areas.

They also do what they call a "long-context" test. They place key text at the beginning of the text, then add long filler text, then ask it to remember the information at the beginning. The Ultra model retrieves the correct information 98% of the time, and this is something that also improves with model size.

For subjective human performance evaluations, they decided to do a comparison with PaLM 2. The way this test works is, you give the same prompt to both models and then show them to humans, without telling them which response came from which model, and ask them to pick which one they like better. For "creative writing", Gemini Pro was preferred 65.0% of the time, for "instruction following", Gemini Pro was preferred 59.2% of the time, and for safety, Gemini Pro was preferred 68.5% of the time.

Alrighty, now let's get to the stuff you've probably been waiting for this whole time: the multi-modal stuff.

For image understanding, they threw a battery of tests at it: MMMU, TextVQA, DocVQA, ChartQA, InfographicVQA, MathVista, AI2D, and AQAv2.

MMMU stands for "massive multi-discipline multimodal understanding". It covers 30 subjects across 6 disciplines, including art, business, health & medicine, science, humanities & social science, and tech & engineering, with 183 subfields. The questions were manually collected by a team of 50 college students from various disciplines and subjects, drawing from online sources, textbooks, and lecture materials. The test is all image-based with a deliberately unpredictable mixture of diagrams, tables, plots and charts, photographs, chemical structures, paintings, medical images, sheet music, geometric figures, pathology images, microscopic images, comics, and more, all interleaved with text. Gemini Ultra beat GPT-4V, with 59.4% for Gemini Ultra and 56.8% for GPT-4V.

TextVQA is, as its name suggests, a text + visual question & answer benchmark. It was originally designed 4 years ago with the idea of making computer vision systems that could help visually impaired people by describing their surroundings, including the text content of their surroundings. Gemini Ultra beat GPT-4V with 82.3% for Gemini Ultra and 78% for GPT-4V. Oh, and Google PaLI-3, fine-tuned, beat GPT-4 and was the prior best model, but Gemini Ultra beat that, too.

DocVQA is, as its name implies, a document question & answer benchmark, except this time the documents are images only. Gemini Ultra 90.9%, GPT-4V 88.4%.

ChartQA, you get the idea, it's for charts. It's different from MMMU, though, which can have charts interleaved with text, in that it interrogates you directly on the meaning of the charts and nothing else, while MMMU quizzes you on general knowledge including both the text and charts. "Q1: Which year has the most divergent opinions about Brazil's economy?" "Answer: 2015" "Q2: What is the peak value of the orange line?" "Answer: 87". Gemini Ultra 80.8%, GPT-4V 78.5%.

InfographicVQA, same but for infographics. "How many companies have more than 10K delivery workers?" "Answer: 2". "Who has better coverage in Toronto, Canada post or Amazon?" "Answer: Canada Post". "In which cities did Canada Post get maximum media coverage?" "Answer: Vancouver, Montreal". Gemini Ultra 80.3%, GPT-4V 75.1%

MathVista, nice how they combined "math" with the word "vista" which means "view" in Spanish. These are math questions that involve visual elements. "Question: Which function is monotonic in range [0, pi]?" [picture of sine waves with different phases.] "Answer: (B) the blue one". Gemini Ultra 53.0%, GPT-4V 49.9%, although if you actually go to the project's website, you'll discover they rank "Human" on top with 60.3%.

In the interest of time I'm going to skip over AI2D and VQAv2 as they are 6-year old tests, for science diagrams and natural image understanding. Google PaLI-X, fine-tuned, actually beat Gemini Ultra on AI2D, with Gemini Ultra getting 79.5% and Google PaLI-X getting 81.4%. Google PaLI-X also won on VQAv2. Gemini Ultra got 77.8% and Google PaLI-X got got 86.1.

They show off Gemini's multimodal reasoning capabilities by showing how you can give it plots an ask it to actually write code that generates the plots. Code that works when it runs. "Successfully solving this task shows the model's capability to combine several capabilities: (1) recognition of the functions depicted in the plots; (2) inverse graphics to infer the code that would have generated the subplots; (3) instruction-following to put subplots in their desired positions; and (4) abstract reasoning to infer that the exponential plot must stay in its original place, because the sine plot must move out of the way for the 3-dimensional plot."

This brings us to the video understanding tests: VATEX, VATEX ZH, YouCook2, NextQA, ActivityNet-QA, and PerceptionTest MCQA.

VATEX is video captioning. The "ZH" version asks you to caption the videos in Chinese. These benchmarks use a benchmarking score called a CIDER score. CIDER stands for "consensus-based image description evaluation". It's interesting in that unlike the BLEU score, which is a simple text comparison, with CIDER, they actually take the trouble to tokenize the captions, and then use cosine similarity to compare them. This is similar to how those "vector databases" you've been hearing about work. This system notices semantic similarity even when different words are used, and recognizes consensus in meaning even with different words and phrasing.

YouCook2 is captioning of cooking videos specifically, in English. NextQA, ActivityNet-QA, and Perception Test MCQA are all question answering, but for videos.

GPT-4V doesn't do video, so it's not even part of the competition here. The competitors are DeepMind Flamingo, Video-LLAVA, and SeViLA, and, spoiler: Gemini Ultra wins. I'm going to skip the scores because they're not percentages and I don't think the numbers would be meaningful to you.

The video that really shows off Gemini's video understanding abilities is the "football/soccer penalty miss analysis" video. "How could this person improve their technique?" "This soccer player needs to work on their ball-striking mechanics. Their foot is not making good contact with the ball, which is causing the ball to fly up and away from the goal. They also need to work on their body positioning. They should be leaning into the shot with their non-kicking leg in front of their kicking leg, and they should be following through with their kicking leg." Link to that below. As well as a lot of other video interaction with Gemini.

Let's move on to image generation. Unlike prior LLMs, Gemini actually outputs images, and it doesn't rely on natural language to do it -- it is an output on the same level as the text output from the model. Gemini "does not have to rely on an intermediate natural language description that can bottleneck the model's ability to express images."

"Give me two ideas that I could do with these 2 colors." [Inputs colors of yarn.] "Idea 1: How about a green avocado with pink seed?" [picture]. "Idea 2: Or a green bunny with pink ears?"

For the last set of benchmarks, we look at audio understanding. For speech translation, speech from YouTube, Multilingual Librespeech, FLEURS (62 languages), and VoxPopuli (14 languages) were compared. Here just to confuse matters, the lower the score the better. That's because the word error rate (WER) benchmark is an "error" measure, so lower is better. Competitors were OpenAI Whisper, and Google Universal Speech Model (USM).

Gemini Pro won on all of them. For YouTube, Gemini Pro 4.9%, USM 6.2%; for Multilingual Librespeech, Gemini Pro 4.8, Whisper 6.2; for FLEURS, Gemini Pro 7.6, USM 11.8; for VoxPopuli, Gemini Pro 9.1, USM 13.4.

"We find that Gemini Pro produces more understandable responses, particularly on rare words and proper nouns."

In the example, the "ground truth" from a human transcriber says, "The archipelago lies 120 km north of the Peninsula. The largest is King George Island, with the settlement of Villa Las Estrellas." USM transcribes the same audio as, "The archipelago lines 120 km north of peninsula. The largest is Kingurch island with the settlement of Cua Losas." Gemini Pro transcribes the same audio as, "The archipelago lies 120 km north of the Peninsula. The largest is King George Island, with the settlement of Villa Las Estrellas."

Now we get to modality combination, and this is where I have to turn things over to the videos. "What's the first step to make a veggie omelet with these ingredients?" "Crack the eggs into a bowl and whisk them."

To wrap things up, I'm going to pull out a few quotes for what they say on safety.

"We filter training data for high-risk content and to ensure all training data is sufficiently high quality. Beyond filtering, we also take steps to ensure all data collected meets Google DeepMind's best practices on data enrichment developed based on the Partnership on AI's 'Responsible Sourcing of Data Enrichment Services'."

"Instruction tuning encompasses supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF) using a reward model. We apply instruction tuning in both text and multimodal settings. Instruction tuning recipes are carefully designed to balance the increase in helpfulness with decrease in model harms related to safety and hallucinations."

"To mitigate risks of harmful text generation, we enumerate approximately 20 harm types (e.g. hate speech, providing medical advice, suggesting dangerous behavior) across a wide variety of use cases."

"We focused instruction tuning efforts on three key desired behaviors, reflecting real-world scenarios: Attribution, closed-book response generation, and hedging."

By "closed-book response generation", they mean, "If provided with a fact-seeking prompt without any given source, Gemini should not hallucinate incorrect information. These prompts can range from information-seeking prompts (e.g. 'Who is the prime minister of India?') to semi-creative prompts that may request factual information (e.g. 'Write a 500-word speech in favor of the adoption of renewable energy')."

By "hedging" they mean, "If prompted with an input such that it is 'unanswerable', Gemini should not hallucinate. Rather, it should acknowledge that it cannot provide a response by hedging."

"Note that the results produced here do not include endowing Gemini with tools or retrieval that purportedly could boost factuality". In case you were wondering.

To test these, they developed 3 test sets: a factuality set, an attribution set, and a hedging set. They claim Gemini Pro has a 3.4% error rate on the factuality set, a 59.7% success rate on the attribution set, and a 69.3% success rate on the hedging set.

"We undertake ethics and safety reviews with the Google DeepMind's Responsibility and Safety Council (RSC), an interdisciplinary group which evaluates Google DeepMind's projects, papers and collaborations against Google's AI Principles."

Introducing Gemini: our largest and most capable AI model

#solidstatelife #ai #genai #llms #gpt #multimodal

waynerad@diasp.org

A single neural network that receives input from 6 "modalities": images, text, audio, depth, thermal, and inertial measurement unit (IMU) readings.

Based on that, you might think it's taking all these different input modalities and constructing a single, unified model of reality, much like humans do. But... that's not really what's going on here.

What's going on is it is training on images paired with each of the other 5 modalities. That is, images + text, images + audio, images + depth, images + thermal, and images + inertial measurement unit (IMU) readings.

And you might be wondering what training it does based on these pairs?

They used something called an InfoNCE loss function, which takes embeddings taken separately on each of the input pairs, and essentially computes a softmax of the combination.

There's that funny word "embeddings" again. A more intuitive word might be "encoding" or just "vector". It runs the input through an encoder and ends up with a vector that represents something meaningful about the input it started with. In this case, they are using a "transformer" architecture for all the modalities. "Transformer" is another unintuitive term from the machine learning world. It actually means the neural network uses an "attention" mechanism. Actually probably dozens or hundreds, not just one, like our conscious minds.

In the case of images, it uses the Vision Transformer (ViT). In the case of audio, it chops the audio into 2-second pieces and makes spectrograms, which get pumped into a Vision Transformer just like an image. Thermal images are images, so they get pumped straight into a Vision Transformer also. In the case of depth, it gets converted into "disparity maps", whatever those are, "for scale invariance", which then get pumped into a transformer. In the case of inertial measurement unit (IMU) readings, they are broken into 5-second pieces and run through a 1D convolutional network before, you guessed it, getting pumped into a transformer.

So, it calculates a separate embedding for each input modality. Yet, by having a loss function that combines the two, it creates in essence a "joint embedding space" -- the term you see them using in the blog post. It should also be noted that the loss function requires "negative" examples, in other words, while giving it the embeddings for each input in your pair, you also need to give it embeddings for the 2nd modality that are not part of your input pair and tell it, "these are negative examples". In this way the system learns learns in a "contrastive" manner reminiscent of CLIP (the contrastive learning technique that was the precursor to DALL-E).

(And in case you're wondering, no, I can't tell you where the term "InfoNCE" came from or what "NCE" stands for.)

So, what is all this good for? Well, one thing you can do is classification using text labels. It turns out that even though it was trained on image + something else pairs only, it can do classification without images. That is, you can give it audio and it can classify it using text, even though it was never trained on any audio + text pairs, only image + audio pairs and image + text pairs.

The other thing you can do is something they call "emergent compositionality". This is best illustrated with an example: let's say you input an image of fruits on a table and an audio clip of birds chirping. The system can retrieve an image that contains fruit and birds, say on a tree.

There is also discussion in the paper of the possibility of using this system as a way of evaluating pretrained vision models like DALL-E 2. And maybe the methodology explored here can be used to enhance pretrained models that currently handle text and images to also handle audio.

ImageBind: Holistic AI learning across six modalities

#solidstatelife #ai #multimodal