Gemini is Google's new multimodal LLM. Crucially, unlike OpenAI's GPT family of models, Gemini was not started as a language model and had other "modes" like images added later. Gemini was multimodal from its inception. "Multimodal" here just means it takes more than one type of input. In the case of Gemini, the input is: text, images, audio, and video. Not only that, but it can output images in addition to text.
It was trained on a large fleet of Google's TPU (tensor processing unit) accelerators across multiple data centers. Tools used include Jax, Pathways, GSPMD, XLA, and MegaScale XLA. For those not in the know, Pathways is a "large scale orchestration layer for accelerators" (by "accelerators" they mean Google's TPUs). GSPMD stands for "General and Scalable Parallelization for ML Computation Graphs" and is a parallelization system for common machine learning computations. "It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation." This brings us to Jax and XLA/MegaScale XLA. These work together. Jax is an autodifferentiation system (analogous to PyTorch and TensorFlow) designed to optimize computation for TPUs using XLA -- in fact the "AX" in "JAX" stands for "autograd and XLA". You might be wondering what the "J" stands for? "JIT" (just-in-time compiler) apparently. And what about XLA? XLA stands for "accelerated linear algebra" and is a compiler for machine learning. It compiles neural networks into machine code optimized for a given set of hardware. As for the "MegaScale" part, XLA in its original formulation did the whole compilation on one computer, and "MegaScale" XLA distributes the compilation across a whole bunch of computers, using original XLA on each computer to do the compilation of that computer's part.
Training occurred on such a large scale that a rare error called "silent data corruption" occurred about once every 2 weeks. This is about data corruption in actual hardware. SRAM has 1 bit flit per billion hours, from things like cosmic rays. So 1 undetectable error per 115,000 years. Or 1 per year if you have 115,000 machines. But wait, we're not done. Bit flips can happen in RAM (DIMMs), CPUs, GPUs, network interface cards (NICs), magnetic hard disks, flash drives, and interconnect wiring. But wait, we're not done. That's from things like cosmic rays. There's also manufacturing defects. Tiny dislocations in placement of routing blocks within the CPU can lead to race conditions in the arrival time of electrical signals, resulting in rare but unpredictable bit-flips. A transistor may simply wear out prematurely. Google said a "silent data corruption" occurred about once every 2 weeks. Suffice to say, while Google doesn't say how much computing power they threw at creating this model, it's extremely massive.
The tokenizer used for text was SentencePiece, which they say works better for multiple languages than the Byte Pair Encoding tokenizer used by OpenAI's GPT family of models. The difference between these two is that Byte Pair Encoding works by iteratively merging pairs of characters, based on how frequently they occur in canonical text, until a desired vocabulary size is reached. SentencePiece, on the other hand, uses self-supervised learning -- the same predictive learning methodology that GPT itself uses -- to create the tokens. Byte Pair Encoding requires a preprocessing step that breaks the input up into words beforehand. SentencePiece works by treating spaces as "just another character". For this reason, SentencePiece is supposed to work better on languages like Chinese and Japanese that don't care about putting spaces between words. SentencePiece is also said to do a better job at handling rare words (the so-called "out-of-vocabulary" words).
As for how images and video are encoded, they don't say explicitly but they do say it builds on previous Google models Flamingo, CoCa, and PaLI, so that is a clue. The way the system likely works is: first there is a "preprocessing" step that does boring things like make all the frames the same size, then there is a convolutional network that extracts "features" from the image, then there is an encoding step that encodes to a "latent space" -- you can think of this as being analogous to the "tokenization" step for text -- then, before all of this is combined with the text input, there is a "quantization" step. You can think of this "quantization" as being analogous to image or video compression. Each of the models mentioned use a different quantization algorithm, so we don't know which one Gemini uses. Flamingo uses "vector quantization", CoCa uses "mixture of experts" and PaLI uses "two-stage quantization". The important thing to understand here is that "quantization" has a "stabilizing" effect on video, as seen from the perspective of the neural network during training.
If you want to know what the data was that the model was trained on, well... they don't say. At all. Sorry, you don't get to know that.
Alright. Next what Google wants to do is trumpet the capabilities of the model. They do this primarily by citing its performance on various benchmarks. There's two summary tables, one for text and one for "multimodal", in the original announcement blog post, so if you want quick summary tables, I suggest you just go look at those. The "text" section is broken out into "general", "reasoning", "math", and "code", while the "multimodal" is broken out into "image", "video", and "audio".
Before we dive into this, need to mention that Gemini comes in different sizes, with "Nano", "Pro", and "Ultra" variants. If you're wondering why you're going to see "Ultra" so much, it's because most of the benchmarks were tested against the "Ultra" version.
The first benchmark is MMLU, which I told you all about when OpenAI advertised GPT-4's score (89%). Gemini beats that slightly with 90.04%. Human expert performance is said to be 89.8%. So GPT-4 almost reached the level of human experts and Gemini just barely passes it. If you believe the 89.8% score really deserves that many significant digits. Anyway, in case you don't remember all about MMLU, MMLU stands for "measuring massive multitask language understanding". It's a test for language models, and the basic idea that you test it on a huge variety of stuff. There are 57 tasks in total: 15,908 questions. "The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online. These include practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination. It also includes questions designed for undergraduate courses and questions designed for readers of Oxford University Press books. Some tasks cover a subject, like psychology, but at a specific level of difficulty, such as 'Elementary,' 'High School,' 'College,' or 'Professional.' For example, the 'Professional Psychology' task draws on questions from freely available practice questions for the Examination for Professional Practice in Psychology, while the 'High School Psychology' task has questions like those from Advanced Placement Psychology examinations."
Google says, "Achieving high performance requires specialist knowledge across many domains (e.g. law, biology, history, etc.), alongside reading comprehension and reasoning. We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought."
Up next is GSM8K. "GSM" stands for "grade school math". "8K" because it has 8,000 questions. Google says, "We find Gemini Ultra reaches 94.4% accuracy with chain-of-thought prompting and self-consistency compared to the previous best accuracy of 92% with the same prompting technique. Similar positive trends are observed in increased difficulty math problems drawn from middle- and high-school math competitions (MATH benchmark), with the Gemini Ultra model outperforming all competitor models, reaching 53.2% using 4-shot prompting. The model also outperforms the state of the art on even harder tasks derived from American Mathematical Competitions. Smaller models perform poorly on this challenging task scoring close to random, but Gemini Ultra can solve 32% of the questions, compared to the 30% solve rate for GPT-4."
To test for coding ability, they use a benchmark called HumanEval and a new one they invented just for this comparison they call Natural2Code. HumanEval apparently is a benchmark that challenges the model to take a function description and produce a Python implementation. Gemini Ultra gets 74.4% on this. The purpose of Natural2Code was they were afraid the model was encountering the questions somewhere in its training data, so they went out of their way to invent a whole new set of questions and verifying that none of them existed anywhere on the internet. (They call this "web leakage".) Gemini Ultra got 74.9% of these, which they said is better than GPT-4. (If you're wondering about AlphaCode 2, I'll get to that in a minute).
Up next we have multilingual tests, which consists of machine translation benchmarks, summarization benchmarks, and translated versions of common benchmarks.
For machine translation, the main benchmark is WMT 23. "WMT" just stands for "Workshop on Machine Translation" and "23" just means they used the 2023 version of the test. Mainly what this test involves is translating news stories between languages. A combination of automatic and human evaluation is used. The automatic evaluation is done with something called a BLEU score. BLEU stands for Bilingual Evaluation Understudy, and the way it works is it compares the machine-translated text to a set of high quality reference translations made by humans. "It has been shown that BLEU scores correlate well with human judgment of translation quality." Gemini Ultra got a score of 74.4, vs GPT-4's 73.8.
Because the WMT 23 test focuses on "high resource" languages (Spanish, German, Russian, Japanese, etc) and "mid-resource" languages, they used different benchmarks for "low-resource" languages. These benchmarks were Flores-200, NTREX, and an internal benchmark that they said used the Quechua language (spoken in Bolivia in South America and nearby countries). They said Gemini Ultra scored 27.0 and the next best model, PaLM 2-L, got 25.3 (not even GPT-4). This was for translations into and out of English only.
For multilingual understanding, they tested it with MGSM, is a translated variant of the math benchmark GSM8K, XLSum, and Wikilingua. MGSM stands for "multilingual grade school math". GPT-4 got 74.5, PaLM 2-L got 74.7, and Gemini Ultra got 79.0.
XLSum stands for "large-scale multilingual abstractive summarization for 44 languages" (well, close enough). You have the BBC to thank for this one. XLSum consists of about 1 million article-summary pairs from the BBC covering 44 languages. Gemini Ultra scores 17.6 vs PaLM 2-L's 15.4. WikiLingua is the same idea except it gets its content from WikiHow, and has 18 languages. PaLM 2-L scores 50.4, winning a rare victory against Gemini Ultra, which fell short at 48.9.
Before we leave the part about purely text-based evaluations, we have to talk about AlphaCode 2. AlphaCode 2 is built on top of Gemini Ultra but is not the same as just chucking programming problems into Gemini Ultra. AlphaCode 2 uses a specialized version of Gemini Pro tuned on competitive programming data, and combined with a system designed to first conduct a search over the space of all possible programs, then do tailored filtering, clustering, and ranking. It was tested against programming challenges from a programming competition website called Codeforces. The same 77 problems were given to the original AlphaCode (which I'm just going to call AlphaCode 1). AlphaCode 2 solved 43%, while the original AlphaCode 1 solved 25%. Comparing this to humans, this means AlphaCode 2 is better than 85% of humans, vs 50% for AlphaCode 1.
Some other tidbits worth mentioning are the graph that performance improves when you go from Nano to Pro to Ultra. In fact they have Nano-1 and Nano-2. The model sizes for these are 1.8 billion parameters and 3.25 billion parameters. Wait, did they ever say the sizes of Pro and Ultra models? They also say the "Nano" models are not trained from scratch but "distilled" from the larger Gemini models (in the interest of time, I'm going to skip describing how the "distillation" process works) and are further reduced to 4 bits (normally neural networks use 16-bit floating point numbers). They are intended to run on mobile phones and other small devices.
Anyway, these four (Nano-1, Nano-2, Pro, and Ultra) are evaluated on "factuality", "long-context", "math/science", "summarization", "reasoning", and "multilinguality". Every time you step up to a larger model, there's a marked improvement in all six of these areas.
They also do what they call a "long-context" test. They place key text at the beginning of the text, then add long filler text, then ask it to remember the information at the beginning. The Ultra model retrieves the correct information 98% of the time, and this is something that also improves with model size.
For subjective human performance evaluations, they decided to do a comparison with PaLM 2. The way this test works is, you give the same prompt to both models and then show them to humans, without telling them which response came from which model, and ask them to pick which one they like better. For "creative writing", Gemini Pro was preferred 65.0% of the time, for "instruction following", Gemini Pro was preferred 59.2% of the time, and for safety, Gemini Pro was preferred 68.5% of the time.
Alrighty, now let's get to the stuff you've probably been waiting for this whole time: the multi-modal stuff.
For image understanding, they threw a battery of tests at it: MMMU, TextVQA, DocVQA, ChartQA, InfographicVQA, MathVista, AI2D, and AQAv2.
MMMU stands for "massive multi-discipline multimodal understanding". It covers 30 subjects across 6 disciplines, including art, business, health & medicine, science, humanities & social science, and tech & engineering, with 183 subfields. The questions were manually collected by a team of 50 college students from various disciplines and subjects, drawing from online sources, textbooks, and lecture materials. The test is all image-based with a deliberately unpredictable mixture of diagrams, tables, plots and charts, photographs, chemical structures, paintings, medical images, sheet music, geometric figures, pathology images, microscopic images, comics, and more, all interleaved with text. Gemini Ultra beat GPT-4V, with 59.4% for Gemini Ultra and 56.8% for GPT-4V.
TextVQA is, as its name suggests, a text + visual question & answer benchmark. It was originally designed 4 years ago with the idea of making computer vision systems that could help visually impaired people by describing their surroundings, including the text content of their surroundings. Gemini Ultra beat GPT-4V with 82.3% for Gemini Ultra and 78% for GPT-4V. Oh, and Google PaLI-3, fine-tuned, beat GPT-4 and was the prior best model, but Gemini Ultra beat that, too.
DocVQA is, as its name implies, a document question & answer benchmark, except this time the documents are images only. Gemini Ultra 90.9%, GPT-4V 88.4%.
ChartQA, you get the idea, it's for charts. It's different from MMMU, though, which can have charts interleaved with text, in that it interrogates you directly on the meaning of the charts and nothing else, while MMMU quizzes you on general knowledge including both the text and charts. "Q1: Which year has the most divergent opinions about Brazil's economy?" "Answer: 2015" "Q2: What is the peak value of the orange line?" "Answer: 87". Gemini Ultra 80.8%, GPT-4V 78.5%.
InfographicVQA, same but for infographics. "How many companies have more than 10K delivery workers?" "Answer: 2". "Who has better coverage in Toronto, Canada post or Amazon?" "Answer: Canada Post". "In which cities did Canada Post get maximum media coverage?" "Answer: Vancouver, Montreal". Gemini Ultra 80.3%, GPT-4V 75.1%
MathVista, nice how they combined "math" with the word "vista" which means "view" in Spanish. These are math questions that involve visual elements. "Question: Which function is monotonic in range [0, pi]?" [picture of sine waves with different phases.] "Answer: (B) the blue one". Gemini Ultra 53.0%, GPT-4V 49.9%, although if you actually go to the project's website, you'll discover they rank "Human" on top with 60.3%.
In the interest of time I'm going to skip over AI2D and VQAv2 as they are 6-year old tests, for science diagrams and natural image understanding. Google PaLI-X, fine-tuned, actually beat Gemini Ultra on AI2D, with Gemini Ultra getting 79.5% and Google PaLI-X getting 81.4%. Google PaLI-X also won on VQAv2. Gemini Ultra got 77.8% and Google PaLI-X got got 86.1.
They show off Gemini's multimodal reasoning capabilities by showing how you can give it plots an ask it to actually write code that generates the plots. Code that works when it runs. "Successfully solving this task shows the model's capability to combine several capabilities: (1) recognition of the functions depicted in the plots; (2) inverse graphics to infer the code that would have generated the subplots; (3) instruction-following to put subplots in their desired positions; and (4) abstract reasoning to infer that the exponential plot must stay in its original place, because the sine plot must move out of the way for the 3-dimensional plot."
This brings us to the video understanding tests: VATEX, VATEX ZH, YouCook2, NextQA, ActivityNet-QA, and PerceptionTest MCQA.
VATEX is video captioning. The "ZH" version asks you to caption the videos in Chinese. These benchmarks use a benchmarking score called a CIDER score. CIDER stands for "consensus-based image description evaluation". It's interesting in that unlike the BLEU score, which is a simple text comparison, with CIDER, they actually take the trouble to tokenize the captions, and then use cosine similarity to compare them. This is similar to how those "vector databases" you've been hearing about work. This system notices semantic similarity even when different words are used, and recognizes consensus in meaning even with different words and phrasing.
YouCook2 is captioning of cooking videos specifically, in English. NextQA, ActivityNet-QA, and Perception Test MCQA are all question answering, but for videos.
GPT-4V doesn't do video, so it's not even part of the competition here. The competitors are DeepMind Flamingo, Video-LLAVA, and SeViLA, and, spoiler: Gemini Ultra wins. I'm going to skip the scores because they're not percentages and I don't think the numbers would be meaningful to you.
The video that really shows off Gemini's video understanding abilities is the "football/soccer penalty miss analysis" video. "How could this person improve their technique?" "This soccer player needs to work on their ball-striking mechanics. Their foot is not making good contact with the ball, which is causing the ball to fly up and away from the goal. They also need to work on their body positioning. They should be leaning into the shot with their non-kicking leg in front of their kicking leg, and they should be following through with their kicking leg." Link to that below. As well as a lot of other video interaction with Gemini.
Let's move on to image generation. Unlike prior LLMs, Gemini actually outputs images, and it doesn't rely on natural language to do it -- it is an output on the same level as the text output from the model. Gemini "does not have to rely on an intermediate natural language description that can bottleneck the model's ability to express images."
"Give me two ideas that I could do with these 2 colors." [Inputs colors of yarn.] "Idea 1: How about a green avocado with pink seed?" [picture]. "Idea 2: Or a green bunny with pink ears?"
For the last set of benchmarks, we look at audio understanding. For speech translation, speech from YouTube, Multilingual Librespeech, FLEURS (62 languages), and VoxPopuli (14 languages) were compared. Here just to confuse matters, the lower the score the better. That's because the word error rate (WER) benchmark is an "error" measure, so lower is better. Competitors were OpenAI Whisper, and Google Universal Speech Model (USM).
Gemini Pro won on all of them. For YouTube, Gemini Pro 4.9%, USM 6.2%; for Multilingual Librespeech, Gemini Pro 4.8, Whisper 6.2; for FLEURS, Gemini Pro 7.6, USM 11.8; for VoxPopuli, Gemini Pro 9.1, USM 13.4.
"We find that Gemini Pro produces more understandable responses, particularly on rare words and proper nouns."
In the example, the "ground truth" from a human transcriber says, "The archipelago lies 120 km north of the Peninsula. The largest is King George Island, with the settlement of Villa Las Estrellas." USM transcribes the same audio as, "The archipelago lines 120 km north of peninsula. The largest is Kingurch island with the settlement of Cua Losas." Gemini Pro transcribes the same audio as, "The archipelago lies 120 km north of the Peninsula. The largest is King George Island, with the settlement of Villa Las Estrellas."
Now we get to modality combination, and this is where I have to turn things over to the videos. "What's the first step to make a veggie omelet with these ingredients?" "Crack the eggs into a bowl and whisk them."
To wrap things up, I'm going to pull out a few quotes for what they say on safety.
"We filter training data for high-risk content and to ensure all training data is sufficiently high quality. Beyond filtering, we also take steps to ensure all data collected meets Google DeepMind's best practices on data enrichment developed based on the Partnership on AI's 'Responsible Sourcing of Data Enrichment Services'."
"Instruction tuning encompasses supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF) using a reward model. We apply instruction tuning in both text and multimodal settings. Instruction tuning recipes are carefully designed to balance the increase in helpfulness with decrease in model harms related to safety and hallucinations."
"To mitigate risks of harmful text generation, we enumerate approximately 20 harm types (e.g. hate speech, providing medical advice, suggesting dangerous behavior) across a wide variety of use cases."
"We focused instruction tuning efforts on three key desired behaviors, reflecting real-world scenarios: Attribution, closed-book response generation, and hedging."
By "closed-book response generation", they mean, "If provided with a fact-seeking prompt without any given source, Gemini should not hallucinate incorrect information. These prompts can range from information-seeking prompts (e.g. 'Who is the prime minister of India?') to semi-creative prompts that may request factual information (e.g. 'Write a 500-word speech in favor of the adoption of renewable energy')."
By "hedging" they mean, "If prompted with an input such that it is 'unanswerable', Gemini should not hallucinate. Rather, it should acknowledge that it cannot provide a response by hedging."
"Note that the results produced here do not include endowing Gemini with tools or retrieval that purportedly could boost factuality". In case you were wondering.
To test these, they developed 3 test sets: a factuality set, an attribution set, and a hedging set. They claim Gemini Pro has a 3.4% error rate on the factuality set, a 59.7% success rate on the attribution set, and a 69.3% success rate on the hedging set.
"We undertake ethics and safety reviews with the Google DeepMind's Responsibility and Safety Council (RSC), an interdisciplinary group which evaluates Google DeepMind's projects, papers and collaborations against Google's AI Principles."
Introducing Gemini: our largest and most capable AI model
#solidstatelife #ai #genai #llms #gpt #multimodal