Gemma is a new family of open source models from Google based on Google's Gemini models.
The 2 billion model is trained on 2 trillion tokens and the 7 billion model is trained on 6 trillion tokens. All text, primarily English "from web documents, mathematics, and code." Unlike Gemini, these models are not multimodal, nor are they trained for multilingual tasks. The text is filtered to remove "unsafe" content.
Number of layers is 18 for the 2 billion model and 28 for the 7 billion model. The number of attention heads goes from 8 to 16 when going from the 2 billion to 7 billion model.
They use something called rotary positional embeddings, instead of regular positional embeddings, now called absolute positional embeddings. So what this is about is the input is a sequence of "tokens", which are also called "embeddings", and never mind that that's a term that doesn't make any sense. I think the term "token" is more intuitive. What these really are are vectors, where the numbers in the vectors relate to the semantic meaning of the word the vectors are a "token" for. These are input in sequence but people have found the "attention" mechanism (which, remember, is what "transformers" do even though you would never guess that from the word "transformer") works better if it is provided additional "positional" information.
Here, though, they want to give "relative" rather than "absolute" positional information. That's accomplished with a "rotation matrix", which is why the technique is called "rotary positional embeddings". I don't know how the "rotation matrix" works and I don't feel at the moment that this is an important enough detail to dwell on, so let's continue.
The tokenizer is SentencePiece, which you may remember from the original Gemini announcement. It uses self-supervised learning at the character level to create the tokens, unlike OpenAI's Byte Pair Encoding tokenizer, which relies on a preprocessor that breaks text into words ahead of time. SentencePiece is supposed to work better on languages like Chinese and Japanese that don't care about putting spaces between words. SentencePiece is also said to do a better job at handling rare words (the so-called "out-of-vocabulary" words).
They replaced the standard ReLU activation function with GeGLU. "ReLU" is short for "rectified linear" and is the smart person's way of saying you chop off the negative numbers and just make them 0. Weirdly enough this is enough non-linearity for neural networks to learn non-linear functions. GeGLU stands for, ehhh, I have no idea what it stands for. The "GLU" part stands for "Gated Linear Unit". It looks like GeGLU does the GLU operation twice. The GLU operation in turn is a tensor product with, it looks like, some learnable parameters. It's a lot more complicated than ReLU, and it's been shown to improve transformer models. Why it improves transformer models, I have no idea.
There's a "normalization" system that normalizes both the input and the output of each transformer sub-layer, "a deviation from the standard practice of solely normalizing one or the other." "Normalization" here means "normal" in the statistical sense -- remember your Gaussian bell curve is also called the "normal" distribution" -- what it does is re-center the numbers around 0 with a standard deviation of 1.
They claim it outperforms LLaMa 2 (both 2 billion and 7 billion models) and Mistral, at question answering, reasoning, math & science, and coding. However, one tester (Matthew Berman) did not find the experience too good (see below) so I'm wondering if it lives up to these claims.
Gemma: Introducing new state-of-the-art open models
#solidstatelife #ai #genai #llms #gemini #openmodels