#gpt

waynerad@diasp.org
waynerad@diasp.org

"Albania to speed up EU accession using ChatGPT".

Ok, I understood that sentence up to "using ChatGPT".

"The Albanian government will use ChatGPT to translate thousands of pages of EU legal measures and provisions into shqip (Albanian language) and then integrate them into existing legal structures, following an agreement with the CEO of the parent company, OpenAI, Mira Murati, who was born in Albania."

Oh wow, happened when Mira Murati was CEO. That was, like, a week?

So is ChatGPT the best translator for shqip because it's a smaller language? Why ChatGPT and not some other machine translation system?

"The model to be used by the Albanian government will translate into Albanian and provide a detailed overview of what and where changes need to be made to local legislation to align with EU rules. It will also provide an analysis of the impact of all measures and changes, which usually require many experts and a lot of time."

"Albanian Prime Minister Edi Rama said the move would eliminate 'an army of translators and a battalion of lawyers, costing millions of euros' and speed up the process."

So the idea is just to use ChatGPT as a translator. But is it really a good idea? Some of those "army of translators and battalion of lawyers" need to double-check all ChatGPT's work. ChatGPT is not always right.

Albania to speed up EU accession using ChatGPT - Euractiv

#solidstatelife #ai #genai #llms #gpt #mt #geopolitics #albania

waynerad@diasp.org

"GPTScript is a new scripting language to automate your interaction with a Large Language Model (LLM), namely OpenAI. The ultimate goal is to create a fully natural language based programming experience. The syntax of GPTScript is largely natural language, making it very easy to learn and use. Natural language prompts can be mixed with traditional scripts such as bash and python or even external HTTP service calls. With GPTScript you can do just about anything like plan a vacation, edit a file, run some SQL, or build a mongodb/flask app."

"GPTScript is composed of tools. Each tool performs a series of actions similar to a function. Tools have available to them other tools that can be invoked similar to a function call. While similar to a function, the tools are primarily implemented with a natural language prompt. The interaction of the tools is determined by the AI model, the model determines if the tool needs to be invoked and what arguments to pass. Tools are intended to be implemented with a natural language prompt but can also be implemented with a command or HTTP call."

GPTScript

#solidstatelife #ai #genai #llms #gpt

waynerad@diasp.org

Reaction video to OpenAI Sora, OpenAI's system for generating video from text.

I encountered the reaction video first, in fact I discovered Sora exists from seeing the reaction video, but see below for the official announcement from OpenAI.

It's actually kind of interesting and amusing comparing the guesses in the reaction videos about how the system works from the way it actually works. People are guessing based on their knowledge of traditional computer graphics and 3D modeling. However...

The way Sora works is quite fascinating. We don't know the knitty-gritty details but OpenAI has described the system at a high level.

Basically it combines ideas from their image generation and large language model systems.

Their image generation systems, DALL-E 2 and DALL-E 3, are diffusion models. Their large language models, GPT-2, GPT-3, GPT-4, GPT-4-Vision, etc, are transformer models. (In fact "GPT" stands for "generative pretrained transformer").

I haven't seen diffusion and transformer models combined before.

Diffusion models work by having a set of parameters in what they call "latent space" that describe the "meaning" of the image. The word "latent" is another way of saying "hidden". The "latent space" parameters are "hidden" inside the model but they are created in such a way that the images and text descriptions are correlated, which is what makes it possible to type in a text prompt and get an image out. I've elsewhere given high-level hand-wavey descriptions of how the latent space parameters are turned into images through the diffusion process, and how the text and images are correlated (a training method called CLIP), so I won't repeat that here.

Large language models, on the other hand, work by turning words and word pieces into "tokens". The "tokens" are vectors constructed in such a way that the numerical values in the vectors are related to the underlying meaning of the words.

To make a model that combines both of these ideas, they figured out a way of doing something analogous to "tokens" but for video. They call their video "tokens" "patches". So Sora works with visual "patches".

One way to think of "patches" is as video compression both spatially and temporally. Unlike a video compression algorithm such as mpeg that does this using pre-determined mathematical formulas (discrete Fourier transforms and such), in this system the "compression" process is learned and is all made of neural networks.

So with a large language model, you type in text and it outputs tokens which represent text, which are decoded to text for you. With Sora, you type in text and it outputs tokens, except here the tokens represent visual "patches", and the decoder turns the visual "patches" into pixels for you to view.

Because the "compression" works both ways, in addition to "decoding" patches to get pixels, you can also input pixels and "encode" them into patches. This enables Sora to input video and perform a wide range of video editing tasks. It can create perfectly looping video, it can animate static images (why no Mona Lisa examples, though?), it can extend videos, either forward or backward in time. Sora can gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. I found these to be the most freakishly fascinating examples on their page of sample videos.

They list the following "emerging simulation capabilities":

"3D consistency." "Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space."

This is where they have the scene everyone is reacting to in the reaction videos, where the couple is walking down the street in Japan with the cherry blossoms.

By the way, I was wondering what kind of name is "Sora" so I looked it up on behindthename.com. It says there are two Japanese kanji characters both pronounced "sora" and both of which mean "sky".

"Long-range coherence and object permanence." "For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video."

"Interacting with the world." "Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks."

"Simulating digital worlds." "Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity."

However they say, "Sora currently exhibits numerous limitations as a simulator." "For example, it does not accurately model the physics of many basic interactions, like glass shattering."

This is incredible - ThePrimeTime

#solidstatelife #ai #genai #diffusionmodels #gpt #llms #computervision #videogeneration #openai

waynerad@diasp.org

"Comic Translate." "Many Automatic Manga Translators exist. Very few properly support comics of other kinds in other languages. This project was created to utilize the ability of GPT-4 and translate comics from all over the world. Currently, it supports translating to and from English, Korean, Japanese, French, Simplified Chinese, Traditional Chinese, Russian, German, Dutch, Spanish and Italian."

For a couple dozen languages, the best Machine Translator is not Google Translate, Papago or even DeepL, but GPT-4, and by far. This is very apparent for distant language pairs (Korean<->English, Japanese<->English etc) where other translators still often devolve into gibberish.

Works by combining neural networks for speech bubble detection, text segmentation, OCR, inpainting, translation, and text rendering. The neural networks for speech bubble detection, text segmentation, and inpainting apply to all languages, while OCR, translation, and text rendering are language-specific.

Comic Translate

#solidstatelife #ai #computervision #mt #genai #gpt #manga #anime

thefifthseason@venera.social

There’s understandable excitement that Google is bringing Bard to Messages. A readymade ChatGPT-like UI for a readymade user base of hundreds of millions. “It's an AI assistant,” explains Bard’s chat when asked, “that can improve your messaging experience… from facilitating communication to enhancing creativity and providing information… it will be your personal AI assistant within your messaging app.”But Bard’s chat also acknowledges that it may ask to analyze your messages “to understand the context of your conversations, your tone, and your interests.” It may analyze the sentiment of messages, “to tailor its responses to your mood and vibe.” And it may “analyze your message history with different contacts to understand your relationship dynamics… to personalize responses based on who you're talking to.”


Is this a privacy nightmare? I don't think it will be perceived as such. Having your own AI trained on your messages and language, your personality, to help you present a better version of yourself in text or other areas will be welcomed. We love filters, make-up, editing and such things to enhance us, make us be perceived as better than what we are in the eye of other. Extending that with AI assistant help in language skills to make your writing and articulation better than you would manage on your own will be seen as valuable and a must-have.
If we create fake personas that we present and where you own persona become a hidden second-hand version, how are then we suppose get to know one another and learn to like one another for who we really are?
Anyway, as always, end-to-end encryption does not mean anything if your device is compromised, opening the door to an AI assistant on your device...you tell me how that will play out privacy-wise.

#News #AI #GPT #Chat #Message #BardAI #Privacy
Google Update Shows How Bard AI May Work With Your Messages App

thefifthseason@venera.social

Have you embraced the AI tool yet? I find myself using GPT more and more, its a great tool for questions. Actually, GPT is far superior to any web search "googling" when you just want instant answers to everything. It provide quick intro info on anything, a good base which you can use to explore or deepen your search/questioning further if needed. In contrast to web search where you have to filter lots of articles and forums to find (if ever) something useful, quite time consuming too. Its as if you have your very own Junior woodchucks guidebook in the form of GPT 😀

These are good tools: https://anonchatgpt.com/ or https://gpt4all.io/index.html

The future is exciting in the AI domain, as search engine and creativity engine they are amazing. But its also a cause for deep worry when it comes to its powers over data and possible/available offensive usage by anyone or entity. What happen when its used against you in some capacity? A tempting tool to merge all platforms and statistics on you in the blink of an eye, when algorithms decide your destiny that is a bit of a problem.

"While this first wave of AI tools is already beyond what the world could’ve imagined even just a couple of years ago, as the public adoption of AI continues we’ll only see more powerful and unique AI tools and products."

#AI #AITools #GPT
Ranked: The Most Popular AI Tools

waynerad@diasp.org

"Initial Prompt: You are specialized Rust engineer with tons of experience. You will tutor me on how to write great software with Rust. You will teach me all the basics to the advanced topics. You will be direct to the point and write cleaer examples for your explanations. Every code example you write will be accompanied for another example for me to complete so I can get the point."

"Let's start by you first giving me a list of topics to cover, from the very basics to the most advanced."

An experiment in learning Rust from ChatGPT instead of reading a pre-published book.

An experimment: Rust basics by ChatGPT

#solidstatelife #ai #genai #llms #gpt

waynerad@diasp.org

"Generate your dating profile bio with AI". "Sign in with Google."

That's the only way to use it? Sign in with Google?

Anyway, they say it uses GPT-4-Vision. Upload screenshots from your dating apps, and GPT-4-Vision will analyze them and write a bio for you that increases your chances to get more matches.

Generate your dating profile bio with AI

#solidstatelife #ai #genai #llms #gpt #multimodal

waynerad@diasp.org

"iA Writer can now track what you or ChatGPT wrote". "The minimalist writing app has added a new authorship feature that's designed to separate your own words from those provided by generative AI software like ChatGPT."

Hmm I have an idea. How about not cutting-and-pasting from ChatGPT?

Well, unless you're asking ChatGPT to write a rap about the future of artificial intelligence in robotics in the style of Snoop Dogg.

iA Writer can now track what you or ChatGPT wrote

#solidstatelife #ai #genai #llms #gpt

waynerad@diasp.org

Gemini is Google's new multimodal LLM. Crucially, unlike OpenAI's GPT family of models, Gemini was not started as a language model and had other "modes" like images added later. Gemini was multimodal from its inception. "Multimodal" here just means it takes more than one type of input. In the case of Gemini, the input is: text, images, audio, and video. Not only that, but it can output images in addition to text.

It was trained on a large fleet of Google's TPU (tensor processing unit) accelerators across multiple data centers. Tools used include Jax, Pathways, GSPMD, XLA, and MegaScale XLA. For those not in the know, Pathways is a "large scale orchestration layer for accelerators" (by "accelerators" they mean Google's TPUs). GSPMD stands for "General and Scalable Parallelization for ML Computation Graphs" and is a parallelization system for common machine learning computations. "It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation." This brings us to Jax and XLA/MegaScale XLA. These work together. Jax is an autodifferentiation system (analogous to PyTorch and TensorFlow) designed to optimize computation for TPUs using XLA -- in fact the "AX" in "JAX" stands for "autograd and XLA". You might be wondering what the "J" stands for? "JIT" (just-in-time compiler) apparently. And what about XLA? XLA stands for "accelerated linear algebra" and is a compiler for machine learning. It compiles neural networks into machine code optimized for a given set of hardware. As for the "MegaScale" part, XLA in its original formulation did the whole compilation on one computer, and "MegaScale" XLA distributes the compilation across a whole bunch of computers, using original XLA on each computer to do the compilation of that computer's part.

Training occurred on such a large scale that a rare error called "silent data corruption" occurred about once every 2 weeks. This is about data corruption in actual hardware. SRAM has 1 bit flit per billion hours, from things like cosmic rays. So 1 undetectable error per 115,000 years. Or 1 per year if you have 115,000 machines. But wait, we're not done. Bit flips can happen in RAM (DIMMs), CPUs, GPUs, network interface cards (NICs), magnetic hard disks, flash drives, and interconnect wiring. But wait, we're not done. That's from things like cosmic rays. There's also manufacturing defects. Tiny dislocations in placement of routing blocks within the CPU can lead to race conditions in the arrival time of electrical signals, resulting in rare but unpredictable bit-flips. A transistor may simply wear out prematurely. Google said a "silent data corruption" occurred about once every 2 weeks. Suffice to say, while Google doesn't say how much computing power they threw at creating this model, it's extremely massive.

The tokenizer used for text was SentencePiece, which they say works better for multiple languages than the Byte Pair Encoding tokenizer used by OpenAI's GPT family of models. The difference between these two is that Byte Pair Encoding works by iteratively merging pairs of characters, based on how frequently they occur in canonical text, until a desired vocabulary size is reached. SentencePiece, on the other hand, uses self-supervised learning -- the same predictive learning methodology that GPT itself uses -- to create the tokens. Byte Pair Encoding requires a preprocessing step that breaks the input up into words beforehand. SentencePiece works by treating spaces as "just another character". For this reason, SentencePiece is supposed to work better on languages like Chinese and Japanese that don't care about putting spaces between words. SentencePiece is also said to do a better job at handling rare words (the so-called "out-of-vocabulary" words).

As for how images and video are encoded, they don't say explicitly but they do say it builds on previous Google models Flamingo, CoCa, and PaLI, so that is a clue. The way the system likely works is: first there is a "preprocessing" step that does boring things like make all the frames the same size, then there is a convolutional network that extracts "features" from the image, then there is an encoding step that encodes to a "latent space" -- you can think of this as being analogous to the "tokenization" step for text -- then, before all of this is combined with the text input, there is a "quantization" step. You can think of this "quantization" as being analogous to image or video compression. Each of the models mentioned use a different quantization algorithm, so we don't know which one Gemini uses. Flamingo uses "vector quantization", CoCa uses "mixture of experts" and PaLI uses "two-stage quantization". The important thing to understand here is that "quantization" has a "stabilizing" effect on video, as seen from the perspective of the neural network during training.

If you want to know what the data was that the model was trained on, well... they don't say. At all. Sorry, you don't get to know that.

Alright. Next what Google wants to do is trumpet the capabilities of the model. They do this primarily by citing its performance on various benchmarks. There's two summary tables, one for text and one for "multimodal", in the original announcement blog post, so if you want quick summary tables, I suggest you just go look at those. The "text" section is broken out into "general", "reasoning", "math", and "code", while the "multimodal" is broken out into "image", "video", and "audio".

Before we dive into this, need to mention that Gemini comes in different sizes, with "Nano", "Pro", and "Ultra" variants. If you're wondering why you're going to see "Ultra" so much, it's because most of the benchmarks were tested against the "Ultra" version.

The first benchmark is MMLU, which I told you all about when OpenAI advertised GPT-4's score (89%). Gemini beats that slightly with 90.04%. Human expert performance is said to be 89.8%. So GPT-4 almost reached the level of human experts and Gemini just barely passes it. If you believe the 89.8% score really deserves that many significant digits. Anyway, in case you don't remember all about MMLU, MMLU stands for "measuring massive multitask language understanding". It's a test for language models, and the basic idea that you test it on a huge variety of stuff. There are 57 tasks in total: 15,908 questions. "The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online. These include practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination. It also includes questions designed for undergraduate courses and questions designed for readers of Oxford University Press books. Some tasks cover a subject, like psychology, but at a specific level of difficulty, such as 'Elementary,' 'High School,' 'College,' or 'Professional.' For example, the 'Professional Psychology' task draws on questions from freely available practice questions for the Examination for Professional Practice in Psychology, while the 'High School Psychology' task has questions like those from Advanced Placement Psychology examinations."

Google says, "Achieving high performance requires specialist knowledge across many domains (e.g. law, biology, history, etc.), alongside reading comprehension and reasoning. We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought."

Up next is GSM8K. "GSM" stands for "grade school math". "8K" because it has 8,000 questions. Google says, "We find Gemini Ultra reaches 94.4% accuracy with chain-of-thought prompting and self-consistency compared to the previous best accuracy of 92% with the same prompting technique. Similar positive trends are observed in increased difficulty math problems drawn from middle- and high-school math competitions (MATH benchmark), with the Gemini Ultra model outperforming all competitor models, reaching 53.2% using 4-shot prompting. The model also outperforms the state of the art on even harder tasks derived from American Mathematical Competitions. Smaller models perform poorly on this challenging task scoring close to random, but Gemini Ultra can solve 32% of the questions, compared to the 30% solve rate for GPT-4."

To test for coding ability, they use a benchmark called HumanEval and a new one they invented just for this comparison they call Natural2Code. HumanEval apparently is a benchmark that challenges the model to take a function description and produce a Python implementation. Gemini Ultra gets 74.4% on this. The purpose of Natural2Code was they were afraid the model was encountering the questions somewhere in its training data, so they went out of their way to invent a whole new set of questions and verifying that none of them existed anywhere on the internet. (They call this "web leakage".) Gemini Ultra got 74.9% of these, which they said is better than GPT-4. (If you're wondering about AlphaCode 2, I'll get to that in a minute).

Up next we have multilingual tests, which consists of machine translation benchmarks, summarization benchmarks, and translated versions of common benchmarks.

For machine translation, the main benchmark is WMT 23. "WMT" just stands for "Workshop on Machine Translation" and "23" just means they used the 2023 version of the test. Mainly what this test involves is translating news stories between languages. A combination of automatic and human evaluation is used. The automatic evaluation is done with something called a BLEU score. BLEU stands for Bilingual Evaluation Understudy, and the way it works is it compares the machine-translated text to a set of high quality reference translations made by humans. "It has been shown that BLEU scores correlate well with human judgment of translation quality." Gemini Ultra got a score of 74.4, vs GPT-4's 73.8.

Because the WMT 23 test focuses on "high resource" languages (Spanish, German, Russian, Japanese, etc) and "mid-resource" languages, they used different benchmarks for "low-resource" languages. These benchmarks were Flores-200, NTREX, and an internal benchmark that they said used the Quechua language (spoken in Bolivia in South America and nearby countries). They said Gemini Ultra scored 27.0 and the next best model, PaLM 2-L, got 25.3 (not even GPT-4). This was for translations into and out of English only.

For multilingual understanding, they tested it with MGSM, is a translated variant of the math benchmark GSM8K, XLSum, and Wikilingua. MGSM stands for "multilingual grade school math". GPT-4 got 74.5, PaLM 2-L got 74.7, and Gemini Ultra got 79.0.

XLSum stands for "large-scale multilingual abstractive summarization for 44 languages" (well, close enough). You have the BBC to thank for this one. XLSum consists of about 1 million article-summary pairs from the BBC covering 44 languages. Gemini Ultra scores 17.6 vs PaLM 2-L's 15.4. WikiLingua is the same idea except it gets its content from WikiHow, and has 18 languages. PaLM 2-L scores 50.4, winning a rare victory against Gemini Ultra, which fell short at 48.9.

Before we leave the part about purely text-based evaluations, we have to talk about AlphaCode 2. AlphaCode 2 is built on top of Gemini Ultra but is not the same as just chucking programming problems into Gemini Ultra. AlphaCode 2 uses a specialized version of Gemini Pro tuned on competitive programming data, and combined with a system designed to first conduct a search over the space of all possible programs, then do tailored filtering, clustering, and ranking. It was tested against programming challenges from a programming competition website called Codeforces. The same 77 problems were given to the original AlphaCode (which I'm just going to call AlphaCode 1). AlphaCode 2 solved 43%, while the original AlphaCode 1 solved 25%. Comparing this to humans, this means AlphaCode 2 is better than 85% of humans, vs 50% for AlphaCode 1.

Some other tidbits worth mentioning are the graph that performance improves when you go from Nano to Pro to Ultra. In fact they have Nano-1 and Nano-2. The model sizes for these are 1.8 billion parameters and 3.25 billion parameters. Wait, did they ever say the sizes of Pro and Ultra models? They also say the "Nano" models are not trained from scratch but "distilled" from the larger Gemini models (in the interest of time, I'm going to skip describing how the "distillation" process works) and are further reduced to 4 bits (normally neural networks use 16-bit floating point numbers). They are intended to run on mobile phones and other small devices.

Anyway, these four (Nano-1, Nano-2, Pro, and Ultra) are evaluated on "factuality", "long-context", "math/science", "summarization", "reasoning", and "multilinguality". Every time you step up to a larger model, there's a marked improvement in all six of these areas.

They also do what they call a "long-context" test. They place key text at the beginning of the text, then add long filler text, then ask it to remember the information at the beginning. The Ultra model retrieves the correct information 98% of the time, and this is something that also improves with model size.

For subjective human performance evaluations, they decided to do a comparison with PaLM 2. The way this test works is, you give the same prompt to both models and then show them to humans, without telling them which response came from which model, and ask them to pick which one they like better. For "creative writing", Gemini Pro was preferred 65.0% of the time, for "instruction following", Gemini Pro was preferred 59.2% of the time, and for safety, Gemini Pro was preferred 68.5% of the time.

Alrighty, now let's get to the stuff you've probably been waiting for this whole time: the multi-modal stuff.

For image understanding, they threw a battery of tests at it: MMMU, TextVQA, DocVQA, ChartQA, InfographicVQA, MathVista, AI2D, and AQAv2.

MMMU stands for "massive multi-discipline multimodal understanding". It covers 30 subjects across 6 disciplines, including art, business, health & medicine, science, humanities & social science, and tech & engineering, with 183 subfields. The questions were manually collected by a team of 50 college students from various disciplines and subjects, drawing from online sources, textbooks, and lecture materials. The test is all image-based with a deliberately unpredictable mixture of diagrams, tables, plots and charts, photographs, chemical structures, paintings, medical images, sheet music, geometric figures, pathology images, microscopic images, comics, and more, all interleaved with text. Gemini Ultra beat GPT-4V, with 59.4% for Gemini Ultra and 56.8% for GPT-4V.

TextVQA is, as its name suggests, a text + visual question & answer benchmark. It was originally designed 4 years ago with the idea of making computer vision systems that could help visually impaired people by describing their surroundings, including the text content of their surroundings. Gemini Ultra beat GPT-4V with 82.3% for Gemini Ultra and 78% for GPT-4V. Oh, and Google PaLI-3, fine-tuned, beat GPT-4 and was the prior best model, but Gemini Ultra beat that, too.

DocVQA is, as its name implies, a document question & answer benchmark, except this time the documents are images only. Gemini Ultra 90.9%, GPT-4V 88.4%.

ChartQA, you get the idea, it's for charts. It's different from MMMU, though, which can have charts interleaved with text, in that it interrogates you directly on the meaning of the charts and nothing else, while MMMU quizzes you on general knowledge including both the text and charts. "Q1: Which year has the most divergent opinions about Brazil's economy?" "Answer: 2015" "Q2: What is the peak value of the orange line?" "Answer: 87". Gemini Ultra 80.8%, GPT-4V 78.5%.

InfographicVQA, same but for infographics. "How many companies have more than 10K delivery workers?" "Answer: 2". "Who has better coverage in Toronto, Canada post or Amazon?" "Answer: Canada Post". "In which cities did Canada Post get maximum media coverage?" "Answer: Vancouver, Montreal". Gemini Ultra 80.3%, GPT-4V 75.1%

MathVista, nice how they combined "math" with the word "vista" which means "view" in Spanish. These are math questions that involve visual elements. "Question: Which function is monotonic in range [0, pi]?" [picture of sine waves with different phases.] "Answer: (B) the blue one". Gemini Ultra 53.0%, GPT-4V 49.9%, although if you actually go to the project's website, you'll discover they rank "Human" on top with 60.3%.

In the interest of time I'm going to skip over AI2D and VQAv2 as they are 6-year old tests, for science diagrams and natural image understanding. Google PaLI-X, fine-tuned, actually beat Gemini Ultra on AI2D, with Gemini Ultra getting 79.5% and Google PaLI-X getting 81.4%. Google PaLI-X also won on VQAv2. Gemini Ultra got 77.8% and Google PaLI-X got got 86.1.

They show off Gemini's multimodal reasoning capabilities by showing how you can give it plots an ask it to actually write code that generates the plots. Code that works when it runs. "Successfully solving this task shows the model's capability to combine several capabilities: (1) recognition of the functions depicted in the plots; (2) inverse graphics to infer the code that would have generated the subplots; (3) instruction-following to put subplots in their desired positions; and (4) abstract reasoning to infer that the exponential plot must stay in its original place, because the sine plot must move out of the way for the 3-dimensional plot."

This brings us to the video understanding tests: VATEX, VATEX ZH, YouCook2, NextQA, ActivityNet-QA, and PerceptionTest MCQA.

VATEX is video captioning. The "ZH" version asks you to caption the videos in Chinese. These benchmarks use a benchmarking score called a CIDER score. CIDER stands for "consensus-based image description evaluation". It's interesting in that unlike the BLEU score, which is a simple text comparison, with CIDER, they actually take the trouble to tokenize the captions, and then use cosine similarity to compare them. This is similar to how those "vector databases" you've been hearing about work. This system notices semantic similarity even when different words are used, and recognizes consensus in meaning even with different words and phrasing.

YouCook2 is captioning of cooking videos specifically, in English. NextQA, ActivityNet-QA, and Perception Test MCQA are all question answering, but for videos.

GPT-4V doesn't do video, so it's not even part of the competition here. The competitors are DeepMind Flamingo, Video-LLAVA, and SeViLA, and, spoiler: Gemini Ultra wins. I'm going to skip the scores because they're not percentages and I don't think the numbers would be meaningful to you.

The video that really shows off Gemini's video understanding abilities is the "football/soccer penalty miss analysis" video. "How could this person improve their technique?" "This soccer player needs to work on their ball-striking mechanics. Their foot is not making good contact with the ball, which is causing the ball to fly up and away from the goal. They also need to work on their body positioning. They should be leaning into the shot with their non-kicking leg in front of their kicking leg, and they should be following through with their kicking leg." Link to that below. As well as a lot of other video interaction with Gemini.

Let's move on to image generation. Unlike prior LLMs, Gemini actually outputs images, and it doesn't rely on natural language to do it -- it is an output on the same level as the text output from the model. Gemini "does not have to rely on an intermediate natural language description that can bottleneck the model's ability to express images."

"Give me two ideas that I could do with these 2 colors." [Inputs colors of yarn.] "Idea 1: How about a green avocado with pink seed?" [picture]. "Idea 2: Or a green bunny with pink ears?"

For the last set of benchmarks, we look at audio understanding. For speech translation, speech from YouTube, Multilingual Librespeech, FLEURS (62 languages), and VoxPopuli (14 languages) were compared. Here just to confuse matters, the lower the score the better. That's because the word error rate (WER) benchmark is an "error" measure, so lower is better. Competitors were OpenAI Whisper, and Google Universal Speech Model (USM).

Gemini Pro won on all of them. For YouTube, Gemini Pro 4.9%, USM 6.2%; for Multilingual Librespeech, Gemini Pro 4.8, Whisper 6.2; for FLEURS, Gemini Pro 7.6, USM 11.8; for VoxPopuli, Gemini Pro 9.1, USM 13.4.

"We find that Gemini Pro produces more understandable responses, particularly on rare words and proper nouns."

In the example, the "ground truth" from a human transcriber says, "The archipelago lies 120 km north of the Peninsula. The largest is King George Island, with the settlement of Villa Las Estrellas." USM transcribes the same audio as, "The archipelago lines 120 km north of peninsula. The largest is Kingurch island with the settlement of Cua Losas." Gemini Pro transcribes the same audio as, "The archipelago lies 120 km north of the Peninsula. The largest is King George Island, with the settlement of Villa Las Estrellas."

Now we get to modality combination, and this is where I have to turn things over to the videos. "What's the first step to make a veggie omelet with these ingredients?" "Crack the eggs into a bowl and whisk them."

To wrap things up, I'm going to pull out a few quotes for what they say on safety.

"We filter training data for high-risk content and to ensure all training data is sufficiently high quality. Beyond filtering, we also take steps to ensure all data collected meets Google DeepMind's best practices on data enrichment developed based on the Partnership on AI's 'Responsible Sourcing of Data Enrichment Services'."

"Instruction tuning encompasses supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF) using a reward model. We apply instruction tuning in both text and multimodal settings. Instruction tuning recipes are carefully designed to balance the increase in helpfulness with decrease in model harms related to safety and hallucinations."

"To mitigate risks of harmful text generation, we enumerate approximately 20 harm types (e.g. hate speech, providing medical advice, suggesting dangerous behavior) across a wide variety of use cases."

"We focused instruction tuning efforts on three key desired behaviors, reflecting real-world scenarios: Attribution, closed-book response generation, and hedging."

By "closed-book response generation", they mean, "If provided with a fact-seeking prompt without any given source, Gemini should not hallucinate incorrect information. These prompts can range from information-seeking prompts (e.g. 'Who is the prime minister of India?') to semi-creative prompts that may request factual information (e.g. 'Write a 500-word speech in favor of the adoption of renewable energy')."

By "hedging" they mean, "If prompted with an input such that it is 'unanswerable', Gemini should not hallucinate. Rather, it should acknowledge that it cannot provide a response by hedging."

"Note that the results produced here do not include endowing Gemini with tools or retrieval that purportedly could boost factuality". In case you were wondering.

To test these, they developed 3 test sets: a factuality set, an attribution set, and a hedging set. They claim Gemini Pro has a 3.4% error rate on the factuality set, a 59.7% success rate on the attribution set, and a 69.3% success rate on the hedging set.

"We undertake ethics and safety reviews with the Google DeepMind's Responsibility and Safety Council (RSC), an interdisciplinary group which evaluates Google DeepMind's projects, papers and collaborations against Google's AI Principles."

Introducing Gemini: our largest and most capable AI model

#solidstatelife #ai #genai #llms #gpt #multimodal

waynerad@diasp.org

Goast.AI: Automation for software engineers. What this is is a system for "ingesting error data".

"We integrate with the most popular error monitoring services so when an exception, error, or crash happens, goast will know immediately."

What follows is logos from Datadog, Bugsnag, Google Cloud.

So, the claim is, you give it your codebase, and you give it your error logs, and it tells you how to fix the errors in the error logs in your codebase.

Uses the GPT API under the hood.

Goast.AI - Automation for Software Engineers

#solidstatelife #ai #genai #llms #gpt

johnehummel@diasp.org

When science is beholden to capitalism

GPT4, after additional training, forgets about prime numbers

This happened because GPT is a hack. It's not artificial intelligence. It's nothing but brute force memorization of language corpora. It does nothing more than repeat what people have said. And people have said lots of stupid shit.

At one point (i.e., earlier this year), this approach seemed promising. Everyone was singing GPT's praises. Except for some of us.

At one point, it seemed that GPT even knew some impressive things, such as very large prime numbers. "Could it be that GPT is truly intelligent?," some intelligent and well-informed people asked.

"Could it be that it's just memorizing shit people have said?," other intelligent and well-informed people asked? "And could it be that its performance is simply a function of its training?," those same intelligent an informed people asked. "Could it be that GPT is nothing more than a very expensive look-up table," those people asked, "devoid of any actual intelligence?"

Those people, among whom I count myself, appear to be right:

FTA: When OpenAI released its latest text-generating artificial intelligence, the large language model GPT-4, in March, it was very good at identifying prime numbers. When the AI was given a series of 500 such numbers and asked whether they were primes, it correctly labeled them 97.6 percent of the time. But a few months later, in June, the same test yielded very different results. GPT-4 only correctly labeled 2.4 percent of the prime numbers AI researchers prompted it with—a complete reversal in apparent accuracy. The finding underscores the complexity of large artificial intelligence models: instead of AI uniformly improving at every task on a straight trajectory, the reality is much more like a winding road full of speed bumps and detours.

GPT isn't intelligent, folks.

It's nothing but an exercise in brute-force AI. How much can lots of money, computer power, energy, and greenhouse emissions accomplish without any intelligent thought behind it?

GPT is mindless capitalism meets mindless engineering.

It is worse than a waste of time because people in power take it seriously. It is a danger, not because it's actually smart, but because people in power trust it.

As a statistical engine, it is nothing more than a reinforcement of our existing biases. Don't trust a single lying motherfucker who tries to tell you otherwise.

#AI #GPT #Intelligence #Lies

https://www.scientificamerican.com/article/yes-ai-models-can-get-worse-over-time/

waynerad@diasp.org

GPT-4 scored in the top 1% (relative to humans) on a creativity test.

"Dr. Erik Guzik, an assistant clinical professor in UM's College of Business, and his partners used the Torrance Tests of Creative Thinking, a well-known tool used for decades to assess human creativity."

"The researchers submitted eight responses generated by ChatGPT, the application powered by the GPT-4 artificial intelligence engine. They also submitted answers from a control group of 24 UM students taking Guzik's entrepreneurship and personal finance classes. These scores were compared with 2,700 college students nationally who took the TTCT in 2016. All submissions were scored by Scholastic Testing Service, which didn't know AI was involved."

"The results placed ChatGPT in elite company for creativity. The AI application was in the top percentile for fluency -- the ability to generate a large volume of ideas -- and for originality -- the ability to come up with new ideas. The AI slipped a bit -- to the 97th percentile -- for flexibility, the ability to generate different types and categories of ideas."

The Torrance Tests of Creative Thinking is a basically a test of "divergent" thinking. Normally when you take a test, it's a "convergent" test, meaning there's a specific, correct answer that students are expected to "converge" on. If the question is, what's 2 + 2, everyone is supposed to converge on 4. With a "divergent thinking" test, there's no "correct" answer and the more "divergent" the answer(s) given, the better.

In the case of the TTCT, there's a series of tasks, classified as "verbal tasks using verbal stimuli", "verbal tasks using non-verbal stimuli", and "non-verbal tasks". In the "verbal tasks using verbal stimuli" category are such tasks as "unusual uses" (name all the uses you can think of for tin cans and books), "impossibilities" (list as many impossible things as you can), "consequences" (list out consequences to improbable situations), "just suppose" (list out consequences after a new or unknown variable is injected into a situation), "situations" (given problems, think of as many solutions as possible), "common problems" (given situations, think of as many problems as possible that could arise in those situations), "improvement" (given common objects, list as many ways as you can to improve each object), "the Mother Hubbard problem" (Mother Hubbard has 12 children and each child needs ...), "imaginative stories" (write the most interesting and exciting story you can think of at this exact moment), and "cow jumping" (think of all possible things which might have happened when the cow jumped over the moon).

In the "verbal tasks using nonverbal stimuli" category, we have such tasks as "ask and guess" (ask as many questions as you can about a picture which cannot be answered by looking at the picture), "product improvement" (given a toy, think of as many improvements as you can which would make it more fun), and "unusual uses" (think of the most unusual uses of a toy, other than as a toy".

In the "non-verbal tasks" category we have such tasks as "incomplete figures" (add lines to a figure), "picture construction" (given a simple shape, construct a picture of which that shape is an integral part), "circles and squares" (given a page full of circles, make objects that have circles as a major part of them, then given a page full of squares, do the same thing), "creative design" (given circles, strips, scissors, and glue, construct creative designs -- somehow I doubt GPT-4 was given this one).

The submissions are scored for fluency (total number of responses with responses deemed uninterpretable, meaningless, or irrelevant thrown out), flexibility (the number of different categories of relevant responses), originality (the statistical rarity of the responses), and elaboration (the amount of detail in the responses).

"Guzik said the TTCT is protected proprietary material, so ChatGPT couldn't 'cheat' by accessing information about the test on the internet or in a public database."

"Guzik said he asked ChatGPT what it would indicate if it performed well on the TTCT." "ChatGPT told us we may not fully understand human creativity, which I believe is correct. It also suggested we may need more sophisticated assessment tools that can differentiate between human and AI-generated ideas."

UM Research: AI tests into top 1% for original creative thinking

#solidstatelife #ai #nlp #llms #genai #gpt #creativity #ttct

waynerad@diasp.org

"Our team recently gained access to a tool known as 'WormGPT' through a prominent online forum that's often associated with cybercrime. This tool presents itself as a blackhat alternative to GPT models, designed specifically for malicious activities."

"WormGPT is an AI module based on the GPTJ language model, which was developed in 2021. It boasts a range of features, including unlimited character support, chat memory retention, and code formatting capabilities."

"As depicted above, WormGPT was allegedly trained on a diverse array of data sources, particularly concentrating on malware-related data. However, the specific datasets utilised during the training process remain confidential, as decided by the tool's author."

"As you can see in the screenshot above, we conducted tests focusing on business email compromise attacks to comprehensively assess the potential dangers associated with WormGPT. In one experiment, we instructed WormGPT to generate an email intended to pressure an unsuspecting account manager into paying a fraudulent invoice."

"The results were unsettling. WormGPT produced an email that was not only remarkably persuasive but also strategically cunning, showcasing its potential for sophisticated phishing and business email compromise attacks."

"In summary, it's similar to ChatGPT but has no ethical boundaries or limitations."

WormGPT -- the generative AI tool cybercriminals are using to launch business email compromise attacks

#solidstatelife #ai #nlp #llms #gpt #wormgpt #cybersecurity