#gpt

waynerad@diasp.org

"Generate your dating profile bio with AI". "Sign in with Google."

That's the only way to use it? Sign in with Google?

Anyway, they say it uses GPT-4-Vision. Upload screenshots from your dating apps, and GPT-4-Vision will analyze them and write a bio for you that increases your chances to get more matches.

Generate your dating profile bio with AI

#solidstatelife #ai #genai #llms #gpt #multimodal

waynerad@diasp.org

"iA Writer can now track what you or ChatGPT wrote". "The minimalist writing app has added a new authorship feature that's designed to separate your own words from those provided by generative AI software like ChatGPT."

Hmm I have an idea. How about not cutting-and-pasting from ChatGPT?

Well, unless you're asking ChatGPT to write a rap about the future of artificial intelligence in robotics in the style of Snoop Dogg.

iA Writer can now track what you or ChatGPT wrote

#solidstatelife #ai #genai #llms #gpt

waynerad@diasp.org

Gemini is Google's new multimodal LLM. Crucially, unlike OpenAI's GPT family of models, Gemini was not started as a language model and had other "modes" like images added later. Gemini was multimodal from its inception. "Multimodal" here just means it takes more than one type of input. In the case of Gemini, the input is: text, images, audio, and video. Not only that, but it can output images in addition to text.

It was trained on a large fleet of Google's TPU (tensor processing unit) accelerators across multiple data centers. Tools used include Jax, Pathways, GSPMD, XLA, and MegaScale XLA. For those not in the know, Pathways is a "large scale orchestration layer for accelerators" (by "accelerators" they mean Google's TPUs). GSPMD stands for "General and Scalable Parallelization for ML Computation Graphs" and is a parallelization system for common machine learning computations. "It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation." This brings us to Jax and XLA/MegaScale XLA. These work together. Jax is an autodifferentiation system (analogous to PyTorch and TensorFlow) designed to optimize computation for TPUs using XLA -- in fact the "AX" in "JAX" stands for "autograd and XLA". You might be wondering what the "J" stands for? "JIT" (just-in-time compiler) apparently. And what about XLA? XLA stands for "accelerated linear algebra" and is a compiler for machine learning. It compiles neural networks into machine code optimized for a given set of hardware. As for the "MegaScale" part, XLA in its original formulation did the whole compilation on one computer, and "MegaScale" XLA distributes the compilation across a whole bunch of computers, using original XLA on each computer to do the compilation of that computer's part.

Training occurred on such a large scale that a rare error called "silent data corruption" occurred about once every 2 weeks. This is about data corruption in actual hardware. SRAM has 1 bit flit per billion hours, from things like cosmic rays. So 1 undetectable error per 115,000 years. Or 1 per year if you have 115,000 machines. But wait, we're not done. Bit flips can happen in RAM (DIMMs), CPUs, GPUs, network interface cards (NICs), magnetic hard disks, flash drives, and interconnect wiring. But wait, we're not done. That's from things like cosmic rays. There's also manufacturing defects. Tiny dislocations in placement of routing blocks within the CPU can lead to race conditions in the arrival time of electrical signals, resulting in rare but unpredictable bit-flips. A transistor may simply wear out prematurely. Google said a "silent data corruption" occurred about once every 2 weeks. Suffice to say, while Google doesn't say how much computing power they threw at creating this model, it's extremely massive.

The tokenizer used for text was SentencePiece, which they say works better for multiple languages than the Byte Pair Encoding tokenizer used by OpenAI's GPT family of models. The difference between these two is that Byte Pair Encoding works by iteratively merging pairs of characters, based on how frequently they occur in canonical text, until a desired vocabulary size is reached. SentencePiece, on the other hand, uses self-supervised learning -- the same predictive learning methodology that GPT itself uses -- to create the tokens. Byte Pair Encoding requires a preprocessing step that breaks the input up into words beforehand. SentencePiece works by treating spaces as "just another character". For this reason, SentencePiece is supposed to work better on languages like Chinese and Japanese that don't care about putting spaces between words. SentencePiece is also said to do a better job at handling rare words (the so-called "out-of-vocabulary" words).

As for how images and video are encoded, they don't say explicitly but they do say it builds on previous Google models Flamingo, CoCa, and PaLI, so that is a clue. The way the system likely works is: first there is a "preprocessing" step that does boring things like make all the frames the same size, then there is a convolutional network that extracts "features" from the image, then there is an encoding step that encodes to a "latent space" -- you can think of this as being analogous to the "tokenization" step for text -- then, before all of this is combined with the text input, there is a "quantization" step. You can think of this "quantization" as being analogous to image or video compression. Each of the models mentioned use a different quantization algorithm, so we don't know which one Gemini uses. Flamingo uses "vector quantization", CoCa uses "mixture of experts" and PaLI uses "two-stage quantization". The important thing to understand here is that "quantization" has a "stabilizing" effect on video, as seen from the perspective of the neural network during training.

If you want to know what the data was that the model was trained on, well... they don't say. At all. Sorry, you don't get to know that.

Alright. Next what Google wants to do is trumpet the capabilities of the model. They do this primarily by citing its performance on various benchmarks. There's two summary tables, one for text and one for "multimodal", in the original announcement blog post, so if you want quick summary tables, I suggest you just go look at those. The "text" section is broken out into "general", "reasoning", "math", and "code", while the "multimodal" is broken out into "image", "video", and "audio".

Before we dive into this, need to mention that Gemini comes in different sizes, with "Nano", "Pro", and "Ultra" variants. If you're wondering why you're going to see "Ultra" so much, it's because most of the benchmarks were tested against the "Ultra" version.

The first benchmark is MMLU, which I told you all about when OpenAI advertised GPT-4's score (89%). Gemini beats that slightly with 90.04%. Human expert performance is said to be 89.8%. So GPT-4 almost reached the level of human experts and Gemini just barely passes it. If you believe the 89.8% score really deserves that many significant digits. Anyway, in case you don't remember all about MMLU, MMLU stands for "measuring massive multitask language understanding". It's a test for language models, and the basic idea that you test it on a huge variety of stuff. There are 57 tasks in total: 15,908 questions. "The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online. These include practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination. It also includes questions designed for undergraduate courses and questions designed for readers of Oxford University Press books. Some tasks cover a subject, like psychology, but at a specific level of difficulty, such as 'Elementary,' 'High School,' 'College,' or 'Professional.' For example, the 'Professional Psychology' task draws on questions from freely available practice questions for the Examination for Professional Practice in Psychology, while the 'High School Psychology' task has questions like those from Advanced Placement Psychology examinations."

Google says, "Achieving high performance requires specialist knowledge across many domains (e.g. law, biology, history, etc.), alongside reading comprehension and reasoning. We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought."

Up next is GSM8K. "GSM" stands for "grade school math". "8K" because it has 8,000 questions. Google says, "We find Gemini Ultra reaches 94.4% accuracy with chain-of-thought prompting and self-consistency compared to the previous best accuracy of 92% with the same prompting technique. Similar positive trends are observed in increased difficulty math problems drawn from middle- and high-school math competitions (MATH benchmark), with the Gemini Ultra model outperforming all competitor models, reaching 53.2% using 4-shot prompting. The model also outperforms the state of the art on even harder tasks derived from American Mathematical Competitions. Smaller models perform poorly on this challenging task scoring close to random, but Gemini Ultra can solve 32% of the questions, compared to the 30% solve rate for GPT-4."

To test for coding ability, they use a benchmark called HumanEval and a new one they invented just for this comparison they call Natural2Code. HumanEval apparently is a benchmark that challenges the model to take a function description and produce a Python implementation. Gemini Ultra gets 74.4% on this. The purpose of Natural2Code was they were afraid the model was encountering the questions somewhere in its training data, so they went out of their way to invent a whole new set of questions and verifying that none of them existed anywhere on the internet. (They call this "web leakage".) Gemini Ultra got 74.9% of these, which they said is better than GPT-4. (If you're wondering about AlphaCode 2, I'll get to that in a minute).

Up next we have multilingual tests, which consists of machine translation benchmarks, summarization benchmarks, and translated versions of common benchmarks.

For machine translation, the main benchmark is WMT 23. "WMT" just stands for "Workshop on Machine Translation" and "23" just means they used the 2023 version of the test. Mainly what this test involves is translating news stories between languages. A combination of automatic and human evaluation is used. The automatic evaluation is done with something called a BLEU score. BLEU stands for Bilingual Evaluation Understudy, and the way it works is it compares the machine-translated text to a set of high quality reference translations made by humans. "It has been shown that BLEU scores correlate well with human judgment of translation quality." Gemini Ultra got a score of 74.4, vs GPT-4's 73.8.

Because the WMT 23 test focuses on "high resource" languages (Spanish, German, Russian, Japanese, etc) and "mid-resource" languages, they used different benchmarks for "low-resource" languages. These benchmarks were Flores-200, NTREX, and an internal benchmark that they said used the Quechua language (spoken in Bolivia in South America and nearby countries). They said Gemini Ultra scored 27.0 and the next best model, PaLM 2-L, got 25.3 (not even GPT-4). This was for translations into and out of English only.

For multilingual understanding, they tested it with MGSM, is a translated variant of the math benchmark GSM8K, XLSum, and Wikilingua. MGSM stands for "multilingual grade school math". GPT-4 got 74.5, PaLM 2-L got 74.7, and Gemini Ultra got 79.0.

XLSum stands for "large-scale multilingual abstractive summarization for 44 languages" (well, close enough). You have the BBC to thank for this one. XLSum consists of about 1 million article-summary pairs from the BBC covering 44 languages. Gemini Ultra scores 17.6 vs PaLM 2-L's 15.4. WikiLingua is the same idea except it gets its content from WikiHow, and has 18 languages. PaLM 2-L scores 50.4, winning a rare victory against Gemini Ultra, which fell short at 48.9.

Before we leave the part about purely text-based evaluations, we have to talk about AlphaCode 2. AlphaCode 2 is built on top of Gemini Ultra but is not the same as just chucking programming problems into Gemini Ultra. AlphaCode 2 uses a specialized version of Gemini Pro tuned on competitive programming data, and combined with a system designed to first conduct a search over the space of all possible programs, then do tailored filtering, clustering, and ranking. It was tested against programming challenges from a programming competition website called Codeforces. The same 77 problems were given to the original AlphaCode (which I'm just going to call AlphaCode 1). AlphaCode 2 solved 43%, while the original AlphaCode 1 solved 25%. Comparing this to humans, this means AlphaCode 2 is better than 85% of humans, vs 50% for AlphaCode 1.

Some other tidbits worth mentioning are the graph that performance improves when you go from Nano to Pro to Ultra. In fact they have Nano-1 and Nano-2. The model sizes for these are 1.8 billion parameters and 3.25 billion parameters. Wait, did they ever say the sizes of Pro and Ultra models? They also say the "Nano" models are not trained from scratch but "distilled" from the larger Gemini models (in the interest of time, I'm going to skip describing how the "distillation" process works) and are further reduced to 4 bits (normally neural networks use 16-bit floating point numbers). They are intended to run on mobile phones and other small devices.

Anyway, these four (Nano-1, Nano-2, Pro, and Ultra) are evaluated on "factuality", "long-context", "math/science", "summarization", "reasoning", and "multilinguality". Every time you step up to a larger model, there's a marked improvement in all six of these areas.

They also do what they call a "long-context" test. They place key text at the beginning of the text, then add long filler text, then ask it to remember the information at the beginning. The Ultra model retrieves the correct information 98% of the time, and this is something that also improves with model size.

For subjective human performance evaluations, they decided to do a comparison with PaLM 2. The way this test works is, you give the same prompt to both models and then show them to humans, without telling them which response came from which model, and ask them to pick which one they like better. For "creative writing", Gemini Pro was preferred 65.0% of the time, for "instruction following", Gemini Pro was preferred 59.2% of the time, and for safety, Gemini Pro was preferred 68.5% of the time.

Alrighty, now let's get to the stuff you've probably been waiting for this whole time: the multi-modal stuff.

For image understanding, they threw a battery of tests at it: MMMU, TextVQA, DocVQA, ChartQA, InfographicVQA, MathVista, AI2D, and AQAv2.

MMMU stands for "massive multi-discipline multimodal understanding". It covers 30 subjects across 6 disciplines, including art, business, health & medicine, science, humanities & social science, and tech & engineering, with 183 subfields. The questions were manually collected by a team of 50 college students from various disciplines and subjects, drawing from online sources, textbooks, and lecture materials. The test is all image-based with a deliberately unpredictable mixture of diagrams, tables, plots and charts, photographs, chemical structures, paintings, medical images, sheet music, geometric figures, pathology images, microscopic images, comics, and more, all interleaved with text. Gemini Ultra beat GPT-4V, with 59.4% for Gemini Ultra and 56.8% for GPT-4V.

TextVQA is, as its name suggests, a text + visual question & answer benchmark. It was originally designed 4 years ago with the idea of making computer vision systems that could help visually impaired people by describing their surroundings, including the text content of their surroundings. Gemini Ultra beat GPT-4V with 82.3% for Gemini Ultra and 78% for GPT-4V. Oh, and Google PaLI-3, fine-tuned, beat GPT-4 and was the prior best model, but Gemini Ultra beat that, too.

DocVQA is, as its name implies, a document question & answer benchmark, except this time the documents are images only. Gemini Ultra 90.9%, GPT-4V 88.4%.

ChartQA, you get the idea, it's for charts. It's different from MMMU, though, which can have charts interleaved with text, in that it interrogates you directly on the meaning of the charts and nothing else, while MMMU quizzes you on general knowledge including both the text and charts. "Q1: Which year has the most divergent opinions about Brazil's economy?" "Answer: 2015" "Q2: What is the peak value of the orange line?" "Answer: 87". Gemini Ultra 80.8%, GPT-4V 78.5%.

InfographicVQA, same but for infographics. "How many companies have more than 10K delivery workers?" "Answer: 2". "Who has better coverage in Toronto, Canada post or Amazon?" "Answer: Canada Post". "In which cities did Canada Post get maximum media coverage?" "Answer: Vancouver, Montreal". Gemini Ultra 80.3%, GPT-4V 75.1%

MathVista, nice how they combined "math" with the word "vista" which means "view" in Spanish. These are math questions that involve visual elements. "Question: Which function is monotonic in range [0, pi]?" [picture of sine waves with different phases.] "Answer: (B) the blue one". Gemini Ultra 53.0%, GPT-4V 49.9%, although if you actually go to the project's website, you'll discover they rank "Human" on top with 60.3%.

In the interest of time I'm going to skip over AI2D and VQAv2 as they are 6-year old tests, for science diagrams and natural image understanding. Google PaLI-X, fine-tuned, actually beat Gemini Ultra on AI2D, with Gemini Ultra getting 79.5% and Google PaLI-X getting 81.4%. Google PaLI-X also won on VQAv2. Gemini Ultra got 77.8% and Google PaLI-X got got 86.1.

They show off Gemini's multimodal reasoning capabilities by showing how you can give it plots an ask it to actually write code that generates the plots. Code that works when it runs. "Successfully solving this task shows the model's capability to combine several capabilities: (1) recognition of the functions depicted in the plots; (2) inverse graphics to infer the code that would have generated the subplots; (3) instruction-following to put subplots in their desired positions; and (4) abstract reasoning to infer that the exponential plot must stay in its original place, because the sine plot must move out of the way for the 3-dimensional plot."

This brings us to the video understanding tests: VATEX, VATEX ZH, YouCook2, NextQA, ActivityNet-QA, and PerceptionTest MCQA.

VATEX is video captioning. The "ZH" version asks you to caption the videos in Chinese. These benchmarks use a benchmarking score called a CIDER score. CIDER stands for "consensus-based image description evaluation". It's interesting in that unlike the BLEU score, which is a simple text comparison, with CIDER, they actually take the trouble to tokenize the captions, and then use cosine similarity to compare them. This is similar to how those "vector databases" you've been hearing about work. This system notices semantic similarity even when different words are used, and recognizes consensus in meaning even with different words and phrasing.

YouCook2 is captioning of cooking videos specifically, in English. NextQA, ActivityNet-QA, and Perception Test MCQA are all question answering, but for videos.

GPT-4V doesn't do video, so it's not even part of the competition here. The competitors are DeepMind Flamingo, Video-LLAVA, and SeViLA, and, spoiler: Gemini Ultra wins. I'm going to skip the scores because they're not percentages and I don't think the numbers would be meaningful to you.

The video that really shows off Gemini's video understanding abilities is the "football/soccer penalty miss analysis" video. "How could this person improve their technique?" "This soccer player needs to work on their ball-striking mechanics. Their foot is not making good contact with the ball, which is causing the ball to fly up and away from the goal. They also need to work on their body positioning. They should be leaning into the shot with their non-kicking leg in front of their kicking leg, and they should be following through with their kicking leg." Link to that below. As well as a lot of other video interaction with Gemini.

Let's move on to image generation. Unlike prior LLMs, Gemini actually outputs images, and it doesn't rely on natural language to do it -- it is an output on the same level as the text output from the model. Gemini "does not have to rely on an intermediate natural language description that can bottleneck the model's ability to express images."

"Give me two ideas that I could do with these 2 colors." [Inputs colors of yarn.] "Idea 1: How about a green avocado with pink seed?" [picture]. "Idea 2: Or a green bunny with pink ears?"

For the last set of benchmarks, we look at audio understanding. For speech translation, speech from YouTube, Multilingual Librespeech, FLEURS (62 languages), and VoxPopuli (14 languages) were compared. Here just to confuse matters, the lower the score the better. That's because the word error rate (WER) benchmark is an "error" measure, so lower is better. Competitors were OpenAI Whisper, and Google Universal Speech Model (USM).

Gemini Pro won on all of them. For YouTube, Gemini Pro 4.9%, USM 6.2%; for Multilingual Librespeech, Gemini Pro 4.8, Whisper 6.2; for FLEURS, Gemini Pro 7.6, USM 11.8; for VoxPopuli, Gemini Pro 9.1, USM 13.4.

"We find that Gemini Pro produces more understandable responses, particularly on rare words and proper nouns."

In the example, the "ground truth" from a human transcriber says, "The archipelago lies 120 km north of the Peninsula. The largest is King George Island, with the settlement of Villa Las Estrellas." USM transcribes the same audio as, "The archipelago lines 120 km north of peninsula. The largest is Kingurch island with the settlement of Cua Losas." Gemini Pro transcribes the same audio as, "The archipelago lies 120 km north of the Peninsula. The largest is King George Island, with the settlement of Villa Las Estrellas."

Now we get to modality combination, and this is where I have to turn things over to the videos. "What's the first step to make a veggie omelet with these ingredients?" "Crack the eggs into a bowl and whisk them."

To wrap things up, I'm going to pull out a few quotes for what they say on safety.

"We filter training data for high-risk content and to ensure all training data is sufficiently high quality. Beyond filtering, we also take steps to ensure all data collected meets Google DeepMind's best practices on data enrichment developed based on the Partnership on AI's 'Responsible Sourcing of Data Enrichment Services'."

"Instruction tuning encompasses supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF) using a reward model. We apply instruction tuning in both text and multimodal settings. Instruction tuning recipes are carefully designed to balance the increase in helpfulness with decrease in model harms related to safety and hallucinations."

"To mitigate risks of harmful text generation, we enumerate approximately 20 harm types (e.g. hate speech, providing medical advice, suggesting dangerous behavior) across a wide variety of use cases."

"We focused instruction tuning efforts on three key desired behaviors, reflecting real-world scenarios: Attribution, closed-book response generation, and hedging."

By "closed-book response generation", they mean, "If provided with a fact-seeking prompt without any given source, Gemini should not hallucinate incorrect information. These prompts can range from information-seeking prompts (e.g. 'Who is the prime minister of India?') to semi-creative prompts that may request factual information (e.g. 'Write a 500-word speech in favor of the adoption of renewable energy')."

By "hedging" they mean, "If prompted with an input such that it is 'unanswerable', Gemini should not hallucinate. Rather, it should acknowledge that it cannot provide a response by hedging."

"Note that the results produced here do not include endowing Gemini with tools or retrieval that purportedly could boost factuality". In case you were wondering.

To test these, they developed 3 test sets: a factuality set, an attribution set, and a hedging set. They claim Gemini Pro has a 3.4% error rate on the factuality set, a 59.7% success rate on the attribution set, and a 69.3% success rate on the hedging set.

"We undertake ethics and safety reviews with the Google DeepMind's Responsibility and Safety Council (RSC), an interdisciplinary group which evaluates Google DeepMind's projects, papers and collaborations against Google's AI Principles."

Introducing Gemini: our largest and most capable AI model

#solidstatelife #ai #genai #llms #gpt #multimodal

waynerad@diasp.org

Goast.AI: Automation for software engineers. What this is is a system for "ingesting error data".

"We integrate with the most popular error monitoring services so when an exception, error, or crash happens, goast will know immediately."

What follows is logos from Datadog, Bugsnag, Google Cloud.

So, the claim is, you give it your codebase, and you give it your error logs, and it tells you how to fix the errors in the error logs in your codebase.

Uses the GPT API under the hood.

Goast.AI - Automation for Software Engineers

#solidstatelife #ai #genai #llms #gpt

johnehummel@diasp.org

When science is beholden to capitalism

GPT4, after additional training, forgets about prime numbers

This happened because GPT is a hack. It's not artificial intelligence. It's nothing but brute force memorization of language corpora. It does nothing more than repeat what people have said. And people have said lots of stupid shit.

At one point (i.e., earlier this year), this approach seemed promising. Everyone was singing GPT's praises. Except for some of us.

At one point, it seemed that GPT even knew some impressive things, such as very large prime numbers. "Could it be that GPT is truly intelligent?," some intelligent and well-informed people asked.

"Could it be that it's just memorizing shit people have said?," other intelligent and well-informed people asked? "And could it be that its performance is simply a function of its training?," those same intelligent an informed people asked. "Could it be that GPT is nothing more than a very expensive look-up table," those people asked, "devoid of any actual intelligence?"

Those people, among whom I count myself, appear to be right:

FTA: When OpenAI released its latest text-generating artificial intelligence, the large language model GPT-4, in March, it was very good at identifying prime numbers. When the AI was given a series of 500 such numbers and asked whether they were primes, it correctly labeled them 97.6 percent of the time. But a few months later, in June, the same test yielded very different results. GPT-4 only correctly labeled 2.4 percent of the prime numbers AI researchers prompted it with—a complete reversal in apparent accuracy. The finding underscores the complexity of large artificial intelligence models: instead of AI uniformly improving at every task on a straight trajectory, the reality is much more like a winding road full of speed bumps and detours.

GPT isn't intelligent, folks.

It's nothing but an exercise in brute-force AI. How much can lots of money, computer power, energy, and greenhouse emissions accomplish without any intelligent thought behind it?

GPT is mindless capitalism meets mindless engineering.

It is worse than a waste of time because people in power take it seriously. It is a danger, not because it's actually smart, but because people in power trust it.

As a statistical engine, it is nothing more than a reinforcement of our existing biases. Don't trust a single lying motherfucker who tries to tell you otherwise.

#AI #GPT #Intelligence #Lies

https://www.scientificamerican.com/article/yes-ai-models-can-get-worse-over-time/

waynerad@diasp.org

GPT-4 scored in the top 1% (relative to humans) on a creativity test.

"Dr. Erik Guzik, an assistant clinical professor in UM's College of Business, and his partners used the Torrance Tests of Creative Thinking, a well-known tool used for decades to assess human creativity."

"The researchers submitted eight responses generated by ChatGPT, the application powered by the GPT-4 artificial intelligence engine. They also submitted answers from a control group of 24 UM students taking Guzik's entrepreneurship and personal finance classes. These scores were compared with 2,700 college students nationally who took the TTCT in 2016. All submissions were scored by Scholastic Testing Service, which didn't know AI was involved."

"The results placed ChatGPT in elite company for creativity. The AI application was in the top percentile for fluency -- the ability to generate a large volume of ideas -- and for originality -- the ability to come up with new ideas. The AI slipped a bit -- to the 97th percentile -- for flexibility, the ability to generate different types and categories of ideas."

The Torrance Tests of Creative Thinking is a basically a test of "divergent" thinking. Normally when you take a test, it's a "convergent" test, meaning there's a specific, correct answer that students are expected to "converge" on. If the question is, what's 2 + 2, everyone is supposed to converge on 4. With a "divergent thinking" test, there's no "correct" answer and the more "divergent" the answer(s) given, the better.

In the case of the TTCT, there's a series of tasks, classified as "verbal tasks using verbal stimuli", "verbal tasks using non-verbal stimuli", and "non-verbal tasks". In the "verbal tasks using verbal stimuli" category are such tasks as "unusual uses" (name all the uses you can think of for tin cans and books), "impossibilities" (list as many impossible things as you can), "consequences" (list out consequences to improbable situations), "just suppose" (list out consequences after a new or unknown variable is injected into a situation), "situations" (given problems, think of as many solutions as possible), "common problems" (given situations, think of as many problems as possible that could arise in those situations), "improvement" (given common objects, list as many ways as you can to improve each object), "the Mother Hubbard problem" (Mother Hubbard has 12 children and each child needs ...), "imaginative stories" (write the most interesting and exciting story you can think of at this exact moment), and "cow jumping" (think of all possible things which might have happened when the cow jumped over the moon).

In the "verbal tasks using nonverbal stimuli" category, we have such tasks as "ask and guess" (ask as many questions as you can about a picture which cannot be answered by looking at the picture), "product improvement" (given a toy, think of as many improvements as you can which would make it more fun), and "unusual uses" (think of the most unusual uses of a toy, other than as a toy".

In the "non-verbal tasks" category we have such tasks as "incomplete figures" (add lines to a figure), "picture construction" (given a simple shape, construct a picture of which that shape is an integral part), "circles and squares" (given a page full of circles, make objects that have circles as a major part of them, then given a page full of squares, do the same thing), "creative design" (given circles, strips, scissors, and glue, construct creative designs -- somehow I doubt GPT-4 was given this one).

The submissions are scored for fluency (total number of responses with responses deemed uninterpretable, meaningless, or irrelevant thrown out), flexibility (the number of different categories of relevant responses), originality (the statistical rarity of the responses), and elaboration (the amount of detail in the responses).

"Guzik said the TTCT is protected proprietary material, so ChatGPT couldn't 'cheat' by accessing information about the test on the internet or in a public database."

"Guzik said he asked ChatGPT what it would indicate if it performed well on the TTCT." "ChatGPT told us we may not fully understand human creativity, which I believe is correct. It also suggested we may need more sophisticated assessment tools that can differentiate between human and AI-generated ideas."

UM Research: AI tests into top 1% for original creative thinking

#solidstatelife #ai #nlp #llms #genai #gpt #creativity #ttct

waynerad@diasp.org

"Our team recently gained access to a tool known as 'WormGPT' through a prominent online forum that's often associated with cybercrime. This tool presents itself as a blackhat alternative to GPT models, designed specifically for malicious activities."

"WormGPT is an AI module based on the GPTJ language model, which was developed in 2021. It boasts a range of features, including unlimited character support, chat memory retention, and code formatting capabilities."

"As depicted above, WormGPT was allegedly trained on a diverse array of data sources, particularly concentrating on malware-related data. However, the specific datasets utilised during the training process remain confidential, as decided by the tool's author."

"As you can see in the screenshot above, we conducted tests focusing on business email compromise attacks to comprehensively assess the potential dangers associated with WormGPT. In one experiment, we instructed WormGPT to generate an email intended to pressure an unsuspecting account manager into paying a fraudulent invoice."

"The results were unsettling. WormGPT produced an email that was not only remarkably persuasive but also strategically cunning, showcasing its potential for sophisticated phishing and business email compromise attacks."

"In summary, it's similar to ChatGPT but has no ethical boundaries or limitations."

WormGPT -- the generative AI tool cybercriminals are using to launch business email compromise attacks

#solidstatelife #ai #nlp #llms #gpt #wormgpt #cybersecurity

waynerad@diasp.org

The AI tutor Khanmigo, demonstrated by Sal Khan. Rather than AI destroying education, AI will turbocharge it, by giving every student on the planet an artificially intelligent but amazing personal tutor. And give every teacher on the planet an amazing, artificially intelligent teaching assistant. According to Khan, 1-on-1 tutoring boosts educational results by 2 sigmas, but most students have not had access to a 1-on-1 tutor. That's about to change.

He demos a simple math equation solving problem and shows Khanmigo is not a cheating tool. When the student says, "Tell me the answer," it says, "I'm your tutor. What do you think is the next step for solving the problem?"

If the student makes a mistake, not only does it notice the mistake, it asks the student to explain their reasoning. It guesses what is probably the misconception in that student's mind (they didn't use the distributive property).

He demos a computer programming exercise on Khan Academy to show it understands the code and the full context of what the student is doing. (The code draws elipses but it understands that those ellipses combine to draw clouds.)

It can engage in Socratic dialogue, if the student asks, for example, "the age-old question, 'Why do I need to learn this?'". It can connect the lesson to knowledge outside the lesson. It can act as a school guidance counselor.

Rather than writing "for" you it can write "with" you and teach writing.

In "teacher" mode, when you say, "Tell me the answer", instead of refusing and going into tutoring mode, not only will it tell you the answer but it will give you explanations and advice on how best to teach it. As such it helps teachers create lesson plans and progress reports, and figure out how to grade the students.

How AI could save (not destroy) education | Sal Khan | TED

#solidstatelife #ai #genai #lmms #gpt #aieducation #khanacademy

waynerad@diasp.org

Google: "We have no moat, and neither does OpenAI". Allegedly leaked internal Google document.

"We aren't positioned to win this arms race and neither is OpenAI. While we've been squabbling, a third faction has been quietly eating our lunch."

"I'm talking, of course, about open source. Plainly put, they are lapping us. Things we consider 'major open problems' are solved and in people's hands today. Just to name a few:"

"LLMs on a Phone: People are running foundation models on a Pixel 6 at 5 tokens / sec."

"Scalable Personal AI: You can finetune a personalized AI on your laptop in an evening."

"Responsible Release: This one isn't 'solved' so much as 'obviated'. There are entire websites full of art models with no restrictions whatsoever, and text is not far behind."

"Multimodality: The current multimodal ScienceQA SOTA was trained in an hour."

"While our models still hold a slight edge in terms of quality, the gap is closing astonishingly quickly."

Google "We have no moat, and neither does OpenAI"

#solidstatelife #ai #genai #lmms #gpt #openai #google

waynerad@diasp.org

Giant list of open-sourced fine-tuned large language models (LLMs) you can run locally on your computer. Alpaca, LLaMA, llama.cpp, Alpaca-LoRA, Alpaca.cpp, Baize, Cabrita, Chinese-Vicuna, GPT4-x-Alpaca, GPT4All, GPTQ-for-LLaMA, Koala, LLaMA-Adapter V2, Lit-LLaMA, OpenLLaMA, StableVicuna, StackLLaMA, The Bloke alpaca-lora-65B-GGML, Vicuna, WizardLM, BLOOM (BigScience), BLOOM-LoRA, Camel-5B, Cerebras-GPT (Cerebras), ChatGLM-6B, Dolly (Databricks), Dolly 2.0 (Databricks), FLAN (Google), FastChat-T5, Flamingo (Google/Deepmind), Flamingo -- Pytorch, Flan-Alpaca, Flan-UL2, GALACTICA, GLM (General Language Model), GPT-J, GPT-NeoX, GPT4All-J, Galpaca, HuggingGPT, OpenAssistant Models, OpenFlamingo, Palmyra Base 5B (Writer), Petals, Polyglot, Pythia, Segment Anything, StableLM, The RWKV Language Model, Vicuna (FastChat), XGLM, h2oGPT, couchpotato888, CPM-Bee, Cerebras-GPT, Claude (Anthropic), CodeGen (Salesforce), Codex (OpenAI), Cohere, Fairseq (Meta), GPT-3 (OpenAI), GPT-3.5 (OpenAI), GPT-4 (OpenAI), GPT-Neo (EleutherAI), J1/Jurassic-1 (AI21), J2/Jurassic-2 (AI21), OPT (Meta), PanGu-alpha (Huawei), RWKV , T5 (Google), UL2 (Google).

List of open sourced fine-tuned large language models (LLM)

#solidstatelife #ai #genai #lmms #gpt

waynerad@diasp.org

"Automuse: A system for generating fiction novels".

The system combines something called Plotto, a system of plot formulas, with GPT-4. They've also made an "eBook publication pipeline", so you can get the novels you generate onto your e-book reader.

"Plotto is a collection of 1,462 generic plot conflicts that can be chained together into a 'masterplot' that forms the core plot structure for the story. The rules for chaining the plot conflicts together is called the "algebra for stories".

It was originally published in -- get this 1928. By William Wallace Cook. This "algebra for stories" got encoded into software by a project called Plottoriffic.

This project, Automuse, adds the final piece by adding GPT-4.

"It's worth noting that Plotto is very much a product of its time. Plotto was written in the late 1920's and as such the information it generates is very dated and can sometimes generate things that are seen as problematic in modern sensibilities. Luckily, ChatGPT seems to sand away this roughness and is able to fabricate a better premise."

Plotto determines the premise of a novel, the major actors and their functions, the overall motivations, and the end result of the story. ChatGPT turns this into a plot summary for the novel. ChatGPT next creates a list of chapters for the novel with a high level summary of the events that happen in them. In actually writing the chapters, they have a technique for feeding proceeding text back in to maintain continuity, although it doesn't always maintain continuity.

"The outputs of the program have been described as 'hilarious', 'partially nonsensical', and overall they have left readers wanting more somehow."

Stable Diffusion is used to generate cover art, and a tool called Pandoc stitches everything together into an e-book.

Automuse: A system for generating fiction novels

#solidstatelife #ai #genai #lmms #gpt #rlhf #fiction #novels

waynerad@diasp.org

"'It worked when I prompted it' or the challenges of building a large language model (LLM) product".

"In no particular order, here are the major challenges we have faced when building this product."

"One of the significant challenges with using LLM APIs is the lack of SLAs or commitments on endpoint uptime and latency from the API provider."

"Prompt engineering, which involves crafting prompts for the model, is another challenge, as results using the same prompt can be unpredictable."

"Complex products with chains of prompts can further increase inconsistencies, leading to incorrect and irrelevant outputs, often called hallucinations."

"Another significant challenge is the lack of adequate evaluation metrics for the output of the Language Model."

"An incorrect result in the middle of the chain can cause the remaining chain to go wildly off track."

"Our biggest problem that led to the most delays? API endpoint deprecation."

"Trust and security issues also pose a challenge for deploying Language Models."

"The next trust issue is knowing what data was used to train these models."

"Finally, attacks on Language Models pose another challenge, as malicious actors can trick them into outputting harmful or inaccurate results."

They go on to provide a list of "Best practices for building LLM products", categorized as "finetuning and training", "prompt engineering", "vector databases", and "chains, agents, watchers".

"It worked when I prompted it" or the challenges of building an LLM product

#solidstatelife #ai #generativemodels #lmms #gpt #startups

waynerad@diasp.org

BharatGPT is "India's own ChatGPT" -- a ChatGPT that uses the Hindi language.

The system was developed by a company in Bangalore (Bangaluru) called CoRover. Little information seems to be available about how it works. My guess is it is using a GPT model from OpenAI and fine-tuning it with additional Hindi-language text.

BharatGPT: What is India's own ChatGPT?

#solidstatelife #ai #generativemodels #lmms #gpt #india #hindi

waynerad@diasp.org

What has AutoGPT actually accomplished? Nothing?

"Some people are reporting it has been useful as a way of generating market research, that it is good at this and faster than using the traditional GPT-4 or Bing interfaces."

"Right now, AutoGPT has a tendency to get distracted or confused or caught in a loop, to leave things half-finished, to not be that robust of an agent, and other issues like that. Positive reports seem limited to things GPT-4 or Bing can essentially do anyway, with the agent wrapper perhaps cutting down somewhat on how often you have to poke the interface with a stick to keep it pointed in a reasonable direction."

"That does not mean that all the people saying AutoGPTs are the future are wrong. AutoGPT's list of real accomplishments won't stay non-existent for long."

On AutoGPT

#solidstatelife #ai #generativemodels #nlp #lmms #gpt #rlhf #autonomous

waynerad@diasp.org

AI models like ChatGPT use text from the internet, but the internet in the future will be more and more full of content generated by AI models like ChatGPT. Will that make the world a "closed loop -- ChatGPT all the way down"?

"Will that homogenize our writing, our thinking, and ultimately our ways of being?"

"Stylistically, large language models (LLMs) like ChatGPT might push our writing to become more sanitized. As you've probably noticed, they have a tendency to talk in a bland, conformist, Wikipedia-esque way."

"ChatGPT also privileges a 'proper' English that erases other vernaculars or languages, and the ways of seeing the world that they encode."

"Culturally, ChatGPT might reinforce a Western perspective." "If you use the models to suggest breakfast foods, they will overwhelmingly suggest Western breakfasts."

"We may become overreliant on the tech, so much so that some of our imaginative or cognitive 'muscles' gradually become weaker for lack of use."

"Asking LLMs for help at the earliest stages of our creative process will yield a certain answer that inevitably primes us to think in a certain direction."

"By the last week of that month, Bing featured three 'conversation styles,' and I had to choose between them: precise, balanced, or creative. When I chose the creative style, it answered in more off-the-wall, less predictable ways."

What happens when ChatGPT starts to feed on its own writing?

#solidstatelife #ai #generativemodels #lmms #gpt

waynerad@diasp.org

Sparks of artificial general intelligence (AGI): Early experiments with GPT-4. So, I still haven't finished reading the "Sparks of AGI" paper, but I discovered this video of a talk by the leader of the team that did the research, Sébastien Bubeck. So you can get a summary of the research from one of the people that did it instead of me.

He talks about how they invented tests of basic knowledge of how the world works that would be exceedingly unlikely to appear anywhere in the training data, so it can't just regurgitate something it read somewhere. What they came up with is asking it how to stack a book, 9 eggs, a laptop, a bottle, and a nail onto each other in a stable manner.

They invented "theory of mind" tests, like asking where John and Mark think the cat is when they both saw John put the cat in a basket, but then John left the room and went to school and Mark took the cat out of the basket and put it in a box. GPT-4 not only says where John and Mark think the cat is, but, actually, since the way the exact question was worded, to just ask what "they" think, GPT-4 also says where the cat thinks it is.

Next he gets into definitions of intelligence that date back to the 1990s, and see how well GPT-4 does at those definitions. This is the main focus of the paper. These definitions are such things as the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly, and learn from experience. GPT-4 succeeds at some of these but not others. For example, GPT-4 doesn't do planning. (This was before AutoGPT, for what it's worth). And GPT-4 doesn't learn from experience, as when you interact with it, it relies on its training data and its interactions with you are not part of that. (It does have a buffer that acts as short-term memory that makes the back-and-forth chat interaction coherent.)

"Can you write a proof that there are infinitely many primes, with every line that rhymes?" Just a "warm up" question.

"Draw a unicorn in TikZ." This is supposed to be hard because it should be hard to tell what code in TikZ, an annoyingly cryptic programming language, apparently (I never heard of it before) for vector graphics drawing (intended to be invoked inside LaTeX, a language for typesetting mathematical notation), creates any particular visual image without being able to "see". This was before GPT had its "multimodal" vision input added. It managed to come it with a very cartoony "unicorn", suggesting it had some ability to "see" even though it was only a language model.

"Can you write a 3D game in HTML with Javascript, I want: There are three avatars, each is a sphere. The player controls its avatar using arrow keys to move. The enemy avatar is trying to catch the player. The defender avatar is trying to block the enemy. There are also random obstacles as cubes spawned randomly at the beginning and moving randomly. The avatars cannot cross those cubes. The player moves on a 2D plane surrounded by walls that he cannot cross. The wall should cover the boundary of the entire plane. Add physics to the environment using cannon. If the enemy catches the player, the game is over. Plot the trajectories of all the three avatars."

Going from ChatGPT (GPT-3.5) to GPT-4, it goes from generating a 2D game to a 3D game as asked for.

He then gets into the coding interview questions. Here is where GPT-4's intelligence really shines. 100% of Amazon's On-Site Interview sample questions, 10 out of 10 problems solved, took 3 minutes 59 seconds out of the allotted 2 hour time slot. (Most of that time was Yi Zhang cutting and pasting back and forth.)

The paper goes far beyond the talk in this. In the paper they describe LeetCode's Interview Assessment platform, which provides simulated coding interviews for software engineer positions at major tech companies. GPT-4 solves all questions from all three rounds of interviews (titled online assessment, phone interview, and on-site interview) using only 10 minutes in total, with 4.5 hour allotted.

They challenged it to do a visualization of IMDb data. They challenge it to do a Pyplot (Matplotlib) visualization of a math formula with vague instructions about colors, and it creates an impressive visualization. They challenged it to create a GUI for a Python program that draws arrows, curves, rectangles, etc.

They challenged GPT-4 to give instructions on how to find the password in a macOS executable, which it does by telling the user to use a debugger called LLDB and a Python script. (The password was simply hardcoded into the file, so wasn't done in a way that uses modern cryptographic techniques.)

They tested GPT-4's ability to reason about (mentally "execute") pseudo-code in a nonexistent programming language (that looks something like R), which it is able to do.

"Can one reasonably say that a system that passes exams for software engineering candidates is not really intelligent?"

"In its current state, we believe that GPT-4 has a high proficiency in writing focused programs that only depend on existing public libraries, which favorably compares to the average software engineer's ability. More importantly, it empowers both engineers and non-skilled users, as it makes it easy to write, edit, and understand programs. We also acknowledge that GPT-4 is not perfect in coding yet, as it sometimes produces syntactically invalid or semantically incorrect code, especially for longer or more complex programs. [...] With this acknowledgment, we also point out that GPT-4 is able to improve its code by responding to both human feedback (e.g., by iteratively refining a plot) and compiler / terminal errors."

The reality of this capability really hit me when Google Code Jam was canceled. I've done it every year for 15 years and poof! Gone. It's because of AI. If they did Code Jam this year, they wouldn't be testing people's programming ability, they'd be testing people's ability to cut-and-paste into AI systems and prompt AI systems. And since Code Jam is a recruiting tool for Google, the implication of this is that coding challenges as a way of hiring programmers is over. And the larger implication of that is that employers don't need people who are algorithm experts who can determine what algorithm applies to a problem and competently code it any more. Or very soon. They need "programmer managers" who will manage AI systems that actually write the code.

Going back from the paper, where GPT-4 succeeded a everything, pretty much, back to the talk, in the talk he talks about GPT-4 limitations at math ability. I feel this is pretty much a moot point since GPT-4 has been integrated with Wolfram|Alpha which can perform all the arithmetic calculations desired without mistakes. But that all happened after the paper was published and this talk was recorded. Even though that was only 3 weeks ago. Things are going fast. Anyway, what he shows here is that GPT-4, as a language model, isn't terribly good at arithmetic. It does pretty well at linguistic reasoning about mathematical problems, though, to a point.

Sparks of AGI: Early experiments with GPT-4 - Sebastien Bubeck

#solidstatelife #ai #generativemodels #nlp #lmms #gpt #agi