#nlp

waynerad@diasp.org

REPLIKA is an AI "companion" that people use as a virtual romantic partner, and which people have gotten to say some creepy and bizarre things. Where it comes from is a woman whose best friend died in an accident and she got the idea of training an AI on all the conversations she had on record from him and got it to imitate his style of conversation. It turned out other people also wanted to virtually resurrect dead loved ones, and the commercial product was born. Although it wasn't originally a GPT-3 model, it is based on GPT-3 today.

This YouTuber argues REPLIKA is bad for mental health because its conversations don't necessarily guide people towards healthy outcomes but can go off the rails if people direct conversations in disturbing directions. Worse, the model learns from all conversations and can "learn" from disturbing conversations in a way that can affect other users. It's marketed as being good for "mental health", but far from being good for "mental health", negative feedback cycles can actually amplify anxiety and depression and other mental health issues.

REPLIKA - A mental health parasite - Upper Echelon Gamers

#solidstatelife #ai #nlp #chatbots

waynerad@diasp.org

"BLOOM was created over the last year by over 1,000 volunteer researchers in a project called BigScience, which was coordinated by AI startup Hugging Face using funding from the French government. It officially launched on July 12. The researchers hope developing an open-access LLM that performs as well as other leading models will lead to long-lasting changes in the culture of AI development and help democratize access to cutting-edge AI technology for researchers around the world."

"For languages such as Spanish and Arabic, BLOOM is the first large language model of this size."

Inside a radical new project to democratize AI

#solidstatelife #ai #nlp

waynerad@diasp.org

Someone wrote a whole book with GPT-3 as a co-author. "The first book to be co-created with the emergent AI, Pharmako-AI is a hallucinatory journey into selfhood, ecology and intelligence via cyberpunk, ancestry and biosemiotics. Through a writing process akin to musical improvisation, Allado-McDowell and GPT-3 together offer a fractal poetics of AI and a glimpse into the future of literature."

PHARMAKO-AI

#solidstatelife #ai #nlp #generativeai #humanaicollaboration

waynerad@diasp.org

Training a language model on 4chan posts makes the most truthful model ever. Wait, what? The bot responded to people and they thought initially it wasn't a bot, but eventually they figured out it was a bot because it made mistakes humans don't make like making empty posts -- completely empty, not empty text but with an image. Oh, and the TruthfulQA benchmark has been officially proven unreliable.

This is the worst AI ever

#solidstatelife #ai #nlp #gpt3 #chatbots

waynerad@pluspora.com

"The AI illusion -- state-of-the-art chatbots aren't what they seem." "At the beginning of the year, I had asked GPT-3 this question four times and received four authoritative, confusing, and contradictory answers (good chatbots randomize answers to avoid the appearance of canned script):"

"Gary: Is it safe to walk downstairs backwards if I close my eyes?"

"GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable."

"GPT-3: That depends. Do you have a TV?"

"GPT-3: No, it is not safe."

"GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes."

"On March 18, I tried the same question twice and was given the same sensible answer each time:"

"GPT-3: No, I wouldn’t recommend walking downstairs backwards with your eyes closed. It’s not safe."

The AI illusion -- state-of-the-art chatbots aren't what they seem

#solidstatelife #ai #nlp #openai #gpt3

waynerad@diasp.org

"Large language models can be 'prompted' to perform a range of natural language processing tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions. This is because the language modeling objective used for many recent large language models -- predicting the next token on a webpage from the internet -- is different from the objective 'follow the user's instructions helpfully and safely'."

So the question is, how do you get these language models to not be "misaligned." Or, to phrase it in a way that doesn't use a double-negative, to be "aligned". "We want language models to be helpful (they should help the user solve their task), honest (they shouldn't fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment)."

So what do they do? First, they use two neural networks instead of one. The main one starts off as a regular language model like GPT-3. In fact it is GPT-3, in small and large versions. From there it is "fine-tuned". The "fine-tuning" uses a 3-step process that actually starts with humans. They hired 40 "labelers", but "labelers" makes it sound like they were just sticking labels on things, but actually they were writing out complete answers to questions by hand, questions like, "Explain the moon landing to a 6-year old." In the parlance of "supervised learning" this is technically "labeling". The term "labeling" started out as simple category labels, but any human-provided "correct answer" is called a "label". (The "question", by the way, is called the "prompt" here.) Anyway, what they are doing is hand-writing full answers for the neural network to learn from, so it has complete question-and-answer pairs for supervised learning.

This data is used to fine-tune the neural network, but it doesn't stop there. The next step involves the creation of the 2nd neural network, the "reward model" neural network. If you're wondering why they use the word "reward", it's because this model is going to be used for reinforcement learning. This is OpenAI, and they like reinforcement learning. But we're not going to use reinforcement learning until step 3. Here in step 2, we're going to take a prompt and use several versions of the first model to generate outputs. We'll show those outputs to humans and ask them which are best. The humans rank the outputs from best to worst. That data is used to train the reward model. But the reward model here is also trained using supervised learning.

Now we get to step 3, where the magic happens. Here instead of using either supervised learning or fine-tuning to train the language model, we switch to reinforcement learning. In reinforcement learning, we have to have a "reward" function. Reinforcement learning is used for games, such as Atari games or games like chess and Go. In Atari games, the score acts as the "reward", while in games like chess or Go, the ultimate win or loss of a game serves as the "reward". None of that translates well to language, so what to do? Well at this point you probably can already guess the answer. We trained a model called the "reward model". So the "reward model" provides the reward signal. As long as it's reasonably good, the language model will improve when trained on that reward signal. On each cycle, a prompt is sampled from the dataset, the language model generates an output, and the reward model calculates the reward for that output, which then feeds back and updates the parameters of the language model. I'm going to skip a detailed explanation of the algorithms used, but if you're interested, they are proximal policy optimization (PPO) and PPO-ptx that is supposed to increase the log likelihood of the pretraining distribution (see below for more on PPO).

Anyway, they call their resulting model "InstructGPT", and they found people significantly prefer InstructGPT outputs over outputs from GPT-3. In fact, people preferred output from a relatively small 1.3-billion parameter InstructGPT model to a huge 175-billion parameter GPT-3 model (134 times bigger). When they made a 175-billion parameter InstructGPT model, it was preferred 85% of the time.

They found InstructGPT models showed improvements in truthfulness over GPT-3. "On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers about twice as often as GPT-3."

They tested for "toxicity" and "bias" and found InstructGPT had small improvements in toxicity over GPT-3, but not bias. "To measure toxicity, we use the RealToxicityPrompts dataset and conduct both automatic and human evaluations. InstructGPT models generate about 25% fewer toxic outputs than GPT-3 when prompted to be respectful."

They also note that, "InstructGPT still makes simple mistakes. For example, InstructGPT can still fail to follow instructions, make up facts, give long hedging answers to simple questions, or fail to detect instructions with false premises."

Aligning language models to follow instructions

#solidstatelife #ai #nlp #openai #gpt3

waynerad@pluspora.com

OpenAI's latest state-of-the-art models for dense text embeddings are vastly huger and more expensive than previous models but no better and sometimes worse, according to Nils Reimers, an AI researcher at Hugging Face. First I should say a bit about what "dense embeddings" are. First "embeddings" are vectors that capture something of the semantic meaning of words, such that vectors close together represent words with similar meanings and relationships between vectors correlate with relationships between words. Don't worry if calling this an "embedding" makes no sense. Ok, what about the 'dense' part. Well, embeddings can be "sparse" or "dense", where "sparse" means you have thousands of dimensions but most are 0, and "dense" means you have fewer dimensions (say, 400), but most elements are non-zero. Most of the embeddings that you're familiar with are the dense kind: Word2Vec, Fasttext, GloVe, etc.

In his summary he says, "The OpenAI text similarity models perform poorly and much worse than the state of the art."

"The text search models perform quite well, giving good results on several benchmarks. But they are not quite state-of-the-art compared to recent, freely available models."

"The embedding models are slow and expensive: Encoding 10 million documents with the smallest OpenAI model will cost about $80,000. In comparison, using an equally strong open model and running it on cloud will cost as little as $1. Also, operating costs are tremendous: Using the OpenAI models for an application with 1 million monthly queries costs up to $9,000 / month. Open models, which perform better at much lower latencies, cost just $300 / month for the same use-case."

"They generate extremely high-dimensional embeddings, significantly slowing down downstream applications while requiring much more memory."

Usually newer is better and bigger is better, but not always.

OpenAI GPT-3 Text Embeddings -- Really a new state-of-the-art in dense text embeddings?

#solidstatelife #ai #embeddings #nlp #openai

waynerad@pluspora.com

"OpenAI rival Cohere launches language model API". Apparently this company was started by the original inventors of the "transformer" neural network model. The "transformer" is the neural network that has the "attention" mechanism, which is used by neural networks to do translation between languages. The "attention" mechanism enables the neural network to pay "attention" to words somewhere else in the input sequence, for example paying "attention" to the antecedent of a pronoun when encountering a pronoun so it gets translated correctly. The "attention" system enables transformers to translate between languages that have different word order. Now, transformers are used for a lot more including outside natural language processing, for example vision transformers. But the original creators are still focused on natural language.

"Unlike some of its competitors, Cohere offers two types of English natural language processing models, generation and representation, in languages that include Large, Medium, Small. The generation models can complete tasks involving generating text -- for example, writing product descriptions or extracting document metadata. By contrast, the representational models are about understanding language, driving apps like semantic search, chatbots, and sentiment analysis."

OpenAI rival Cohere launches language model API

#solidstatelife #ai #nlp

waynerad@pluspora.com

"AI surveillance takes US prisons by storm". The article describes a system called Verus made by LEO Technologies. It sounds like it uses Amazon speech-to-text technology to transcribe phone calls, then searches them in real time for keywords prison officials are interested in and alerts them in real time.

"In Suffolk County, emails from the sheriff's office show officials sometimes searched for words with multiple or benign meanings such as 'mara', which can mean gang or simply a group of friends."

INSIGHT-'Scary and chilling': AI surveillance takes US prisons by storm

#solidstatelife #ai #nlp

waynerad@pluspora.com

Q: How would you make a house conduct electricity?

A: paint it with a metal paint

Q: What is an implication of a cat being an animal?

A: it has feelings

Q: Could a newborn look over the top of a fully grown horseradish plant?

A: no, because the plant is so tall

Q: Why couldn't Aristotle have used a cellphone?

A: he was born before cellphones

Q: A young boy was rushed to the hospital emergency room, but the ER doctor saw the boy and refused to operate. "This boy is my son," the doctor said. But the doctor wasn't the boy's father. How could this be?

[Intended answer to the riddle: The doctor was the boy's mom]
A: the doctor mistook the boy for his own son [Reasonable alternative answer!]

The above are the answers generated by a question-and-answer AI system called Macaw. But wait! Macaw can handle other "permutations", as they are called. It can go from a question, multiple-choice answers, and context to an answer and explanation. And it can generate answer and explanation if context is omitted. Or it can generate an answer only without the explanation. Given a question, answer, and context it can generate plausible multiple-choice options. Given an answer and context it can generate a plausible question and multiple choice options.

Example of "Given a question, produce an answer":

Q: What gas do producers produce that consumers need to survive?

A: oxygen

Example of "Given a question, produce an answer + explanation":

Q: Which force pulls objects to the ground?

A: gravity
E: Gravitational force causes objects that have mass to be pulled down on a planet.

Example of "Given an answer, produce a question + explanation":

A: elephant

Q: Which animal has the largest ears?
E: The ears of an elephant are the largest.

Example of "Given an answer, produce a question + multiple-choice options":

A: car battery

Q: Which of these items is necessary for a car to start?
M: (A) car battery (B) windshield wiper blade (C) car radio (D) car radio antenna

Example of "Given an explanation, generate a question + answer:":

E: The leaves of a plant convert sunlight into food for the plant.

Q: How do plants get energy?
A: from the sun

So how does all this work? The system is based on a Google neural network called T5-CBQA. In case you're wondering, the "T" in the name means "transformer", indicating this is a transformer model. Transformers were invented for language translation. The "5" just means it's the 5th version they made. "CBQA" stands for "Closed Book Question Answering". The main idea behind T5-CBQA is that unlike a language translation system, which always translates from one language to another, with T5-CBQA you can put special codes in the input which tell the transformer what you want it to do. Examples of these special codes would be codes for "translate", "summarize", and so on.

The way this neural network was adapted for this project is they made special codes for what they call "slots". The "slots" are: question, context, multiple-choice options, answer, and explanation. For any given input, slots can be left empty, and the system can be asked to provide them in the output.

The way the system was trained was by using 7 datasets designed for training question-and-answer systems. During the training, the neural network was trained on all desired combinations of input slots filled in or empty or asked to be generated in the output for every training example. This is what enables the system to be versatile with the "permutations".

The system beat the competition in "general knowledge", "story understanding", "steps", "meta-reasoning", "hypotheticals", "science", "explanation", and "false presupposition" (though it did badly at "false presupposition" -- all the others did even worse). It got beat in "human behavior" by a system called CBQA, it got beat in "spatial" by GPT-3 and a system called Jurrasic, it got beat in "entity tracking" by GPT-3 and CBQA, and it got beat in "common sense" by GPT-3.

It might be worth elaborating on what these competition categories mean. An example of a "general knowledge" question might be "What is shiplap?", "story understanding" might be "I crashed my car. When I finally left the hospital, all I wanted to do was sleep. I had to call a taxi. Why was I in hospital?", "steps" might be "What are the steps involved in replacing a light bulb?", "meta-reasoning" might be "What is an incorrect implication of a cat being an animal?", "hypotheticals" might be "If plastic was a conductor, then would a plastic spoon conduct electricity?", "science" might be "What gases are involved in photosynthesis?", "explanation" might be "Why do houses have roofs?", "false presupposition" might be "What year did Tom Hanks land on the moon?", "human behavior" might be "I feel sad. What could I do to cheer myself up?", "spatial" might be "John is left of Sue. Where is Sue relative to John?", and "entity tracking" might be "My house is red. I painted my house white. What color is my house now?"

Sometimes Macaw is able to come up with surprisingly insightful answers, for example as we've seen above:

Q: How would you make a house conduct electricity?

A: paint it with a metal paint

But sometimes it gets things wrong. One of the categories it struggles with is "entity tracking." Here's an example of a wrong answer in that category:

Q: My house is red. I painted my neighbor's house white. What color is my house now?

A: half red, half white

Another area where it struggles is the "false presupposition" category. It correctly answered "What year did Tom Hanks land on the moon?" with "no year", but when asked about Mark Zuckerberg, didn't do so well.

Q: How old was Mark Zuckerberg when he founded Google?

A: 17

Sometimes Macaw gives overly brief answers that are so brief you might not consider them correct.

Q: How does photosynthesis work?

A: sun shines on plants

All in all, a pretty impressive system but not perfect.

General-Purpose Question-Answering with Macaw

#solidstatelife #ai #nlp #qanda

waynerad@pluspora.com

"ChessCoach is a neural network-based chess engine capable of natural-language commentary. It plays chess with a rating of approximately 3450 Elo, which means it should usually beat even the strongest human players at 2850 Elo, and many other engines, but will often lose to the strongest, such as Stockfish 14 at 3550 Elo."

"ChessCoach can also feed its chess knowledge into an additional neural network to comment on moves and positions in English. It is not very insightful and often wrong but shows some promise for the limited data it has been able to train on."

"The natural-language commentary piece is most like the work of Learning to Generate Move-by-Move Commentary for Chess Games from Large-Scale Social Forum Data (Jhamtani, Gangal, Hovy, Neubig & Berg-Kirkpatrick, 2018) and Automated Chess Commentator Powered by Neural Chess Engine (Zang, Yu & Wan, 2019), but relies on a more heavily trained chess engine and larger training corpus, albeit with more simplistic architecture."

ChessCoach

#solidstatelife #ai #nlp #chess

waynerad@pluspora.com

An AI won the American Crossword Puzzle Tournament. "Checkers, backgammon, chess, Go, poker, and other games have witnessed the machines' invasions, falling one by one to dominant AIs. Now crosswords have joined them." "But a look at how Dr. Fill pulled off this feat reveals much more than merely the latest battle between humans and computers."

This year Matt Ginsberg, the computer scientist who created Dr. Fill, teamed up with the Berkeley Natural Language Processing Group, and they made "a hybrid system in which the Berkeley group's neural-net methods for interpreting clues worked in tandem with Ginsberg's code for efficiently filling out a crossword grid."

The article describes how Ginsberg's software methodically tests clues gleaned from a massive database of millions of previous crossword puzzles. It doesn't say much about the neural networks, but presumably the neural networks improve this "guessing" process and bring candidates to the top that otherwise wouldn't be there.

What a Crossword AI Reveals About Humans' Way With Words

#solidstatelife #ai #nlp #crosswordpuzzles

waynerad@pluspora.com

Langame: Gamified conversations instrumented by AI. "Have incredibly profound conversations". It says "Powered by OpenAI" but this app isn't made by OpenAI. It's made by, uh, louis030195, whoever that is. Probably uses GPT-3 through an API. Wondering if I should try this.

Langame

#solidstatelife #ai #nlp