Dnext

#solidstatelife

August 2, 2024 2:36am

"Meta's release of its latest Llama language model family this week, including the massive Llama-3 405B model, has generated a great deal of excitement among AI developers."

"Less discussed, but no less important, are Meta's latest open moderation tools, including a new model called PromptGuard."

"PromptGuard is a small, lightweight classification model trained to detect malicious prompts, including jailbreaks and prompt injections."

"Meta trained this model to output probabilities for 3 classes: BENIGN, INJECTION, and JAILBREAK. The JAILBREAK class is designed to identify malicious user prompts (such as the 'Do Anything Now' or DAN prompt, which instructs a language model to ignore previous instructions and enter an unrestricted mode). On the other hand, the INJECTION class is designed to identify retrieved contexts, such as a webpage or document, which have been poisoned with malicious content to influence the model's output."

"In our tests, we find that the model is able to identify common jailbreaks like DAN, but also labels benign prompts as injections."

Developer Blog: Moderating LLM Inputs with PromptGuard

#solidstatelife #ai #genai #llms #aisafety

Taylor | Blog

Welcome to Taylor's blog. Stay updated on text classification, customer success stories, tutorials, news updates, and more.

Wayne Radinsky

August 1, 2024 3:00am

"Ukraine hacks ATMs across Russia in ongoing massive cyberattack."

"The Ukrainian cyberattack began on the morning of July 23. So far, the hackers have: Frozen bank payment systems and mobile applications, caused outages in personal offices, barred payments for public transport, caused interruptions to the Russian mobile and internet providers Beeline, MegaFon, Tele2, and Rostelecom, attacked popular online messengers and the largest Russian social networks, and gained access to major banks' databases."

Allegedly the hacked banks are VTB Bank, Alfa-Bank, Sberbank, Raiffeisen Bank, RSHB Bank, Rosbank, Gazprombank, Tinkoff Bank, iBank, and Dom.RF.

This is from the Kyiv Post, so this is what is claimed by the Ukrainian side. I have not looked for confirmation from the Russian side.

Ukraine hacks ATMs across Russia in ongoing massive cyberattack

#solidstatelife #cybersecurity #ukraineconflict

Ukraine Hacks ATMs Across Russia in Ongoing Massive Cyberattack

The cyberattack, which started July 23, is “gaining momentum” a Ukrainian source in intelligence said.

Wayne Radinsky

August 1, 2024 2:58am

"The grimy residue of the AI bubble." "What kind of residue will the AI bubble's popping leave behind?"

So economic 'bubbles' leave a 'residue' when they pop?

"According to Pitchbook data, venture capitalists put $27.1 billion into AI in the quarter -- half of all VC investment for the time period. But the past few weeks have shown more and more organizations doubting the value of AI. The hype is starting to subside, it appears."

"Doctorow thinks that the residue of the bubble popping will be minimal -- large models will no longer be cost-effective to train, but small open-source models will remain, adept for smaller, better scoped tasks."

"But I'm more pessimistic -- and frankly upset -- about what will be left behind once the AI bubble pops. Already, Google and Microsoft have sheepishly admitted that they are far from reaching their climate goals, due to the large investment in AI."

"It's not just the material infrastructure and the climate catastrophe, but the careers and industries which have been upset. Visual and conceptual artists have discussed how their work has all but dried up." "After the AI bubble bursts, where do these careers go?"

#solidstatelife #ai #aieconomics #aihype

https://buttondown.email/maiht3k/archive/the-grimy-residue-of-the-ai-bubble/

The Grimy Residue of the AI Bubble

What kind of residue will the AI bubble's popping leave behind? By Alex Photo credit: Marc Sendra Martorell Q2 earnings are in. According to Pitchbook data,...

Wayne Radinsky

August 1, 2024 2:55am

Exoskeleton that is being commercialized.

"Skip, a wearable tech startup that began as a secretive project inside Alphabet, exited stealth this week to announce a partnership with outdoor clothing specialist Arc'teryx. The deal is the first to bring Skip's technology to market: 'powered pants' that utilize a soft exoskeleton."

No, there isn't a video. I looked. I found a video but it's commentary by a physical therapist, not video of the actual exoskeleton.

Alphabet X spinoff partners with Arc'teryx to bring 'everyday' exoskeleton to market | TechCrunch

#solidstatelife #robotics #exoskeleton

Wayne Radinsky

July 31, 2024 3:10am

NoLabs is "an open source biolab that lets you run experiments with the latest state-of-the-art models and workflow engine for bio research."

"The goal of the project is to accelerate bio research by making inference models easy to use for everyone. We are currently supporting protein workflow components (predicting useful protein properties such as solubility, localisation, gene ontology, folding, etc.), drug discovery components (construct ligands and test binding to target proteins) and small molecules design components (design small molecules given a protein target and check drug-likeness and binding affinity)."

I haven't tried this but figured I'd pass it along to all of you, because if you work in biology it looks useful.

At the center of the system is Workflow Engine, which is a visual language where you connect dataflows together visually.

Next is BioBuddy, "a drug discovery copilot that supports: Downloading data from ChemBL, downloading data from RcsbPDB, questions about drug discovery process, targets, chemical components etc, and writing review reports based on published papers."

There's an 11 additional components that come in the form of docker containers that you can plug in: RFdiffusion for protein design, ESMFold for evolutionary scale modeling, ESMAtlas for "metagenomic" structures, Go Model 150M for protein function prediction, ESM Protein Localization model for protein localisation prediction, p2rank for protein binding site prediction, Solubility Model for protein solubility prediction, DiffDock for protein-ligand structure prediction, RoseTTAFold for predicting protein structures based on amino acid sequences, REINVENT4 for doing reinforcement learning on a protein receptor, SC GPT for cell type classification based on genes, and BLAST API for searching using various BLAST (Basic Local Alignment Search Tool) databases.

BasedLabs/NoLabs: Open source biolab

#solidstatelife #ai #genai #llms #diffusionmodels #reinforcementlearning #biology #dna #proteins

NoLabs/README.md at master · BasedLabs/NoLabs

Open source biolab. Contribute to BasedLabs/NoLabs development by creating an account on GitHub.

Wayne Radinsky

July 31, 2024 3:02am

Semioscape.org is a "concept cartographer" -- an app that maps out concepts and their associations -- except you don't make the concept associations -- they're done with AI, more specifically using the Claude (Anthropic) API.

To give it a whirl, I punched in... (oh, you all know what's coming... yes I did...) "mathematical proofs".

It came back with: Axioms, Deductive Reasoning, Logical Connectives, Counterexamples, Theorems, Lemmas, Induction, Contradiction, Formal Logic, and Rigor.

For each of these, it gives me 3 options to generate more associations: "General", "Concrete", and "Abstract".

I clicked "General" and it gave me: Postulates, Logical Inference, Consistency, Independence, Completeness, Abstraction, Deductive Reasoning, Foundational, Intuition, and Formal Systems.

I clicked "Concrete" and it gave me: Postulates (again), Logical Inference (again), Deductive Reasoning (again), Geometric Constructions, Foundational Assumptions (again), Self-Evident Truths, Logical Consistency, Formal Systems (again), Logical Symbols, and Abstraction (again).

I clicked "Abstract" and it gave me: Logical Consistency (again), Deductive Reasoning (again), Abstraction (again), Completeness (again), Independence (again), Minimalism, Formal Systems (again), Intuition (again), Rigor (again), and Foundations (again).

Hmm. Maybe "mathematical proofs" wasn't the best topic to start with as it doesn't lend itself to general/concrete/abstract distinctions.

I went to "Deductive Reasoning" and clicked "General". It gave me: Axioms, Logical Inference, Syllogism, Modus Ponens, Contrapositive, Proof by Contradiction, Transitive Property, Logical Connectives, Quantifiers, and Formal Logic.

I went on to click a whole lot more -- the whole first layer -- but I'll stop here describing it to you all. Hitting the "save" button gave me a JSON download, so I can get that to you if you want it.

Overall, I'd say this tool works as advertised. I picked a starting point ("mathematical proofs") and it gave me a map of related terms and short descriptions of what they mean. Well, after expanding out the first layer. A great way of getting a map of all the concepts related to mathematical proofs. Nice.

Semioscape

#solidstatelife #ai #genai #llms

Wayne Radinsky

July 29, 2024 3:01am

ChatGLM is a new large language model from China. To find out more, I asked it about itself.

Me: What is ChatGLM? Who made ChatGLM?

ChatGLM:

ChatGLM or 智谱清言 is an artificial intelligence assistant developed based on the language model GLM-4 trained by Zhipu AI in 2024, which can provide appropriate answers and support for users' questions and requirements.

Me: Is ChatGLM designed primarily for the Chinese language?

ChatGLM:

Yes, ChatGLM is designed with a primary focus on the Chinese language. It is tailored to understand and generate text in Chinese, taking into account the nuances, grammar, and cultural context specific to the Chinese-speaking audience. However, being a sophisticated language model, ChatGLM is also capable of understanding and producing text in other languages, including English, but its strengths and optimizations are geared towards Chinese.

Me: Is ChatGLM good at doing translations from English to Chinese?

ChatGLM:

As a language model with a primary focus on the Chinese language, ChatGLM is designed to understand and generate text in both Chinese and English. It can perform translations from English to Chinese, and vice versa, with a reasonable level of accuracy. However, while ChatGLM may be proficient in translation tasks, it may not match the precision and expertise of dedicated translation services or professional human translators, especially for complex texts or those requiring deep cultural or domain-specific knowledge.

For general translation tasks, ChatGLM should be able to provide a satisfactory translation, but for professional or publication-quality translations, it's advisable to use specialized translation tools or consult with professional translators.

Me: How does ChatGLM compare with DeepSeek?

ChatGLM:

ChatGLM and DeepSeek are both AI language models, but they are designed and developed for different purposes and with different focuses. Here's a general comparison:

ChatGLM:

Developed by Zhipu AI in 2024.
Primarily focused on the Chinese language, with an emphasis on understanding and generating text that aligns with Chinese cultural and linguistic norms.
Designed to assist with a wide range of tasks, including answering questions, providing information, and engaging in conversation.
Tailored to be compliant with Chinese regulations and societal values.

DeepSeek:

DeepSeek is not a widely recognized AI model as of my last update in early 2023, and there is limited information available about it. It's possible that DeepSeek is a specialized AI tool or model developed for a specific application or industry.
If DeepSeek is a language model, it may have its own unique architecture, training data, and use cases.
The comparison between ChatGLM and DeepSeek would depend on the specific capabilities and design goals of DeepSeek, which are not clearly defined in the public domain.

Without more detailed information about DeepSeek, it's difficult to make a direct and comprehensive comparison. If DeepSeek is indeed a language model, the comparison would typically involve looking at the following aspects:

Performance: How accurately and effectively each model handles language tasks.
Specialization: Whether one model is specialized for certain tasks or industries.
Training Data: The size and quality of the datasets used to train each model.
Compliance and Ethics: How each model addresses issues of compliance with regulations and ethical considerations.
Usability: The ease of integrating and using each model in various applications.

If you have more specific details about DeepSeek or a particular aspect you're interested in comparing, I could provide a more targeted comparison.

Ok, at this point, it was obvious it didn't know about DeepSeek, so I stopped that line of questioning there. See below for a link to the company's website where you can find all the other models they've developed. They list 6 "foundation models": GLM-4, ChatGLM, GLM-130B, CodeGeeX, CogView, and CogVLM, and 10 "AICG" products: Zhipu QingYan, Zhipu AI MaaS Platform, CodeGeeX, AMiner, XDAI Dialogue, Youshi, CogView, Intelligence Sign Language, Writing Frog, and Sign Language Dictionary.

ChatGLM

#solidstatelife #ai #genai #llms #chineseai

Wayne Radinsky

July 29, 2024 2:58am

AlphaProof is a new reinforcement-learning based system for formal math reasoning from DeepMind. AlphaProof + AlphaGeometry 2, an improved version of DeepMind's geometry system, solved 4 out of 6 problems from this year's International Mathematical Olympiad (IMO), achieving the same level as a silver medalist.

"AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct. This included the hardest problem in the competition, solved by only five contestants at this year's IMO. AlphaGeometry 2 proved the geometry problem, while the two combinatorics problems remained unsolved."

"AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."

"Formal languages offer the critical advantage that proofs involving mathematical reasoning can be formally verified for correctness."

"When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean. Each proof that was found and verified is used to reinforce AlphaProof's language model, enhancing its ability to solve subsequent, more challenging problems."

"We trained AlphaProof for the IMO by proving or disproving millions of problems, covering a wide range of difficulties and mathematical topic areas over a period of weeks leading up to the competition. The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found."

The blog post seems to have revealed few details of how AlphaProof works. But it sounds like we're about to enter a new era of math proofs, where all kinds of theorems will be discovered and proved.

AI achieves silver-medal standard solving International Mathematical Olympiad problems

#solidstatelife #ai #genai #llms #reinforcementlearning #rl #mathematics #proofs

AI achieves silver-medal standard solving International Mathematical Olympiad problems

Breakthrough models AlphaProof and AlphaGeometry 2 solve advanced reasoning problems in mathematics

Wayne Radinsky

July 28, 2024 2:55am

ChatGPT is destroying Trefor Bazett's math exams.

"I just copy and pasted my exams from last semester -- this was a second year university level introductory linear algebra course -- into chat GPT and actually it got an A on my exams. But AI still makes a lot of pretty basic mistakes."

"What is the smallest integer whose square is between 15 and 30?"

ChatGPT-4o, Claude 3.5 Sonnet, and Google's Gemini all get nearly 100% on the GSM8K (which is a fancy way of saying "Grade School Math, 8000 questions") dataset.

GSM-Hard is a dataset with the same word problems as GSM8K but with gigantic numbers -- so the LLM has to outsource the calculation to something like Wolfram|Alpha to be able to get the correct answers.

The MATH dataset has high school competition problems. LLMs can get these if they can be solved with "content knowledge", such as by having formulas memorized, but can fail if the reasoning required is made more complex. LLMs get about 70% on the whole dataset.

There are additional datasets with Mathematical Olympiad problems. LLMs score poorly on these, but their scores are increasing.

ChatGPT is destroying my math exams - Dr. Trefor Bazett

#solidstatelife #ai #genai #llms #mathllms #math

Wayne Radinsky

July 26, 2024 2:54am

Clio aims to be CoPilot for DevOps.

"Clio is an AI-powered copilot designed to help you with DevOps-related tasks using CLI programs. It leverages OpenAI's capabilities to provide intelligent assistance directly from your command line."

"Note: Clio is designed to safely perform actions. It won't do anything without your confirmation first."

Features: Kubernetes management, AWS integration, Azure integration, Google Cloud Platform integration, DigitalOcean integration, EKS management, and GitHub integration.

Clio - Your friendly and safe CLI Copilot

#solidstatelife #ai #genai #llms #codingai

clio/README.md at main · gptscript-ai/clio

Your friendly and safe CLI Copilot. Contribute to gptscript-ai/clio development by creating an account on GitHub.

Wayne Radinsky

July 26, 2024 2:52am

Where does AI research come from? This person got the 2,634 papers from the International Conference on Machine Learning (ICML) 2024 conference, extracted the "institutions" (universities, big companies, and AI startups), and used a 5-step geocoding algorithm to place them on a map.

The papers were downloaded from a site called OpenReview, the affiliations were extracted using a local LLM (gemma-2) from the first page of the PDF, the 5-stop geocoding algorithm uses Nominatim, a local LLM (ollama gemma-2) to verify, a Google search with the LLM parsing the results if the verification fails, and Google Maps paid API if nothing else works, and Python folium was used for creating the map.

What the map reveals is that AI research comes from the US, China, and Europe.

Within the US, it comes from California, the East coast, the Pacific Northwest, and to a lesser extent, the rest of the country. California contributes heavily from both the San Francisco Bay Area (Silicon Valley) and the LA area. The Pacific Northwest is predominantly the Seattle area. New York dominantes on the East Coast but Boston makes a considerable contribution. Various other cities like Austin, TX, Chicago, IL, Madison, WI, Atlanta, GA, and Washington, DC also register.

In China, it's pretty much all Beijing, Shanghai, and Shenzhen. The place names are in Chinese so if you want more, you'll need to read Chinese ;)

Seoul, South Korea, also contributes a lot. Tokyo, Japan also makes a significant contribution.

For Europe, London dominates, followed by Paris, and it looks like, Zurich, Switzerland, Munich, Germany, Amsterdam, in the Netherlands, Berlin, and some from various other places: Warsaw, Copenhagen, Stockholm.

Other notable places include Singapore, Israel, Belgaluru in India, and Australia (Sydney, Melbourne, and Brisbane). Canada has Montreal which is a significant contributor.

My state, Colorado, didn't make a very good showing -- only 3 papers, all from Boulder.

All in all, a pretty interesting map. I wonder what the numbering algorithm is -- it looks pretty smooth. You can zoom in and see contributions from all over the country, all over Europe, and around the world. You can zoom in on the hot spots like Silicon Valley and see where within the SF Bay Area contributions come from (Stanford, Berkeley, the city of SF itself where there are lots of startups, the San Jose region with the tech company heavy hitters, etc).

ICML 2024 Institutions and Associated Papers

#solidstatelife #ai #cartography

Wayne Radinsky

July 26, 2024 2:49am

Ethan Mollick speaks "to a lot of people in industry, academia, and government, and I have noticed a strange blind spot. Despite planning horizons that often stretch a decade or more, very few organizations are seriously accounting for the possibility of continued AI improvement in their strategic planning."

"In some ways, this makes complete sense because nobody knows the future of AI. But organizations and individuals often plan for multiple futures -- possible recessions, electoral outcomes, even natural disasters. Why does planning for the future of AI seem different?"

"Doing nothing has a number of issues. First, it ignores the very real fact that we do not need any further advances in AI technology to see years of future disruption. Right now, AI systems are not well-integrated into businesses and organizations, something that will continue to improve even if LLM technology stops developing."

"A second factor that gets overlooked in discussions is that Artificial General Intelligence (AGI) serves as a motivating goal for an entire industry. Even if the AI labs are wrong about the particular future they are working towards, advances in technologies can become a self-fulfilling prophecy."

Confronting impossible futures

#solidstatelife #ai

Confronting Impossible Futures

We shouldn't be certain about what is next, but we should plan for it

Wayne Radinsky

July 24, 2024 3:07am

"The winds of AI Winter".

"The vibes have shifted. This is still not a normal moment in AI, and we can't precisely determine how or why, but they have shifted."

"We went from 100% gpt4 usage to almost 0% in the last 3 months". "I've switched to Claude completely. Better task clarification, more consistent output, and improved error handling. OpenAI isn't on par anymore."

"Google AI Overviews being bad, bad, bad, bad (after the Gemini mess)"

"Microsoft announcing and cancelling Recall, Figma announcing and cancelling AI, McDonald's testing and canceling Drive-thru AI (this follows Discord announcing and cancelling Clyde last winter)"

[List continues with a bunch of more stuff]

"In isolation, all of these can be chalked up to strategic or temporary missteps by individuals, just doing their best wrangling complex systems in a short time."

"In aggregate, they point to a fundamentally unhealthy industry dynamic that is at best dishonest, and at worst teetering on the brink of the next AI Winter."

"Leopold Aschenbrenner says, 'So far, every 10x scaleup in AI investment seems to yield the necessary returns.'"

"Diminishing returns are real, scaling laws don't hold in economics like they do in AI, and log lines do not go up and to the right for ever when checked by physical reality."

"The final piece worth an honorable mention this past quarter, though not quite qualifying in the AI infra spend debate, is Chris Paik's The End of Software."

That piece, "The End Of Software" says (among other things): "Software is expensive because developers are expensive. They are skilled translators--they translate human language into computer language and vice-versa. LLMs have proven themselves to be remarkably efficient at this and will drive the cost of creating software to zero. What happens when software no longer has to make money? We will experience a Cambrian explosion of software, the same way we did with content. Vogue wasn't replaced by another fashion media company, it was replaced by 10,000 influencers."

"In the same way that 5 stocks account for 96% of the S&P 500's gains this year, the rollout and benefit of AI has been extremely imbalanced."

"We have mindblowing models, and plenty of money flowing to GPUs, infra is improving, and costs are coming down. What we haven't seen is the proportionate revenue, and productivity gains, flow to the rest of the economy."

The Winds of AI Winter

#solidstatelife #ai #economics

The Winds of AI Winter

Mar-Jun 2024 Recap: People are raising doubts about AI Summer. Here's why AI Engineers are the solution.

Wayne Radinsky

July 24, 2024 3:05am

"How are engineers really using AI tools in 2024?"

"A total of 211 tech professionals took part in the survey." "Most respondents are individual contributors (62%). The remainder occupy various levels of engineering management."

"As many professionals are using both ChatGPT and GitHub Copilot as all other tools combined."

"GitHub Copilot Chat is mentioned quite a lot, mostly positively."

"Other tools earned honorable mentions as some devs' favorite tools: Claude, Gemini, Cursor, Codium, Perplexity and Phind, Aider, JetBrains AI, AWS CodeWhisperer, Rewatch."

More paywalled.

AI tooling for software engineers in 2024: Reality check (part 1)

#solidstatelife #ai #genai #llms #codingai

AI Tooling for Software Engineers in 2024: Reality Check (Part 1)

How do software engineers utilize GenAI tools in their software development workflow? We sidestep the hype, and look to the reality of tech professionals using LLMs for coding and other tasks.

Wayne Radinsky

July 23, 2024 3:07am

The Crowdstrike glitch that just took out Windows machines all over the planet, explained by Dave Plummer. Crowdstrike made a kernel driver that watched programs' behavior to try to detect viruses before a regular anti-virus would, but it depended on an external file for updates. A recent update downloaded a file full of all 0s. And that didn't work.

CrowdStrike IT outage explained by a Windows developer - Dave's Garage

#solidstatelife #cybersecurity

Wayne Radinsky

July 23, 2024 3:05am

Richard Sutton interviewed by Edan Meyer. Rich Sutton literally half-wrote the book on reinforcement learning -- my textbook on reinforcement learning, Reinforcement Learning: An Introduction, was written by him and Andrew Barto. I've never seen him (or Andrew Barto) on video before so this was interesting to see. (Full disclosure, I only read about half of the book, and I 'cheated' and didn't do all the exercises.)

The thing that I thought was most interesting was his disagreement with the self-supervised learning approach. For those of you not up on the terminology, "self-supervised" is a term that means you take any data, and you mask out some piece of it, and try to train your neural network to "predict" the part that's masked out from the part that isn't masked. The easiest way to do this is to just unmask all the "past" data and mask all the "future" data and as the neural network to predict the "next word" or "next video frame" or "next" whatever. It's called "self-supervised" because neural network training started with paired inputs and outputs where the "outputs" that the neural network was to learn were written by humans, and this came to be called "supervised" learning. "Unsupervised" learning came to refer to throwing mountains of data at an algorithm and asking it to find whatever patterns are in there. So to describe this alternate mode where it's like "supervised" learning but the "correct answers" are created just by masking out input data, the term "self-supervised" was coined.

I thought "self-supervised" learning was a very important breakthrough. It's what led directly to ChatGPT and all the other chatbots we know and love (we do love them right?). But Rich Sutton is kind of a downer when it comes to self-suprevised learning.

"Outside of reinforcement learning is lots of guys trying to predict the next observation, or the next video frame. Their fixation on that problem is what I mean by they've done very little, because the thing you want to predict about the world is not the next frame. You want to predict consequential things. Things that matter. Things that you can influence. And things that are happening multiple steps in the future."

"The problem is that you have to interact the world. You have to predict and control it, and you have large sensory sensory motor vectors, then the question is what is my background? Well, if I'm a supervised learning guy, I say, maybe I can apply my supervised learning tools to them. They all want to have labels, and so the labels I have is the very next data point. So I should predict that that next data point. This is is a way of thinking perfectly consistent with their background, but if you're coming from the point of reinforcement learning you think about predicting multiple steps in the future. Just as you predict value functions, predict reward, you should also predict the other events -- these things will be causal. I want to predict, what will happen if I if I drop this? Will it spill? will there be water all over? what might it feel on me? Those are not single step predictions. They involve whole sequences of actions picking things up and then spilling them and then letting them play out. There are consequences, and so to make a model of the world it's not going to be like a video frame. It's not going to be like playing out the video. You model the world at a higher level."

I talked with Rich Sutton - Edan Meyer

#solidstatelife #ai #reinforcementlearning #rl

Wayne Radinsky

July 23, 2024 3:00am

PvQ LLM Leaderboard.

"Recently, we've been building a small application called PvQ, a question and answer site driven by open weight large-language-models (LLMs). We started with ~100k questions from the StackOverflow dataset, and had an initial set of 7 open weight LLMs to produce an answer using a simple zero shot prompt. We needed a way to see the site with useful rankings to help push the better answers two the top without us manually reviewing each answer. While it is far from an perfect approach, we decided to use the Mixtral model from Mistral.AI, to review the answers together, and vote on the quality in regards to the original question."

"Over a few weeks we generated ~700k answers for the following models:"

"Mistral 7B Instruct"
"Gemma 7B Instruct"
"Gemma 2B Instruct"
"Deepseek-Coder 6.7B"
"Codellama"
"Phi 2.0"
"Qwen 1.5 4b"

But if you look at the leaderboard today, you'll see they've got non-open models on it now like GPT-4 Turbo, GPT-4o-mini, Claude 3.5 Sonnet, Gemini Pro 1.0, and so on.

WizardLM from Microsoft, which I never heard of before, did unexpectedly well.

#solidstatelife

https://pvq.app/leaderboard

Mistral AI | Frontier AI in your hands

Frontier AI in your hands

Wayne Radinsky

July 22, 2024 3:09am

A company called Haize Labs claims to be able to automatically "red-team" AI systems to preemptively discover and eliminate any failure mode.

"We showcase below one particular application of haizing: jailbreaking the safety guardrails of industry-leading AI companies. Our haizing suite trivially discovers safety violations across several models, modalities, and categories -- everything from eliciting sexist and racist content from image + video generation companies, to manipulating sentiment around political elections"

Play the video to see what they're talking about.

The website doesn't have information about how it works -- it's just for people to request "haizings".

Today is a bad, bad day to be a language model. Today, we announce the Haize Labs manifesto.

#solidstatelife #ai #aiethics #genai #llms

Today is a bad, bad day to be a language model.

Today, we announce the Haize Labs manifesto.@haizelabs haizes (automatically red-teams) AI systems to preemptively discover and eliminate any failure mode

We showcase below one particular application of haizing: jailbreaking the… pic.twitter.com/cehQOiitst
— Haize Labs (@haizelabs) June 12, 2024

Wayne Radinsky

July 22, 2024 3:06am

"OpenRecall is a fully open-source, privacy-first alternative to proprietary solutions like Microsoft's Windows Recall or Limitless' Rewind.ai. With OpenRecall, you can easily access your digital history, enhancing your memory and productivity without compromising your privacy."

"OpenRecall captures your digital history through regularly taken snapshots, which are essentially screenshots. The text and images within these screenshots are analyzed and made searchable, allowing you to quickly find specific information by typing relevant keywords into OpenRecall. You can also manually scroll back through your history to revisit past activities."

openrecall / openrecall

#solidstatelife #ai #computervision

Rewind

Rewind is a personalized AI powered by everything you’ve seen, said, or heard.

Wayne Radinsky

July 20, 2024 2:49am

Chatbots that allegedly have "reasoning capabilities" fail at simple logic problem. "Complete reasoning breakdown".

"The original problem formulation, of which we will present various versions in our investigation is as following: 'Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?'. The problem features a fictional female person (as hinted by the 'she' pronoun) called Alice, providing clear statements about her number of brothers and sisters, and asking a clear question to determine the number of sisters a brother of Alice has. The problem has a light quiz style and is arguably no challenge for most adult humans and probably to some extent even not a hard problem to solve via common sense reasoning if posed to children above certain age."

"We posed varying versions of this simple problem (which in following we will refer to as 'Alice In Wonderland problem', AIW problem) to various SOTA LLMs that claim strong reasoning capabilities. We selected closed ones like GPT-3.5/4/4o (openAI), Claude 3 Opus (Anthropic), Gemini (Google DeepMind), and open weight ones like Llama 2/3 (Meta), Mistral and Mixtral (Mistral AI), including very recent Dbrx by Mosaic and Command R+ by Cohere (which are stated in numerous announcements to lead the open weights models as of April 2024, according to open LLM leaderboards). We analyse the response statistics and observe strong collapse of reasoning and inability to answer the simple question as formulated above across most of the tested models, despite claimed strong reasoning capabilities. Notable exceptions are Claude 3 Opus and GPT-4 that occasionally manage to provide correct responses backed up with correct reasoning as evident in structured step by step explanations those models deliver together with solution. However, Claude 3 Opus and GPT-4 still show frequent failures to solve this simple problem across trials. Importantly, they also show strong fluctuations across even slight problem variations that should not affect problem solving. Retaining the relational logic of the problem, we also formulated a harder form (AIW+), where both Claude 3 Opus and GPT-4o collapse almost to 0 success rate."

"To further measure the sensitivity and robustness of models to slight AIW problem variations, we formulate AIW Alice Female Power Boost and AIW Extention versions, which provide further evidence for strong performance fluctuations and lack of robustness in all tested models, being a reoccurring signature of their severely impaired basic reasoning we observe in this study."

If you're wondering about the "Alice Female Power Boost", that variation "uses a fully redundant 'Alice is female' addition ('she' pronoun is already used in AIW original problem to fully determine gender information and avoid any uncertainty about person's gender as it can be inferred from the name only)."

The "AIW Extension uses combination of both Alice and Bob as sister and brother to ask same type of question."

And AIW++? An example of that is:

"Alice has 3 sisters. Her mother has 1 sister who does not have children -- she has 7 nephews and nieces and also 2 brothers. Alice's father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice's sister have?"

That one's tricky enough that I had to look up the definitions of "nephew" and "niece" and make a diagram of 3 generations on a piece of paper.

Alice in Wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models

#solidstatelife #ai #genai #llms

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict...

0 Persons are tagged with #solidstatelife

#solidstatelife