#llms

waynerad@diasp.org

"Evaluate LLMs in real time with Street Fighter III"

"A new kind of benchmark? Street Fighter III assesses the ability of LLMs to understand their environment and take actions based on a specific context. As opposed to RL models, which blindly take actions based on the reward function, LLMs are fully aware of the context and act accordingly."

"Each player is controlled by an LLM. We send to the LLM a text description of the screen. The LLM decide on the next moves its character will make. The next moves depends on its previous moves, the moves of its opponents, its power and health bars."

"Fast: It is a real time game, fast decisions are key"
"Smart: A good fighter thinks 50 moves ahead"
"Out of the box thinking: Outsmart your opponent with unexpected moves"
"Adaptable: Learn from your mistakes and adapt your strategy"
"Resilient: Keep your RPS high for an entire game"

Um... Alrighty then...

OpenGenerativeAI / llm-colosseum

#solidstatelife #ai #genai #llms

waynerad@diasp.org

"Texas will use computers to grade written answers on this year's STAAR tests."

STAAR stands for "State of Texas Assessments of Academic Readiness" and is a standardized test given to elementary through high school students. It replaced an earlier test starting in 2007.

"The Texas Education Agency is rolling out an 'automated scoring engine' for open-ended questions on the State of Texas Assessment of Academic Readiness for reading, writing, science and social studies. The technology, which uses natural language processing, a building block of artificial intelligence chatbots such as GPT-4, will save the state agency about $15 million to 20 million per year that it would otherwise have spent on hiring human scorers through a third-party contractor."

"The change comes after the STAAR test, which measures students' understanding of state-mandated core curriculum, was redesigned in 2023. The test now includes fewer multiple choice questions and more open-ended questions -- known as constructed response items."

Texas will use computers to grade written answers on this year's STAAR tests

#solidstatelife #ai #llms #technologicalunemployment

waynerad@diasp.org

Why this developer is no longer using Copilot. He feels his programming skills atrophy. He writes code by pausing to wait for Copilot to write code, and doesn't enjoy programming that way. The AI-generated code is often wrong or out-of-date and has to be fixed. Using copilot is a privacy issue because your code is shared with Copilot.

I thought this was quite interesting. I tried Copilot in VSCode and I figured I wasn't using it much because I'm a vim user. So I tracked down the Neovim plug-in & got it working in vim, but still found I don't use it. Now I've come to feel it's great for certain use cases and bad for others. Where it's great is writing "boilerplate" code for using a public API. You just write a comment describing what you want to do and the beginning of the function, and Copilot spits out practically all the rest of the code for you function -- no tedious hours studying the documentation from the API provider.

But that's not the use case I actually engage in in real life. Most of what I do is either making a new UI, or porting code from PHP to Go. For the new UI, AI has been helpful -- I can take a screenshot, input it to ChatGPT, and ask it how to improve the AI. (I'm going to be trying this with Google's Gemini soon but I haven't tried it yet.) When it makes suggestions, I can ask it what HTML+CSS is needed to implement those suggestions. I've found it gets better and better for about 6 iterations. But you notice, Copilot isn't part of the loop. I'm jumping into dozens of files and making small changes, and that's a use case where Copilot just isn't helpful.

For porting code from PHP to Go, I modified a full-fledged PHP parser to transpile code to Go, and this has been critical because it's important that certain things, especially strings, get ported over exactly -- no room for errors. So this system parses PHP strings using PHP's parsing rules, and outputs Go strings using Go's parsing rules, and is always 100% right. Copilot isn't part of the loop and doesn't help.

Another place I've found AI incredibly useful is debugging problems where I have no clue what the problem might be. This goes back to using other people's large systems such as the public APIs mentioned earlier. Every now and then you get cryptic error messages or some other bizarre malfunction, and endless Google searching doesn't help. I can go to ChatGPT, Gemini, Claude, Perplexity, DeepSeek (and others, but those are the main ones I've been using) and say hey, I'm getting this cryptic error message or this weird behavior, and it can give you a nice list of things you might try. That can get you unstuck when you'd otherwise be very stuck.

It's kinda funny because, obviously I'm an avid follower of what's going on in AI, and happy to try AI tools, and I constantly run across other developers who say "Copilot has made me twice as productive!" or "Copilot has made me five times as productive!" or somesuch. I've wondered if there's something wrong with me because I haven't experienced those results at all. But AI has been helpful in other areas nobody ever seems to talk about.

Why I'm no longer using Copilot - Dreams of Code

#solidstatelife #ai #genai #llms #codingllms #openai #copilot

waynerad@diasp.org

"Agentic workflow" is what's next for AI, says Andrew Ng. Agentic meaning the AI acts as an "agent". Basically, ChatGPT gives you one-off answers. You ask it to write something for you, or write some code, and you can have follow-up conversation, but each time it has to generate a new response.

In the "agentic" workflow Andrew Ng envisions, the AI agent has a work item that it works on in an iterative manner, interacting with you at each step. If you ask a human programmer to write some code, they never blast out the whole thing right off the bat. They write some code and then iterate on it until they get it right. By changing language AIs into full-fledged agents, they will be able to engage in this practice themselves. An AI agent tasked with writing code can run and test its code, it can write its own unit tests, it can engage in self-reflection, and so on.

The next step after that is mult-agent collaboration. In this case you could give it a high level task, and an agent can do high-level planning, and other can search on HuggingFace for an AI model appropriate for the task, and another can write the code and so on.

What's next for AI agentic workflows ft. Andrew Ng of AI Fund - Sequoia Capital

#solidstatelife #ai #genai #llms #andrewng

waynerad@diasp.org

"Will AI save physicians and clinicians time and from burnout?"

"Copilots for clinicians are also becoming more common. Ambient clinical documentation is a booming business. The technology allows doctors to record conversations with patients to automatically turn them into clinical notes and summaries using AI and is a major topic at Healthcare conferences like HIMSS conference this year, where more than 30,000 health and tech professionals gathered in Orlando, Florida."

"Earlier in March, Salesforce announced Einstein Copilot: Health Actions will allow doctors to book appointments, summarize patient information and send referrals by prompting AI with conversational language."

"Administrative workloads are a major problem for clinicians across the US health-care system. A survey published (via CNBC) by Athenahealth in February found that more than 90% of physicians report feeling burned out on a regular basis, largely because of the paperwork they are expected to complete."

"I used to be part of an admissions committee for a medical school. When I interviewed idealistic young people applying to medical school, 'typing' and 'filling out forms' was never once mentioned as a reason for becoming a physician."

She goes on to describe using AI for prior authorization letters that have to be written to insurance companies. These require a letter to be written to justify the use of a drug or therapy for a specific patient and to contain details of that specific patient and why that patient needs that therapy. These are frequently rejected by the insurance companies and have to be re-written over and over to eventually get approval. "A third of medical offices employ full-time staff to take care of the average 30 prior authorizations per physician per week."

On the flip side, "the insurers have started to use AI to deny claims more quickly."

Another use is referral letters from one physician to another. "Like prior authorization letters, these are pretty formulaic."

But the thing she has the most enthusiasm for is what she calls "ambient scribes". "Ambient scribes" are AI systems that listen in to the conversation between the patient and the physician and create a templated note for the medical record. "This technology allows physicians to avoid looking at a screen and typing while they're trying to connect with a patient."

"I've tried versions from multiple AI scribe companies (including TORTUS AI, which - full disclosure - I consult for) and they do an amazing job of filtering out irrelevant information and putting the information in the right spot."

"Think of the technological challenge inherent in this process: patient visits are often interrupted by clinic staff or phone calls, meander off into conversations about kids and dogs, and use abbreviations and technical jargon. They're often circular, meaning a patient will mention a symptom and the physician won't ask a follow up question about it until several minutes later. These tools produce a full transcript that uses generative AI to find the important information and put it into a form that's indistinguishable from what a physician would actually type. Many of my friends have reported that ambient scribes actually do a better job of including important details than they would have included themselves."

Will AI save physicians and clinicians time and from burnout?

#solidstatelife #ai #voicetotext #nlp #genai #llms #medicalai

waynerad@diasp.org

The Scalable, Instructable, Multiworld Agent (SIMA) from DeepMind plays video games for you. You tell it what you want to do in regular language, and it goes into a 3D environment, including some provided by commercial video games, and carries out keyboard-and-mouse actions.

Before getting into how they did this, might be worth citing some of the reasons they thought this was challenging: Video games can be open-ended, visually complex, and have hundreds of different objects. Video games are asynchronous -- no turn taking like chess or Go, or many research environments, which stop and wait while the agent computes its next action. Each instance of a commercial video game needs its own GPU -- no running hundreds or thousands of actors per game per experiment as has been historically done in reinforcement learning. AI agents see the same screen pixels that a human player gets -- no access to internal game state, rewards, or any other "privileged information". AI agents use the same keyboard-and-mouse controls that humans do -- no handcrafted action spaces or high-level APIs.

In addition to all those challenges, they demanded their agents follow instructions in regular language, rather than simply pursuing a high score in the game, and the agents were not allowed to use simplified grammars or command sets.

"Since the agent-environment interface is human compatible, it allows agents the potential to achieve anything that a human could, and allows direct imitation learning from human behavior."

"A key motivation of SIMA is the idea that learning language and learning about environments are mutually reinforcing. A variety of studies have found that even when language is not necessary for solving a task, learning language can help agents to learn generalizable representations and abstractions, or to learn more efficiently." "Conversely, richly grounded learning can also support language learning."

I figure you're all eager to know what the games were. They were: Goat Simulator 3 (you play the goat), Hydroneer (you run a mining operation and dig for gold), No Man's Sky (you explore a galaxy of procedurally-generated planets), Satisfactory (you attempt to build a space elevator on an alien planet), Teardown (you complete heists by solving puzzles), Valheim (you try to survive in a world of Norse mythology), and Wobbly Life (you complete jobs to earn money to buy your own house).

However, before the games, they trained SIMA in research environments. Those, which you probably never heard of, are: Construction Lab (agents are challenged to build things from construction blocks), Playhouse (a procedurally-generated house), ProcTHOR (procedurally-generated rooms, such as offices and libraries), and WorldLab (an environment with better simulated physics).

The SIMA agent itself maps visual observations and language instructions to keyboard-and-mouse actions. But it does that in several stages. For input, it takes a language instruction from you, and the pixels of the screen.

The video and language instruction both go through encoding layers before being input to a single, large, multi-modal transformer. The transformer doesn't output keyboard and mouse actions directly. Instead, it outputs a "state representation" that gets fed into a reinforcement learning network, which translates the "state" into what in reinforcement learning parlance is called a "policy". A more intuitive regular word might be "strategy". Basically this is a function that, when given input from the environment including the agent's state within the environment, will output an action. Here, the actions are the same actions a human would take with mouse and keyboard.

The multi-modal transformer was trained from scratch. A recent new algorithm called Classifier-Free Guidance (CFG) was used, an algorithm inspired by the algorithms used by diffusion models to "condition" the diffusion model on the text you, the user, typed in.

Even in the research environments, it is hard to automate judging of whether an agent completed its tasks. Instructions may be such things as, "make a pile of rocks to mark this spot" or "see if you can jump over this chasm". The environment may not provide any signal indicating these have been fulfilled. There are some they can handle, though, like "move forward", "lift the green cube", and "use the knife to chop the carrots".

For commercial video games, all the agent gets is pixels on the screen, just like a human player, and has no access to the internal game state of the game. The games generally don't allow any game state to be saved and restored, something researchers like for reproducibility.

For video games, they resorted to detecting on-screen text using OCR. They did this in particular for two games, No Man's Sky and Valheim, "which both feature a significant amount of on-screen text."

Why not just have people look, i.e. have humans judge whether the instructions were followed? Turns out humans were "the slowest and most expensive." They were able to get judgments from humans who were experts at the particular game an agent was playing, though.

For automated judgment, if a task contains a knife, a cutting board, and a carrot, the agent may ascertain the goal ("cut the carrot on the cutting board") without relying on the language instruction. This example illustrates the need to differentiate between following a language task and inferring the language task from "environmental affordances".

How'd SIMA do? It looks like its success rate got up to about 60% for Playhouse, but only about 30% for Valheim. That's the percentage of tasks completed. The ranking goes Playhouse, Worldlab, Satisfactory, Construction Lab, No Man's Sky, Goat Simulator 3, and Valheim.

"Note that humans would also find some of these tasks challenging, and thus human-level performance would not be 100%."

Grouped by "skill category", movement instructions ("stop", "move", "look") were the easiest, while food and resource gathering instructions ("eat", "cook", "collect", "harvest") were the hardest.

For No Man's Sky, they did a direct comparison with humans. Human's averaged 60%, while SIMA had around 30%.

How long til the AIs can beat the humans?

A generalist AI agent for 3D virtual environments

#solidstatelife #ai #genai #llms #computervision #multimodal #videogames

waynerad@diasp.org

Yesterday Google Chrome asked me if I want to try "out experimental AI features?"

(It lives at chrome://settings/ai -- you might see if it's available in yours. If not, it might be on the next update).

"These features use AI, are in early development, and won't always get it right."

I really don't want (even more of) my browser usage sent to Google's servers... but I can't resist, so I turned on the AI features to see what they do.

The features it advertises are:

"Help me write: Helps you write short-form content for things on the web, like reviews. Suggested content is based on your prompts and the content of the web page. To use this feature, right-click on a text box."

"Tab organizer: Automatically creates tab groups based on your open tabs. To use this feature, right-click on a tab and click Organize similar tabs."

To try out "Help me write", I punched in, "Today Google Chrome asks me if I want to try 'out experimental AI features?'".

After selecting "Elaborate" for "Length" and "Formal" for "Tone", I got:

"Today, Google Chrome prompted me to try out 'experimental AI features.' I clicked on the link and was directed to a page that described the features. The descriptions sounded interesting, but I am unsure what the features are or how they would improve my workflow. I am interested in learning whether anyone has tried these features and can share their evaluation of their usefulness."

I clicked on a button, not a link. Ha.

As for "Tab organizer", I clicked on a tab with a video about Boeing and clicked "Organize Similar Tabs".

It created a tab group called "Boeing Drama".

Here are the tabs it put together into the same tab group:

https://www.youtube.com/watch?v=Q8oCilY4szc - Boeing: Last Week Tonight with John Oliver (HBO)

https://www.youtube.com/watch?v=BlmYZ06F-78 - Boeing's killer plane - What went wrong? | ENDEVR Documentary

https://www.youtube.com/watch?v=NDEkH0zd3F8 - Scandal: Boeing's 737 Max disaster - Plainly Difficult

https://www.youtube.com/watch?v=UUuB0C1Nk8U - The SR-71 was much faster than the Air Force will admit - Sandboxx

https://www.youtube.com/watch?v=kLT1QEIIaB4 - Did beleaguered aircraft giant eliminate whistleblower? - Todd Grande

https://www.youtube.com/watch?v=dTmeswV3Ln0 - Boeing whistleblower found dead amid safety concerns and legal battle - dustin dailey

https://www.youtube.com/watch?v=Sdb44vY9VBw - "They Silenced Him." Boeing Whistleblower found dead after testifying | Attorney Ryan Explains

https://www.youtube.com/watch?v=OfoBxa7EoIo - The World with Yalda Hakim: Boeing whistleblower John Barnett found dead - Sky News

https://www.youtube.com/watch?v=mwAtCavQQlA - "Dead after testifying" - Was Boeing whistle blower John Barnett killed to silence him? - Valuetainment

https://www.youtube.com/watch?v=eOffvIaWNm4 - Ex-Boeing Quality Manager Warns of 737 Plane Being Back Air So Soon | TMZ Live - Jan 31, 2024

https://news.ycombinator.com/item?id=39673589 - Boeing whistleblower found dead in US (bbc.com)

https://news.ycombinator.com/item?id=39673589 - Boeing whistleblower found dead in US (bbc.com)

It included the Hacker News link twice. But I had it open twice so maybe it should have done that?

And if you look closely, you'll notice it snuck in one video that's not about Boeing. There's an SR-71 video in there. It kind of makes sense that it's in the group because it's also about aviation, but the label it came up with for the tab group wasn't "Aviation", it was "Boeing Drama". So, there's a little bit of disconnect between the clustering algorithm and the labelling algorithm.

Also, and if you're thinking most of the tabs open on my machine were about Boeing, you'd be wrong. I've got 320 tabs open. So, 13 about Boeing, 307 about other topics. As a percentage, 4% about Boeing. (More on that below.)

And yes, I know you all count on me to bring you insights into the latest AI developments (lol), but I got sucked into the "Boeing news" rabbit hole. (More on that below, too).

[Experimental AI](chrome://settings/ai)

#solidstatelife #ai #genai #llms #googlechrome

waynerad@diasp.org

"Ema, a 'Universal AI employee,' emerges from stealth with $25M."

"Meet Ema, a universal AI employee that boosts productivity across every role in your organization. She is simple to use, trusted, and accurate."

[Insert joke here about how saying things like that won't make people worry about their jobs.]

"Ema's the missing operating system that makes Generative AI work at an enterprise level. Using proprietary Generative Workflow Engine, Ema automates complex workflows with a simple conversation. She is trusted, compliant and keeps your data safe. EmaFusion model combines the outputs from the best models (public large language models and custom private models) to amplify productivity with unrivaled accuracy. See how Ema can transform your business today."

"They say Ema (the company) has already quietly amassed customers while still in stealth, including Envoy Global, TrueLayer, and Moneyview."

"Ema's Personas operate on our patent-pending Generative Workflow Engine (GWE), which goes beyond simple language prediction to dynamically map out workflows with a simple conversation. Our platform offers Standard Personas for common enterprise roles such as Customer Service Specialists (CX), Employee Assistant (EX), Data Analyst, Sales Assistant etc. and allows for the rapid creation of Specialized Personas tailored to rapidly automate unique workflows. No more waiting for months to build Gen AI apps that work!"

"To address accuracy issues and computational costs inherent in current Gen AI applications, Ema leverages our proprietary "fusion of experts" model, EmaFusion, that exceeds 2 Trillion parameters. EmaFusion intelligently combines many large language models (over 30 today and that number keeps growing), such as Claude, Gemini, Mistral, Llama2, GPT4, GPT3.5, and Ema's own custom models. Furthermore, EmaFusion supports integration of customer developed private models, maximizing accuracy at the most optimal cost for every task."

Oh, and "Ema" stands for "enterprise machine assistant".

Ema "taps into more than 30 large language models."

"As for what Ema can do, these businesses are using it in applications that range from customer service -- including offering technical support to users as well as tracking and other functions -- through to internal productivity applications for employees. Ema's two products -- Generative Workflow Engine (GWE) and EmaFusion -- are designed to "emulate human responses" but also evolve with more usage with feedback."

They also say, "Pre-integrated with hundreds of apps, Ema is easy to configure and deploy."

What are those integrations? They said some of those integrations are: Box, Dropbox, Google Drive, OneDrive, SharePoint, Clear Books, FreeAgent, FreshBooks, Microsoft Dynamics 365, Moneybird, NetSuite, QuickBooks Online, Sage Business Cloud, Sage Intacct, Wave Financial, Workday, Xero, Zoho Books, Aha!, Asana, Azure DevOps, Basecamp, Bitbucket, ClickUp, Dixa, Freshdesk, Freshservice, Front, GitHub Issues, GitLab, Gladly, Gorgias, Height, Help Scout, Hive, Hubspot Ticketing, Intercom, Ironclad, Jira, Jira Service Management, Kustomer, Linear, Pivotal Tracker, Rally, Re:amaze, Salesforce Service Cloud, ServiceNow, Shortcut, SpotDraft, Teamwork, Trello, Wrike, Zendesk, Zoho BugTracker, Zoho Desk, Accelo, ActiveCampaign, Affinity, Capsule, Close, Copper, HubSpot, Insightly, Keap, Microsoft Dynamics 365 Sales, Nutshell, Pipedrive, Pipeliner, Salesflare, Salesforce, SugarCRM, Teamleader, Teamwork CRM, Vtiger, Zendesk Sell, Zoho CRM, ApplicantStack, Ashby, BambooHR, Breezy, Bullhorn, CATS, ClayHR, Clockwork, Comeet, Cornerstone TalentLink, EngageATS, Eploy, Fountain, Freshteam, Greenhouse, Greenhouse - Job Boards API, Harbour ATS, Homerun, HR Cloud, iCIMS, Infinite BrassRing, JazzHR, JobAdder, JobScore, Jobsoid, Jobvite, Lano, Lever, Oracle Fusion - Recruiting Cloud, Oracle Taleo, Personio Recruiting, Polymer, Recruitee, Recruiterflow, Recruitive, Sage HR, SAP SuccessFactors, SmartRecruiters, TalentLyft, TalentReef, Teamtailor, UKG Pro Recruiting, Workable, Workday, Zoho Recruit, ActiveCampaign, Customer.io, getResponse, Hubspot Marketing Hub, Keap, Klaviyo, Mailchimp, MessageBird, Podium, SendGrid, Sendinblue, 7Shifts, ADP Workforce Now, AlexisHR, Altera Payroll, Azure Active Directory, BambooHR, Breathe, Ceridian Dayforce, Charlie, ChartHop, ClayHR, Deel, Factorial, Freshteam, Google Workspace, Gusto, Hibob, HRAlliance, HR Cloud, HR Partner, Humaans, Insperity Premier, IntelliHR, JumpCloud, Justworks, Keka, Lano, Lucca, Namely, Nmbrs, Officient, Okta, OneLogin, OysterHR, PayCaptain, Paychex, Paycor, PayFit, Paylocity, PeopleHR, Personio, PingOne, Proliant, Rippling, Sage HR, Sapling, SAP SuccessFactors, Sesame, Square Payroll, TriNet, UKG Dimensions, UKG Pro, UKG Ready, Workday, and Zenefits.

Ema, a 'Universal AI employee,' emerges from stealth with $25M

#solidstatelife #ai #genai #llms #aiagents #technologicalunemployment

waynerad@diasp.org

Devon "the first AI software engineer"

You put it in the "driver's seat" and it does everything for you. Or at least that's the idea.

"Benchmark the performance of LLaMa".

Devon builds the whole project, uses the browser to pull up API documentation, runs into an unexpected error, adds a debugging print statement, uses the error in the logs to figure out how to fix the bug, then builds and deploys a website with full styling as visualization.

See below for reactions.

Introducing Devin, the first AI software engineer - Cognition

#solidstatelife #ai #genai #llms #codingai #technologicalunemployment

waynerad@diasp.org

"Albania to speed up EU accession using ChatGPT".

Ok, I understood that sentence up to "using ChatGPT".

"The Albanian government will use ChatGPT to translate thousands of pages of EU legal measures and provisions into shqip (Albanian language) and then integrate them into existing legal structures, following an agreement with the CEO of the parent company, OpenAI, Mira Murati, who was born in Albania."

Oh wow, happened when Mira Murati was CEO. That was, like, a week?

So is ChatGPT the best translator for shqip because it's a smaller language? Why ChatGPT and not some other machine translation system?

"The model to be used by the Albanian government will translate into Albanian and provide a detailed overview of what and where changes need to be made to local legislation to align with EU rules. It will also provide an analysis of the impact of all measures and changes, which usually require many experts and a lot of time."

"Albanian Prime Minister Edi Rama said the move would eliminate 'an army of translators and a battalion of lawyers, costing millions of euros' and speed up the process."

So the idea is just to use ChatGPT as a translator. But is it really a good idea? Some of those "army of translators and battalion of lawyers" need to double-check all ChatGPT's work. ChatGPT is not always right.

Albania to speed up EU accession using ChatGPT - Euractiv

#solidstatelife #ai #genai #llms #gpt #mt #geopolitics #albania

waynerad@diasp.org

OpenAI is apparently partnering with a robotics company, Figure 01. The robot seems to do a remarkable job of integrating language with vision and body movements. They say it works using neural networks entirely. No teleoperation or hand-coded code. Nothing is scripted ahead of time.

"Can I have something to eat?"

"Sure thing."

"Great. Can you explain why you did what you just did while you pick up this trash?"

"On it! So I gave you the apple because it's the only edible item I could provide you with from the table."

"Great. So, based on the scene right now, where do you think the dishes in front of you go next?"

"The dishes on the table, like that plate and cup, are likely to go into the drying rack next."

"Great. Can you put them there?"

"Of course."

OpenAI's 'AGI Robot' Develops SHOCKING NEW ABILITIES | Sam Altman Gives Figure 01 Get a Brain - Wes Roth

#solidstatelife #ai #openai #figure01 #genai #llms #robotics

waynerad@diasp.org

"Shares of Teleperformance plunged 23% on Thursday, after the French call center and office services group missed its full-year revenue target."

"Investors have been spooked by the potential impact of artificial intelligence on its business model, as companies become more able to tap into the technology directly for their own benefit."

Call center group Teleperformance falls 23%; CEO insists AI cannot replace human staff

#solidstatelife #ai #genai #llms #technologicalunemployment

waynerad@diasp.org

"Magic School: Make your school magic with AI".

"The most used and loved AI platform for educators worldwide. Over a million educators embracing AI with MagicSchool and growing."

A million? Really? And I only just now heard of this? What does this "Magic School" do? (Is it me or does "Magic School" always feel like it ought to be followed with "Bus"?)

They claim "60+ AI tools & growing (lesson planning, differentiation, writing feedback, IEP writing, and more!)".

IEP evidently stands for "individual education plan". Apparently a thing in the "special education" field.

They also claim to have a chatbot called "Raina". "Raina" has customizations in all 60+ tools.

I got curious what the 60+ tools were so I found a list.

5E Model Science Lesson Plan Generator ("5E" stands for "Engage, Explore, Explain, Elaborate, Evaluate")
AI-Resistant Assignment Suggestions
Academic Content Generator
Accommodation Suggestion Generator
Assignment Scaffolder
BIP Suggestion Generator ("BIP" stands for "Behavior Intervention Plan")
Choice Board Generator (UDL) ("UDL" stands for "Universal Design for Learning")
Class Newsletter Tool
Clear Directions
Coach's Sports Practice Generator
Colleague Song Generator
Common Misconception Generator
Conceptual Understanding Generator
Custom Chatbot
DOK Questions Generator ("DOK" stands for "Depth of Knowledge")
Data Table Analysis Generator
Decodable Text Generator
E-mail Family Tool
E-mail Responder Tool
Exemplar & Non-Exemplar (examples -- the AI writes examples)
IEP Generator
Informational Text Generator
Lesson Plan Generator
Letter of Recommendation
Make it Relevant!
Math Spiral Review Generator
Math Story Word Problems
Multi-Step Assignment Generator
Multiple Choice Quiz Generator
Multiple Explanations for Complex Concepts
Professional Email Tool
Project Based Learning (PBL) Generator
Reading Quiz Generator
Report Card Comments
Restorative Reflection Generator
Rubric Generator
SAT Reading Practice Test Generator
SAT Reading Questions Custom
Science Lab Generator
Social Stories Generator
Student Work Feedback Tool
Syllabus Generator
Teacher Observation Tool
Team Builder / Ice Breaker
Text Analysis Assignment Generator
Text Dependent Questions
Text Leveler Tool
Text Proofreader Tool
Text Rewriter Tool
Text Scaffolder Tool
Text Summarizer Tool
Text Translator Tool
Thank You Note Generator
Three Dimensional Science Assessment Generator
Unit Plan Generator
Vocab List Generator
Vocabulary Based Text Generator
YouTube Video Question Generator
YouTube Video Summarizer

Magic School: Make your school magic with AI

#solidstatelife #ai #educations #genai #llms

waynerad@diasp.org

Approaching human-Level forecasting with language models.

The idea here is to pit AI head-to-head against humans in forecasting competitions. They mention 5 of these: Metaculus, GJOpen, INFER, Polymarket, and Manifold. The way they are scored is with something called a "Brier score". To keep things simple, they limited their system to only yes/no "binary" questions. When dealing with "binary" questions, the way the Brier score is computed is, one option is assigned the value 0 (say, some event not happening by a certain date), or 1 (the event happening). The person -- or now, language model -- making the prediction actually predicts a probability -- a number between 0 and 1. Once the outcome is known, the difference between the prediction probability number and the actual outcome number is computed and then squared. For multiple predictions, these numbers are all averaged. In this way, the Brier score represents the "error" in the predictions. A perfect predictor will predict "1" for every event that actually happens, and "0" for every event that does not happen, leading to a Brier score of 0. If the predictor does not know if something will happen or not, they can say 0.5, which will lead to a Brier score of 0.25 no matter which outcome actually happens. It's better to do that then to predict 0 or 1 and be wrong.

This glosses over various details like how to handle when people change their predictions, how to handle multiple choice outcomes or numerical outcomes, but you get the idea. The Brier score represents your prediction error.

The researchers found language models are bad at predicting. With no additional information retrieval or fine-tuning, most language models do only a little better than picking at random, and the biggest and best models like GPT-4 and Claude-2 do better than chance but still much worse than humans.

For the dataset that they trained the model on, they used the above-mentioned 5 forecasting competitions and combined data from all of them to get a dataset of 33,664 binary questions. Here's an example showing what these binary questions look like:

"Question: Will Starship achieve liftoff before Monday, May 1st, 2023?"

"Background: On April 14th, SpaceX received a launch license for its Starship spacecraft. A launch scheduled for April 17th was scrubbed due to a frozen valve. SpaceX CEO Elon Musk tweeted: 'Learned a lot today, now offloading propellant, retrying in a few days . . . '"

"Resolution: Criteria This question resolves Yes if Starship leaves the launchpad intact and under its own power before 11:59pm ET on Sunday, April 30th."

"Key Dates: Begin Date: 2023-04-17, Close Date: 2023-04-30, Resolve Date: 2023-04-20."

The "begin date" is the date people can start making predictions. The "close date" is the last date people can make predictions. The "resolve date" is the date reality is checked to see if the prediction happened or not. But, for this example, the reason why the "resolve date" is before the "close date" is because the event occurred.

Their system consists of a retrieval system, a reasoning system, and a candidate selection system.

The retrieval system enables the system to do search engine searches. It consists of 4 steps: search query generation, news retrieval, relevance filtering and ranking, and text summarization. The summarization step is because large language models are limited by their context window, and that may be less of a limitation in the future.

The reasoning system works by first prompting the large language model to rephrase the question. The model is next asked to leverage the retrieved information and its pre-training knowledge to produce arguments for why the outcome may or may not occur. Since the model can generate weak arguments, to avoid treating them all as equal, it is instructed to weigh them by importance and aggregate them accordingly. Finally, "to prevent potential bias and miscalibration, the model is asked to check if it is over- or underconfident and consider historical base rates, prompting it to calibrate and amend the prediction accordingly."

This is called reasoning by "scratchpad prompting". Since the aggregate of predictions is usually superior to individual forecasts, this is repeated multiple times and the average is used.

All of this needs to be in place before fine-tuning because it's used by the fine-tuning system. The fine-tuning was done by selecting a subset of the data for fine-tuning, a subset where the model outperformed the human crowd. But they discard examples where the model is too much better than the crowd. They say this is because "We seek to fine-tune our model on strong forecasts" but at the same time, thus using the subset where the model outperformed the human crowd, but, "this can inadvertently cause overconfidence in our fine-tuned model" -- unless they discard the examples where the model exceeds the crowd prediction too much.

"The input to the model consists of the question, description, and resolution criteria, followed by summarized articles. The target output consists of a reasoning and a prediction. Importantly, the fine-tuning input excludes the scratchpad instructions. By doing so, we directly teach the model which reasoning to apply in a given context."

In addition they did a "hyperparameter sweep" where they tried to optimize the hyperparameters. The "hyperparameters" were the search query prompt, the summarization prompt, the number of articles to keep and rank, the reasoning prompt, and the ensembling method for combining multiple answers (they tested 5 different algorithms).

Anyway, the end result of all this is that the large language model had a Brier score of .179, while the crowd had .149, in a difference of only .03. So the system is very close to human accuracy. If traditional "accuracy" numbers are more intuitive to you, they gave 71.5% as their accuracy number, and 77.0% for the human crowd.

Approaching human-Level forecasting with language models

#solidstatelife #ai #genai #llms #futurology #predictionmarkets #brierscore

waynerad@diasp.org

Porn star Riley Reid made an AI chatbot of herself.

Alrighty, so those of you keeping lists of "things that haven't happened yet," you can cross that off the list.

Article is paywalled. If you really need to read it.

I fell in (and out of) love with Riley Reid's AI porn bot

#solidstatelife #ai #genai #llms #chatbots

waynerad@diasp.org

""Klarna AI assistant handles two-thirds of customer service chats in its first month."

"Klarna Bank is an online financial services in Sweden."

"The AI assistant has had 2.3 million conversations, two-thirds of Klarna's customer service chats."

"It is doing the equivalent work of 700 full-time agents."

"It is on par with human agents in regard to customer satisfaction score."

"It is more accurate in errand resolution, leading to a 25% drop in repeat inquiries."

"Customers now resolve their errands in less than 2 mins compared to 11 mins previously."

"It's available in 23 markets, 24/7 and communicates in more than 35 languages."

"It's estimated to drive a $40 million USD in profit improvement to Klarna in 2024."

Klarna AI assistant handles two-thirds of customer service chats in its first month

#solidstatelife #ai #genai #llms

waynerad@diasp.org

"Even LLMs need education -- quality data makes LLMs overperform."

In other words, textbooks are all you need?

The idea is that instead of making a huge language model, you zero in on the best possible training data -- which for a large language model means textbooks, or "textbook-like data" -- and even create your own, called "synthetic data".

These researchers developed "a data set of toddler-level stories called TinyStories that could be used to create models of less than ten million parameters that still produced comprehensible outputs. They trained a whole LLM from the ground up in a single day only using a single GPU -- probably less that $100 worth of compute time. The stories it produced were grammatically correct, maintained consistency, and showed reasoning."

"If you were to ask someone to learn how to build a rocket ship just by searching the internet, you'd likely not have great results. Sure, there may be some good resources and communities that ahem get you off the ground. But there's also a lot of cruft out there -- anyone can put something on the internet and there's nobody to vet it."

"If you instead gave someone a textbook on rocketry, they'd at least know how to start, what the concepts are, and how to move towards an answer."

Even LLMs need education -- quality data makes LLMs overperform

#solidstatelife #ai #genai #llms