#agenticai

waynerad@diasp.org

Grunty is a "self-hosted desktop app to have AI control your computer, powered by the new Claude computer use capability. Allow Claude to take over your laptop and do your tasks for you (or at least attempt to, lol). Written in Python, using PyQt."

"If it wipes your computer, sends weird emails, or orders 100 pizzas... that's on you."

Grunty

#solidstatelife #ai #llms #genai #agenticai #anthropic

waynerad@diasp.org

"Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku."

For safety reasons, the last thing we'd allow an AI to do is take full control over a computer, looking at the screen and typing keys and moving the mouse and doing mouse clicks, just like a human, enabling it to do literally everything on a computer a human can do. Oh wait...

"Available today on the API, developers can direct Claude to use computers the way people do -- by looking at a screen, moving a cursor, clicking buttons, and typing text. Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta. At this stage, it is still experimental -- at times cumbersome and error-prone. We're releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time."

"Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company have already begun to explore these possibilities, carrying out tasks that require dozens, and sometimes even hundreds, of steps to complete. For example, Replit is using Claude 3.5 Sonnet's capabilities with computer use and UI navigation to develop a key feature that evaluates apps as they're being built for their Replit Agent product."

But unlike me, everyone else seems to be reacting very positively.

"It doesn't get said enough: Not only is Claude the most capable LLM, but they also have the best character. Great work Claude and Team!"

"Just imagine the accessibility possibilities. For those with mobility or visual impairments, Claude can assist with tasks by simply asking, like helping in usage with apps and systems that often lack proper accessibility features."

That's a good point, actually.

Still, you might want to run it in a VM for now?

"Wow, this is going to be quite game-changing!"

"Impressive to see Claude navigating screens like a human! Though still in beta, this could be a game-changer for automating tedious tasks. Can't wait to see how it develops!"

"What I found particularly noteworthy in this demo was that the information wasn't copied from the CRM, but typed letter by letter. Purely speculating, but perhaps because there are rare cases where websites do not accept copied input, which often also affects password managers."

"This is RPA-like functionality. Wow, Will this be a game-changer?"

RPA stands for Robotic Process Automation.

"What are the security implications of this? Could a bad actor use this to ask Claude to go into other people's computers and access their confidential information?"

Ok, at least one person besides me is feeling a little worry.

"That's epic, you guys have the best AI. This company is something special."

"Computer Use is truly a pivotal advancement. Enabling AI to interact with computers like humans do is a significant leap towards AGI. Exciting times ahead!"

"Looks like Siri on screen awareness but two (or more) years early and available for use now (but meanwhile, on server.) WOW. Well done guys."

"Absolutely incredible -- Super excited to build with this & see what others build!"

"Immediately prompting: 'Do all my work'"

If Claude can do all your work, why will you get paid?

"This could be huge for companies struggling with legacy systems and modernization."

"This is one more pivotal point in AI's evolution. In 2025, more innovation and use cases will emerge, and human involvement is slowly being eliminated. It looks like a small improvement, but it's huge at its core and will significantly impact how AI will be used in a few years. Kudos Claude Team!"

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

#solidstatelife #genai #llms #agenticai #athropic

waynerad@diasp.org

"BabyAGI 2o is an exploration into creating the simplest self-building autonomous agent. Unlike its sibling project BabyAGI 2, which focuses on storing and executing functions from a database, BabyAGI 2o aims to iteratively build itself by creating and registering tools as required to complete tasks provided by the user. As these functions are not stored, the goal is to integrate this with the BabyAGI 2 framework for persistence of tools created."

The naming might be confusing. OpenAI came out with a model called "o1", and the name "2o" might get you thinking this BabyAGI is using the "o" model. That's not the case.

What this is is a variant of BabyAGI 2 that installs anything it likes and runs code generated by LLMs automatically, that tries continuously to update itself and its tools in order to accomplish whatever taks you gave it. It works with a variety of LLMs -- it uses a system called LiteLLM that lets you choose between more than 100 LLMs. It tries to do everything without human intervention, so when errors happen, it will try to learn from them, and continue iterating towards task completion.

babyagi-2o

#solidstatelife #ai #genai #llms #agenticai

waynerad@diasp.org

PaperQA2: Superhuman scientific literature search.

"To evaluate AI systems on retrieval over the scientific literature, we first generated LitQA2, a set of 248 multiple choice questions with answers that require retrieval from scientific literature. LitQA2 questions are designed to have answers that appear in the main body of a paper, but not in the abstract, and ideally appear only once in the set of all scientific literature."

"We generated large numbers of questions about obscure intermediate findings from very recent papers, and then excluded any questions where either an existing AI system or a human annotator could answer the question using an alternative source."

These were generated by human experts.

"When answering LitQA2 questions, models can refuse to answer." "Some questions are intended to be unanswerable."

After creating LitQA2, the researchers then turned their attention to creating an AI system that could score highly on it.

"Retrieval-augmented generation provides additional context to the LLM (e.g., snippets from research papers) to ground the generated response. As scientific literature is quite large, identifying the correct snippet is a challenge. Strategies like using metadata or hierarchical indexing can improve retrieval in this setting, but finding the correct paper for a task often requires iterating and revising queries. Inspired by PaperQA, PaperQA2 is a retrieval-augmented-generation agent that treats retrieval and response generation as a multi-step agent task instead of a direct procedure. PaperQA2 decomposes retrieval-augmented generation into tools, allowing it to revise its search parameters and to generate and examine candidate answers before producing a final answer. PaperQA2 has access to a 'Paper Search' tool, where the agent model transforms the user request into a keyword search that is used to identify candidate papers. The candidate papers are parsed into machine readable text, and chunked for later usage by the agent. PaperQA2 uses the state-of-the-art document parsing algorithm, Grobid, that reliably parses sections, tables, and citations from papers. After finding candidates, PaperQA2 can use a 'Gather Evidence' tool that first ranks paper chunks with a top-k dense vector retrieval step, followed by an LLM reranking and contextual summarization step. Reranking and contextual summarization prevents irrelevant chunks from appearing in the retrieval-augmented generation context by summarizing and scoring the relevance of each chunk, which is known to be critical for retrieval-augmented generation. The top ranked contextual summaries are stored in the agent's state for later steps. PaperQA2's design differs from similar retrieval-augmented generation systems like Perplexity, Elicit, or Mao et al. which deliver retrieved chunks without substantial transformation in the context of the user query. While reranking and contextual summarization is more costly than retrieval without a contextual summary, it allows PaperQA2 to examine much more text per user question."

"Once the PaperQA2 state has summaries, it can call a 'Generate Answer' tool which uses the top ranked evidence summaries inside a prompt to an LLM for the final response to the asked questions or assigned task. To further improve recall, PaperQA2 adds a new 'Citation Traversal' tool that exploits the citation graph as a form of hierarchical indexing to add additional relevant sources."

Got that? Ok, that's quite a lot, so to summarize: The system consists of 5 'agents': 1. Paper search agent, 2. Gather evidence agent, 3. Citations traversal agent, 4. Generate answer agent, and 5. PaperQA2 agent which is the 'manager' agent directing all the other agents and that inputs the question and outputs the final answer.

Here's an example question: "What effect does bone marrow stromal cell-conditioned media have on the expression of the CD8a receptor in cultured OT-1 T cells?"

PaperQA2 answer: "The effect of bone marrow stromal cell conditioned media (SCM) on the expression of the CD8a receptor in cultured OT-1 T cells is reported to have no significant impact. According to the the study by Kellner et al (2024)."

The referenced paper says, on page 24: "OT-1 T-cells are primary mouse CD8+ T cells that specifically recognize the ovalbumin peptide residues 257-264 (SIINFEKL) presented on a class I major histocompatibility complex. OT-1 transgenic mice were housed and bred at the Division of Animal Resources at Emory University following all Institutional Animal Care and Use Committee protocols. OT-1 T-cells were magnetically isolated from spleens using CD8a+ negative selection according to the manufacturer's protocol (Miltenyi Biotec, # 130-104-075). Purified OT-1 T-cells were resuspended in unconditioned medium (UCM), bone marrow stromal cell conditioned media (SCM), or adipocyte conditioned medium (ACM) at 1x10^6 cells/mL in a 24 well plate and incubated at 37 degrees C, 5% CO2 for 24h prior to being seeded on tension probes. Images were taken 15 minutes after seeding OT-1 T-cells on tension probes. Fluorescence intensity values of individual cells were quantified from at least 10 images in FIJI software."

No, I didn't figure out what "SIINFEKL" stands for. I asked Google what it stands for and its "AI Overview" gave me a blatantly wrong answer (ironically?). One paper referred to it as "the well-known ovalbumin epitope SIINFEKL" -- but it's not well know enough to have a Wikipedia page or a Science Direct summary page saying what it stands for and having a basic description of what it is. By the way the term "epitope" means the part of a molecule that activates an immune system response, especially the part of the immune system that adapts new responses, primarily T cell and B cell receptors. Stromal cells of various types exist throughout the body, but the question here refers specifically to bone marrow stromal cells. These are "progenitor" cells that produce bone and cartilage cells, and well as cells that function as part of the immune system, such as cells that produce chemokines, cytokines, IL-6, G-CSF, GM-CSF, CXCL12, IL7, and LIF (for those of you familiar with the immune system -- if you're not, I'm not going on another tangent to explain what those are), though from what I can tell they don't produce T-cells or B-cells. T-cells and B-cells are produced in bone marrow, but not from stromal cells.

T-cells and B-cells are produced in bone marrow, but not from stromal cells. "OT-1" refers to a strain for transgenic mice sold by The Jackson Laboratory. CD8a is a gene that is expressed in T cells.

Anyway, let's get back to talking about PaperQA2.

So how did PaperQA2 do?

"We evaluate two metrics: precision, the fraction of questions answered correctly when a response is provided, and accuracy, the fraction of correct answers over all questions."

"In answering LitQA2 questions, PaperQA2 parsed and utilized an average of 14.5 papers per question. Running PaperQA2 on LitQA2 yielded a precision of 85.2% , and an accuracy of 66.0%, with the system choosing 'insufficient information' in 21.9% of answers."

"To compare PaperQA2 performance to human performance on the same task, human annotators who either possessed a PhD in biology or a related science, or who were enrolled in a PhD program, were each provided a subset of LitQA2 questions and a performance-related financial incentive of $3-12 per question to answer as many questions correctly as possible within approximately one week, using any online tools and paper access provided by their institutions. Under these conditions, human annotators achieved 64.3% +/- 15.2% precision on LitQA2 and 63.1% +/- 16.0% accuracy. PaperQA2 thus achieved superhuman precision on this task and did not differ significantly from humans in accuracy."

For "precision": PaperQA2 85.2%, Perplexity Pro 69.7%, human 64.3%, Geminin Pro 1.5 51.7%, GPT-4o 44.6%, GPT-4-Turbo 43.6%, Elicit 40.9%, Claude Sonnet 3.5 37.7%, Cloude Opus 23.6%.

For "accuracy": PaperQA2 66.0%, human 63.1%, Perplexity Pro 52.0%, Elicit 25.9%, GPT-4o 20.2%, GTP-4-Turbo 13.7%, Gemini Pro 1.5 12.1%, Claude Sonnet 3.5 8.1%, Claude Opus 5.2%.

PaperQA2: Superhuman scientific literature search

#solidstatelife #ai #genai #llms #agenticai

waynerad@diasp.org

Eric Schmidt gave a talk at Stanford Business School, that was so censored it took me about 2 seconds to find it on YouTube -- oh wait, it's gone from YouTube. I guess it really is censored after all. And it's subtitled in Chinese, suggesting this talk is of interest to Chinese people. Er, was subtitled in Chinese. I guess it's gone now. Anyway, it was Eric Schmidt answering audience questions, moderated by Erik Brynjolfsson.

Anyway, Eric Schmidt has 2 predictions:

  1. He predicts that LLMs will soon have 1 million token context windows. Which he says is 20 books, but I estimated 1,500 pages which is more like 3 books.

  2. The next thing he predicts is "text-to-action" AIs. You give it text, and it does the actions you ask for. How it does this is by writing code (e.g. Python) and then running it. This is also called "agentic" AI.

There's a chemistry lab where knowledge from experiments is fed back into the AI which it uses to plan the next experiments, which are carried out overnight (by humans? by robotics?) and this is accelerating knowledge in chemistry and material science. I don't remember the name, but if you watch the video, eh, oh wait.

He envisions a future where, for example, TikTok gets banned, and you could go to an LLM and say, "Make me a TikTok clone", and you can just repeat that over and over and over until you hit upon a clone that "goes viral".

A few other points of note: He said he oscillates between thinking open source models and closed source models will win. It seems like only closed source is possible because of the huge amount of money involved. But then open source catches up and he flips to thinking the other way.

With regards to China, he says the US is ahead and has to stay ahead. Because of the huge amounts of money and expertise involved, only a few countries can compete -- the US and China and maybe a few others -- but not the EU because Brussels screwed them. Everyone else just lives in the AI world the giants are creating. With regard to national security, countries will align themselves with the US or China, with the EU, South Korea, Japan, etc, in our camp.

Brrrrrp! I found a video with clips from the Stanford talk with commentary (from Matthew Berman) that seems to have not been taken down. He (Berman) focuses on things in the talk I didn't mention, like how CUDA locks people into Nvidia and that's responsible for Nvidia's disproportionately high market cap, and how people at Google aren't working 80-hour weeks any more but he thinks they should be.

#solidstatelife #ai #genai #llms #agenticai

https://www.youtube.com/watch?v=7PMUVqtXS0A

waynerad@diasp.org

The CEO of Bumble, the dating app, Whitney Wolfe Herd, says you'll soon be able to create an AI dating "concierge" that goes on the dating app in your place and does everything you would do, like swiping and small talk, and since other people will do the same thing, AI dating concierges will interact with other AI dating concierges to find matches.

Sounds like a perfect way to find compatible couples... or an episode of Black Mirror?

Will you let AI date for you? Bumble says it's the future | Vantage with Palki Sharma - Firstpost

#solidstatelife #ai #genai #llms #agenticai #relationships