#solidstatelife

waynerad@diasp.org

"Chipotle's testing an avocado-peeling robot and an automated bowl assembly line."

The video is just a short animated GIF.

There's also a short video of the "Augmented Makeline" preparing a burrito bowl.

My take: Progress is slow in robotics. Robotics' "ChatGPT moment" still hasn't happened.

Chipotle's testing an avocado-peeling robot and an automated bowl assembly line

#solidstatelife #robotics

waynerad@diasp.org

"At a glance, SocialAI -- which is billed as a pure 'AI Social Network' -- looks like Twitter, but there's one very big twist on traditional microblogging: There are no other human users here. Just you."

"In a nutshell, SocialAI lets you share your thoughts with an infinite supply of ever-available AI-powered bots that can endlessly chat back."

"Think about it: No remark you post to SocialAI will ever be greeted with silence nor fail to engage en masse. You simply can't get ghosted. The app's faux users exist to hang on your every word -- leveraging programmed enthusiasm to chip canned commentary into your replies (even the sarcastic, snarky, and pessimistic bots can't resist joining these continuous scroll comment pile-ons)."

Alrighty then.

SocialAI offers a Twitter-like diary where AI bots respond to your posts | TechCrunch

#solidstatelife #ai #genai #llms #chatbots #socialnetworking

waynerad@diasp.org

I heard the judge at Google's antitrust trial left to coach Kamala Harris for her Presidential debate, but after looking for more information on (ironically?) Google, I discovered it wasn't the judge, it was Google's lead defense attorney, which is a lot less weird, but is still a little weird.

"Karen Dunn, a litigator at Paul Weiss, opened Google's defense in a federal court case targeting its digital ad business. Shortly after, she reportedly helped Harris prepare for her debate with Donald Trump in Philadelphia."

What's y'all's predictions for how these antitrust cases are going to turn out for Google? Business as usual or a radical change in Google's business practices?

Who is Karen Dunn? Key figure behind Kamala Harris' debate preparations

#solidstatelife #domesticpolitics #monopoly #antitrust

waynerad@diasp.org

"The Safe C++ project adds new technology for ensuring memory safety, and isn't just a reiteration of best practices."

"Safe C++ prevents users from writing unsound code. This includes compile-time intelligence like borrow checking to prevent use-after-free bugs and initialization analysis for type safety.'"

"Sean Baxter, creator of the Circle compiler, said that rewriting a project in a different programming language is costly, so the aim here is to make memory safety more accessible by providing the same soundness guarantees as Rust at a lower cost. 'With Safe C++, existing code continues to work as always,' he explained. 'Stakeholders have more control for incrementally opting in to safety.'"

The empire of C++ strikes back with Safe C++ blueprint - The Register

#solidstatelife #computerscience #programminglanguages

waynerad@diasp.org

"3GPP Release-18 Physical Layer Enhancements for IoT-NTN".

"The advent of mega satellite constellations has paved the way for bringing cellular connectivity to mobile broadband as well as Internet of Things (IoT) devices in unconnected or remote regions via satellites, thus complementing the terrestrial cellular coverage. With this motivation, the 3rd Generation Partnership Project (3GPP) introduced Non-Terrestrial Networks (NTN) support for IoT technologies including Long-Term Evolution (LTE) for Machine-Type Communications (LTE-MTC) and Narrowband Internet of Things (NB-IoT) in Release-17, which are collectively known as IoT-NTN. Release-17 IoT-NTN focused on only the essential changes to the existing IoT specification to enable IoT operation in satellite scenarios."

So in case you didn't follow the techspeak, what this is about is being able to have electronic devices in the most remote parts of the world, and as long as they have electric power, they can get on the internet and transmit data and receive data by communicating through satellites. They need particular electronic circuits that speak what have become known as "internet of things" (IoT) protocols -- protocols originally designed to carry internet protocol (TCP/IP) over low power and low bandwidth radio links.

Here we find an organization that develops standards for terrestrial cellular phone networks (3GPP) getting in on the act and developing standards for this internet of things using satellites, which they call IoT-NTN ("NTN" for "non-terrestrial networks").

When you run IoT communication over cellular networks (before getting to satellites -- regular cellular networks), that goes by the name "LTE-MTC", which as you see above, stands for "Long-Term Evolution Machine Type Communication". Don't ask me how cellular network standards came to be called "Long-Term Evolution", but that's what they're called. Tack on "Machine Type Communication" and now you're talking about machine-to-machine instead of human-to-human communication. How does this have to be modified to get it to work through satellites? That's what 3GPP's Release-17 answers. But this is Release-18.

Finishing out the abstract:

"IoT-NTN was further evolved in Release-18 and several performance enhancements were introduced. In this article, an overview of the Release-18 physical layer enhancements for IoT-NTN is provided. Specifically, a new feature related to disabling Hybrid Automatic Repeat reQuest (HARQ) feedback is described which helps mitigate the impact of HARQ stalling on throughput. Enhancements related to improving Global Navigation Satellite System (GNSS) operation are also discussed that enable the user equipment (UE) to maintain its uplink (UL) synchronization if its GNSS position becomes invalid during an ongoing connection. In addition, the performance evaluation of IoT-NTN technologies is presented in the context of International Mobile Telecommunications-2020 (IMT-2020) satellite performance requirements related to connection density."

So Release-17 came out in 2021, and Release-18 just came out, and Release-18 ties down some loose ends from Release-17. If you're wondering if "Global Navigation Satellite System" (GNSS) refers to the GPS system, the answer is... yes, but GNSS is actually a generic term for all GPS-type systems -- the US system is called GPS, Russia has a system called GLONASS (from "Globalnaya Navigatsionnaya Sputnikovaya Sistema"), China has a system called BeiDou, and the European Union has a system called Galileo. Japan has a system called the Quasi-Zenith Satellite System (QZSS), but rather than function as its own independent system, it's just a 4-satellite system that supplements the US GPS system around Japan.

Another loose end is HARQ, which stands for Hybrid Automatic Repeat reQuest. HARQ is a combination of an error-correction protocol and a system for requesting messages get repeated. With error-correction codes, additional bits are added to messages to enable errors to be detected and corrected -- within limits. If there are too many errors, however, the error-correction system gives up and the system requests the whole message be repeated. The "hybrid" part of "HARQ" is about how the HARQ system is a hybrid system that handles both -- error correction and message repeating.

The paper is paywalled, so I had to go searching on the internet to find out more. See below.

3GPP Release-18 Physical Layer Enhancements for IoT-NTN

#solidstatelife #communication #cellular #satellite #gps #iot #iotntn

waynerad@diasp.org

"Runway ML and Lionsgate are partnering to explore the use of AI in film production."

"Lionsgate and Runway have entered into a first-of-its-kind partnership centered around the creation and training of a new AI model, customized on Lionsgate's proprietary catalog. Fundamentally designed to help Lionsgate Studios, its filmmakers, directors and other creative talent augment their work, the model generates cinematic video that can be further iterated using Runway's suite of controllable tools."

And so it begins? No mention of how there won't be any loss of jobs.

Runway partners with Lionsgate | Runway News

#solidstatelife #ai

waynerad@diasp.org

"Kalshi, a US-regulated prediction market platform, won its federal lawsuit against the Commodity Futures Trading Commission (CFTC) over a plan to offer contracts on which party will control each house of Congress after the November election."

"Although the CFTC could appeal, Kalshi, which had been locked out of this year's election betting boom while the case was pending, can now grab a sliver of that action in the last two months before the election."

The political prediction market has been dominated by Polymarket.

Kalshi cleared to offer congressional prediction markets in victory against CFTC

#solidstatelife #cryptocurrency

waynerad@diasp.org

AI can't cross this line on a graph and we don't know why.

The graph has the "error" that the neural net is trying to minimize as part of its training (also called the "loss") on the vertical axis.

On the horizontal axis, it has the amount of computing power thrown at the training process.

When switched to a log-log graph -- logarithmic on both axes -- a straight line emerges.

This is actually one of 3 observed neural network scaling laws. The other two look at model size and dataset size, and see a similar pattern.

Have we discovered some fundamental law of nature, like the ideal gas law in chemistry, or is this an artifact of the particular methods we are using now to train neural networks?

You might think someone knows but no one knows.

That didn't stop this YouTuber from making some good animations of the graphs and various concepts in neural network training, such as cross-entropy. It introduces the interesting concept that language may have a certain inherent entropy.

The best theory as to why the scaling laws hold tries to explain it in terms of neural networks learning high-dimensional manifolds.

AI can't cross this line and we don't know why. - Welch Labs

#solidstatelife #ai #llms #genai #deeplearning #neuralnetworks #scalinglaws

waynerad@diasp.org

Alexis Conneau, OpenAI's research lead for GPT-4o/GPT-5, has left OpenAI to start a new company to create "Her", as in, from the movie. (Alrighty then.)

Career update: After an amazing journey at @OpenAI building #Her, I’ve decided to start a new company

#solidstatelife #ai #genai #llms #chatbots #her

waynerad@diasp.org

"FastHTML: The fastest, most powerful way to create an HTML app".

"FastHTML apps are just Python code, so you can use FastHTML with the full power of the Python language and ecosystem. FastHTML's functionality maps 1:1 directly to HTML and HTTP, but allows them to be encapsulated using good software engineering practices -- so you'll need to understand these foundations to use this library fully."

In the section on "Getting help from AI" it says:

"Because FastHTML is newer than most LLMs, AI systems like Cursor, ChatGPT, Claude, and Copilot won't give useful answers about it. To fix that problem, we've provided an LLM-friendly guide that teaches them how to use FastHTML. To use it, add this link for your AI helper to use:

/llms-ctx.txt
"

I wonder if we're going to see more of this kind of thing for new tech.

FastHTML: The fastest, most powerful way to create an HTML app

#solidstatelife #ai #genai #llms

waynerad@diasp.org

"Oracle is designing a data center that would be powered by three small nuclear reactors" alrighty then.

The data center will require more than a gigawatt of electricity.

The article says the location has been chosen and building permits obtained, but the nuclear reactor designs have not been revealed.

"Small modular nuclear reactors are new designs that promise to speed the deployment of reliable, carbon-free energy as power demand rises from data centers, manufacturing and the broader electrification of the economy. Generally, these reactors are 300 megawatts or less, about a third the size of the typical reactor in the current US fleet."

Oracle is designing a data center that would be powered by three small nuclear reactors

#solidstatelife #energy #nuclear

waynerad@diasp.org

"The struggle over digital infrastructure": Commentary by Robin Berjon, deputy director of the IPFS Foundation. InterPlanetary File System (IPFS) is a protocol for a peer-to-peer distributed file system.

"By and large, the Internet seems more akin to a failed state, at best an undergoverned space where fragments of essential infrastructure are provided at the whim of local warlords."

"The cyberlibertarian vision of yesteryears is at the root of the myriad problems confronting global digital governance today."

The cyberlibertarian vision of yesteryear was that bad, huh?

"One important property of the Internet is its adhesion to the end-to-end principle, one formulation of which is: 'Nothing should be done in the network that can be efficiently done in an end-system.' This may seem somewhat abstract, but we can see the end-to-end principle at work in other infrastructural systems. For instance, if you invent a new type of light bulb or toaster, you don't need to change the lighting or toasting functions of the electrical grid."

"Seen from 2024, this might strike readers as obvious but it wasn't always so. Before the Internet emerged as the more successful alternative, it was competing for funding and attention with a much more telecoms-centric model. Under the telecoms model, intelligence resides in the network rather than at the edges."

"Crucially, intermediary capture does not only describe the world we had when networking was dominated by telecoms operators, it is also an apt description of the world we now have in which many infrastructure layers of our digital spaces have been captured by tech platforms."

"Global majority countries (as well as most global minority ones) are entirely right to feel colonized by Google and other platforms, in a very literal sense. The Internet is 'the infrastructure of all infrastructure' and having one's critical infrastructure controlled and exploited by a foreign entity is colonial. But the way forward does not lie with a reversal to the telecoms model."

Ok. What do to, then?

The Struggle Over Digital Infrastructure | TechPolicy.Press

#solidstatelife #geopolitics

waynerad@diasp.org

"Simpler solution for disabling the DCM telematics -- Silencing Antennas"

"We just bought our 2023 Toyota Tacoma TRD Off Road at the end of November and in reading through the manual found out that we could contact Toyota to opt out of data reselling (to insurance companies and advertisers) but couldn't actually disable our Data Communication Module (DCM)/Telematics module from connecting to the internet via user-accessible menus from the truck."

So you can opt out of the data reselling but you can't opt out of the data collection.

This got my attention because it made me wonder, is this kind of data collection happing for all or almost all new cars these days? Do essentially all new cars have "an internet connected computer sitting on my CAN Bus"?

"After tearing apart my passenger dash by removing the scuff plate, cowl side trim board, instrument side panel, passenger knee airbag, glove compartment plate, and finally the glove box/'instrument lower panel assembly' with a trim tool, a philips, and a 10 mm socket I was able to see the DCM module on the far right with the three antenna connectors."

Post goes on to describe modifications to the vehicle. They are specific to this vehicle so unless you have the same vehicle, and feel comfortable doing maintenance on your vehicle, you might not be interested in what follows.

"If my understanding of the Toyota documentation is correct, this should continue to run happily with the DCM/telematics module believing it is out of cell coverage range, and then just overwriting the oldest events in the internal memory in the vain hope it someday hears a cell tower again. The nice part of this approach is should I (or the dealer) ever need to undo this mod, it's completely reversible. Since the three radios were properly terminated into a 50-ohm terminator, there won't be any damage to the transmitting or receiving side of the DCM module, and there also won't be any damage to the wiring on the D37 connector either."

"Altogether, this only took about an hour and a half to do once I figured out the right combination of connectors to use."

Simpler solution for disabling the DCM telematics - Silencing Antennas

#solidstatelife #surveillance

waynerad@diasp.org

"Oura has acquired metabolic health startup Veri."

"Blood sugar levels are foundational to the Veri platform. The Finnish company notes, 'Veri does more than show you blood sugar levels. We help you stabilize your levels by providing the insight and guidance you need to find the right foods and habits for you.'"

"Oura CEO Tom Hale tells TechCrunch that, according to an internal survey, 97% of its users are 'really interested in understanding how nutrition affects their health.' The more surprising stat, however, is that 13% of those surveyed have been wearing a continuous glucose monitor prior to the recent increased availability of the devices."

Oura has acquired metabolic health startup Veri

#solidstatelife #medicaldevices

waynerad@diasp.org

"Today I read yet again someone suggesting that using ChatGPT to rewrite code from one programming language to another is a great idea. I disagree: a programming language is an opinionated way on how to better achieve a certain task and switching between world views without understanding how and why they do things the way they do is a recipe for inefficient code at best and weird bugs at worse."

"I decided to test my theory with Google's Gemini - I've seen students using it in their actual coding (probably because it's free) making it a fair choice. I asked the following:"

"Convert the following code from Python to Elixir:"

The code looks equivalent, but it's only equivalent for the normal case -- if the input is bad, the Python and Elixir code behave differently. I think this is a good example of how LLMs can make translations that look correct but aren't in subtle ways.

#solidstatelife #ai #genai #llms #codingai

https://7c0h.com/blog/new/therac_25_and_llms.html

waynerad@diasp.org

"Why do you need a time series database inside a car?"

That's a good question. Sometimes I wonder if we've crossed the point where further complexification of cars doesn't yield much benefit. But let's continue.

"As automotive intelligence progresses, vehicles generate increasing amounts of time-series data from various sources. This leads to high costs in data collection, transmission, and storage. GreptimeDB's Integrated Vehicle-Cloud Solution addresses these issues by leveraging the advanced computational capabilities of modern on-vehicle devices. Unlike traditional vehicle-cloud coordination where vehicles are mere data collectors, this new approach treats them as full-fledged servers capable of running complex tasks locally. The evolution from 32-bit microcontrollers to powerful chip modules like Qualcomm's 8155 or 8295 has enabled intelligent vehicles to perform edge computing efficiently, reducing transmission costs and improving overall efficiency."

"GreptimeDB is a cloud-native time-series database built on a highly scalable foundation. However, we did not initially anticipate it running on edge devices such as vehicles, which has presented significant challenges."

"The first challenge is resource usage constraints. GreptimeDB runs in the vehicle's cockpit domain controller and must minimize CPU and memory usage to avoid interfering with infotainment systems."

"The second concern is robustness; GreptimeDB collects critical diagnostic metrics from the CAN bus, so any crashes could result in data loss."

"CAN" here stands for "controller area network" and is a data bus inside vehicles that replaces masses of wires that go directly from components to other components -- it allows any "electronic control unit" (ECU) connected to the bus to communicate with any other.

"Lastly, unlike servers in datacenters, vehicle-based GreptimeDB operates under various conditions -- frequent power cycles, fluctuating ADAS data rates due to changing road traffic, etc. -- and needs to adapt while remaining stable and efficient."

"ADAS" stands for "advanced driver-assistance systems".

How to build a TSDB Running inside a car

#solidstatelife #databases #timeseries

waynerad@diasp.org

PaperQA2: Superhuman scientific literature search.

"To evaluate AI systems on retrieval over the scientific literature, we first generated LitQA2, a set of 248 multiple choice questions with answers that require retrieval from scientific literature. LitQA2 questions are designed to have answers that appear in the main body of a paper, but not in the abstract, and ideally appear only once in the set of all scientific literature."

"We generated large numbers of questions about obscure intermediate findings from very recent papers, and then excluded any questions where either an existing AI system or a human annotator could answer the question using an alternative source."

These were generated by human experts.

"When answering LitQA2 questions, models can refuse to answer." "Some questions are intended to be unanswerable."

After creating LitQA2, the researchers then turned their attention to creating an AI system that could score highly on it.

"Retrieval-augmented generation provides additional context to the LLM (e.g., snippets from research papers) to ground the generated response. As scientific literature is quite large, identifying the correct snippet is a challenge. Strategies like using metadata or hierarchical indexing can improve retrieval in this setting, but finding the correct paper for a task often requires iterating and revising queries. Inspired by PaperQA, PaperQA2 is a retrieval-augmented-generation agent that treats retrieval and response generation as a multi-step agent task instead of a direct procedure. PaperQA2 decomposes retrieval-augmented generation into tools, allowing it to revise its search parameters and to generate and examine candidate answers before producing a final answer. PaperQA2 has access to a 'Paper Search' tool, where the agent model transforms the user request into a keyword search that is used to identify candidate papers. The candidate papers are parsed into machine readable text, and chunked for later usage by the agent. PaperQA2 uses the state-of-the-art document parsing algorithm, Grobid, that reliably parses sections, tables, and citations from papers. After finding candidates, PaperQA2 can use a 'Gather Evidence' tool that first ranks paper chunks with a top-k dense vector retrieval step, followed by an LLM reranking and contextual summarization step. Reranking and contextual summarization prevents irrelevant chunks from appearing in the retrieval-augmented generation context by summarizing and scoring the relevance of each chunk, which is known to be critical for retrieval-augmented generation. The top ranked contextual summaries are stored in the agent's state for later steps. PaperQA2's design differs from similar retrieval-augmented generation systems like Perplexity, Elicit, or Mao et al. which deliver retrieved chunks without substantial transformation in the context of the user query. While reranking and contextual summarization is more costly than retrieval without a contextual summary, it allows PaperQA2 to examine much more text per user question."

"Once the PaperQA2 state has summaries, it can call a 'Generate Answer' tool which uses the top ranked evidence summaries inside a prompt to an LLM for the final response to the asked questions or assigned task. To further improve recall, PaperQA2 adds a new 'Citation Traversal' tool that exploits the citation graph as a form of hierarchical indexing to add additional relevant sources."

Got that? Ok, that's quite a lot, so to summarize: The system consists of 5 'agents': 1. Paper search agent, 2. Gather evidence agent, 3. Citations traversal agent, 4. Generate answer agent, and 5. PaperQA2 agent which is the 'manager' agent directing all the other agents and that inputs the question and outputs the final answer.

Here's an example question: "What effect does bone marrow stromal cell-conditioned media have on the expression of the CD8a receptor in cultured OT-1 T cells?"

PaperQA2 answer: "The effect of bone marrow stromal cell conditioned media (SCM) on the expression of the CD8a receptor in cultured OT-1 T cells is reported to have no significant impact. According to the the study by Kellner et al (2024)."

The referenced paper says, on page 24: "OT-1 T-cells are primary mouse CD8+ T cells that specifically recognize the ovalbumin peptide residues 257-264 (SIINFEKL) presented on a class I major histocompatibility complex. OT-1 transgenic mice were housed and bred at the Division of Animal Resources at Emory University following all Institutional Animal Care and Use Committee protocols. OT-1 T-cells were magnetically isolated from spleens using CD8a+ negative selection according to the manufacturer's protocol (Miltenyi Biotec, # 130-104-075). Purified OT-1 T-cells were resuspended in unconditioned medium (UCM), bone marrow stromal cell conditioned media (SCM), or adipocyte conditioned medium (ACM) at 1x10^6 cells/mL in a 24 well plate and incubated at 37 degrees C, 5% CO2 for 24h prior to being seeded on tension probes. Images were taken 15 minutes after seeding OT-1 T-cells on tension probes. Fluorescence intensity values of individual cells were quantified from at least 10 images in FIJI software."

No, I didn't figure out what "SIINFEKL" stands for. I asked Google what it stands for and its "AI Overview" gave me a blatantly wrong answer (ironically?). One paper referred to it as "the well-known ovalbumin epitope SIINFEKL" -- but it's not well know enough to have a Wikipedia page or a Science Direct summary page saying what it stands for and having a basic description of what it is. By the way the term "epitope" means the part of a molecule that activates an immune system response, especially the part of the immune system that adapts new responses, primarily T cell and B cell receptors. Stromal cells of various types exist throughout the body, but the question here refers specifically to bone marrow stromal cells. These are "progenitor" cells that produce bone and cartilage cells, and well as cells that function as part of the immune system, such as cells that produce chemokines, cytokines, IL-6, G-CSF, GM-CSF, CXCL12, IL7, and LIF (for those of you familiar with the immune system -- if you're not, I'm not going on another tangent to explain what those are), though from what I can tell they don't produce T-cells or B-cells. T-cells and B-cells are produced in bone marrow, but not from stromal cells.

T-cells and B-cells are produced in bone marrow, but not from stromal cells. "OT-1" refers to a strain for transgenic mice sold by The Jackson Laboratory. CD8a is a gene that is expressed in T cells.

Anyway, let's get back to talking about PaperQA2.

So how did PaperQA2 do?

"We evaluate two metrics: precision, the fraction of questions answered correctly when a response is provided, and accuracy, the fraction of correct answers over all questions."

"In answering LitQA2 questions, PaperQA2 parsed and utilized an average of 14.5 papers per question. Running PaperQA2 on LitQA2 yielded a precision of 85.2% , and an accuracy of 66.0%, with the system choosing 'insufficient information' in 21.9% of answers."

"To compare PaperQA2 performance to human performance on the same task, human annotators who either possessed a PhD in biology or a related science, or who were enrolled in a PhD program, were each provided a subset of LitQA2 questions and a performance-related financial incentive of $3-12 per question to answer as many questions correctly as possible within approximately one week, using any online tools and paper access provided by their institutions. Under these conditions, human annotators achieved 64.3% +/- 15.2% precision on LitQA2 and 63.1% +/- 16.0% accuracy. PaperQA2 thus achieved superhuman precision on this task and did not differ significantly from humans in accuracy."

For "precision": PaperQA2 85.2%, Perplexity Pro 69.7%, human 64.3%, Geminin Pro 1.5 51.7%, GPT-4o 44.6%, GPT-4-Turbo 43.6%, Elicit 40.9%, Claude Sonnet 3.5 37.7%, Cloude Opus 23.6%.

For "accuracy": PaperQA2 66.0%, human 63.1%, Perplexity Pro 52.0%, Elicit 25.9%, GPT-4o 20.2%, GTP-4-Turbo 13.7%, Gemini Pro 1.5 12.1%, Claude Sonnet 3.5 8.1%, Claude Opus 5.2%.

PaperQA2: Superhuman scientific literature search

#solidstatelife #ai #genai #llms #agenticai

waynerad@diasp.org

"Hasbro's CEO thinks D&D's adoption of AI Is inevitable."

"If you look at a typical D&D player... I play with probably 30 or 40 people regularly. There's not a single person who doesn't use AI somehow for either campaign development or character development or story ideas. That's a clear signal that we need to be embracing it. We need to do it carefully, we need to do it responsibly, we need to make sure we pay creators for their work, and we need to make sure we're clear when something is AI-generated."

Yet...

"Wizards of the Coast at large, at least so far, has been keen to emphasize that Dungeons & Dragons is a game about human creativity, made by actual people for actual people to play."

Hasbro's CEO thinks D&D's adoption of AI Is inevitable

#solidstatelife #ai #genai #llms #games #rpg

waynerad@diasp.org

OpenAI has created a new large language model that they call "o1", which has been "trained with reinforcement learning to perform complex reasoning." "o1 thinks before it answers -- it can produce a long internal chain of thought before responding to the user."

"OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).

To put that in comparison, GPT-4o was at the 11th percentile for CodeForces programming competition coding. So between GPT-4o and o1 the improvement was from 11th percentile to 89th.

For AIME 2024, the improvement from GPT-4o to o1 was from 13.4-the percentile to 83.3-rd percentile.

For GPQA, for biology it's 63.2 to 68.4, for chemistry it's 43.0 to 65.6, for physics it's 68.6 to 94.2.

Quoting further:

"Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them."

"Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn't working. This process dramatically improves the model's ability to reason."

"Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles." "We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios."

"To stress-test our improvements, we conducted a suite of safety tests and red-teaming before deployment, in accordance with our Preparedness Framework(opens in a new window). We found that chain of thought reasoning contributed to capability improvements across our evaluations. Of particular note, we observed interesting instances of reward hacking."

"We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to 'read the mind' of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users."

Learning to Reason with LLMs | OpenAI

#solidstatelife #ai #genai #llms #chainofthought