PaperQA2: Superhuman scientific literature search.
"To evaluate AI systems on retrieval over the scientific literature, we first generated LitQA2, a set of 248 multiple choice questions with answers that require retrieval from scientific literature. LitQA2 questions are designed to have answers that appear in the main body of a paper, but not in the abstract, and ideally appear only once in the set of all scientific literature."
"We generated large numbers of questions about obscure intermediate findings from very recent papers, and then excluded any questions where either an existing AI system or a human annotator could answer the question using an alternative source."
These were generated by human experts.
"When answering LitQA2 questions, models can refuse to answer." "Some questions are intended to be unanswerable."
After creating LitQA2, the researchers then turned their attention to creating an AI system that could score highly on it.
"Retrieval-augmented generation provides additional context to the LLM (e.g., snippets from research papers) to ground the generated response. As scientific literature is quite large, identifying the correct snippet is a challenge. Strategies like using metadata or hierarchical indexing can improve retrieval in this setting, but finding the correct paper for a task often requires iterating and revising queries. Inspired by PaperQA, PaperQA2 is a retrieval-augmented-generation agent that treats retrieval and response generation as a multi-step agent task instead of a direct procedure. PaperQA2 decomposes retrieval-augmented generation into tools, allowing it to revise its search parameters and to generate and examine candidate answers before producing a final answer. PaperQA2 has access to a 'Paper Search' tool, where the agent model transforms the user request into a keyword search that is used to identify candidate papers. The candidate papers are parsed into machine readable text, and chunked for later usage by the agent. PaperQA2 uses the state-of-the-art document parsing algorithm, Grobid, that reliably parses sections, tables, and citations from papers. After finding candidates, PaperQA2 can use a 'Gather Evidence' tool that first ranks paper chunks with a top-k dense vector retrieval step, followed by an LLM reranking and contextual summarization step. Reranking and contextual summarization prevents irrelevant chunks from appearing in the retrieval-augmented generation context by summarizing and scoring the relevance of each chunk, which is known to be critical for retrieval-augmented generation. The top ranked contextual summaries are stored in the agent's state for later steps. PaperQA2's design differs from similar retrieval-augmented generation systems like Perplexity, Elicit, or Mao et al. which deliver retrieved chunks without substantial transformation in the context of the user query. While reranking and contextual summarization is more costly than retrieval without a contextual summary, it allows PaperQA2 to examine much more text per user question."
"Once the PaperQA2 state has summaries, it can call a 'Generate Answer' tool which uses the top ranked evidence summaries inside a prompt to an LLM for the final response to the asked questions or assigned task. To further improve recall, PaperQA2 adds a new 'Citation Traversal' tool that exploits the citation graph as a form of hierarchical indexing to add additional relevant sources."
Got that? Ok, that's quite a lot, so to summarize: The system consists of 5 'agents': 1. Paper search agent, 2. Gather evidence agent, 3. Citations traversal agent, 4. Generate answer agent, and 5. PaperQA2 agent which is the 'manager' agent directing all the other agents and that inputs the question and outputs the final answer.
Here's an example question: "What effect does bone marrow stromal cell-conditioned media have on the expression of the CD8a receptor in cultured OT-1 T cells?"
PaperQA2 answer: "The effect of bone marrow stromal cell conditioned media (SCM) on the expression of the CD8a receptor in cultured OT-1 T cells is reported to have no significant impact. According to the the study by Kellner et al (2024)."
The referenced paper says, on page 24: "OT-1 T-cells are primary mouse CD8+ T cells that specifically recognize the ovalbumin peptide residues 257-264 (SIINFEKL) presented on a class I major histocompatibility complex. OT-1 transgenic mice were housed and bred at the Division of Animal Resources at Emory University following all Institutional Animal Care and Use Committee protocols. OT-1 T-cells were magnetically isolated from spleens using CD8a+ negative selection according to the manufacturer's protocol (Miltenyi Biotec, # 130-104-075). Purified OT-1 T-cells were resuspended in unconditioned medium (UCM), bone marrow stromal cell conditioned media (SCM), or adipocyte conditioned medium (ACM) at 1x10^6 cells/mL in a 24 well plate and incubated at 37 degrees C, 5% CO2 for 24h prior to being seeded on tension probes. Images were taken 15 minutes after seeding OT-1 T-cells on tension probes. Fluorescence intensity values of individual cells were quantified from at least 10 images in FIJI software."
No, I didn't figure out what "SIINFEKL" stands for. I asked Google what it stands for and its "AI Overview" gave me a blatantly wrong answer (ironically?). One paper referred to it as "the well-known ovalbumin epitope SIINFEKL" -- but it's not well know enough to have a Wikipedia page or a Science Direct summary page saying what it stands for and having a basic description of what it is. By the way the term "epitope" means the part of a molecule that activates an immune system response, especially the part of the immune system that adapts new responses, primarily T cell and B cell receptors. Stromal cells of various types exist throughout the body, but the question here refers specifically to bone marrow stromal cells. These are "progenitor" cells that produce bone and cartilage cells, and well as cells that function as part of the immune system, such as cells that produce chemokines, cytokines, IL-6, G-CSF, GM-CSF, CXCL12, IL7, and LIF (for those of you familiar with the immune system -- if you're not, I'm not going on another tangent to explain what those are), though from what I can tell they don't produce T-cells or B-cells. T-cells and B-cells are produced in bone marrow, but not from stromal cells.
T-cells and B-cells are produced in bone marrow, but not from stromal cells. "OT-1" refers to a strain for transgenic mice sold by The Jackson Laboratory. CD8a is a gene that is expressed in T cells.
Anyway, let's get back to talking about PaperQA2.
So how did PaperQA2 do?
"We evaluate two metrics: precision, the fraction of questions answered correctly when a response is provided, and accuracy, the fraction of correct answers over all questions."
"In answering LitQA2 questions, PaperQA2 parsed and utilized an average of 14.5 papers per question. Running PaperQA2 on LitQA2 yielded a precision of 85.2% , and an accuracy of 66.0%, with the system choosing 'insufficient information' in 21.9% of answers."
"To compare PaperQA2 performance to human performance on the same task, human annotators who either possessed a PhD in biology or a related science, or who were enrolled in a PhD program, were each provided a subset of LitQA2 questions and a performance-related financial incentive of $3-12 per question to answer as many questions correctly as possible within approximately one week, using any online tools and paper access provided by their institutions. Under these conditions, human annotators achieved 64.3% +/- 15.2% precision on LitQA2 and 63.1% +/- 16.0% accuracy. PaperQA2 thus achieved superhuman precision on this task and did not differ significantly from humans in accuracy."
For "precision": PaperQA2 85.2%, Perplexity Pro 69.7%, human 64.3%, Geminin Pro 1.5 51.7%, GPT-4o 44.6%, GPT-4-Turbo 43.6%, Elicit 40.9%, Claude Sonnet 3.5 37.7%, Cloude Opus 23.6%.
For "accuracy": PaperQA2 66.0%, human 63.1%, Perplexity Pro 52.0%, Elicit 25.9%, GPT-4o 20.2%, GTP-4-Turbo 13.7%, Gemini Pro 1.5 12.1%, Claude Sonnet 3.5 8.1%, Claude Opus 5.2%.