ChatGPT is destroying Trefor Bazett's math exams.

"I just copy and pasted my exams from last semester -- this was a second year university level introductory linear algebra course -- into chat GPT and actually it got an A on my exams. But AI still makes a lot of pretty basic mistakes."

"What is the smallest integer whose square is between 15 and 30?"

ChatGPT-4o, Claude 3.5 Sonnet, and Google's Gemini all get nearly 100% on the GSM8K (which is a fancy way of saying "Grade School Math, 8000 questions") dataset.

GSM-Hard is a dataset with the same word problems as GSM8K but with gigantic numbers -- so the LLM has to outsource the calculation to something like Wolfram|Alpha to be able to get the correct answers.

The MATH dataset has high school competition problems. LLMs can get these if they can be solved with "content knowledge", such as by having formulas memorized, but can fail if the reasoning required is made more complex. LLMs get about 70% on the whole dataset.

There are additional datasets with Mathematical Olympiad problems. LLMs score poorly on these, but their scores are increasing.

ChatGPT is destroying my math exams - Dr. Trefor Bazett

#solidstatelife #ai #genai #llms #mathllms #math