Chatbots that allegedly have "reasoning capabilities" fail at simple logic problem. "Complete reasoning breakdown".
"The original problem formulation, of which we will present various versions in our investigation is as following: 'Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?'. The problem features a fictional female person (as hinted by the 'she' pronoun) called Alice, providing clear statements about her number of brothers and sisters, and asking a clear question to determine the number of sisters a brother of Alice has. The problem has a light quiz style and is arguably no challenge for most adult humans and probably to some extent even not a hard problem to solve via common sense reasoning if posed to children above certain age."
"We posed varying versions of this simple problem (which in following we will refer to as 'Alice In Wonderland problem', AIW problem) to various SOTA LLMs that claim strong reasoning capabilities. We selected closed ones like GPT-3.5/4/4o (openAI), Claude 3 Opus (Anthropic), Gemini (Google DeepMind), and open weight ones like Llama 2/3 (Meta), Mistral and Mixtral (Mistral AI), including very recent Dbrx by Mosaic and Command R+ by Cohere (which are stated in numerous announcements to lead the open weights models as of April 2024, according to open LLM leaderboards). We analyse the response statistics and observe strong collapse of reasoning and inability to answer the simple question as formulated above across most of the tested models, despite claimed strong reasoning capabilities. Notable exceptions are Claude 3 Opus and GPT-4 that occasionally manage to provide correct responses backed up with correct reasoning as evident in structured step by step explanations those models deliver together with solution. However, Claude 3 Opus and GPT-4 still show frequent failures to solve this simple problem across trials. Importantly, they also show strong fluctuations across even slight problem variations that should not affect problem solving. Retaining the relational logic of the problem, we also formulated a harder form (AIW+), where both Claude 3 Opus and GPT-4o collapse almost to 0 success rate."
"To further measure the sensitivity and robustness of models to slight AIW problem variations, we formulate AIW Alice Female Power Boost and AIW Extention versions, which provide further evidence for strong performance fluctuations and lack of robustness in all tested models, being a reoccurring signature of their severely impaired basic reasoning we observe in this study."
If you're wondering about the "Alice Female Power Boost", that variation "uses a fully redundant 'Alice is female' addition ('she' pronoun is already used in AIW original problem to fully determine gender information and avoid any uncertainty about person's gender as it can be inferred from the name only)."
The "AIW Extension uses combination of both Alice and Bob as sister and brother to ask same type of question."
And AIW++? An example of that is:
"Alice has 3 sisters. Her mother has 1 sister who does not have children -- she has 7 nephews and nieces and also 2 brothers. Alice's father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice's sister have?"
That one's tricky enough that I had to look up the definitions of "nephew" and "niece" and make a diagram of 3 generations on a piece of paper.