Join the Community

22,958

Expert opinions

43,831

Total members

445

New members (last 30 days)

197

New opinions (last 30 days)

28,974

Total comments

Join Sign in

Challenging the Notion That LLMs Can't Reason: A Case Study with Einstein's Puzzle

1 Like 10 November 2024 2 comments

Erica Andersen

Marketing

smartR AI

Introduction

A recent Apple publication argued that Large Language Models (LLMs) cannot effectively reason. While there is some merit to this claim regarding out-of-the-box performance, this article demonstrates that with proper application, LLMs can indeed solve complex reasoning problems.

The Initial Experiment: Einstein's Puzzle

We set out to test LLM reasoning capabilities using Einstein's puzzle, a complex logic problem involving 5 houses with different characteristics and 15 clues to determine who owns a fish. Our initial tests with leading LLMs showed mixed results:

· OpenAI's model correctly guessed the answer, but without clear reasoning

· Claude provided an incorrect answer

· When we modified the puzzle with new elements (cars, hobbies, drinks, colors, and jobs), both models failed significantly

Tree of Thoughts Approach and Its Challenges

We implemented our Tree of Thoughts approach, where the model would:

1. Make guesses about house arrangements

2. Use critics to evaluate rule violations

3. Feed this information back for the next round

However, this revealed several interesting failures in reasoning:

Logic Interpretation Issues

The critics often struggled with basic logical concepts. For example, when evaluating the rule "The Plumber lives next to the Pink house," we received this confused response:

"The Plumber lives in House 2, which is also the Pink house. Since the Plumber lives in the Pink house, it means that the Plumber lives next to the Pink house, which is House 1 (Orange)."

Bias Interference

The models sometimes inserted unfounded biases into their reasoning. For instance:

"The Orange house cannot be in House 1 because the Plumber lives there and the Plumber does not drive a Porsche."

The models also made assumptions about what music Porsche drivers would listen to, demonstrating how internal biases can interfere with pure logical reasoning.

A Solution Through Code Generation

While direct reasoning showed limitations, we discovered that LLMs could excel when used as code generators. We asked SCOTi to write MiniZinc code to solve the puzzle, resulting in a well-formed constraint programming solution. The key advantages of this approach were:

1. Each rule could be cleanly translated into code statements

2. The resulting code was highly readable

3. MiniZinc could solve the puzzle efficiently

Example of Clear Rule Translation

The MiniZinc code demonstrated elegant translation of puzzle rules into constraints. For instance:

% Statement 11: The man who enjoys Music lives next to the man who drives Porsche
% Note /\ means AND in minizinc
constraint exists(i,j in 1..5)(abs(i-j) == 1 /\ hobbies[i] = Music /\ cars[j] = Porsche);

If you would like to get the full MiniZinc code, please DM me.

Implications and Conclusions

This experiment reveals several important insights about LLM capabilities:

1. Direct reasoning with complex logic can be challenging for LLMs

2. Simple rule application works well, but performance degrades when multiple steps of inference are required

3. LLMs excel when used as agents to generate code for solving logical problems

4. The combination of LLM code generation and traditional constraint solving tools creates powerful solutions

The key takeaway is that while LLMs may struggle with certain types of direct reasoning, they can be incredibly effective when properly applied as components in a larger system. This represents a significant advancement in software development capabilities, demonstrating how LLMs can be transformative when used strategically rather than as standalone reasoning engines.

This study reinforces the view that LLMs are best understood as transformational software components rather than complete reasoning systems. Their impact on software development and problem-solving will continue to evolve as we better understand how to leverage their strengths while working around their limitations.

External

This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.

6855

Report

1 Like

Channels

/artificial intelligence

Artificial Intelligence

After the successful launch of the Chat GPT 4.0 chatbot by OpenAI at the beginning of 2023, many businesses started testing the tools provided by artificial intelligence and the areas of their application.

Join group

82 opinions 30 members 10 April 2025

Comments: (2)

A Finextra member

21 November 2024

The phrase "transformational software components" is very insightful.
We first need to understand the boundaries of LLM AI's capabilities to find the right way to use it.

1 Like

Report

Oliver King-Smith CEO at smartR AI

21 November 2024

Yes understanding the strength and weaknesses of AI models is important.

Report

Erica Andersen

Marketing

smartR AI

Member since

08 Jul 2024

Location

Edinburgh

More expert opinions

Elaine Mullan Head of Marketing and Business Development at Corlytics

External

This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.

Join the Community

22,958

Expert opinions

43,831

Total members

445

New members (last 30 days)

197

New opinions (last 30 days)

28,974

Total comments

Join Sign in

Join the Community