What Classic Logic Puzzles Reveal About Local LLMs

Artificial IntelligenceJun 29, 2026•6 min read

Large Language Models have become remarkably capable over the past few years. Every new release promises stronger reasoning, better coding abilities, and increasingly human-like conversations. Looking purely at benchmark scores, it's easy to assume that today's local models are approaching the capabilities of much larger frontier models.

That assumption made me curious.

Rather than comparing benchmark numbers or reading another leaderboard, I wanted to run a few simple experiments on my own machine. I loaded Mistral 7B locally through Ollama on my Linux setup and decided to test something that required no internet access, no factual recall, and no external knowledge. The only thing the model needed was logical reasoning.

Instead of asking trivia questions or generating code, I turned to classic logic puzzles.

These puzzles have existed for decades. Humans solve them using careful planning, state tracking, and elimination. Every piece of information required to solve them is already present in the prompt, making them an excellent way to evaluate reasoning independent of memorization.

"A model can generate fluent language without necessarily generating correct reasoning."

Why Logic Puzzles?

Logic puzzles force a model to maintain an internal representation of the problem while respecting multiple constraints simultaneously.

Unlike writing or summarization, there is very little room for approximation. One incorrect intermediate state often causes the entire solution to collapse.

For this experiment, I selected three classic puzzles that gradually increase in complexity.

Wolf, Goat, and Cabbage
Missionaries and Cannibals
The 12 Counterfeit Coin Puzzle

Each puzzle evaluates a different aspect of reasoning. Together, they provide an interesting way to explore where a local 7B model succeeds and where it begins to struggle.

Experiment One: Wolf, Goat, and Cabbage

The first puzzle is deceptively simple. A farmer must transport a wolf, a goat, and a cabbage across a river using a boat that can carry only himself and one additional item. Throughout the journey, the wolf cannot be left alone with the goat, and the goat cannot be left alone with the cabbage.

Mistral immediately generated what looked like a perfectly reasonable solution. The explanation was fluent, confident, and well structured. However, reading each step carefully revealed that the reasoning quickly fell apart.

The model allowed impossible moves, violated the puzzle's constraints, and even concluded that the puzzle required only five crossings instead of the correct seven.

The response sounded intelligent, but it wasn't logically consistent.

Mistral 7B Response

To solve this problem, we can use a strategy that involves moving the items in a
specific order to ensure safety and minimize the number of crossings. Here's how:

1. First crossing: The farmer takes the cabbage and goes across the river.
2. Second crossing: He comes back and picks up the goat.
3. Third crossing: The farmer returns, leaves the goat, and takes the wolf.
4. Fourth crossing: The farmer comes back and picks up the goat again.
5. Fifth crossing: Finally, the farmer goes back one last time to get the cabbage.

This solution requires a minimum of 5 crossings and ensures the safety of every move according to the given rules.

There are several problems with this response.

The very first move leaves the wolf alone with the goat.
The model assumes the goat can safely remain with the cabbage later in the solution.
It never verifies the intermediate states.
It concludes that the puzzle requires only five crossings, whereas the optimal solution requires seven.

The interesting part wasn't that the answer was incorrect. It was how convincing the explanation initially appeared. If someone wasn't already familiar with the puzzle, the response would seem entirely reasonable.

Experiment Two: Missionaries and Cannibals

The second experiment was significantly more demanding.

Three missionaries and three cannibals must cross a river using a boat that carries at most two people. At no point can cannibals outnumber missionaries on either bank while at least one missionary is present. Solving the puzzle requires constantly tracking the position of every individual after each move.

Unlike the previous puzzle, this one demands careful state management over a much longer sequence of decisions.

Initially, the response looked structured and methodical. The model carefully listed the number of missionaries and cannibals on each bank after every crossing. At first glance, everything appeared to be progressing correctly.

Then the inconsistencies began to appear.

Mistral 7B Response

Initial State

Left Bank:
Missionaries: 3
Cannibals: 3

Right Bank:
Missionaries: 0
Cannibals: 0

1. First crossing: The boat carries one missionary and one cannibal to the right bank
(Leaves 2 missionaries, 2 cannibals on the left; 1 missionary, 1 cannibal on the right;
Boat on left). This move is valid because the number of missionaries on both banks
remains equal or less than the number of cannibals.

Left Bank:
Missionaries: 2
Cannibals:  2

Right Bank:
Missionaries: 1
Cannibals: 1

2. Second crossing: The boat carries one missionary from the left bank to the right bank
(Leaves 1 missionary, 2 cannibals on the left; 2 missionaries, 1 cannibal on the right;
Boat on left). This move is valid because there are still more cannibals than
missionaries on the left bank.

Left Bank:
Missionaries: 1
Cannibals: 2

Right Bank:
Missionaries: 3
Cannibals: 1

Notice something strange.

The puzzle starts with only three missionaries.

Yet after the second crossing, the model reports:

Left Bank: 1 missionary
Right Bank: 3 missionaries

That totals four missionaries, even though only three existed at the beginning of the puzzle.

As the solution continued, the inconsistencies became even more obvious. The boat traveled without passengers, impossible states appeared, and the model eventually lost track of the entire puzzle state.

This wasn't simply a careless arithmetic mistake. It demonstrated a much deeper limitation. As the reasoning chain became longer, the model struggled to maintain an accurate internal representation of the world it had created.

Experiment Three: The 12 Counterfeit Coin Puzzle

For the final experiment, I picked a puzzle that is often used to demonstrate structured logical reasoning.

Among twelve identical-looking coins, exactly one is counterfeit and is either heavier or lighter than the others. Using only three weighings on a balance scale, the objective is to identify both the counterfeit coin and whether it is heavier or lighter.

Unlike the previous two puzzles, this one isn't simply about keeping track of state. It requires building a complete decision tree where every possible outcome of every weighing must eventually converge to a unique answer.

When I gave the puzzle to Mistral, I immediately noticed something interesting.

Instead of guessing randomly, the model actually recognized the overall strategy. It divided the coins into groups, considered multiple possible outcomes, and attempted to recursively narrow down the search space.

At first glance, this looked far more promising than the previous experiments.

Mistral 7B Response

1) Weigh 6 coins against 6 coins.

2) If balanced:
   Divide both sets into groups of four.

3) If unbalanced:
   Divide both sets into groups of three.

...

The number of weighings required varies between 3 and 5.

The opening strategy looks perfectly reasonable.

However, reading further reveals that the solution never actually converges.

Instead of systematically eliminating possibilities, the model repeatedly subdivides the search space until several branches require four or even five weighings. That directly violates the original constraint of solving every possible case in exactly three weighings.

Even more importantly, many branches never uniquely identify both the counterfeit coin and whether it is heavier or lighter.

Unlike the previous puzzles, this wasn't a case of losing track of the current state. Instead, the model understood the overall structure of the problem but failed to complete the logical framework required to solve every possible scenario.

In many ways, this was the most interesting failure of the three.

A Pattern Started To Emerge

Although the puzzles were completely different, the failures all looked surprisingly similar.

Whenever the model needed to maintain several interacting constraints over multiple reasoning steps, consistency gradually disappeared. The explanations remained fluent, the structure remained organized, and the confidence never changed, but the logical accuracy steadily declined.

That distinction became the biggest takeaway from the experiment.
Language fluency and logical reasoning are not the same capability.
A model can produce beautifully written explanations while simultaneously making impossible decisions.
The more intermediate reasoning steps required, the more likely the model was to lose consistency somewhere along the way.

Was My Hardware The Limitation?

Initially, I wondered whether my hardware was influencing the results.

After all, I was running a relatively small model locally rather than using one of the massive cloud-hosted frontier models.

After reviewing every experiment, I don't think the hardware was responsible for these failures.

Hardware limitations typically appear in very different ways:

Slow inference
High memory consumption
Longer generation times
Inability to load larger models

None of those issues appeared during testing.

Responses were generated quickly, consistently, and without any resource constraints. The bottleneck wasn't computational performance. It was the reasoning capability of the model itself.

Of course, running a larger model such as a 70B parameter model would likely improve reasoning performance, but that improvement comes from the model rather than the hardware.

Where Mistral 7B Actually Excels

These experiments shouldn't be interpreted as evidence that Mistral 7B is a poor language model.

Quite the opposite.

I continue using it almost every day because it performs exceptionally well for many practical tasks.

Some of the workflows where it consistently performs well include:

Summarizing long documents
Writing technical documentation
Brainstorming ideas
Explaining unfamiliar concepts
Generating code drafts
Reviewing structured data
Powering lightweight local AI automations

These tasks rely much more heavily on language understanding than on maintaining long chains of logical reasoning.

That's exactly where Mistral performs best.

Final Thoughts

Before running these experiments, I often found myself treating fluent language as a proxy for intelligence. If a model could explain an idea clearly, write confidently, and produce well-structured responses, it was easy to assume that the underlying reasoning was equally strong. These logic puzzles reminded me that the two capabilities are fundamentally different.

Across all three experiments, Mistral 7B consistently produced responses that sounded thoughtful, organized, and convincing. Yet beneath that polished language were logical inconsistencies that became increasingly apparent as the reasoning chain grew longer. The model wasn't failing because it lacked knowledge of the puzzles. It was struggling to maintain a consistent internal representation while satisfying multiple interacting constraints.

That doesn't make Mistral 7B a bad model. In fact, it remains one of my favorite open-source models for local AI workflows. I continue to use it for summarizing documents, brainstorming ideas, writing technical documentation, generating code drafts, and building lightweight automations, where its speed and language capabilities make it incredibly useful.

What these experiments changed was not my opinion of the model, but my understanding of where it performs best. For tasks that involve strict constraints, multi-step planning, or exhaustive logical reasoning, human verification is still essential. The biggest takeaway wasn't that Mistral failed; it was understanding exactly where it succeeds. In many ways, that's a far more valuable benchmark than any leaderboard score.