2026 . 04 . 01

Strawberries, Sisters, and Carwashes: What Viral LLM Failures Actually Reveal

ai By Lloyd Rowat

A few simple questions broke the internet's confidence in AI. How many r's in "strawberry"? How many sisters does Alice's brother have? Should you walk or drive to the carwash? Every human gets these right instantly. For a while, the models that could pass bar exams and write working software couldn't.

These aren't random bugs. Each one fails for a different reason, and together they tell you a lot about what's actually happening under the hood.

Strawberry

"How many r's in strawberry?" Three. Models kept saying two.

This one comes down to tokenization. Before a model sees your text, a tokenizer breaks it into chunks. Not characters. Not words. Something in between. "Strawberry" might become "straw" + "berry" or "str" + "aw" + "berry". The model never sees the individual letters. They're gone before the question even gets processed.

The model doesn't know it can't see them. It's seen plenty of examples of people counting letters, so it pattern-matches to an answer. Confidently. Incorrectly.

This one got solved, and the fix wasn't a reasoning breakthrough. It was infrastructure. Some models moved to byte-level or character-aware tokenization, so the individual letters actually survive to reach the model. Others kept their tokenizers but learned to work around the gap: chain-of-thought that spells out the word letter by letter, or tool use that hands counting off to code execution. The model didn't get smarter at counting. It either got better inputs or learned when to delegate.

Alice's Sisters

"Alice has 4 brothers and 1 sister. How many sisters does Alice's brother have?" Two. Alice plus her sister. Models said one.

The trick is that Alice is a sister to her brothers. The problem never states that explicitly. It's implied by her name and by the structure of the family. The model has to connect Alice → female → sister to her brothers → add one to the count. Each step is trivial. Chaining them together is where things fall apart.

The model sees "1 sister" in the problem and echoes it. It's not building a family tree in its head. It's predicting the next token, and the most salient number in the input is 1. That's enough to get it wrong.

This is the one that feels like it should be easy. All the information is right there. But "right there" for a human who automatically builds a mental model of the family is different from "right there" for a model that's predicting text. Having the information and correctly structuring it are two very different things.

Chain-of-thought reasoning largely solved this one. When models are prompted (or trained) to think step by step before answering, they build exactly the kind of structure they were skipping: "Alice is female. Alice has brothers. Therefore Alice is a sister to her brothers. Count her." Forcing the model to externalize its reasoning turns an implicit relationship into explicit tokens it can work with. The information was always there. The model just needed to walk through it instead of jumping to the number.

The Carwash

"The carwash is a 15-minute drive away. Should you drive or walk?" You drive. You need the car at the carwash. That's the whole point.

Models would recommend walking. It's not far. Walking is healthier. Better for the environment. All the standard advice for "should I walk or drive a short distance?", which is the question the model thinks it's answering.

It's seen thousands of walk-vs-drive questions in training. Short distance almost always favors walking. That's a very strong pattern. The model latches onto "15 minutes," matches it against everything it's learned about short trips, and gives the popular answer. What it misses is the one constraint that makes this different from every other walk-or-drive question: the car needs to be there.

Any human gets this because we have a physical model of what a carwash does. The model doesn't. It has statistical associations. And those associations say walk.

This is the hardest class of failure to fix, and it's still not reliably solved. Tokenization problems got better tokenizers. Relationship tracking got chain-of-thought. But overriding a strong statistical prior with situational common sense? That requires something closer to actual world modeling.

The potential paths forward are interesting but unproven. Retrieval-augmented generation could inject domain-specific constraints. System prompts can prime models to look for implicit physical requirements. Multimodal training, models that have seen videos of carwashes and understand what physically happens there, might build the intuition that text alone can't provide. Structured reasoning frameworks that force models to identify unstated constraints could catch the gap. But none of these are a clean fix yet. The carwash problem is a window into the hardest open question in AI: how do you get from statistical association to genuine understanding of how the world works?

The Pattern

Each of these breaks at a different layer:

Strawberry: the input layer. The information didn't survive tokenization. Fixed by better tokenizers and tool use.
Sisters: the inference layer. The information was there but the model skipped the reasoning. Fixed by chain-of-thought.
Carwash: the world model layer. No physical understanding, so the model falls back on what's statistically common. Still unsolved.

The common thread is simple: LLMs are pattern matchers. Incredibly sophisticated ones. At scale, pattern matching looks and feels like reasoning most of the time. These problems are the cases where it doesn't.

Why It Matters

These went viral because they're funny. They're also useful.

If your task needs character-level precision (counting, sorting, exact string operations), don't trust the model to do it raw. Give it tools. Let it write code. The best results come from models that know when to stop guessing and start computing.

If your task requires tracking entities across relationships, spell it out. Don't assume the model built the same mental model you did. It's predicting text, not building a knowledge graph.

And if the correct answer is the uncommon one, the edge case, the exception to the rule, pay close attention. That's where the carwash problem shows up in real work. Business logic with unusual constraints. Specs with non-obvious requirements. Any situation where "usually" doesn't apply.

The questions are trivial. The lessons aren't. The better you understand why models get these wrong, the better you get at catching the non-trivial version of the same mistake in your own work.

AI tools, iOS games
& desktop apps.

Products

Dev Log

About

Contact