Sebastian Beyer · 16. Juni 2026 AISoftware EngineeringCoding AgentsCode QualityAI Coding

Why AI Coding Fails at Taste

Zusammenfassung

Sebastian Beyer of ex-nihilo explains why AI coding produces slop in high-taste codebases. When the real test of quality is a maintainer's implicit judgment about elegance, minimalism, and the right abstraction, the verifier sits outside the agent's loop, so the agent guesses. He sorts coding tasks on two axes: how expensive verification is, and how close the task sits to the training distribution. AI does well on common, cheaply verified work like CRUD apps and worse on rare, taste-heavy systems work. The practical fix is making taste explicit through dependency checks, architecture tests, and worked examples.

In the first post, I argued that AI can code when verification is cheap. If the agent can run tests, see errors, obey executable constraints, and loop until a clear handoff gate is green, AI coding works surprisingly well.

But the same idea explains the opposite case. AI coding fails when verification is expensive, and one of the most expensive forms of verification is taste.

Not taste as in "I prefer tabs over spaces." I mean the deeper kind of engineering taste: elegance, minimalism, architectural judgment, knowing which abstraction is too much, knowing which solution is technically correct but spiritually wrong for a codebase.

This is where a lot of very good programmers look at AI-generated code and see slop. And I think they are often right. But the conclusion is not simply "AI cannot code." The better conclusion is:

AI struggles when the verifier is a human's implicit taste.

The Hidden Oracle Problem

A lot of code quality is easy to verify. Does the test pass? Does the form submit? Does the API return the expected shape? Does the database constraint match the application validation? Does the browser test complete the user flow? Does the linter allow this import?

These are nice problems for an AI agent. Not necessarily easy, but the success condition is at least visible. The agent can try something, run the check, observe the failure, and fix the mistake.

Now compare that to a different kind of judgment:

This is too much abstraction.
This is not elegant enough.
This solution works, but it is not how this project thinks.
This adds a concept that should not exist.
This is correct, but ugly.
This is five lines too many.
This feels like slop.

These judgments may be completely valid. But they are expensive to verify, because the verifier is often a specific human with specific taste and deep context. The AI cannot run make george-likes-this. That is the problem.

If the pass/fail condition is "the maintainer likes it and would have written it that way himself," then the maintainer is the oracle, and the oracle is not inside the loop. So the agent guesses. And when it guesses wrong, the output feels like slop.

Working Code Is Not Always Good Code

This distinction matters. A piece of AI-generated code can be working code and still be bad code. It can pass the tests and still introduce the wrong abstraction. It can solve the immediate task and still make the codebase worse. It can be technically correct and still violate the style, spirit, or long-term direction of the project.

This is why some great programmers are unimpressed by AI coding. They are not only asking:

Does this work?

They are asking:

Is this the right expression of the idea?

That is a much harder target, especially in codebases with a strong aesthetic. Some projects have a very particular idea of what good code looks like. Minimal dependencies. Small abstractions. No unnecessary indirection. Directness. Mechanical sympathy. A certain density. A certain rhythm.

If those rules are not written down, not demonstrated in enough examples, and not executable in checks, the AI will mostly approximate them. Sometimes that approximation is good enough. Sometimes it is not. And when it is not, the maintainer sees the gap immediately.

The Tinygrad-Type Problem

Take a project like tinygrad, or really any codebase with a strong taste around minimalism and elegance. A maintainer may reject AI-generated code not because it fails to run, but because it does not fit the project. It may add the wrong abstraction, solve the problem in a way that is too generic, or miss a simpler formulation. It may be correct in the boring sense but wrong in the aesthetic sense.

This is not irrational. In fact, this kind of judgment is often what separates a great codebase from a mediocre one. But it is also exactly the kind of judgment that is hard to outsource to an AI agent unless you make the criteria explicit. And even then, it is hard.

You can write style guides. You can add examples. You can tell the agent to prefer minimal changes, avoid new abstractions, and inspect nearby code before editing. You can add custom lint rules and architecture tests. All of that helps. But there will still be a gap between "this follows the written rules" and "this is what the best maintainer of this project would have written." That gap is taste. And taste is expensive.

Why CRUD Apps Are Easier

This is also why AI often looks much better in boring product software than in highly aesthetic systems work. A CRUD app gives the model a lot of advantages: the patterns are common, the desired behavior is usually easy to describe, and the tests are relatively straightforward. The UI can be checked with browser automation, the database behavior can be asserted, and the API response can be compared to an expected shape.

In other words, the task is close to the center of the training distribution, and verification is relatively cheap. That is the dream setup for AI coding.

Build a form. Add a table. Create an endpoint. Add a background job. Validate input. Send an email. Render a dashboard. Add a Stripe webhook. Write a migration. Connect a model to a controller. None of that is trivial in a real system, but the shape of the work is familiar. The model has seen a lot of it, and the environment can often verify it.

Now compare that to something far from the center. A reverse-engineered USB-to-eSATA driver. A niche compiler optimization. A tiny tensor operation in a minimal ML framework. A weird browser bug. A legacy enterprise integration with undocumented behavior. A custom protocol from 2007 that only works if you accidentally violate the spec in exactly the same way as the original vendor did.

This is a different world. There is less training signal, more unknowns, and fewer examples. The success condition may not even be known at the start. The AI can still help, but the work becomes exploratory, and exploration is expensive.

Unknown Unknowns

Reverse engineering is a good example of where AI coding gets much harder. In normal feature work, you often know what success looks like before you begin: the user can upload a file, the payment is recorded, the page renders, the scheduled job runs.

But in reverse engineering, you may not even know what you need to know yet. You have to discover the constraints first. You run experiments, inspect behavior, form hypotheses, find out which assumptions are false, and uncover the shape of the problem before you can solve it.

That is not impossible for an AI agent. AI can actually be quite useful here. It can automate experiments, summarize logs, generate hypotheses, write probing scripts, compare outputs, and keep track of what has been tried. But it is no longer the same loop as:

Write the feature, run the tests, fix the failures.

It is research. And research has a higher cost of verification because the success condition itself is moving.

This is another reason people talk past each other when discussing AI coding. One person is using AI to ship a SaaS feature. Another is trying to use AI for a weird systems problem with incomplete information. They both say "coding," but they are not doing the same thing.

Two Dimensions That Matter

I currently think about AI coding tasks on two dimensions.

First: how expensive is verification? Can the agent cheaply check whether the result is correct, or does it need a human oracle?

Second: how close is the task to the center of the training distribution? Is this a common pattern with many examples, or a rare problem with little prior signal?

AI coding works best when the task is common and verification is cheap. It works worst when the task is rare and verification is expensive. That sounds obvious once stated, but most debates about AI coding ignore it. People compare their experiences as if "coding" were one thing. It is not.

A Rails CRUD feature and a minimal tensor compiler optimization are both coding, but they are not the same kind of work. The first may have cheap verification and lots of training signal. The second may have expensive verification and depend heavily on taste. Of course the AI looks different in those environments.

Making Taste More Explicit

So what can we do about this? One answer is to accept that some taste will remain human. I think that is true. But another answer is to make more of the taste explicit.

If you do not want new dependencies, add a dependency check. If you want minimal changes, instruct the agent to show the diff and justify every touched file. If certain abstractions are forbidden, write that down. If some architectural boundaries must not be crossed, add architecture tests. If code should follow local patterns, make the agent inspect nearby files before editing. If a style is important, include examples of accepted and rejected solutions.

This will not fully encode taste. But it moves some of the hidden standard into the visible environment, and every piece of taste you can externalize reduces the cost of verification. That is the real opportunity. Not to pretend AI has perfect judgment, but to stop leaving all judgment as an implicit human vibe.

The Honest Middle Position

Here is what I think the honest position looks like. AI is often good at producing working code when the task is common and verification is cheap. AI is much worse at producing excellent code when excellence depends on hidden taste, rare context, or expensive exploration.

That does not make the pessimists wrong, and it does not make the optimists delusional. They are often operating in different environments. If your world is mostly product work with good tests, fast checks, and clear constraints, AI coding can feel like a superpower. If your world is mostly high-taste systems work where the real verifier is a maintainer's intuition, AI coding can feel like a slop machine. Both experiences are real. The mistake is turning either one into a universal law.

For me, the useful question is not "Can AI code?" It is: what kind of code, in what environment, with what verifier? That question is less dramatic. But it is much more useful.

Who This Is For

Teams shipping with coding agents who get clean, passing code that still somehow feels wrong, and want to understand why.
Maintainers of high-taste codebases who keep rejecting AI contributions and cannot fully explain the rule they are enforcing.
Engineering leads deciding which parts of their work to hand to agents and which to keep human.
Anyone caught in the "AI writes slop" vs. "AI ships products" argument who suspects both sides are describing different work.

Frequently Asked Questions

Why does AI write working code that experienced engineers still call slop? Because "working" and "good" are verified differently. Tests, linters, and type checks verify working code cheaply. Whether the code uses the right abstraction and fits the project's style is judged by a human's implicit taste, which the agent cannot run as a check. When that human is the oracle and sits outside the loop, the agent guesses, and the guesses sometimes miss.

What is the difference between working code and good code? Working code passes the tests and does the task. Good code also uses the right abstraction, follows the project's style and direction, and avoids making the codebase worse. AI-generated code can be fully working and still be bad: technically correct, but the wrong expression of the idea.

Why is AI better at CRUD apps than at systems work? CRUD work is close to the center of the training distribution and is cheap to verify. The patterns are common, the behavior is easy to describe, and tests, browser automation, and API assertions can confirm it. Niche systems work, like a compiler optimization or a reverse-engineered driver, has little training signal and expensive verification, so the agent has less to lean on.

Can you make taste executable? Not fully, but you can externalize a lot of it. Dependency checks, architecture tests, diff justifications, custom lint rules, and examples of accepted and rejected solutions move hidden standards into the visible environment. A gap will remain between "follows the written rules" and "what the best maintainer would have written," but every rule you externalize lowers the cost of verification.

What are the two dimensions for judging an AI coding task? First, how expensive verification is: can the agent check correctness itself, or does it need a human oracle? Second, how close the task is to the center of the training distribution: a common pattern with many examples, or a rare problem with little prior signal. AI does best on common, cheaply verified tasks and worst on rare, expensively verified ones.

Why is reverse engineering especially hard for AI agents? In normal feature work you usually know what success looks like before you start. In reverse engineering you have to discover the constraints first by running experiments and forming hypotheses. The success condition keeps moving, which makes verification expensive. AI can assist by automating experiments and tracking what has been tried, but it is research, not the write-test-fix loop.

Projekt besprechen

Software, die ankommt

Wir bauen KI-gestützte Software für Unternehmen – vom PoC in einem Monat bis zur langfristigen Begleitung. Kein Projektgrab, kein Buzzword-Bingo.

Jetzt besprechen

← Zurück zum Blog