Sebastian Beyer · 11. Juni 2026 AISoftware EngineeringCoding AgentsVerificationAI Coding

AI Can Code When Verification Is Cheap

Zusammenfassung

Sebastian Beyer of ex-nihilo argues that the debate over whether AI can write good code misses the point. AI produces slop when verification is expensive, slow, or depends on vague human judgment. When verification is cheap—fast tests, linters, diff checks, and a clear handoff gate like make check—agents can loop until constraints pass and ship real production software. Projects like Zedl and Nodl were built this way at ex-nihilo. The quality ceiling is set less by the model than by the verification environment around it.

Here are some thoughts that have bugged me for a while.

There is this annoying debate around whether AI can actually write good code.

Some very good software engineers say it cannot. Or at least that it can only help you in toy examples, greenfield projects, prototypes, throwaway scripts, things like that. Once the codebase becomes real, the argument goes, AI starts producing slop. It makes things look like they work, but underneath you slowly get an unmaintainable mess.

And honestly, I get where that argument comes from.

At the same time, I also see the opposite. I see individuals and small teams shipping real software with AI-generated code. Not just demos. Not just weekend projects. Real products. Real users. Real money. Some people claim they barely write code themselves anymore. Some do not even review every line anymore.

I am somewhere in that second camp.

Two of my own projects, Zedl and Nodl (GitHub), were created in that operating mode. Sometimes I find myself pushing code into production for a SaaS product with actual paying customers without having looked at every line of code, and sometimes without manually trying every path myself.

That sounds insane if you come from the classical software engineering mindset.

But it does not feel insane to me, because I am not trusting the AI in the abstract. I am trusting a process around the AI.

So how does this fit together?

How can very talented programmers look at AI coding and conclude that it produces mostly slop, while other people are shipping mostly AI-generated products and not constantly blowing things up?

I think the answer is not "AI can code" or "AI cannot code".

The better answer is:

AI can code when verification is cheap.

And AI looks bad when verification is expensive, fuzzy, hidden, or depends on constraints that were never made explicit.

That is the distinction that made the whole thing click for me.

The Cost of Verification

The hot topic in AI coding right now is the loop.

People seem to have rediscovered that you can let an AI agent work on a problem, run commands, observe the result, fix mistakes, and keep going until it reaches some goal. This is a big improvement over the old way of working with AI:

You prompt it.
It writes some code.
You manually check what it did.
You find something wrong.
You paste the error back.
It tries again.
Repeat until either it works or you lose patience.

That workflow is exhausting because the expensive part is still on you.

You are the one reviewing the code line by line. You are the one clicking around the UI. You are the one opening and closing the browser. You are the one copying stack traces. You are the one deciding whether the thing is actually done.

The AI may write the code, but you are still carrying the verification cost.

And that cost is high.

It costs your attention. It costs your judgment. It costs your time. It also breaks your flow, because every time the AI needs you to check something, the loop stops and waits for a human oracle.

I think this is the wrong way to use AI for serious coding.

If you want AI to produce quality with today's models, you need to bring the cost of verification down. Way down.

That means:

The AI should create or update automated tests.
Those tests should run after every meaningful change.
Linter rules and static checks should be enforced.
The checks should be fast enough that the agent can run them constantly.
The AI should see the error messages itself.
Interfaces that are expensive or fragile should be mocked where possible.
The agent should only hand work back after it has run the relevant checks and they pass.

This is the important shift:

Do not ask the AI to "please be careful".

Give it a system where carelessness fails fast.

A Small Example: Migrating Away From PostCSS

A concrete example.

A software engineer I know was refactoring a large codebase to migrate away from postcss.

He ran into a situation where the AI reintroduced postcss in a part of the code where it had already been removed, even though the AI had been told not to do that.

His manual review caught the mistake.

The normal conclusion would be:

See? You cannot trust AI. A human has to review everything.

But I think that is the wrong lesson.

The problem is not that the AI made a mistake. Humans also make mistakes. The problem is that the mistake was caught by expensive human attention instead of cheap executable verification.

If the rule is "do not introduce new references to postcss", then that should not live only in a prompt. And it should not live only in someone's memory during code review.

It should be a check.

For example: add a CI guard that looks at the git diff and fails if any newly added line contains postcss.

Not the whole repository, because during a migration old references may still exist. The new diff. The new code. The thing the agent is currently changing.

And the failure message should not be vague. It should tell the agent exactly what happened:

File X introduces postcss on line N. This migration forbids new postcss references. Remove it before completing the task.

Now the agent can run the check, see the failure, fix the mistake, and run the check again.

No human needed. No code review discussion. No "please remember not to do this". No vibes.

That is cheap verification.

And this is the pattern I think people underestimate. A lot of AI failures are not signs that AI cannot code. They are signs that the verification environment is bad.

The agent was asked to satisfy a requirement that was never turned into an executable constraint.

Prompts Are Weak. Checks Are Stronger.

This is the part I keep coming back to.

When an AI does something wrong, the default reaction is usually to improve the prompt.

"Do not use PostCSS."

"Do not break existing functionality."

"Make sure this is production ready."

"Be careful."

These instructions are not useless. But they are weak.

A prompt is a wish. A check is a constraint.

If something matters, it should eventually move out of the prompt and into the environment.

If an import must not be used, add a lint rule or diff check.

If a database constraint must match the application model, add a parity check.

If a user flow must keep working, add a browser test.

If a translation key must exist in every locale, add a locale parity check.

If a migration must not drop data accidentally, add a migration safety check.

Then the agent can run into the wall by itself.

That is the whole point.

The AI does not become reliable because it suddenly becomes morally committed to your architecture. It becomes more reliable because the environment makes certain mistakes cheap to detect and hard to hand off.

The Handoff Gate

The practical version of this is simple.

Give the agent one command that represents "done".

For us, that is usually something like:

make check

The exact command does not matter. What matters is that the command contains the checks you care about.

It should run tests. It should run linting. It should run static analysis. It should run migration safety checks. It should run the project-specific things that are easy to forget and painful to catch manually.

The agent's job is not to impress you with code you admire.

The agent's job is to make that command green.

If it is red, the agent loops. If it is green, the work can be handed back.

Of course this does not mean every green check is automatically safe in every context. The strength of the handoff gate has to match the risk level of the change.

But the principle is still powerful:

Do not make the human the first verifier.

Make the system verify as much as possible before the human ever looks at it.

The Real Question

So, can AI code?

I think the honest answer is:

Sometimes extremely well, sometimes terribly, and the difference is often not the model alone.

The difference is the operating environment.

If you give an AI agent a vague task, no tests, no fast feedback, no ability to inspect failures, no clear constraints, and a human who only checks the result at the end, you will get slop.

If you give it a clear goal, fast tests, executable constraints, good failure messages, and permission to loop until the checks pass, you can get very useful production code.

That does not prove AI is magic.

It proves that software engineering has always been partly about building systems that catch mistakes.

AI just makes the cost of bad verification more obvious.

For me, this is the main lesson so far:

The quality of AI code is bounded less by the act of generation than by the quality of verification around it.

The future of AI coding is not just better models.

It is better loops.

Better tests. Better checks. Better constraints. Better handoff gates. Better ways to turn requirements into something the agent can actually see.

Until then, people will keep arguing.

One person will say AI writes slop.

Another person will ship a working product.

And both may be right.

They are just working with very different costs of verification.

Who This Is For

Teams experimenting with coding agents who keep hitting the same wall: the AI writes something plausible, but every change still needs a human to babysit it.
Engineering leads who want to move beyond "read every line" without lowering the quality bar.
Founders shipping with small teams who need a practical operating model, not another model benchmark debate.
Anyone migrating or refactoring large codebases where regressions are easy to introduce and expensive to catch manually.

Frequently Asked Questions

Can AI write production-quality code today? Sometimes yes, sometimes no. The difference is usually not the model alone. It is whether the agent can verify its own work cheaply: fast tests, linters, diff checks, and a clear handoff gate.

What does "cheap verification" mean in practice? Verification is cheap when the agent can run the checks itself, see clear failure messages, fix the problem, and loop again without waiting for a human. Manual code review, clicking through the UI, and copy-pasting stack traces are expensive verification.

Why do talented engineers say AI only produces slop? Often because they are working in environments where verification is expensive or fuzzy. If success depends on taste, intuition, or hidden constraints that were never made executable, the agent cannot close the loop on its own.

How would you catch an AI reintroducing a banned dependency? Do not rely on the prompt alone. Add an executable check, for example a diff guard that fails if a forbidden import appears in newly added lines, with a failure message that tells the agent exactly what to fix.

What is a handoff gate? A single command, such as make check, that runs every check you care about before work is considered done. The agent loops until that command is green. The human is not the first verifier.

How does ex-nihilo use this approach? On projects like Zedl and Nodl, the operating model is built around executable constraints: tests, static analysis, migration safety checks, and project-specific invariants encoded as checks with actionable failure messages.

Does this mean humans never review code? No. It means humans should not be the first line of defense for every mistake. The system should catch predictable failures before a human ever looks at the change. Human review still matters for risk, architecture, and things checks cannot yet encode.

Projekt besprechen

Software, die ankommt

Wir bauen KI-gestützte Software für Unternehmen – vom PoC in einem Monat bis zur langfristigen Begleitung. Kein Projektgrab, kein Buzzword-Bingo.

Jetzt besprechen

← Zurück zum Blog