I Would Rather Work in a Messy Codebase That Works

I would rather work in a messy codebase that works than a clean, readable codebase that does not.

That might sound backwards at first. Most engineers are trained to value clean code. We like neat abstractions, clear naming, small functions, and predictable structure. I like those things too. A readable codebase is easier to navigate, easier to review, and easier to reason about.

But only if it works.

A messy codebase that works gives me something solid to stand on. I can read it. I can trace it. I can run it. Eventually, I can understand it well enough to improve it. I can take one ugly section at a time and make it more readable without losing the behavior that already matters.

A clean codebase that does not work is more frustrating.

It creates a different kind of confusion. Everything looks like it should be working, but it is not. The names make sense. The files are organized. The abstractions look intentional. The code gives off the appearance of correctness, which makes the bug feel even more unreasonable. You end up staring at clean code and thinking, “This should work,” while the system continues to fail.

That is where AI-generated code gets interesting.

When people talk about AI-generated code, the default assumption is usually that it belongs in the messy bucket. We imagine sprawling functions, duplicate logic, strange decisions, and code that feels stitched together from Stack Overflow fragments.

That can happen.

But I think the more common danger is different. AI-generated code often fits better in the clean-but-does-not-work bucket.

It can look polished. It can use good names. It can follow common patterns. It can split logic into tidy modules and produce something that feels like it came from a thoughtful engineer. At a glance, the code can seem reasonable.

The problem is that readable code is not the same thing as correct code.

A system does not work because the code looks clean. It works because the behavior is verified. The main features do what they are supposed to do. Known bugs do not regress. Important workflows are protected. Edge cases that have hurt you before are covered.

This was true before AI, but AI makes the weakness harder to ignore.

Most teams have always been underinvested in automated testing. Not because engineers do not care, but because testing is often treated as a secondary concern. It is the thing people agree is important while deadlines slowly push it to the side. Teams ship features, patch bugs, clean up code, and tell themselves they will improve the test suite later.

Then later never really comes.

Before AI, teams could sometimes survive that culture because code accumulated more slowly. A developer wrote the feature, carried some mental model of it, and usually had at least some memory of why the messy parts existed. That did not make the system safe, but it gave the team a fighting chance when something broke.

AI changes the speed of accumulation.

Now it is easy to generate a lot of code very quickly. It is also easy for that code to look more mature than it is. The danger is not just that AI writes bad code. Humans write bad code too. The danger is that AI can produce code that looks clean enough to lower your suspicion while still missing the actual behavior the business depends on.

That is why the real issue is not human code versus AI code.

The real issue is whether the team has a system that checks if the product still works.

I do not fully trust other developers not to break the codebase. I do not fully trust myself not to break the codebase either. I have written bugs. I have misunderstood requirements. I have changed something in one part of a system and accidentally broken another part. That is normal software development.

So why would AI be treated differently?

Whether the code comes from a junior engineer, a senior engineer, a tired version of yourself at midnight, or an AI assistant, the rule should be the same: prove that it works.

This is why I favor strong integration tests.

I care about code quality, but correctness comes first. The most important questions are simple:

Does the main feature work?

Did we break something that used to work?

Can we refactor with confidence?

A good integration test suite gives you that confidence. It lets you clean up messy code without guessing. It gives you a behavioral contract. You can change the internal implementation, improve the design, remove duplication, and simplify the structure while knowing the important workflows still pass.

That is the development loop I trust.

Ship something that works. Protect the behavior. Refactor when the code starts slowing you down. Keep the tests honest.

The key phrase is “keep the tests honest.”

If AI is helping generate application code, the test system cannot become something the AI casually edits until everything turns green. That defeats the point. The tests are the guardrail. They should represent what the system must do, not what the implementation happens to do today.

For some teams, that might mean keeping integration tests in a separate repo. For others, it might mean strict code ownership, CI-only test suites, or clear rules that AI-generated changes do not modify the behavioral test suite without human review. The specific mechanism matters less than the principle.

The code is allowed to change.

The contract should not change by accident.

This is where I think a lot of the AI coding conversation gets distracted. People focus on whether AI-generated code is elegant, whether it follows style conventions, whether it looks like something a senior engineer would write. Those things matter, but they are not the foundation.

The foundation is verification.

A clean codebase without a serious test system is not as safe as it feels. It may be readable, but readability does not guarantee that the checkout flow works, the underwriting calculation is right, the background job does not regress, or the customer-facing report still generates correctly.

Clean code can lie.

Working code with strong tests gives you leverage.

The worst case, of course, is a messy AI-generated codebase that also does not work. That is the nightmare scenario. The code is hard to read, hard to trust, and not backed by a reliable test suite. At that point, you are not engineering anymore. You are spelunking through a cave system with a flashlight that may or may not have batteries.

If you are dealing with that, I am sorry. I hope things turn around soon.

But for everyone else, I think the lesson is clear.

AI does not remove the need for engineering discipline. It changes where the discipline matters most. The value is not in pretending we can read every generated line with perfect confidence. The value is in building systems that judge code by behavior.

A messy codebase that works can be improved.

A clean codebase that does not work has already broken the first promise.

In the AI era, the teams that win will not be the teams with the prettiest code. They will be the teams with the strongest verification systems, the clearest behavioral contracts, and the discipline to make every contributor, human or AI, prove that the product still works.

I Would Rather Work in a Messy Codebase That Works

Cart

Select options