Skip to main content

Command Palette

Search for a command to run...

Staying vigilant in the AI coding gold rush: from generation to delivery

Being developers and managers with boots on the ground, we owe it to our customers and users to be the voice of reason and stability when deploying AI-generated assets. Despite what is being asked of us from leadership.

Updated
•12 min read
Staying vigilant in the AI coding gold rush: from generation to delivery
D
I am a full-stack developer with a passion for creating intuitive and user-friendly experiences. With a strong background in frontend development and a deep understanding of backend technologies, I excel in building robust and scalable applications. My expertise spans across various frameworks and languages, allowing me to adapt to different project requirements. I am dedicated to staying updated with the latest industry trends and continuously improving my skills to deliver high-quality solutions.

For those who haven't worked inside the corporate world, the continuous tango of "look at metrics, adjust; look at metrics, adjust" is a never-ending cycle that we, as mid-level managers and directors, must both endure and champion for our reports. These metrics can be used to analyze almost anything about our personal performance and that of our teams. As the organizations we work for grow larger and more complex, context and awareness (of the application and situation) usually diminish. This awareness is replaced by a focus on delivering features and meeting financial and velocity goals.

🤔
What's ironic is "context" is one of the pillars that quality AI prompts and commands are built on. The more it knows, the better the result.

You might be asking yourself, "What metrics am I talking about?" I'm referring to productivity metrics, DORA metrics, but more specifically: lines of code generated. It would seem logical to equate lines of code written with productivity, right? The more code written, the more is delivered to the customer—making the application better, allowing more features to be sold, which means more money. Again, on the surface this makes sense and has been a point of emphasis in my experience. But the more I think about it, just because we are writing more code with AI are we actually more productive? I am not entirely convinced.

Let's dive into it.


A quick story

Recently I have attended a series of management meetings where the subject of AI metrics and its usage have been a topic of discussion. Further debates about which models work in which situation, and token-spend among others dominate the forum. Likely a pretty standard "corporate AI" meeting. However one point of discussion always seems to surface again and again: usage. The overwhelming common sentiment between each meeting is the ever-increasing lines of code being generated by our agents. The accompanying narrative being paired with this analysis is:

Our teams have never been more productive. We are writing more code than we ever have and we are shipping more features than any other point in the company's history.

Sounds like a pretty successful and celebratory conclusion eh? Initially I thought so. Code makes our applications function, functions create features, features means more money; so more code must be good right? But the more I got to thinking, there are other elements that go into code and features. Things like code reviews, security audits, unit tests, and so on. Technically you could use AI to perform these tasks, but (other than the unit tests) you really shouldn't. These things take time, more time to do it right, and even more time when you include multiple people in the process.

I recognize I am just one manager/developer, one lowly manager/developer. I also recognize that my opinion is just one small grain of sand in the desert that is the AI efficiency discussion. But I cannot help my skepticism of these MASSIVE gains surrounding the deployment of AI in software development practices.

More isn't always better

As a small disclaimer, I am basing a bulk of my argument with the assumption your company is following standard development processes, like writing unit tests, performing code reviews and such. If you aren't doing these things and just deploying AI-generated code to production without even the slightest review, then God's speed to you. Hope you have good insurance when your data gets ransomed.

Now let's continue.


When we leverage AI to rapidly generate code, it's hard to not be in awe of the speed and accuracy it is able to do so. Some outlets are reporting developers are seeing as much as a 55% percent increase of productivity, with a reduction of lead time for tasks by as much as half. More often than not, the code that is generated is pretty accurate and clean; especially if your agent guidelines are clear and actionable. The magic of typing a prompt and getting a working output will inevitably lull you into this result-driven comatose. It certainly has for me. The ease and luxury provided to you by your coding agent of choice makes it very easy to simply bypass the processes and protocols you would normally follow for your own code: unit tests, user-acceptance testing, quality assurance, security analysis, and so on. If the code works, why would I change it and why would I test it?

The "mechanism of structural decay"

If you have worked with AI long enough, you are likely to notice the trend of: write, create, write, append, write, refactor. The agent's immediate response to a prompt is to write code, regardless if you need it to or not. According to a study done by GitClear and Sonar citing:

an 8-fold increase in the frequency of code blocks containing five or more duplicated lines, confirming a significant decline in code reuse. Furthermore, 2024 was the first year where the number of copy/pasted lines exceeded the number of moved (refactored) lines.

What this crazy statistic ultimately points to is ever-growing code bases that are discarding the DRY (Do not Repeat Yourself) principles that were drilled into us as junior developers. What this also means is our code is growing larger and thus reducing efficiency and long-term maintainability. This quote from Sonar's article says it beautifully, "structural debt arises because LLMs prioritize local functional correctness over global architectural coherence and long-term maintainability." So despite the numerous additions and augmentations to our code, we could be only making it worse and thus damning our application to the pits of volatility.

Instability at the cost of insatiable productivity

According to a Google 2025 DORA Report, the folks over at Google found a 90% increase in AI-generated code also brought about a 9% increase in reported bugs and over a 150% increase in pull/merge request size. If you take these numbers at face value, there are a few big things that should jump out at you:

Our code bases are growing at an alarming rate. I recently reviewed a merge request in GitLab that included 388 lines of net new code (not changes or deletions). Let's assume the developer of this merge request does this four more times over the course of a week. In a month, this means a single developer will have introduced 13,580 net new lines of code into the project, in a single month. If that amount of new code is being written and merged into the repository at that rate, I would have some genuine concerns over the amount of duplicated code, and code that is just unnecessary.

The number of bugs will grow as well. On the surface, a 9% increase of reported bugs does sound like a lot, but I believe it is so much worse than that. All developers know all bugs are not created equal. They require time to analyze and troubleshoot, and even more time to fix regardless if you use AI or not. As the code base grows, the introduction of bugs is inevitable no matter the amount of testing or QA is done to prevent them (unless your company's name is Linear. Those guys know how to produce a clean product; props). But I digress.

Let's go back to the example we were using above. Let's assume for every three merge requests, one bug is introduced. In addition to this, another bug is introduced for every 5,000 lines of code. This means, keeping to the cadence of this developer's AI-powered productivity, in one month, our developer would have introduced around eight bugs. Now, as I was hinting at earlier, these issues could be very simple and depending your CI/CD process, a resolution could be deployed in a matter of minutes. No harm, no foul right? However some of these issues could be debilitating outages, causing lost progress and lost revenue for your business; all from a single developer.

🤨
Please know, the numbers that I used above are not actual numbers from me or any of my team. They are simply just examples .

This code needs to be tested and reviewed. If you are following prototypical development processes, we are probably aligned in the fact of code should be tested by both a human (QA) and a machine (automated unit/feature testing). All of these things takes time, even testing that has been automated. Depending on the size of the code base and testing coverage you have for your application, these tests take time to run. And if you are running these tests on dedicated resources, they also cost money. Code reviews by humans also cost money. Yes, it takes time for another developer to review, test, and smoke check the code. This has a cost. But what people don't think about is the productivity that is also lost due to that developer not coding. So in essence, every code review costs twice of what a developer costs during that same timeframe.

The cost of doing it right

For those of us in the tech and software industry, you are likely aware of the meme: "You can have it good, fast, or cheap, but it cannot be all three." Quality, at speed, comes with a cost. This also applies to testing, but even more so when it comes to AI generated code.

A 2025 study showed that senior engineers spend an average of 4.3 minutes evaluating AI-generated suggestions, versus 1.2 minutes for code written by humans. Meanwhile, teams that rely heavily on AI are delivering substantially more pull requests. Faros AI examined data from over 10,000 developers and found a 98% rise in PR volume. Consequently, PR review time increased by 91%, even though the code generation process itself became faster. A Qodo survey found that 68% of senior engineers report quality improvements from AI, but only 26% would ship AI-generated code without review (I'm surprised its that high).

So lets play this out. Lets assume you have a development team that makes the following hourly rates:

  • Junior/intermediate developers: about $50 per hour (or $0.83 per minute)

  • Mid-level developers: about $100 per hour (or $1.67 per minute)

  • Senior/lead developers or architects: about $180 per hour (or $3.00 per minute)

If you were to have these developers review human code at their hourly rate it would cost:

  • Junior/intermediate developers: about $1 per code review

  • Mid-level developers: about: $2 per code review

  • Senior/lead developers or architects: $3.60 per code review

If these same developers were to review AI-written code at their same hourly rate, it would cost:

  • Junior/intermediate developers: about $3.57 per code review

  • Mid-level developers: about: $7.01 per code review

  • Senior/lead developers or architects: $15.48 per code review

It is worth mentioning these numbers do not account for the lost productivity of the code reviewer as well, much like I mentioned above. Another item that is worth considering is the time added to timelines and lead times for features that can increase your company's bottom line. You will also need to keep an eye on the code bloat and potential regressions, as these can obviously affect these numbers too.

When you think about it...

Keep in mind, the numbers that I have provided are very much an approximation and estimate of your team's productivity. There are a fair amount of other caveats like language, code size, complexity, current technical debt, etc. But I think you get my point when you start to think of productivity and output in dollars and cents. As I also mentioned above, I myself have seen the benefits in my own work and undoubtedly you have as well. But with the security risks, bloat, and code duplication AI usage can introduce, AND you combine that with the increased code review time and associated costs with the potential delays and regressions; is it worth it?

As leaders, we have a direct responsibility to make decisions, and to put our teams into the best possible places/opportunities to succeed. This includes selecting tools and services to use. However when we are being pushed by our bosses and their bosses to use AI to optimize our work, I believe we owe it to our reports and the companies that we represent to at least take a second and really think about the outcomes of our decisions. We are very likely to not see the long-term effects for quite some time, but months and even years from now, are our AI-driven decisions going to really make our work better? Probably, but I am highly skeptical it will be this "slam dunk" we are being told it will be.

To put a bow on it

The AI coding gold rush is real — but so are the risks of mistaking generation for delivery. As organizations scale, context and awareness often get compressed into simplistic metrics. Lines of code and raw output from AI tools can feel like progress, but they are a poor proxy for actual customer value, reliability, and long-term maintainability.

To stay vigilant, treat AI as an amplifier, not a replacement, for disciplined software delivery. Anchor decisions in outcome-oriented metrics (customer impact, lead time, MTTR, change-failure rate), maintain strong feedback loops, and preserve human oversight across design, review, and deployment. Invest in observability, automated testing, security scanning, and provenance tracking so generated code is verifiable and safe to ship. Foster a culture that rewards quality, learning, and shared context—document assumptions, keep prompts and models auditable, and run blameless postmortems when things go wrong.

Practical steps:

  • Prioritize outcome metrics over LOC; use DORA metrics thoughtfully and complement them with product and user metrics.

  • Require tests, code review, and CI/CD for all AI-generated code before it reaches production.

  • Capture context: rich prompts, architectural notes, and intent annotations so AI output is traceable and maintainable.

  • Automate safety checks (security, license, performance) and monitor runtime behavior post-deploy.

  • Train teams on effective prompting, model limits, and how to validate AI outputs.

In short: harness AI to accelerate engineering, but measure and manage what actually matters—delivery, quality, and customer value. Vigilance, not hype, will determine who benefits from this gold rush.