Is Vibe Coding Bad for Software Quality?
In February 2025, Andrej Karpathy coined "vibe coding" to describe something many engineers were already doing: you tell the model what you want, accept what it gives you, and keep moving. The term spread fast. So did the questions that followed it, most of them circling the same concern: does this approach actually produce software that holds up in production?
The honest answer requires data rather than opinion. What follows draws on real quality metrics from an AI-assisted industrial B2B SaaS delivery our team completed. The numbers are more instructive than the usual debate suggests.
What vibe coding actually is (and what it is not)
Vibe coding sits on a spectrum. At one end, a developer describes a feature in plain language, accepts whatever the model generates, and never reads the output closely. At the other end, an experienced engineer uses AI-assisted development as a deliberate accelerant: prompting carefully, reviewing outputs critically, running tests before committing, and catching the cases where the model built something plausible but incorrect.
Most professional AI-assisted development sits somewhere in the middle.
The term has become a shorthand for AI-assisted development in general, which makes quality conversations harder than they need to be. Vibe coding at its best is a force multiplier. Without measurement and governance, it becomes a debt accumulator. Not because AI writes poor code, but because the feedback loop that catches poor code gets skipped.
The productive question is not whether vibe coding is inherently bad for software quality. Whether a given team has the measurement and governance discipline to manage the pattern that AI-assisted code reliably produces: that is what actually determines the outcome.
What the data from a real AI-assisted delivery actually shows

In a recent industrial B2B SaaS delivery our team completed, we tracked every issue logged on the project board from the first sprint through go-live. At the point of final QA, the board held 229 total issues: 101 bugs, 122 tasks, and 6 epics.
The figures are worth sitting with:
- Bug rate.
44.1% of all issues raised were bugs. Nearly one in every two items on the project tracker was a defect.
- Defect density
The project averaged 0.86 bugs per development task. The backend engineer carried 1.18 bugs per task; the frontend engineer carried 0.68.
- Severity
19 bugs were Critical or Highest severity. All 19 reached a "Ready For Production" status before go-live.
- Traceability
Only 32 of the 101 bugs had an explicit parent task link in the tracking system. The remaining 69 were raised without a reference to the feature that generated them.
- Data-integrity catch
A bug that silently stripped leading zeros from numeric fields in CSV exports was identified during active QA testing. The bug would not have been visible to users until they ran a batch import and found their product data corrupted.
None of this was a surprise to the delivery team. It was a managed outcome.
One contrast worth noting: the DevOps engineer on the same project carried zero bugs across ten tasks. Infrastructure-as-code has fewer of the ambiguous integration surfaces where AI-generated application code tends to fail.
Where vibe-coded projects accumulate bugs (and why)
If you had to predict which module in an AI-assisted build would carry the most defects, what would you say?
In the delivery above, five of the seven most bug-dense task areas were in authentication and user management: login flows, registration, password policy enforcement, admin user CRUD operations, and approval workflows. The two remaining high-defect areas were data import/export and catalogue redesign features added late in the build.
This clustering is not coincidental. Large language models are trained on large volumes of authentication code, but most of that code is tutorial-grade. It covers the happy path, handles the common error cases, and stops there. Real production auth involves token expiry timing, concurrent session handling, permission inheritance, and state transitions that generic training data does not represent well. The model turns out to be something plausible. A basic review passes it.
The import/export bugs follow a slightly different logic. Features added near the end of a build are always the active QA frontier, regardless of whether AI is involved. The leading-zero strip described above is a classic format-handling edge case: technically wrong, not immediately visible, and destructive under specific production conditions.
The pattern across both areas is the same. AI-generated code handles core logic adequately. Edge cases, especially in high-complexity areas like authentication state, data transformation, and cross-component integration, are where it consistently falls short.
The four risks that turn vibe coding into a liability

Understanding where bugs cluster is useful. Understanding which structural risks compound them is more useful.
The complexity cliff
Bug intake in the delivery above was steady across the first four weeks, ranging from 4 to 12 bugs per week. Then 38 bugs arrived in a single week during the first full-system QA pass. The team had been moving fast. Quality debt had been accumulating invisibly. The cliff arrives when integrated testing begins, and the accumulated debt surfaces at once.
One engineer was assigned 64.4% of all bugs on the project, including every Critical and Highest severity item. This is a natural outcome of AI-assisted workflows: the developer who drove the generation sessions retains the mental context for fixing the output. Distributed code review culture does not develop naturally in vibe-coded builds, because no one reviewed the code line by line on the way in.
The traceability gap
When code is generated in large chunks rather than written task by task, bugs rarely trace back to a specific parent feature in the tracker. 68.3% of bugs in this project had no explicit task link. The audit trail broke down not from negligence, but because the workflow did not naturally produce one.
The data-integrity tail
The last bugs to surface in an AI-assisted build are often the quietest and the most damaging. Format errors, precision handling, edge-case corruption. They do not throw visible errors unless someone goes looking in the right places.
What good vibe-coding governance looks like in practice
None of these risks is exclusive to AI-assisted development. Teams without strong governance run into all of them on traditional builds, too. What vibe coding does is compress the timeline: the happy path ships fast, and the debt arrives sooner. The governance response needs to match that pace.
Five practices that keep an AI-assisted build on track:
- Weekly defect density tracking from sprint one
If you wait for QA to surface the quality picture, you will see it at the worst possible moment. A weekly bug intake figure costs nothing to track and shows the complexity cliff before it becomes a crisis.
- Explicit sign-off records for all Critical and Highest bugs
"Ready For Production" is a status whose meaning depends on where the project stands — for a pre-production build, it signals readiness; for a live system, it implies verified regression coverage. Each critical fix should carry a named reviewer, a regression test result, and a timestamp.
- Distributed ownership reviews for AI-generated modules
Before a major module is merged, a second engineer reads the output critically. Not to slow delivery, but to ensure more than one person understands what was built and can support it later.
- Production environment validation as a hard gate
The production environment must be validated before bug resolution counts as complete. Bugs cleared in a staging environment that diverges from production are not truly cleared.
- Post-production monitoring is defined before go-live
Error logging thresholds, alerting rules, and a first-week triage cadence should be agreed upon before deployment, not assembled reactively in the days after.
These are not exotic practices. They are what an AI-assisted delivery looks like when the team treats it as a real engineering process rather than a shortcut to a working prototype.
What leaders should remember?
Vibe coding produces a predictable quality profile. That is actually the good news. High bug rates in AI-assisted builds are not random or chaotic. They cluster in specific areas, arrive at specific moments, and respond to specific governance practices. Unpredictable quality is hard to manage. Predictable quality is not.
The figure that tends to unsettle people is the bug rate. In our delivery, 44.1% of all tracked issues were bugs. The number is not the problem. Not having the number is the problem. Teams that do not measure defect density during an AI-assisted build are not producing cleaner software. They are just finding out later.
Rather than distributing QA effort evenly across all features, use risk-based testing to focus on areas where defects would have the greatest impact or are most likely to occur. Consider factors such as business criticality, security requirements, technical complexity, and historical defect trends when defining test coverage.
The four structural risks (the complexity cliff, concentration risk, the traceability gap, and data-integrity edge cases) are all avoidable. Not with process overhead, but with a small number of habits applied consistently from sprint one. The governance burden is light. The cost of skipping it is not.
Frequently asked questions
Is vibe coding bad for production software?
Not inherently. The key variable is governance. AI-assisted development produces a recognisable quality profile, including higher bug rates and edge-case clustering in complex feature areas. Teams that measure defect density, track bug intake weekly, and apply structured QA can deliver production-ready software. Teams that treat vibe coding as a substitute for engineering discipline tend to find out the hard way what they skipped.
What bug rate should I expect from an AI-assisted build?
Based on our delivery data, a bug rate above 40% of total issues is achievable in a professionally managed AI-assisted project. The figure sounds high, but context matters. Most of those bugs were identified and resolved before go-live. A more useful measure is defect density per task (we saw 0.86 on average) combined with the severity distribution. A high bug rate with a managed severity profile is a process outcome. An unknown bug rate is the actual risk.
How do you run QA on a vibe-coded project?
The same way you run QA on any project, with sharper attention on two areas: high-complexity areas (authentication and user management being a common example, though the specific hotspots vary by project), where AI-generated code consistently yields edge-case failures, and data-handling features, where format and precision bugs surface late. Weekly defect density tracking, explicit sign-off records for critical bugs, and a validated production environment are the minimum viable governance framework.
What are the biggest risks of vibe coding for a B2B product?
The four that matter most are the complexity cliff (a spike in bug intake when integrated QA begins), concentration risk (one engineer holding all the context and all the critical bugs), the traceability gap (bugs not linked to parent features in the tracker), and data-integrity edge cases that only become visible under specific production conditions. Read our related article on the bus factor of vibe-coded projects for a deeper look at concentration risk.
Can vibe-coded software be delivered to a professional standard?
Yes. The evidence from our delivery is that professionally governed AI-assisted development produces software with a higher bug intake than traditional development, but with comparable resolution rates and severity profiles when QA is applied with rigour. The development model is different. The quality bar does not have to be.
Share and subscribe to our blog
How can we help you ?


