The Invisible Cliff: AI Development and Architectural Debt
AI coding tools are accelerating how fast we accumulate architectural debt while quieting the signals that used to warn us about it.
If you’ve been in software engineering long enough, you’ve probably lived some version of the following: A system ships features reliably for months, then a change that should take a day takes a week. Maybe a colleague mentions “I don’t understand why this is so hard.” After months of relatively quiet on-call shifts, pagers start going off and the customer support team gets hammered with user reported bugs. The next thing you know you’re knee deep in a large system rewrite project that’s months over budget and the sales team is starting to panic.
Engineering teams have been hitting these cliffs for decades, and the healthy ones have figured out how to proactively identify and mitigate them. What’s concerning is that AI coding tools have the potential to accelerate how fast we approach these cliffs while suppressing many of the signals we rely on to see them coming.
Complexity theory
Why this is so concerning rests on the following premises.
Software systems are complex systems. Complex systems exhibit emergent behavior, nonlinear responses, and outcomes only understandable in retrospect. The same change can produce different results depending on context. Past experience becomes unreliable as the system grows. Unless you’re doing explicit work to reduce it, complexity increases continuously. Quality degrades over time not from negligence, but as an inherent property of systems adapting to changing environments.
Software systems at scale conform to the definition of a complex system quite well, and this has been a useful paradigm for understanding evolution and maintainability of these systems.
Complex systems fail suddenly, not gradually. Charles Perrow’s Normal Accidents (1984) showed that in tightly coupled complex systems, major failures aren’t anomalies but inevitabilities. They start small and cascade rapidly. Richard Cook’s How Complex Systems Fail (1998) sharpened the point: complex systems are continuously degraded by latent faults masked by overlapping defenses. Catastrophe happens not when something new breaks but when several pre-existing small faults align in a way the defenses can’t absorb. Sidney Dekker’s Drift into Failure (2011) found that systems don’t lurch toward failure, they drift through a long series of locally rational decisions until they cross a boundary that was difficult to identify ahead of time. Marten Scheffer’s Critical Transitions in Nature and Society (2009) describes how systems absorb stress smoothly across a wide operating range and then collapse abruptly once a threshold is crossed, often irreversibly.
Software researchers have observed the same shape in codebases. The “Crisis Model“ describes a recurring pattern where architectural debt accumulates silently until adding new value becomes so cumbersome that a large, expensive refactor is forced under duress. Bug and incident rates stay flat for long stretches before spiking seemingly overnight as different parts of the system fail in unison for reasons that look unrelated on the surface.
The difference between organizations that are periodically shocked by these complex system collapses and those that don’t isn’t that the latter simply produce less debt. It’s that they’re able to detect problems early enough to intervene, well before the precipice of dramatic failure.
AI tooling suppresses traditional warning signs. AI driven development didn’t create these problems, but it can mute the signals teams use to catch them early.
Early detection has historically come from a handful of overlapping feedback signals, most of them informal. For example:
Engineers feel friction when trying to make a change.
Defect density rises in structurally degraded areas.
The cost of each additional feature starts climbing.
What most of these signals have in common is that they derive from the human experience of working with the system. The people writing the code develop an intuitive feel they use to advocate for regular refactoring and upkeep efforts.
AI doesn’t just have the ability to accelerate the rate at which structural debt accumulates. By reducing the level of hands on human interaction with the code, it unintentionally suppresses our intuitive warning signals. Accelerated architectural drift combined with quieter warnings is a dangerous combination.
Where the signals are going quiet
We’re losing proprioception. Proprioception is the intuitive sense that tells you where your body is in space without looking. Experienced engineers develop something analogous for codebases they spend time in: a felt sense of where the structure resists, which changes will create problems, and when something is off in a way they can’t yet articulate. Luciano Nooijen, a lead developer at Companion Group, uses the German Fingerspitzengefühl (intuitive situational awareness) to describe what is eroding.
Good test coverage papers over structural issues. Defect rate used to be a fairly reliable lagging indicator of architectural problems. When a team starts getting peppered with bug reports it often triggers a deeper investigation which uncovers more fundamental issues. AI suppresses this signal by making it dramatically easier to generate comprehensive regression suites. This is valuable, but it’s also a way to accidentally hide architectural issues for longer. The system can appear stable from the outside while becoming progressively more complex internally. This is essentially putting a tighter lid on the pot so you don’t notice it boiling until it overflows.
Ox Security’s analysis of 300 open-source projects found AI-generated code to be highly functional but systematically lacking in architectural judgment.[^1] Sonar’s 2026 State of Code survey of 1,149 professional developers found that 88% have experienced at least one negative impact of AI on technical debt, with 53% specifically attributing it to AI generating code that “looked correct but was unreliable.”[^2] Functional correctness and structural health are becoming less reliably correlated.
Refactoring cycles are slowing. The clearest empirical signal of these issues shows up in how teams are no longer maintaining their codebases as coherent structures. GitClear’s analysis of 211 million lines of code between 2020 and 2024 found that refactoring activity collapsed from 24% of all changed lines in 2021 to under 10% in 2024. Copy-pasted code rose from 8.3% to 12.3%. 2024 was the first year where code duplication exceeded refactoring activity. Code churn (code committed and then rewritten within two weeks) nearly doubled, from 3.1% to 5.7%.[^3] These are the traces of refactoring-as-continuous-practice giving way to accumulation.
The counterarguments
The most common pushback is something like: “The problem isn’t AI. Organizations were already bad at monitoring architectural health.” Right, but the argument is that AI exacerbates it. The 2025 DORA Report puts it cleanly:
AI doesn’t independently improve or degrade software delivery. It amplifies whatever conditions already exist.[^4]
The second pushback is that today’s limitations are a transient artifact of immature tooling. Models will get better at architecture, and the problem will dissolve on its own.
Even granting that, the likely bottleneck is context, not raw model capability. No near-future model is going to fit a real production codebase into a context window and reason about its global architecture from first principles. You still need intentional information compression: architectural summaries, dependency graphs, code maps, structured ADR indexes, project memory. Tooling is already moving this direction (codebase indexing in Cursor, project memory in Claude Code, Sourcegraph code intelligence).
Either humans maintain these compressed artifacts or they gets auto-extracted from the codebase. An auto-extracted representation without human review that feeds back into the model to guide future generation creates a closed loop: the model learns “this is how this codebase is structured” and reinforces the existing direction, drift included. Judgment about architectural intent still requires a human regardless of model capability. And the deeper concern still stands: a more capable model doesn’t restore the suppressed feedback loops, it deepens their suppression by making the absorbed friction even less visible.
The third pushback claims “architectural debt” is a concept inherited from a world where humans had to read, reason about, and maintain code at the level of internal structure, and that world is ending. Regression test coverage that used to be uneconomic is now cheap, so external behavior can be verified to a depth that wasn’t previously reachable. AI can navigate code that humans would find unmaintainable. Treat the system as a black box, verify behavior comprehensively, let AI tend the internals, and stop worrying about “purity.”
I’m skeptical of this argument for three reasons.
First, regression tests verify behavior you thought to test. The whole point about complex systems is that they can exhibit emergent properties that interact in ways nobody anticipated. The more complex the internals, the harder it is to predict these emergent scenarios.
Second, the black-box view doesn’t eliminate architectural judgment, it relocates it without specifying where the judgment comes from. AI can spot patterns and known anti-patterns, but it cannot a priori tell you that the current structure is wrong relative to where the product is going, because product direction is intent. The architecture still lives in the model’s learned patterns, in the curated context, or in the implicit conventions of the codebase. Mark Heath has made the related point that even granting the black-box framing, AI agents themselves operate less effectively against degraded structure: their reasoning depends on the same structural legibility human reasoning does, just compressed differently.
Third, claims of this nature often have a survivorship problem. The messy systems people point to that ran for decades are the ones that didn’t collapse, didn’t get rewritten under duress, didn’t kill their companies. Some early studies find no statistically significant maintainability difference between AI-assisted and unassisted code, but those measurements look at code as authored, not at codebases as they evolve under sustained mixed-authorship pressure over years.[^5] “Nothing has gone wrong yet” is not evidence the strategy is working, it is the common precondition of drift-into-failure stories in the complex-systems literature.
A version of this third objection worth treating separately: YAGNI has been good advice for thirty years and remains good advice. But YAGNI presupposes a working feedback loop. The principle works because when you actually do need the structural change, the friction of not having it shows up clearly enough that you go make it. You feel the missing abstraction the second time you copy something. You feel the wrong module boundary the third time a change touches both sides of it. YAGNI is a bet on responsive refactoring, not a bet on ignoring structure. When the friction of a missing abstraction is being absorbed by an AI that’s quietly generating the seventeenth copy or routing around the wrong boundary, the “you’ll know when you need it” mechanism stops working.
Replacing the signals we’ve lost
The right response to this risk isn’t to abandon or even necessarily slow AI adoption. It’s to double down on architectural evolution as a deliberate practice rather than an emergent byproduct. Below are a few strategies teams can try that might help mitigate this risk.
Treat architectural evolution as first-class work
Structural health competes for attention against feature velocity in every engineering organization. Without intentional prioritization and explicit organizational standing, it loses. The suppression of our standard early warning signals requires engineering leaders to be even more adamant that there is sufficient investment in system health work.
Frame it as a capability, not a debt. Technical debt implies a liability to manage while structural health implies a capability to maintain. Liabilities get deferred while capabilities get invested in.
Clarify ownership. Architectural evolution needs an accountable person or group who can justify spending their capacity on this problem. This could be a traditional architect, a staff engineer, an architecture guild, etc. Expecting this work to happen organically without explicit ownership is how this need gets crowded out by roadmap features.
Surface metrics. Whatever instrumentation you build, make sure at least one structural-health metric shows up in the same operational reviews as sprint velocity, reliability, deployment frequency, etc. The specific metric isn’t that important, what matters is that the metric earns standing time in the same forums where feature progress is reported.
Make architectural intent explicit
If architectural decisions happen implicitly, AI absorbs the friction and the architecture drifts in the dark. If they happen explicitly, both humans and AI have something to align against. This intent is codified in two types of artifacts: prescriptive and retrospective.
Prescriptive artifacts define the intended shape of the system prior to implementation. This architectural guidance might take the form of RFCs, technical design docs, architecture diagrams, or any other output that defines the intended shape of the system. These serve two audiences: engineers who need to know the direction, and AI tooling who needs that same direction in machine-readable form. The artifacts don’t have to be heavy, they just need to exist and be current.
Retrospective artifacts (such as ADRs) capture key structural decisions as written records produced at the time of implementation. The important part of these implementation logs is that they capture not just what changed but why, since the rationale isn’t easily derivable from the diff itself.
Together these artifacts constitute a continuously maintained model of the intended vs actual architecture.
Instrument the system so drift is visible
Each item below replaces a signal AI muted.
Architectural fitness functions. Encode structural rules as automated tests that fail the build when violated. Open-source libraries cover most major language ecosystems: ArchUnit for the JVM, NetArchTest for .NET, Dependency Cruiser for Node and TypeScript. These let teams encode rules like “domain layer must not depend on infrastructure layer” and enforce them in CI. These rules come from the architecture guidance in the previous section and the enforcement makes drift fail-loud rather than fail-quiet. Dependency Cruiser also generates visual dependency graphs alongside enforcement, making drift legible to non-engineers over time. Nx and other monorepo tooling ship with module boundary enforcement built in, which means teams may already have access without additional investment.
Temporal codebase analysis. One valuable raw signal data source is git history. Mature commercial platforms like CodeScene, building on Adam Tornhill’s behavioral code analysis research (Software Design X-Rays), do hotspot detection, temporal coupling, knowledge distribution, and organizational bottleneck identification, all from git history. Because the input is the commit log, these approaches are language-agnostic and immediately applicable. On the open-source side, Varves takes a focused temporal-snapshot approach for TypeScript codebases, checking out the repository at monthly intervals and producing a dashboard tracking complexity, coupling, churn concentration, and code health over time.
Periodic human-only implementation. One solution to regaining some proprioception is to have engineers implement a defined set of representative tasks without AI assistance at regular intervals as a diagnostic exercise. If those tasks feel dramatically harder than expected, that’s useful data. Nooijen’s experience validates the approach: when he tried working without AI after months of heavy use, the gap between expected and actual experience was immediate.
Closing
The velocity gains from AI coding tools are real, and the teams that adopt them well will outpace those that don’t. The question is whether that adoption compounds into durable advantage or accumulates into a Crisis Model cliff. None of this is an argument against leveraging AI tooling for increased velocity. It’s an argument for instrumenting that velocity so its structural cost is visible in real time rather than in retrospect.
The organizations that sustain their advantage are the ones whose architecture can absorb each round of new requirements without toppling over.
Notes
[^1]: Ox Security, analysis of 300 open-source projects on the architectural quality of AI-generated code.
[^2]: Sonar, 2026 State of Code Survey, 1,149 professional developers surveyed on AI tooling’s impact on technical debt.
[^3]: GitClear, analysis of 211 million lines of code from 2020 through 2024, tracking refactoring, duplication, and churn.
[^4]: 2025 DORA Report on AI as an amplifier of existing engineering practice.
[^5]: Early empirical study finding no statistically significant maintainability difference between AI-assisted and unassisted code.

