Rows of server racks with networking cables in a large industrial data center

AI Democratization and Its Discontents

On the feedback loop that AI’s democratization left behind

In the previous post I argued that research is not engineering carried out at a slower speed, and that the two activities have different goals, different tolerances for failure, and a different relationship with uncertainty.

Here I want to explore a companion problem, one that has been on my mind for a while. It has to do with what happens to a technology once it stops being the property of the few who understand it deeply and becomes available to the many who simply want to use it. We have come to call this democratization, and it is, on balance, a wonderful thing. But it carries a discontent, in the sense Sigmund Freud gave the word in his 1930 essay Civilization and Its Discontents, where he argued that the very things that make civilization possible are also a permanent source of unease. I mean something far less dramatic, of course: not a defect we could engineer away, but something intrinsic to the bargain itself. That discontent is what I want to examine here.

From science to infrastructure

Every transformative technology eventually goes through the same passage. It begins as the exclusive territory of the people who build it, and it ends as something used by people who have no need to understand it at all. Electricity is the textbook example. Nobody wiring a lamp today gives a thought to Maxwell’s equations, and nobody should have to. The complexity gets encapsulated behind an interface, and a new layer of practice grows on top of it, with its own logic, its own professionals, and its own common sense, largely disconnected from the underlying science. That, in W. Brian Arthur’s terms, is the creation of a domain: the moment a technology stops being a scientific object and becomes an infrastructure on which a whole new world of practice can be built.

Let me bring this closer to home, to something I lived through for a large part of my career: speech recognition.

In the 1980s, speech recognition was the exclusive domain of people who understood it from the ground up. The work demanded a mastery of signal processing and acoustic feature extraction: LPC coefficients, filterbank parameters, cepstral features. It demanded fluency in probability theory and statistics, in hidden Markov models, in mixtures of Gaussian densities, and in the famous fundamental equation of speech recognition that ties together statistical acoustic and language models through Bayes’ theorem. This was hard science, done by hard scientists, at places like CSELT in Italy, where I began my own career, LIMSI in France, Philips Research in Germany, Bell Laboratories, IBM Research, Carnegie Mellon University, MIT, SRI, and a handful of others. They were all places characterized by what we would call research. And yet, for all that sophistication, the systems those scientists produced were brittle, error-prone, and frankly not very good outside carefully controlled conditions.

And yet something interesting happened. In the mid-1990s, some of us who had spent years on the theory decided it was time to put it to work. I was part of that move myself, joining SpeechWorks, one of the companies, along with Nuance, founded with a specific and bounded goal: to replace the telephone-based customer care systems of the day. Those systems meant either expensive human operators or the maddening “press 1 for billing, press 2 for technical support” DTMF menus that everyone hated. The technology was imperfect, and we knew it. But the business case was compelling anyway. And, crucially, those of us who built and worked at those companies understood in our bones exactly where the technology worked and where it did not, because we had spent years watching it fail in informative ways.

By the early 2000s an entire industry, the IVR industry, had grown up around that foundation. It was not built by the researchers who understood the acoustic models from the inside, or not just by them. It was built mostly by a new layer of practitioners: voice user interface designers, IVR platform specialists, people fluent in voice browser standards and in the art of call flow design, who had no need to understand the recognizer’s internals in order to build something that worked within its scope. In Arthur’s terms this was not yet a full domain, the way electricity eventually became a domain spanning motors, lighting, and appliances, an entire built environment organized around a single phenomenon. It was something narrower: a domain for one sector, customer care, with its own specialized companies (ASR engine vendors, platform providers, VUI consultancies), its own conventions, and, most importantly, its own well-understood failure mode, the humble “press zero for an operator.” The domain was small, but it was real, and it worked because everyone in it, even without research-level understanding, had absorbed where the edges were. “Press zero” existed because somebody, somewhere, had had to design for the recognizer’s known error rate, and the scope was small enough that this design work was tractable.

This is how translation happens, and it is a genuinely good thing. The translator figures I keep returning to in this series, Aldus Manutius, George Westinghouse, Steve Jobs, did not need deep expertise in the underlying science of their moment. What they had instead was a kind of informed abstraction: enough understanding to build something useful, and enough distance from the original complexity to see applications that the original inventors never could.

AI is going through exactly this passage right now, faster and more broadly than any technology before it. People who could not have written a line of code five years ago are shipping products. Domain experts, doctors, lawyers, teachers, small business owners, are building tools tailored to problems they understand far better than any AI lab ever could. This is democratization doing precisely what it is supposed to do: populating the translation phase with the messy, distributed experimentation out of which real applications eventually emerge. Anyone who has watched a field for decades, as I have, should recognize this moment for what it is. And it is, let me be clear, the good part. Never before in the history of technology has something of this magnitude reached this many people, this fast. So, where is the bad part? 

Where the analogy breaks

The difference this time is not about the fact of democratization at all. It is about two things: the span of the domain, which is no longer a narrow use case like the voice-ification of call centers, and what happens to the failures.

Let me stay with speech recognition, because the contrast is sharper there. When we began deploying ASR into call centers, the failures were visible, local, and bounded. The system either understood the caller or it did not, and everyone involved, the deployer, the caller, the platform team, could tell which had happened. The error was legible. It could be measured, reported, and routed back to someone who knew what to do with it. The feedback loop was slow and imperfect, but it functioned, because the failure announced itself. A caller who got confused by the interface, or who phrased a request in an unexpected way, or who stumbled over a badly designed prompt, produced a recognizable signal: silence, repetition, audible frustration, a request for an agent. Those signals were data. And the solution, typically, was smart voice user interface design and patient usability testing: improving the flows, limiting the variability of speech, raising recognition accuracy in the hard cases like numbers and spelled-out words. All of this was possible mostly because the domain, customer care, was small and manageable.

AI’s failure mode is different, and it is different in a specific and consequential way. An LLM is a probabilistic black box wrapped in a deterministic-looking interface. A product built on top of an LLM can fail at some low but irreducible rate, producing outputs that are subtly wrong rather than obviously broken. A legal summarization tool that quietly omits a liability clause; a medical assistant that states the wrong drug interaction with complete confidence; a sales analysis that consistently understates churn by a few points: none of these announce themselves the way a crashed program or a garbled phone call would. And the failure is not always in the output. Sometimes it is the user who does not know how to phrase the question and receives a confident wrong answer; sometimes it is an interface that gives no hint that the system has wandered outside its reliable zone; sometimes a badly posed query returns something plausible but beside the point. The people on the receiving end, the end users, the customers, usually have no way of knowing that any of this is happening, and no channel back to anyone who could diagnose it. And the domain, unlike call centers, is practically unbounded: document compliance, sales, customer attrition, and an endless list of others. It is virtually impossible to make a single AI work equally well across such a myriad of use cases.

And there is a further complication, one that sets this situation apart from the systems of the past. The statistical models of the previous generation were imperfect, but they were legible. They were assembled from designed parts, an acoustic model, a language model, a confidence score, and when something went wrong an expert could open the machine, probe its components, and usually localize the failure to a cause. A modern large language model grants no such access. There are no separate modules for syntax or meaning to inspect, only billions of weights, and a confident error arrives with no internal account of why it was made. So even the team that genuinely wants to close the feedback loop confronts a difficulty the earlier generation never faced: the system cannot tell them what went wrong, and often cannot be compelled to explain itself.

There is a deeper reason these failures stay buried, and it points to why one corner of the AI world is, supposedly, thriving while others struggle. To catch an error, you need a way to verify the output, and for most uses of AI no such verification exists. When a model drafts a contract clause, summarizes a diagnosis, or estimates a market, there is no quick, objective test that tells you whether the answer is right; someone has to know the domain well enough to judge, and often no one checks at all. Code is the great exception. A program either compiles or it does not, the tests pass or they fail, and the verdict comes back in seconds, cheaply and without argument. That is, I think, the real reason AI has been so transformative for software: not because writing code is easier than writing prose, but because code carries its own feedback loop, built in. It is verifiable in a way that most of what we ask AI to do simply is not. Where the output can be checked, the loop closes on its own; where it cannot, the error has no way of making itself known.

The builder, for their part, often does not know either. Not for any lack of skill, but because nothing in the path to building with these tools today requires the direct, repeated exposure to a model failing in informative ways that years of research apprenticeship used to guarantee. When something goes wrong, the natural instinct is to look at the prompt, the data, the interface, all the places where a fix is straightforward and within reach. And for most builders that is where the toolkit ends. Adjusting a prompt is democratic; fine-tuning and reinforcement learning, the interventions that can actually address a deeper structural limitation in the model, require exactly the kind of technical depth that the democratization wave did not carry along with it. To be fair, this is not true of everyone. There are teams, a minority, with genuine intimacy with the models they build on, people who came up through the research world or have lived close enough to it to recognize a structural failure when they see one, and they do close the loop. But for the great majority of practitioners now building with these tools, that intimacy is precisely what the democratization did not hand them. And so the failure goes nowhere. It does not become a signal. It simply accumulates, quietly, as diffuse harm.

The word “bug” is doing a lot of damage

Underneath all of this lies a more basic problem, and it is one I have watched play out before.

The early speech recognition deployments ran into a version of it decades ago. The system would occasionally misrecognize an utterance, and the people running the deployment, smart and capable people, would promptly file it as a bug. Something to be fixed, patched, eliminated. What they were really looking at was the system working exactly as designed: a statistical model producing its expected error rate. The “fix” was no fix at all, because nothing was broken. The behavior was structural, not incidental. I remember more than one customer who simply could not accept that an isolated recognition error was not, in itself, a defect to be reported and corrected. And in most cases they did not ask the question that mattered, how often does this happen, but were affronted by the single instance, as though one error were proof of a broken machine.

LLMs have inherited this problem and made it worse, in part because of a single seductive word: hallucination.

“Hallucination” frames the phenomenon as a malfunction, as something that happens when the system goes wrong, rather than as something the system does a certain percentage of the time as part of how it works. But a large language model that produces a confidently stated inaccuracy is not doing anything categorically different from what every statistical machine learning system has always done. It is wrong some fraction of the time, in ways that do not announce themselves. We never called a speech recognizer’s misrecognitions “hallucinations”, nor the errors inevitably produced by any statistical classifier. We called them errors. We measured their rate. We built our systems around the expectation that the rate was nonzero and would stay nonzero.

By giving LLM inaccuracies a special and evocative name, we have made them seem exotic, almost pathological, when in fact they are nothing of the sort: they are the expected output of a probabilistic system, of precisely the kind every ML practitioner has worked with for decades. This is, after all, what an LLM is built to do: it is trained to predict the next token, not to be correct, so an occasional confident falsehood is not a betrayal of its purpose but a direct consequence of it. And the framing matters enormously, because if you believe you are dealing with a bug, you go looking for the bug. If instead you understand that you are dealing with an intrinsic error rate, you build around it, with confidence thresholds, human review, and fallback paths, the whole apparatus that the speech and natural language world assembled over years of hard-won experience with exactly this kind of system. And here is what strikes me: very rarely, in my meetings with AI practitioners, do I hear anyone talk about error rates, accuracy, precision, or recall, about validation and test sets, or about the statistical significance of a result. The vocabulary that my field used to live by, the basic measures of how often a system is right and how often it is wrong, has quietly fallen out of the conversation, and that absence is itself a symptom of the problem I am describing.

This connects directly back to the feedback problem. If the people deploying these systems experience an inaccuracy as a “bug”, they report it as a bug, which means it goes into a queue to be “fixed”, which means it never reaches the place where the actual signal it carries, an empirical error rate, under these conditions, for this task, could be measured and designed around. The miscategorization is not merely semantic. It is the very mechanism by which the feedback loop gets short-circuited before it can even begin.

The one piece that has not caught up

And yet none of this is an argument against democratization, nor a claim that today’s builders are somehow less capable or less talented than earlier generations. Not at all. The democratization of AI is a positive development, and it is poised to speed up the very process of domain building I have been describing. The tools really are amazing, the people using them really are capable, and the explosion of applications is real progress, not an illusion.

What has not kept pace is something more specific: the mechanism of feedback. The old path into a technical field required years of hands-on work with systems that failed regularly and visibly. That was never by design; it was simply the nature of the work. But those years built something that no course or tutorial can deliver, namely an instinct for failure. You developed a feel for where a system’s reliable zone ended, for how to design around a nonzero error rate, for when to trust an output and when to question it. That instinct was not taught. It was accumulated, slowly, through repetition and failure.

It is worth being concrete about how that feedback actually worked in the world I knew, because the contrast with today is sharper than it first appears. Companies like SpeechWorks, Nuance, and IBM did not simply sell a recognition engine and walk away. We collected call recordings from every deployment. Every misrecognition, every caller repetition, every “press zero” was logged, analyzed, and fed back into the engine. Production failures became training data. On top of that, we sent people in: VUI designers and tuning engineers and integrators who would sit with the customer, listen to the failure recordings, redesign the call flows, and adjust the confidence thresholds. The failure was not merely logged; it was diagnosed, by people who understood what they were looking at. And the deployments carried measurable, contractual commitments, around recognition accuracy, task completion, and the all-important containment rate, the proportion of calls handled without a human agent. If the numbers were bad, somebody was accountable. That accountability created a direct, unavoidable incentive to understand failures rather than ignore them.

Almost none of that infrastructure exists in the typical AI deployment today. Most LLM-based products are built on top of foundation models reached through an API. The builder has no visibility into the model’s internals, no way to feed production failures back into it, and no professional services layer with the depth to diagnose what is really going wrong. The feedback loop we had in the ASR world rested on ownership of the stack, or at least on deep access to it. Today’s builder, more often than not, has a prompt and an API key. The failures pile up in production, and they have nowhere to go.

So it is not that anything was taken away. It is that something which used to travel for free now has to be built deliberately, and right now, mostly, it is not being built.

Where it adds up

The result is a kind of fog: a growing mass of AI-mediated decisions and outputs in the world, of genuinely unknown reliability, with no aggregate mechanism for discovering which of them are unreliable until something breaks badly enough to reach the news.

This, I think, is the deeper structure beneath what I have called the demo-driven economy. A demo is exactly the artifact one would expect to flourish in such an environment: cheap to produce, instantly legible as “AI working”, and entirely disconnected from the validation that the missing feedback infrastructure ought to have supplied but does not. The demo becomes a substitute for evidence, not because anyone is being dishonest, but because evidence is expensive to produce and the market has quietly stopped requiring it.

General AI deployment does not yet have a domain in the sense the IVR world eventually did, and I would argue it cannot have one in the same way, precisely because the scope is not narrow. There is no single sector whose conventions could absorb “this output is wrong some percentage of the time” the way customer care absorbed it for ASR. If anything, the closest analog, voice-driven customer care built on LLMs, may well be where this kind of domain construction happens first, and for the very same reason it happened the first time: a sector bounded enough, and stakes high enough, that someone is finally forced to invent the equivalent of “press zero”. But for AI in general the domain is not built yet. We are still at the stage where the IVR industry was before anyone had thought of the escape hatch: plenty of capable builders, plenty of activity, and no shared convention yet for what to do when the system is, predictably, wrong.

Where this shows up as a headline

If all this sounds abstract, there is a very concrete place where it surfaces: the stubbornly high failure rate of agentic AI projects.

The numbers, drawn from several independent sources, tell a remarkably consistent story. MIT’s Project NANDA, published in August 2025 and based on 150 interviews with leaders, a survey of 350 employees, and an analysis of 300 public AI deployments, found that roughly 95 percent of generative AI pilot programs fail to deliver any measurable financial return. Gartner, in a June 2025 prediction, resting on a poll of more than 3,400 organizations, forecast that over 40 percent of agentic AI projects specifically will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. Deloitte’s research from the same period, drawing on a survey of 3,235 senior leaders, found that only 11 percent of organizations are actually running agentic AI in production. And perhaps the most telling figure of all: HBR research found that while 86 percent of organizations plan to increase their investment in agentic AI, only 6 percent trust the agents to handle core, end-to-end business processes on their own. The gap between ambition and trust has become the defining tension of 2026.

The Gartner analyst behind the prediction put it bluntly: most of these projects are early-stage experiments and proofs of concept, driven by hype and frequently misapplied. That diagnosis maps almost exactly onto the argument I have been making here. The tools work; what is missing is the infrastructure to know when they fail, and to repair them.

The controlled benchmarks tell the same story, and without any of the governance framing. The APEX-Agents benchmark, published in 2026 and testing agents on the kind of complex tasks that occupy financial analysts, management consultants, and lawyers, found that the best model met all of a task’s requirements only about 24 percent of the time. And the CLEAR framework, a 2026 study evaluating enterprise agentic AI systems, found a 37 percent gap between lab benchmark scores and real-world deployment performance.

Why does an agent fail so much more dramatically than a single answer? Most explanations are either too generic (“the tech isn’t ready”) or too narrow (blame the orchestration framework, blame the prompting strategy). A recent Medium article by the AI strategist Celine Xu puts it well: agents rarely fail because they are unintelligent. They fail because we misunderstand what they are. They are not magic decision-makers, and they are not deterministic software; they are adaptive systems operating inside architectures that we design. And when those architectures lack observability, even a perfectly functioning agent becomes untrustworthy, because you cannot trace its reasoning, and if you cannot trace its reasoning you cannot tell when it has gone astray.

That observability gap is the broken feedback loop in its purest form. Xu describes one production system whose backend stored only the user’s input and the agent’s final output, and none of the steps in between: no planner decisions, no tool calls, no execution states. On the first interaction everything worked. On the second, the behavior diverged. Same user, same request, a different result, and no record of why. In the ASR world, by contrast, every misrecognition was logged and every “press zero” was a data point; the failure was visible, local, and traceable. In most agentic deployments today the failure is invisible by design, not because anyone intended it to be, but because the feedback infrastructure was simply never built.

There is also a compounding logic here that is worth stating plainly. A single LLM answer that is wrong some percentage of the time is a bounded thing: one bad answer, one bad draft, and a human reading it has a built-in fallback, the habit of checking before acting. An agent removes that fallback at the worst possible moment. It takes a sequence of actions, each feeding the next, so the intrinsic error rate does not merely add up across steps, it compounds. A small per-step chance of a subtly wrong output can become, over several steps, a chain that is confidently and coherently wrong by the end, in a way far harder to catch than any single mistake. The benchmark numbers above are simply what that compounding looks like when you measure it. The CLEAR study mentioned above measured exactly this: accuracy that held at 60 percent on a single run fell to 25 percent across eight consecutive runs.

And the word “bug” does its damage here too, only at a larger scale. When an agentic project disappoints, the reflex is to reach for engineering fixes: better orchestration, a different framework, more tool-calling scaffolding, all of which treat the failure as a configuration problem. But if a meaningful share of the failure is the base model’s ordinary error rate compounding across steps, no amount of orchestration tuning will touch it. It is not the wiring. It is the thing the wiring is built on. This is also why agentic demos can look so spectacular and still collapse in deployment: a demo is a short performance, a handful of steps at most, and that is precisely where compounding error has not yet had the chance to accumulate. Stretch the same system over the dozens or hundreds of steps a real task demands, and the errors begin to chain.

Agentic AI, almost by definition, is the least scoped application of these models, the one furthest from possessing any sector-specific “press zero”. So I do not think its high failure rate is a phenomenon separate from everything else in this post. It is the same phenomenon in its sharpest form: the point at which “missing feedback” and “no domain built yet” stop being abstractions and start showing up as a number that everyone is suddenly asking about.

What this means, and where I think it leads

If there is a constructive thread to pull from all this, it runs through the translation layer itself, through the intermediaries who sit between raw model capability and deployed product. That, I believe, is where the missing feedback infrastructure will have to be rebuilt. Not by re-erecting the gates around the underlying technology, which is neither possible nor desirable, but by someone in that middle layer taking deliberate responsibility for the validation work that used to happen on its own, as a byproduct of the slow path into the field.

This is not a small ask. It means treating the questions “how do we know this works, at what rate does it fail, and what happens when it does” as first-class parts of the product rather than as afterthoughts bolted on once something has gone wrong in production. It is, in truth, closer to the discipline of a serious engineering organization than to the instincts of a research lab or the incentives of a demo-driven market. Which is really just another way of saying that this is precisely the translation problem the AI field finds itself in the middle of, whether it has noticed or not.

So, do we still need the kind of deep expertise that democratization has made it possible to do without? I think we do, only in a new place. We need it less at the point of building, where the tools have done their wonderful, liberating work, and more at the point of knowing, the unglamorous, indispensable work of finding out whether what we have built actually works. The democratization has already happened, and we should celebrate it. The discontent is simply the part of the job that no one has yet been assigned. I am optimistic, as I usually am, that we will assign it. But only if we first stop calling the error a bug, and start calling it what it is.

Leave a comment