AI Democratization and Its Discontents

On the feedback loop that AI’s democratization left behind

What happens to a technology once it stops being the property of the few who understand it deeply and becomes available to the many who simply want to use it. We have come to call this democratization, and it is, on balance, a wonderful thing. But it carries a discontent, in the sense Sigmund Freud gave the word in his 1930 essay Civilization and Its Discontents: not a defect we could engineer away, but something intrinsic to the bargain itself. That discontent is what I want to examine here.

From science to infrastructure

Every transformative technology eventually goes through the same passage. It begins as the exclusive territory of the people who build it, and it ends as something used by people who have no need to understand it at all. Electricity is the textbook example: nobody wiring a lamp today gives a thought to Maxwell’s equations, and nobody should have to. The complexity gets encapsulated behind an interface, and a new layer of practice grows on top of it, with its own logic and its own professionals, largely disconnected from the underlying science. That, in W. Brian Arthur’s terms, is the creation of a domain: the moment a technology stops being a scientific object and becomes an infrastructure on which a whole new world of practice can be built.

Let me bring this closer to home, to something I lived through for much of my career: speech recognition.

In the 1980s, speech recognition belonged to people who understood it from the ground up: signal processing and acoustic features, hidden Markov models and mixtures of Gaussian densities, and the fundamental equation that ties statistical acoustic and language models together through Bayes’ theorem, and a thicket of mathematical subtleties that most people outside the field never encounter. This was hard science, done by hard scientists, at places like CSELT in Italy, where I began my own career, LIMSI in France, Philips Research, Bell Laboratories, IBM Research, and a handful of others. And yet, for all that sophistication, the systems they produced were brittle, error-prone, and not very good outside carefully controlled conditions.

And then something interesting happened. In the mid-1990s, some of us who had spent years on the theory decided it was time to put it to work. I was part of that move myself, joining SpeechWorks, one of the companies, along with Nuance, founded to replace the telephone customer care systems of the day, the expensive human operators and the maddening “press 1 for billing, press 2 for technical support” menus that everyone hated. The technology was imperfect, and we knew it, but the business case was compelling anyway. By the early 2000s an entire industry had grown up around that foundation, built not by the researchers who understood the acoustic models from the inside, but by a new layer of practitioners: voice user interface designers, platform engineers, and system integrators. These were people who understood call flows, voice browser standards, and deployment logistics, but had no need to understand the recognizer’s internals to build something that worked within its scope. In Arthur’s terms this was not yet a full domain, the way electricity became a domain spanning motors, lighting, and appliances. It was something narrower: a domain for one sector, customer care, with its own conventions and, most importantly, its own well-understood failure mode, the humble “press zero for an operator.” It worked because everyone in it, even without research-level understanding, had absorbed where the edges were.

This is how translation happens, and it is a genuinely good thing. The translator figures I keep returning to in this series, Aldus Manutius, George Westinghouse, Steve Jobs, did not need deep expertise in the science of their moment. What they had instead was a kind of informed abstraction: enough understanding to build something useful, and enough distance from the original complexity to see applications the inventors never could.

AI is going through exactly this passage now, faster and more broadly than any technology before it. People who could not have written a line of code five years ago are shipping products; doctors, lawyers, teachers, and small business owners are building tools for problems they understand far better than any AI lab ever could. This is democratization doing precisely what it is supposed to do, populating the translation phase with the messy, distributed experimentation out of which real applications emerge. And it is, let me be clear, the good part. Never before in the history of technology has something of this magnitude reached this many people, this fast. So, where is the bad part?

Where the analogy breaks

The difference this time is not about the fact of democratization at all. It is about two things: the span of the domain, no longer a narrow use case like the voice-ification of call centers, and what happens to the failures.

Let me stay with speech recognition, because the contrast is sharp. When we deployed ASR into call centers, the failures were visible, local, and bounded. The system either understood the caller or it did not, and everyone involved could tell which had happened. The error was legible: it could be measured, reported, and routed back to someone who knew what to do with it. A confused caller produced a recognizable signal, silence, repetition, a request for an agent, and those signals were data. The fix was patient voice interface design and usability testing, and it was tractable mostly because the domain, customer care, was small and manageable.

AI’s failure mode is different, and consequentially so. An LLM is a probabilistic black box wrapped in a deterministic-looking interface. LLM-based products can fail at some low but irreducible rate, producing outputs that are subtly wrong rather than obviously broken: a legal summary that quietly omits a liability clause, a medical assistant that states the wrong drug interaction with complete confidence, a financial forecast that gets the number wrong and nobody notices. None of these announce themselves the way a crashed program or a garbled phone call would. The people on the receiving end usually have no way of knowing it is happening, and no channel back to anyone who could diagnose it. And the domain, unlike call centers, is practically unbounded: document compliance, sales, attrition, and an endless list of others. It is virtually impossible to make a single AI work equally well across such a myriad of uses.

There is a further complication, one that sets this apart from the systems of the past. The statistical models of the previous generation were imperfect, but they were legible. They were assembled from designed parts, an acoustic model, a language model, a confidence score, and when something went wrong an expert could open the machine, probe the components, and usually localize the failure to a cause. A modern large language model grants no such access: there are no separate modules for syntax or meaning to inspect, only billions of weights, and a confident error arrives with no internal account of why it was made. So even the team that genuinely wants to close the loop faces a difficulty the earlier generation never did: the system cannot tell them what went wrong, and often cannot be compelled to explain itself.

There is also a deeper reason these failures stay buried. To catch an error you need a way to verify the output, and for most uses of AI no such verification exists; someone has to know the domain well enough to judge, and often no one checks at all. Code is the great exception: a program compiles or it does not, the tests pass or they fail, and the verdict comes back in seconds. That, I think, is the real reason AI has been so transformative for software, not because writing code is easier than writing prose, but because code carries its own feedback loop, built in. Where the output can be checked, the loop closes on its own; where it cannot, the error has no way of making itself known.

The builder often does not know either, not for any lack of skill, but because nothing in the path to building with these tools requires the repeated exposure to a model failing in informative ways that years of research apprenticeship used to guarantee. When something goes wrong, the instinct is to adjust the prompt, the data, the interface, the places where a fix is straightforward. Fine-tuning and reinforcement learning, the interventions that address deeper structural limits, require exactly the technical depth the democratization wave did not carry along with it. A minority of teams, those with real intimacy with the models they build on, do close the loop. But for most, that intimacy is precisely what democratization did not hand them, and so the failure goes nowhere. It does not become a signal. It simply accumulates, quietly, as diffuse harm.

The word “bug” is doing a lot of damage

Underneath all of this lies a more basic problem, and one I have watched play out before. While working on the early speech recognition deployments, I ran into exactly this problem. The system would misrecognize an utterance, and the capable people running the deployment would file it as a bug, something to be fixed, patched, eliminated. What they were really looking at was the system working exactly as designed, a statistical model producing its expected error rate. The fix was no fix at all, because nothing was broken. I remember more than one customer who could not accept that an isolated recognition error was not, in itself, a defect to be corrected, and who did not ask the question that mattered, how often does this happen, but was affronted by the single instance, as though one error were proof of a broken machine.

LLMs have inherited this problem and made it worse, in part because of a single seductive word: hallucination. The word frames the phenomenon as a malfunction, something that happens when the system goes wrong, rather than something it does a certain percentage of the time as part of how it works. But a model producing a confident inaccuracy is doing nothing categorically different from what every statistical machine learning system has always done. This is, after all, what an LLM is built to do: it is trained to predict the next token, not to be correct, so an occasional confident falsehood is not a betrayal of its purpose but a direct consequence of it. We never called a speech recognizer’s misrecognitions “hallucinations,” nor the errors of any statistical classifier. We called them errors, we measured their rate, and we built around the expectation that the rate was nonzero and would stay nonzero.

And here is what strikes me in my meetings with AI practitioners: very rarely do I hear anyone talk about error rates, accuracy, precision, recall, validation and test sets, or the statistical significance of a result. The vocabulary my field once lived by has quietly fallen out of the conversation, and that absence is itself a symptom. For if you experience an inaccuracy as a bug, you report it as a bug, and it goes into a queue to be fixed rather than to the place where the signal it carries, an empirical error rate, for this task, under these conditions, could be measured and designed around. The miscategorization is not merely semantic. It is the mechanism by which the feedback loop gets short-circuited before it can even begin.

The one piece that has not caught up

And yet none of this is an argument against democratization, nor a claim that today’s builders are less capable than earlier generations. Not at all. The democratization of AI is a positive development, poised to speed up the very process of domain building I have been describing. The tools are amazing, the people using them are capable, and the explosion of applications is real progress.

What has not kept pace is the mechanism of feedback. The old path into a technical field required years of hands-on work with systems that failed regularly and visibly. That was never by design; it was the nature of the work. But those years built something no course can deliver, an instinct for failure: a feel for where a system’s reliable zone ends, for how to design around a nonzero error rate, for when to trust an output and when to question it. That instinct was not taught. It was accumulated, slowly, through repetition and failure.

It is worth being concrete about how that feedback actually worked, because the contrast with today is sharp. Companies like SpeechWorks, Nuance, and IBM did not simply sell an engine and walk away. Every misrecognition, every caller repetition, every “press zero” was logged, analyzed, and fed back into the system; production failures became training data. We sent people in, designers and tuning engineers who sat with the customer, listened to the recordings, and adjusted the flows and the confidence thresholds. And the contracts carried measurable commitments, recognition accuracy, task completion, the all-important containment rate. If the numbers were bad, somebody was accountable, and that accountability created a direct incentive to understand failures rather than ignore them.

Almost none of that infrastructure exists in the typical AI deployment today. Most products are built on foundation models reached through an API. The builder has no visibility into the model’s internals, no way to feed production failures back into it, and no service layer with the depth to diagnose what is going wrong. Today’s builder, more often than not, has a prompt and an API key, and the failures pile up in production with nowhere to go. So it is not that anything was taken away. It is that something which used to travel for free now has to be built deliberately, and right now, mostly, it is not.

Where it adds up

The result is a kind of fog: a growing mass of AI-mediated decisions and outputs of genuinely unknown reliability, with no aggregate mechanism for discovering which of them are unreliable until something breaks badly enough to reach the news.

This is the deeper structure beneath what I have called the demo-driven economy. A demo is exactly the artifact one would expect to flourish here: cheap to produce, instantly legible as “AI working,” and entirely disconnected from the validation the missing feedback infrastructure ought to have supplied. The demo becomes a substitute for evidence, not because anyone is being dishonest, but because evidence is expensive to produce and the market has quietly stopped requiring it.

General AI deployment does not yet have a domain in the sense the IVR world eventually did, and I would argue it cannot have one in the same way, precisely because the scope is not narrow. There is no single sector whose conventions could absorb “this output is wrong some percentage of the time” the way customer care absorbed it for ASR. The closest analog, voice-driven customer care built on LLMs, may well be where this domain construction happens first, for the same reason it happened the first time: a sector bounded enough, and stakes high enough, that someone is finally forced to invent the equivalent of “press zero.” But for AI in general the domain is not built yet.

Where this shows up as a headline

If all this sounds abstract, there is a very concrete place where it surfaces: the stubbornly high failure rate of agentic AI projects.

The numbers, from several independent sources, tell a consistent story. MIT’s Project NANDA, published in August 2025, found that roughly 95 percent of generative AI pilot programs fail to deliver any measurable financial return. Gartner has forecast that over 40 percent of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear value, and inadequate risk controls. Deloitte found that only 11 percent of organizations are actually running agentic AI in production. And perhaps the most telling figure of all, HBR research found that while 86 percent of organizations plan to increase their investment in agentic AI, only 6 percent trust the agents to handle core, end-to-end processes on their own. The gap between ambition and trust has become the defining tension of 2026.

The Gartner analyst behind the prediction put it bluntly: most of these projects are early-stage experiments driven by hype and frequently misapplied. That maps almost exactly onto the argument I have been making. The tools work; what is missing is the infrastructure to know when they fail, and to repair them.

The controlled benchmarks say the same thing without any governance framing. The APEX-Agents benchmark found that the best model met all of a complex task’s requirements only about 24 percent of the time, and the CLEAR framework, a 2026 study of enterprise agentic systems, found a 37 percent gap between lab benchmark scores and real-world deployment performance.

Why does an agent fail so much more dramatically than a single answer? A recent Medium article by the AI strategist Celine Xu puts it well: agents rarely fail because they are unintelligent, but because we misunderstand what they are. They are not deterministic software; they are adaptive systems operating inside architectures we design, and when those architectures lack observability, even a perfectly functioning agent becomes untrustworthy, because you cannot trace its reasoning and so cannot tell when it has gone astray. Xu describes one production system that stored only the user’s input and the agent’s final output, with none of the steps in between. On the first interaction everything worked; on the second, the behavior diverged, same request, different result, and no record of why. In the ASR world every misrecognition was logged and traceable. In most agentic deployments today the failure is invisible by design, not because anyone intended it, but because the feedback infrastructure was never built.

There is also a compounding logic worth stating plainly. A single wrong answer is bounded, and a human reading it has a built-in fallback, the habit of checking before acting. An agent removes that fallback at the worst moment: it takes a sequence of actions, each feeding the next, so the intrinsic error rate does not merely add up, it compounds. A small per-step chance of a subtly wrong output becomes, over several steps, a chain that is confidently and coherently wrong by the end. The CLEAR study measured exactly this: accuracy that held at 60 percent on a single run fell to 25 percent across eight consecutive runs. This is also why agentic demos can look so spectacular and still collapse in deployment. A demo is a short performance, a handful of steps, too few for the errors to accumulate; stretch the same system over the dozens or hundreds of steps a real task demands, and they begin to chain.

Agentic AI is, almost by definition, the least scoped application of these models, the one furthest from possessing any sector-specific “press zero.” So its high failure rate is not a phenomenon separate from everything else here. It is the same phenomenon in its sharpest form: the point at which “missing feedback” and “no domain built yet” stop being abstractions and start showing up as a number everyone is suddenly asking about.

What this means, and where I think it leads

If there is a constructive thread to pull from all this, it runs through the translation layer itself, the intermediaries who sit between raw model capability and deployed product. That is where the missing feedback infrastructure will have to be rebuilt, not by re-erecting the gates around the technology, which is neither possible nor desirable, but by someone in that middle layer taking deliberate responsibility for the validation work that used to happen on its own, as a byproduct of the slow path into the field.

This is not a small ask. It means treating the questions, how do we know this works, at what rate does it fail, and what happens when it does, as first-class parts of the product rather than afterthoughts bolted on once something has gone wrong. It is closer to the discipline of a serious engineering organization than to the instincts of a research lab or the incentives of a demo-driven market, which is just another way of saying that this is precisely the translation problem the AI field is in the middle of, whether it has noticed or not.

So, do we still need the kind of deep expertise that democratization has made it possible to do without? I think we do, only in a new place: less at the point of building, where the tools have done their liberating work, and more at the point of knowing, the unglamorous, indispensable work of finding out whether what we have built actually works. The democratization has already happened, and we should celebrate it. The discontent is simply the part of the job no one has yet been assigned. I am optimistic, as I usually am, that we will assign it, but only if we first stop calling the error a bug and start calling it what it is.

3 responses to “AI Democratization and Its Discontents”

Yannick

June 30, 2026

Isn’t the solution to this exactly what is now being called harness engineering? The article perfectly captures the problem: agentic errors compound quietly. The only way to fix that broken feedback loop is by building a rigid scaffolding of observability, evaluation, and guardrails around the model before it ever hits production.

1. Roberto Pieraccini
  
  July 3, 2026
  
  Yes, exactly. But that alone will not solve the problem, unless as you say, observability, evaluation, retry, multiple hypotheses are built in. Thanks for your comment.
  
Prompting Is Not AI Research – THE VOICE IN THE MACHINE

July 14, 2026

[…] hallucination first, because I wrote about it in the previous essay on AI democratization from a different angle. There is by now a small industry of work on detecting, benchmarking, and […]