OpenAI Hardware Lead Details Future of Persistent Agents

OpenAI’s Richard Ho warns that long-lived AI agents will break today’s systems, demanding global-scale infrastructure with new memory, networking, and hardware trust.

Three years ago, models answered prompts and then disappeared. Today they don’t stop working when you close the tab. They linger, collaborate, and spin up tasks in the background and talk to other agents across the network.

As Richard Ho, Head of Hardware at OpenAI said recently at the AI Infra Summit, “We’re heading towards an agentic workflow, meaning that most of the work is being done by agents, between many agents. Many agents are long-lived, meaning you have a session with an agent… we’re going to move to a state where the agents are actually active, even if you’re not typing something in and asking something.”

The line is simple but the implications are not.

The era of one-shot prompts has already closed, he says. These are continuous, asynchronous workloads that don’t pause to await human input and don’t live in a single GPU’s memory. As he describes, they spread, coordinate, and recombine state across machines, racks, and regions.

A background agent might run for hours or days before surfacing its results, and in that time it will have depended on dozens of others, exchanged state thousands of times, and leaned on memory and network pathways that were never designed for persistence.

Scaling laws make this unavoidable, he adds.

“This is starting to look a lot more like a scaling model that actually has no end. It doesn’t seem clear to us at this point there is an end to this game.”

For the past five years, the field has lived by the observation that capability correlates with compute. More tokens trained, more flops burned, more parameters dialed in, more intelligence. The curve has held. But persistence changes the curve’s meaning. It shifts the bottleneck from clever model architectures to the physical substrate. Ho explains.

From where Ho sits. the constraint moves from model design to the hardware underneath, which has to sustain exponential demand without collapsing under latency tails, thermal budgets, or security breaches.

“We’ve gone from large mainframe computers back in the 60s and 70s, which shrunk all the way down to personal computer mobile. But now we’re heading back out, first of all, with warehouse scale computing, with datacenters, Google, Amazon, Azure, massive datacenter. And where we’re heading towards is now global scale computing,” Ho said.

Even the most aggressive hyperscale sites, with five-building footprints and hundreds of megawatts of power draw, are not enough.

“You need to have many, many of those campuses, and you have them spread around the world, because users are around the world.”

The effort to build them is already underway. Stargate is the symbol, but sovereign datacenter programs are multiplying, pushed by governments that don’t want their citizens’ AI workloads to transit foreign soil. The infrastructure map will be global fleets stitched together into a single machine.

That machine’s first failure mode will be memory, Ho explains.

“The agents are gonna be long lived, and your session can be long lived, meaning we’re gonna have to be offload to stuff. It can’t just be stuff that’s on the GPU today.” GPU HBM was enough for prompt-response, but persistence pulls memory out into the open. CXL-attached DRAM pools, GDDR7 offload, next-generation HBM4, and disaggregated hierarchies layered across racks. All of it becomes necessary just to maintain continuity.

Packaging, he adds, moves from today’s 2.5D interposers to co-packaged optics, because no copper interposer can carry the bandwidth across multiple chiplets at tolerable latency. “Two and a half D integration, where you have an interposer and integrate one side, that’s kind of the standard format for devices today. We’re going to have to have co-packaged or near-package optics to overcome proper limits of communication.

He also says we’re going to have to be able to integrate multiple types of chiplets, and all this is gonna have to fit within a thermal envelope that we can actually cool and keep running on these massive racks.”

Thermal envelopes are already stretched. GPUs at 700 watts apiece, accelerators crammed in 60kW racks, liquid cooling lines cutting through aisles, for Ho these are today’s biggest constraints.

Add chiplet sprawl and optical engines to every package and the cooling problem no longer belongs to the facility. It belongs to the socket. And failure here is not a matter of reduced efficiency, he says. A long-lived agent relying on offloaded state cannot survive corrupted memory or a dropped interconnect. Reliability, not performance, becomes the governing metric.

Networking also got a heavy nod. “Having good, reliable network infrastructure across a wide part of the world is not something that is really well implemented today,” Ho said. “

At low latency, very hard to be put and quickly at low power. Those are things I think are pretty important.” The internet is tolerant of delay. Video buffers, web pages retry, emails queue. Agents do not. If an agent is orchestrating others, a tail response delayed by 200ms doesn’t degrade the experienc—it breaks the chain of reasoning.

“Your tail latencies are going to be super important, right? If you have an agent that’s out there… and it takes time for that communication to come back, it could actually affect the results.”

The fabric for agents must therefore guarantee not just bandwidth but determinism. Port flaps, congestion storms, buffer overflows, all the normal fault modes of global networking must be handled without breaking sessions.

“One of the things that we’ve been pushing hard on is to make sure that the networking paths in these computers haven’t been consumed, so that if there is a port flap or a heat flap somewhere, the protocols and networks are able to self-heal and to recover without necessarily having to make an interrupt of some sort and to bring down run.”

Self-healing at this scale means fabric-level observability, error correction at every hop, congestion controls that don’t assume best effort, and redundancy that can sustain not just one failure but rolling faults across optical paths.

And beyond failure, there is tampering. “Fiber goes around the world. You have to, kind of like, build in something that something could splice underneath your fiber under the ocean, because things happen, right?” In other words, sovereignty is not only about where the datacenter sits. It is about whether the lines between them can be trusted. Sovereign AI will only hold if the fabric itself is hardened against intrusion, whether by accident or adversary.

Which leads to trust. Ho’s definition of alignment extended past model behavior into hardware integrity. “One of the things that we don’t have enough of today is observability,” he said.

“When anomalous events occur in the networking in terms of power and kernels, it may affect results. May affect which results agents [deliver].”

Software assumes the hardware is stable. Exponential scale makes that assumption untenable. Silent data corruption in DRAM, transient faults in interconnects, undetected kernel crashes…all of these can shift agent behavior in ways indistinguishable from misalignment. Safety that stops at the software boundary is incomplete.

“So what does it mean for alignment and safety? Ho thinks one of the things that it has to be built through the hardware.

“So today, a lot of safety work is kind of in the software. It assumes that your hardware is secure. It assumes that your hardware will be maintained. It assumes that you can pull the plug on the hardware.”

Which is why Ho insisted that trust must be embedded in silicon: error correction as default, anomaly detection as hardware feature, tamper resistance designed into fabrics, secure-by-design networking.

If millions of persistent agents depend on hardware paths that cannot be verified, the system fails not gracefully but catastrophically. “If you embed trust [and] safety into your hardware, you scale one industry. You also scale confidence. People need to have confidence that this thing is going to be right.”

“OpenAI [is] kind of like screaming a little bit about, hey, there’s not infrastructure. There’s no infrastructure. You guys have to build more, build more capacity for wires. We have to build more capacity for switches, for racks and stuff like that. Because we do see these things, right? We’re living in exponential time. It’s hard for us to understand what exponential means, and we’re trying to ring the bell to say, you know, it’s time to build.”

In short, Ho’s argument is that long-lived, collaborating agents will not wait for five-year capex cycles. They will force memory systems off the GPU into pooled hierarchies, force packaging into optics and chiplets, force networks into levels of resilience never before attempted, and force trust down into the hardware itself.

The current datacenter model (rows of GPUs fed by local DRAM, connected by best-effort Ethernet, secured by firewalls at the edge) is not extensible to this regime.

The machine that Ho described is something else: global fleets of sovereign and hyperscale campuses, connected by hardened fabrics, filled with heterogeneous packages cooled at the socket, running agents that never sleep.

“We need a new infrastructure for this new age. We don’t have potential for agent aware architecture and hardware, right?” Ho concluded.