Sep 11, 2025

What Happened When AI Rewrote Meta’s Infrastructure Playbook

Nicole Hemsoth Prickett

What Happened AI Broke Meta’s Infrastructure Playbook

Meta is emptying datacenters, wiring gigawatt clusters, and battling hardware sprawl as AI training jobs push GPUs, networks, and power to their limits.

Meta’s Vice President of Infrastructure, Yee Jiun Song, has spent the two decades pushing the limits of datacenter design, and told the packed room this week at the AI Infra Summit that the playbook for web and mobile scale has broken under AI. 

He explained that recommendation engines and LLMs now require clusters so large they consume entire facilities, with jobs so brittle a single GPU can topple them, and with hardware ecosystems so fragmented that only open standards can hold the pieces together.

Song framed the argument around three realities that Meta has been forced to confront over time. The first is the shift from CPU-based web workloads to GPU-driven AI, which has rewritten datacenter architecture. The second is the scale of modern training jobs, which has led to superclusters on the order of small cities. And the third is the chaos of hardware diversity, which is driving the need for open standards and open source frameworks to manage the complexity.

He says that when he began at Meta, infrastructure followed the familiar model of early web companies (web servers connected to databases, traffic routed through a few leased facilities, and a growing set of caching and ranking layers). But that model quickly collapsed under growth. 

This pushed Meta to design its own datacenters, stitching them together with a global backbone and edge network, all the while learning the hard lessons of distributed computing. 

As Song details, notifications misfired, messages appeared out of order, and entire datacenter failures had to be abstracted away from developers.

“The more machine you have, the more likely you have to fail, and so you have to figure out how masked it is,” Song said this week, summing up the mindset that shaped the web era.

But AI introduced a new class of really big problems. 

For years, he says, Facebook products relied on the bounded nature of the social graph (show users what their friends liked or what their communities discussed) but the rise of short-form video changed everything. Relevance could no longer be derived from human networks. Every uploaded video had to be evaluated against every user’s interests, a computationally unbounded problem that CPUs couldn’t handle. 

To tackle this, Song explained how Meta turned to embeddings and GPUs, using dense vectors to represent similarity and large compute engines to process them. “AI combines the mathematical notion of similarity in content with the computational power of GPUs to provide personalized recommendations,” he explained. It was at that moment GPUs became the center of Meta’s infrastructure.

The first buildouts created 4,000-GPU “AI zones” inside datacenters. Half the racks carried compute, the rest storage, all tied together with custom interconnects. For recommendation engines, the model worked. But for large language models, it fell apart. 

By 2022, competitive training jobs demanded synchronous operation across thousands of GPUs. What had once been 128-device clusters expanded to 2,000 and then 14,000 devices. “For the first time, we were regularly dealing with training jobs where we needed thousands of GPUs to run synchronously… any single stranding GPU across the whole job failed,” Song said. 

Engineers who once optimized for server reboots now had to make sure multi-week training runs could survive the weakest link in tens of thousands of devices.

Meta response? Building 24,000-GPU clusters, designed to fill every watt of power and every square foot of cooling capacity in single datacenter buildings. And even that wasn’t enough. 

The social giant went further, emptying five production datacenters that had been serving live workloads to assemble a single supercluster. Thousands of racks were uprooted and moved, new robotics and packaging systems were designed to accelerate the migration, and entire networks were rebuilt to connect the buildings. 

“We actually emptied out five production data centers… we had to redesign the loading blocks, build brand new cloud robots… and even design weightless packaging for these racks to speed up the moves,” Song said. The result was a 100,000-GPU cluster, but the scale of disruption made it clear that the model was unsustainable.

And guess what? The next phase is even larger. 

Meta is building one cluster, designed to draw a full gigawatt of power, followed by yet another, spread across farmland on the scale of Manhattan.

These projects are more like industrial complexes with power and cooling footprints closer to energy plants than IT facilities he explains. Quite simply, they’re being built because nothing less will sustain the demands of modern AI training and inference.

Song says hardware diversity is another problem. NVIDIA remains dominant, but Meta also deploys AMD GPUs, its own custom accelerators, and devices from emerging vendors he says. Each platform introduces different interconnects, software stacks, and tuning requirements. The effect is operational drag and underutilization of expensive hardware. “The heterogeneity of the food is making it difficult to move workloads around, leading to underutilized hardware… what we need here are open standards and open source software,” Song said. 

Without stable abstractions, every new hardware generation requires rewriting libraries and retraining operators and that’s not really sustainable either.

Meta’s answer has been to double down on PyTorch and invest in the Open Compute Project, where rack designs are being extended to handle AI density and power requirements. The goal is to provide consistent interfaces for developers and operators, making it possible to adopt new hardware without the friction that slows deployment and wastes capacity.

And so Meta, a company that once set the pace for vertical integration in servers and datacenters, now depends on open source and open standards to keep moving but the pace of change has made it unavoidable. Chips double in performance every generation, models expand by orders of magnitude and their clusters grow from thousands to hundreds of thousands of devices in just a few years. 

In that environment, no operator can afford to rely on proprietary stacks that risk stranding workloads.

“Having built out infrastructure for years, we thought we learned everything about scale, but honestly, AI is kicking our butts,” Song said. 

This reflects the pressure of running jobs so large that a single hardware fault can invalidate weeks of work, of designing buildings that draw gigawatts, of constantly re-engineering power and cooling systems to match the speed of model growth.

And this should make us ask: If the next era of AI infrastructure will be defined less by incremental efficiency and more by how far operators can push power grids, supply chains, and physical limits to keep pace, how we can future-proof? 

Subscribe and learn everything.
Newsletter
Podcast
Spotify Logo
Subscribe
Community
Spotify Logo
Join The Cosmos Community

© VAST 2025.All rights reserved

  • social_icon
  • social_icon
  • social_icon
  • social_icon