Production Inference Requires an AI-Aware Control Plane

“2025, is really the start of what’s called the age of inference,” Mark Lohmeyer, VP and GM of Compute at Google Cloud told the crowd at the recent AI Infra Summit in Santa Clara.

This comes from first-person insight. As Lohmeyer says, Inside Google, token volume grew 50X from April to April, and then jumped again to 980 trillion tokens in June, which he translated as a novel’s worth of text for every person on Earth each month.

At that scale, the constraints are not abstract, he explains. The cost to serve these determines ultimate reach, latency defines experience, and power availability now gates physical buildouts.And also at this scale efficiency is measured across the whole system, which changes the way operators design, deploy, and tune.

Lohmeyer pointed to a more comprehensive accounting of energy, carbon, and water for real workloads.

“A single Gemini prompt is substantially more efficient than what previously people might have thought publicly. It uses about a quarter watt hour of power, 30 milligrams of carbon dioxide, about five drops of water.” He put that into everyday terms as about the same amount of power as watching nine seconds of TV.

He also says that over the last year, Google drove a 33x reduction in energy per prompt through model choices, software techniques like speculative decoding and disaggregated serving, and datacenter operations that treat power as a first-class resource.

Despite emphasizing recent GPU and TPU innovations, the center of gravity in the talk was not chips. but the control plane and the operational blueprint for inference at scale. And while many might not have expected a talk from Google to hone in on load balancing, Lohmeyer says that traditional web load balancers don’t understand prompt patterns or context locality, which is why they struggle to meet the demands of AI workloads.”

Google’s answer is GKE Inference Gateway, which is now generally available, and which performs AI-aware routing using live signals such as pending request queue length and KV-cache utilization before selecting a node.

Two built-ins matter for operators. “Prefix aware routing” keeps multi-turn chat and document analysis on the accelerator pools that already hold the right context. “Disaggregated serving” separates prefill from decode so each stage can scale and be tuned independently.

Staying current with best practices is its own tax, so Google wrapped patterns and measurements into a living catalog. “Think about Inference Quick Start as a database of tested inference configurations” that returns recommendations based on your models and priorities and “the latest benchmarks that we run within Google.”

The reported end-to-end effects are material with “up to 96% lower latency at peak throughput,” “up to 40% higher throughput” for prefix-heavy jobs, and “up to 30% lower” cost per token when the gateway, scheduler, model server, storage, and network are aligned.

Capacity is the next area where Lohmeyer sees weaknesses. Operators either over-provision and eat the waste or under-provision and stall teams and customers but Google’s Dynamic Workload Scheduler tries to map levers to how people actually work. For time-flexible jobs they a flex start mode that lets users queue up an experiment or request, which will run as soon as those resources are available across all of Google Cloud based on the policies users set.

For time-critical launches, “it allows you to book capacity like you book a hotel room,” with a guarantee that the resources will be there when needed, he says.

Because tokens do not live in one region, the last mile matters. Cloud LAN is the connective tissue that links models to data and to users across regions, other clouds, and on-prem, and because it rides Google’s global network it claims a 40 percent lift in application experience and 40 percent lower TCO versus bespoke WANs.

“Maybe a simple way to think about it is, it’s a blueprint, a blueprint for state of the art inference,” tailored to the operator’s environment and goals.

Read straight through, the path is practical. Put an AI-aware front door on the service so routing respects context and stages. Treat configuration as a living catalog fed by current benchmarks rather than folklore.

Express business intent to the scheduler so capacity lines up with real timelines, whether that means queuing flexible work globally or booking resources like a room when dates cannot slip.Keep models and data where the accelerators sit and keep the network simple enough that latency behaves when traffic spikes.

The promise is a service that lowers unit cost, tightens tail latency, and makes datacenter capacity feel dependable at the exact moment a user presses enter.