Cloud Blog: Fast and efficient AI inference with new NVIDIA Dynamo recipe on AI Hypercomputer

Sep 10, 2025

—

Source URL: https://cloud.google.com/blog/products/compute/ai-inference-recipe-using-nvidia-dynamo-with-ai-hypercomputer/
Source: Cloud Blog
Title: Fast and efficient AI inference with new NVIDIA Dynamo recipe on AI Hypercomputer

Feedly Summary: As generative AI becomes more widespread, it’s important for developers and ML engineers to be able to easily configure infrastructure that supports efficient AI inference, i.e., using a trained AI model to make predictions or decisions based on new, unseen data. While great at training models, traditional GPU-based serving architectures struggle with the “multi-turn" nature of inference, characterized by back-and-forth conversations where the model must maintain context and understand user intent. Further, deploying large generative AI models can be both complex and resource-intensive.
At Google Cloud, we’re committed to providing customers with the best choices for their AI needs. That’s why we are excited to announce a new recipe for disaggregated inferencing with NVIDIA Dynamo, a high-performance, low-latency platform for a variety of AI models. Disaggregated inference separates out model processing phases, offering a significant leap in performance and cost-efficiency.
Specifically, this recipe makes it easy to deploy NVIDIA Dynamo on Google Cloud’s AI Hypercomputer, including Google Kubernetes Engine (GKE), vLLM inference engine, and A3 Ultra GPU-accelerated instances powered by NVIDIA H200 GPUs. By running the recipe on Google Cloud, you can achieve higher performance and greater inference efficiency while meeting your AI applications’ latency requirements. You can find this recipe, along with other resources, in our growing AI Hypercomputer resources repository on GitHub.
Let’s take a look at how to deploy it.
The two phases of inference
LLM inference is not a monolithic task; it’s a tale of two distinct computational phases. First is the prefill (or context) phase, where the input prompt is processed. Because this stage is compute-bound, it benefits from access to massive parallel processing power. Following prefill is the decode (or generation) phase, which generates a response, token by token, in an autoregressive loop. This stage is bound by memory bandwidth, requiring extremely fast access to the model’s weights and the KV cache.
In traditional architectures, these two phases run on the same GPU, creating resource contention. A long, compute-heavy prefill can block the rapid, iterative decode steps, leading to poor GPU utilization, higher inference costs, and increased latency for all users.

aside_block
), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

A specialized, disaggregated inference architecture
Our new solution tackles this challenge head-on by disaggregating, or physically separating, the prefill and decode stages across distinct, independently managed GPU pools.
Here’s how the components work in concert:

A3 Ultra instances and GKE: The recipe uses GKE to orchestrate separate node pools of A3 Ultra instances, powered by NVIDIA H200 GPUs. This creates specialized resource pools — one optimized for compute-heavy prefill tasks and another for memory-bound decode tasks.

NVIDIA Dynamo: Acting as the inference server, NVIDIA Dynamo’s modular front end and KV cache-aware router processes incoming requests. It then pairs GPUs from the prefill and decode GKE node pools and orchestrates workload execution between them, transferring the KV cache that’s generated in the prefill pool to the decode pool to begin token generation.

vLLM: Running on pods within each GKE pool, the vLLM inference engine helps ensure best-in-class performance for the actual computation, using innovations like PagedAttention to maximize throughput on each individual node.

This disaggregated approach allows each phase to scale independently based on real-time demand, helping to ensure that compute-intensive prompt processing doesn’t interfere with fast token generation. Dynamo supports popular inference engines including SGLang, TensorRT-LLM and vLLM. The result is a dramatic boost in overall throughput and maximized utilization of every GPU.

Experiment with Dynamo Recipes for Google Cloud
The reproducible recipe shows the steps to deploy disaggregated inference with NVIDIA Dynamo on the A3 Ultra (H200) VMs on Google Cloud using GKE for orchestration and vLLM as the inference engine. The single node recipe demonstrates disaggregated inference with one node of A3 Ultra using four GPUs for prefill and four GPUs for decode. The multi-node recipe demonstrates disaggregated inference with one node of A3 Ultra for prefill and one node of A3 Ultra for decode for the Llama-3.3-70B-Instruct Model.
Future recipes will provide support for additional NVIDIA GPUs (e.g. A4, A4X) and inference engines with expanded coverage of models.
The recipe highlights the following key steps:

Perform initial setup – This sets up environment variables and secrets; this needs to be done one-time only.

Install Dynamo Platform and CRDs – This sets up the various Dynamo Kubernetes components; this needs to be done one-time only.

Deploy inference backend for a specific model workload – This deploys vLLM/SGLang as the inference backend for Dynamo disaggregated inference for a specific model workload. Repeat this step for every new model inference workload deployment.

Process inference requests – Once the model is deployed for inference, incoming queries are processed to provide responses to users.

Once the server is up, you will see the prefill and decode workers along with the frontend pod which acts as the primary interface to serve the requests.

We can verify if everything works as intended by sending a request to the server like this. The response is generated and truncated to max_tokens.

code_block
<ListValue: [StructValue([(‘code’, ‘curl -s localhost:8000/v1/chat/completions \\\r\n -H "Content-Type: application/json" \\\r\n -d \'{\r\n "model": "meta-llama/Llama-3.3-70B-Instruct",\r\n "messages": [\r\n {\r\n "role": "user",\r\n "content": "what is the meaning of life ?"\r\n }\r\n ],\r\n "stream":false,\r\n "max_tokens": 30\r\n }\’ | jq -r \’.choices[0].message.content\’\r\n\r\n\r\n—\r\nThe question of the meaning of life is a complex and deeply philosophical one that has been debated by scholars, theologians, philosophers, and scientists for’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e200d2d3790>)])]>

Get started today
By moving beyond the constraints of traditional serving, the new disaggregated inference recipe represents the future of efficient, scalable LLM inference. It enables you to right-size resources for each specific task, unlocking new performance paradigms and significant cost savings for your most demanding generative AI applications. We are excited to see how you will leverage this recipe to build the next wave of AI-powered services. We encourage you to try out our Dynamo Disaggregated Inference Recipe which provides a starting point with recommended configurations and easy steps. We hope you have fun experimenting and share your feedback!

AI Summary and Description: Yes

Summary: The text discusses a new recipe from Google Cloud for implementing disaggregated inference using NVIDIA Dynamo, specifically targeting the efficient deployment of large generative AI models. It addresses significant challenges in traditional GPU architectures while providing insights into improving performance and cost-effectiveness for AI inference tasks.

Detailed Description:
The text outlines an innovative approach to optimizing AI inference on Google Cloud using NVIDIA’s Dynamo platform. This method allows for improved resource allocation by separating the compute-heavy and memory-bound phases of inference into distinct GPU pools. Key elements include:

– **Traditional Challenges**:
– Traditional GPU serving architectures face issues with multi-turn inference, leading to resource contention and higher latency.
– Both the prefill (context processing) and decode (response generation) phases operate on the same GPU, which can lead to inefficiencies.

– **Disaggregated Inference Architecture**:
– **NVIDIA Dynamo** serves as the inference server and manages workload distribution between GPU pools optimized for each phase.
– **Technical Breakdown**:
– **A3 Ultra Instances and GKE**: Utilizes Google Kubernetes Engine to manage separate pools for compute-bound prefill and memory-bound decode tasks.
– **KV Cache Transfer**: The cache generated during prefill is efficiently transferred to support rapid token generation during decoding.
– **vLLM**: Utilizes advanced techniques like PagedAttention to ensure high performance in computation.

– **Implementation Benefits**:
– The disaggregated approach allows for independent scaling of each phase based on demand, enhancing overall GPU utilization.
– Supports various inference engines, fostering versatility in handling multiple AI models.

– **Deployment Steps**:
– Initial setup, installation of the Dynamo platform, deploying the inference backend, and processing user queries are well-structured and documented.
– Recipies are provided for single-node and multi-node setups, illustrating the practical applications of the disaggregated inference architecture.

– **Future Innovations**:
– Plans to expand recipe support for additional NVIDIA GPUs and inference engines are highlighted, indicating ongoing investment in enhancing performance and flexibility.

This text is highly relevant for AI, cloud computing, and infrastructure security professionals as it emphasizes advancements in AI model deployment practices that can significantly enhance security and efficiency. The shift towards disaggregated inference not only improves the performance of AI applications but also sets a new standard for computational resource management in cloud environments.

1 2 3 4 7 800 a A3 Ultra A4 accelerated instances access Act addresses advanced advanced techniques advancement advancements age AI AI applications ai model AI models air All and API app Application applications Arch architecture architectures Aria art as at ated Auto aware backend bandwidth based benefits Best beyond Bi bot by C Cache challenge challenges chat CI CIA class Cloud cloud computing cloud environment cloud environments co code coding commit computation computational resource management compute computer Computing Configuration configurations content Context context processing conversation cost cost savings cost-effective cost-effectiveness Costs coverage cross Curl custom Customer D data day de decision decisions decoding deep demand demo deployment deployment practices developer developers Disaggregated Inference document dual Dynamo e E2 effective effectiveness efficiency efficient end Engineer engineers engines environment environment variables environments EoL execution exp face fast feedback first flexibility following for front future g Gen generated generation generative Generative AI generative AI models git GitHub GKE Go Google Google Cloud Google Kubernetes Google Kubernetes Engine GPU GPUs gs H H200 H200 GPUs handling high high-performance Highlight HP HR http HTTPS Hyper Hypercomputer image implementation improving in inefficiencies Inference inference costs inference efficiency inference engine inference engines inference server inference tasks inferencing infrastructure infrastructure security innovation Innovations innovative approach insights installation Instance intensive intent inter interface investment io Iron IRS issue ite J jq js json k Key Kubernetes Kubernetes Engine l language large latency latency requirements leading led Li life line llama llm lm load local localhost Lock long loop low M man management mass max mean memory memory bandwidth Meta ML Mode model model deployment model inference models modular monolithic multi N NCA needs new next NGO no Node node pool node pools non NPU Nvidia NVIDIA GPUs o oE of off on one only ons oost opt optimized orchestration ory oS oss other out over Pair paradigms Parallel parallel processing per performance phi platform point Power powered practical application practical applications practices pre pro process processes processing processing power product products professionals prompt ps Q queries question R rag Rama rate RCE re real real-time red repository Requirements resource resource allocation resource management resources response responses right Ro Role row RSA s sam saving scalable Scale scaling scientists sec secrets security security professionals server service services SHA shift side Sig single size sizes source specialized specific SSE STAR start structured support T target Task tasks tech technical techniques ted text the throughput Time to token token generation tokens Tor TP trained training turn two type Uber UI Ultra under up ups US use user user intent user queries Users utilization V val versatility vllm vm Ware weight Well Wi workers workload workload distribution x yt z