Cloud Blog: vLLM Performance Tuning: The Ultimate Guide to xPU Inference Configuration

Aug 25, 2025

—

Source URL: https://cloud.google.com/blog/topics/developers-practitioners/vllm-performance-tuning-the-ultimate-guide-to-xpu-inference-configuration/
Source: Cloud Blog
Title: vLLM Performance Tuning: The Ultimate Guide to xPU Inference Configuration

Feedly Summary: Additional contributors include Hossein Sarshar, Ashish Narasimham, and Chenyang Li.
Large Language Models (LLMs) are revolutionizing how we interact with technology, but serving these powerful models efficiently can be a challenge. vLLM has rapidly become the primary choice for serving open source large language models at scale, but using vLLM is not a silver bullet. Teams that are serving LLMs for downstream applications have stringent latency and throughput requirements that necessitate a thorough analysis of which accelerator to run on and what configuration offers the best possible performance.
This guide provides a bottoms-up approach to determining the best accelerator for your use case and optimizing your vLLM’s configuration to achieve the best and most cost effective results possible.
Note: This guide assumes that you are familiar with xPUs, vLLM, and the underlying features that make it such an effective serving framework.
Prerequisites
Before we begin, ensure you have:

A Google Cloud Project with billing enabled.

The gcloud command-line tool installed and authenticated.

Basic familiarity with Linux commands and Docker.

Hugging Face account, a read token and access to the Gemma 3 27B model.

Gathering Information on Your Use Case
Choosing the right accelerator can feel like an intimidating process because each inference use case is unique. There is no a priori ideal set up from a cost/performance perspective; we can’t say model X should always be run on accelerator Y.
The following considerations need to be taken into account to best determine how to proceed:
What model are you using?
Our example model is google/gemma-3-27b-it. This is a 27-billion parameter instruction-tuned model from Google’s Gemma 3 family.
What is the precision of the model you’re using?
We will use bfloat16 (BF16).
Note: Model precision determines the number of bytes used to store each model weight. Common options are float32 (4 bytes), float16 (2 bytes), and bfloat16 (2 bytes). Many models are now also available in quantized formats like 8-bit, 4-bit (e.g., GPTQ, AWQ), or even lower. Lower precision reduces memory requirements and can increase speed, but may come with a slight trade-off in accuracy.
Workload characteristics: How many requests/second are you expecting?
We are targeting support for 100 requests/second.
What is the average sequence length per request?

Input Length: 1500 tokens
Output Length: 200 tokens
The total sequence length per request is therefore 1500 + 200 = 1700 tokens on average.

What is the maximum total sequence length we will need to be able to handle?
Let’s say in this case it is 2000 total tokens
What is the GPU Utilization you’ll be using?
The gpu_memory_utilization parameter in vLLM controls how much of the xPU’s VRAM is pre-allocated for the KV cache (given the allocated memory for the model weights). By default, this is 90% in vLLM, but we generally want to set this as high as possible to optimize performance without causing OOM issues – which is how our auto_tune.sh script works (as described in the “Benchmarking, Tuning and Finalizing Your vLLM Configuration" section of this post).
What is your prefix cache rate?
This will be determined from application logs, but we’ll estimate 50% for our calculations.
Note: Prefix caching is a powerful vLLM optimization that reuses the computed KV cache for shared prefixes across different requests. For example, if many requests share the same lengthy system prompt, the KV cache for that prompt is calculated once and shared, saving significant computation and memory. The hit rate is highly application-specific. You can estimate it by analyzing your request logs for common instruction patterns or system prompts.
What is your latency requirement?
The end-to-end latency from request to final token should not exceed 10 seconds (P99 E2E). This is our primary performance constraint.
Selecting Accelerators (xPU)
We live in a world of resource scarcity! What does this mean for your use case? It means that of course you could probably get the best possible latency and throughput by using the most up to date hardware – but as an engineer it makes no sense to do this when you can achieve your requirements at a better price/performance point.
Identifying Candidate Accelerators
We can refer to our Accelerator-Optimized Machine Family of Google Cloud Instances to determine which accelerator-optimized instances are viable candidates.
We can refer to our Cloud TPU offerings to determine which TPUs are viable candidates.
The following are examples of accelerators that can be used for our workloads, as we will see in the "Calculate Memory Requirements" section.
The following options have different Tensor Parallelism (TP) configurations required depending on the total VRAM. Please see the next section for an explanation of Tensor Parallelism.
Accelerator-optimized Options

g2-standard-48

Provides 4 accelerators with 96 GB of GDDR6

TP = 4

a2-ultragpu-1g

Provides 1 accelerator with 80 GB of HBM

TP = 1

a3-highgpu-1g

Provides 1 accelerator with 80GB of HBM

TP = 1

TPU Options

TPU v5e (16 GB of HBM per chip)

v5litepod-8 provides 8 v5e TPU chips with 128GB of total HBM

TP = 8

TPU v6e aka Trillium (32 GB of HBM per chip)

v6e-4 provides 4 v6e TPU chips with 128GB of total HBM

TP = 4

Calculate Memory Requirements
We must estimate the total minimum VRAM needed. This will tell us if the model can fit on a single accelerator or if we need to use parallelism. Memory utilization can be broken down into two main components: static memory from our model weights, activations, and overhead plus the KV Cache memory.
The following tool was created to answer this question: Colab: HBM Calculator.
You can enter the information we determined above to estimate the minimum required VRAM to run our model.

Hugging Face API Key

Model Name from Hugging Face

Number of Active Parameters (billions)

The average input and output length (in tokens) for your workload.

A batch size of 1

The calculation itself is generally out of scope for this discussion, but it can be determined from the following equation:
Required xPU memory = [(model_weight + non_torch_memory + pytorch_activation_peak_memory) + (kv_cache_memory_per_batch * batch_size)] ,
where

model_weight is equal to the number of parameters x a constant depending on parameter data type/precision

non_torch_memory is a buffer for memory overhead (estimated ~1GB)

pytorch_activation_peak_memory is the memory required for intermediate activations

kv_cache_memory_per_batch is the memory required for the KV cache per batch

batch_size is the number of sequences that will be processed simultaneously by the engine

A batch size of one is not a realistic value, but it does provide us with the minimum VRAM we will need for the engine to get off the ground. You can vary this parameter in the calculator to see just how much VRAM we will need to support our larger batch sizes of 128 – 512 sequences.

In our case, we find that we need a minimum of ~57 GB of VRAM to run gemma-3-27b-it on vLLM for our specific workload.
Is Tensor Parallelism Required?
In this case, the answer is that parallelism is not necessarily required, but we could and should consider our options from a price/performance perspective. Why does it matter?
Very quickly – what is Tensor Parallelism? At the highest level, Tensor Parallelism is a method of breaking apart a large model across multiple accelerators (xPU) so that the model can actually fit on the hardware we need. See here for more information.
vLLM supports Tensor Parallelism (TP). With tensor parallelism, accelerators must constantly communicate and synchronize with each other over the network for the model to work. This inter-accelerator communication can add overhead, which has a negative impact on latency. This means we have a tradeoff between cost and latency in our case.
Note: Tensor parallelism is required for TPU’s because of the particular size of this model. v5e and v6e have 16 GB and 32 GB of HBM respectively and mentioned above, so multiple chips are required to support the model size. In this guide, v6e-4 does pay a slight performance penalty for this communication overhead while a single accelerator instance would not.
Benchmarking, Tuning and Finalizing Your vLLM Configuration
Now that you have your short list of accelerator candidates, it is time to see the best level of performance we can across each potential setup. We will only overview an anonymized accelerator-optimized instance and Trillium (v6e) benchmarking & tuning in this section – but the process would be nearly identical for the other accelerators:

Launch, SSH, Update VMs

Pull vLLM Docker Image

Update and Launch Auto Tune Script

Analyze Results

Accelerator-optimized Machine Type
In your project, open the Cloud Shell and enter the following command to launch your chosen instance and its corresponding accelerator and accelerator count. Be sure to update your project ID accordingly and select a zone that supports your machine type for which you have quota.

code_block
)])]>

SSH into the instance.

code_block
<ListValue: [StructValue([(‘code’, ‘gcloud compute ssh vllm-test-instance –zone us-central1-a’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fdb790>)])]>

Now that we’re in our running instance, we can go ahead and pull the latest vLLM Docker image and then run it interactively. A final detail — if we are using a gated model (and we are in this demo) we will need to provide our HF_TOKEN in the container:

code_block
<ListValue: [StructValue([(‘code’, ‘# install docker\r\nsudo apt update && sudo apt -y install docker.io\r\n\r\n# launch container\r\nsudo docker run –gpus=all -dit –privileged \\\r\n –shm-size=16g –name vllm-serve \\\r\n –entrypoint /bin/bash vllm/vllm-openai:latest\r\n\r\n# enter container\r\nsudo docker exec -it vllm-serve bash\r\n\r\n# install required library\r\napt-get install bc\r\n\r\n# Provide HF_TOKEN\r\nexport HF_TOKEN=XXXXXXXXXXXXXXXXXXXXX’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fdb9d0>)])]>

In our running container, we can now find a file called vllm-workspace/benchmarks/auto_tune/auto_tune.sh that we need to update with the information we determined above to tune our vLLM configuration for the best possible throughput and latency.

code_block
<ListValue: [StructValue([(‘code’, ‘# navigate to correct directory\r\ncd benchmarks/auto_tune\r\n\r\n# update the auto_tune.sh script – user your preferred script editor\r\nnano auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fdbac0>)])]>

In the auto_tune.sh script, you will need to make the following updates:

code_block
<ListValue: [StructValue([(‘code’, ‘TAG=$(date +"%Y_%m_%d_%H_%M")\r\nSCRIPT_DIR=$( cd — "$( dirname — "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )\r\nBASE="/vllm-workspace"\r\nMODEL="google/gemma-3-27b-it"\r\nSYSTEM="GPU"\r\nTP=1\r\nDOWNLOAD_DIR="/vllm-workspace/models"\r\nINPUT_LEN=1500\r\nOUTPUT_LEN=200\r\nMAX_MODEL_LEN=2000\r\nMIN_CACHE_HIT_PCT=50\r\nMAX_LATENCY_ALLOWED_MS=10000\r\nNUM_SEQS_LIST="128 256"\r\nNUM_BATCHED_TOKENS_LIST="512 1024 2048"\r\n\r\nLOG_FOLDER="$BASE/auto-benchmark/$TAG"\r\nRESULT="$LOG_FOLDER/result.txt"\r\nPROFILE_PATH="$LOG_FOLDER/profile"\r\n\r\necho "result file: $RESULT"\r\necho "model: $MODEL"\r\n\r\nrm -rf $LOG_FOLDER\r\nrm -rf $PROFILE_PATH\r\nmkdir -p $LOG_FOLDER\r\nmkdir -p $PROFILE_PATH\r\n\r\ncd "$BASE"’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fdb8e0>)])]>

Specify the model we will be using.
Specify that we are leveraging GPU in this case if we did when selecting our instance type.
Tensor Parallelism is set to 1 if we are using a machine type that only has 1 accelerator.
Specify our inputs and outputs.
Specify our 50% min_cache_hit_pct.
Specify our latency requirement.
Update our num_seqs_list to reflect a range of common values for high performance.
Update num_batched_tokens_list if necessary.

This step willl likely not be necessary, but if a use case is particularly small or particularly large inputs/outputs, it may become necessary.

Be sure to specify the BASE, DOWNLOAD_DIR, and cd “$BASE” statement exactly as shown.

Once the parameters have been updated, launch the auto_tune.sh script.

code_block
<ListValue: [StructValue([(‘code’, ‘# launch script\r\nbash auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fbf190>)])]>

The following processes occur:

Our auto_tune.sh script downloads the required model and attempts to start a vLLM server at the highest possible gpu_utilization (0.98 by default). If a CUDA OOM occurs, we go down 1% until we find a stable configuration.

Troubleshooting Note: In rare cases, a vLLM server may be able to start during the initial gpu_utilization test but then fail due to CUDA OOM at the start of the next benchmark. Alternatively, the initial test may fail and then not spawn a follow up server resulting in what appears to be a hang. If either happens, edit the auto_tune.sh near the very end of the file so that gpu_utilization begins at 0.95 or a lower value rather than beginning at 0.98.

Then, for each permutation of num_seqs_list and num_batched_tokens, a server is spun up and our workload is simulated.

A benchmark is first run with an infinite request rate.

If the resulting P99 E2E Latency is within the MAX_LATENCY_ALLOWED_MS limit, this throughput is considered the maximum for this configuration.

If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement.

In our results.txt file at /vllm-workspace/auto-benchmark/$TAG/result.txt, we will find which combination of parameters is most efficient, and then we can take a closer look at that run:

code_block
<ListValue: [StructValue([(‘code’, ‘# result.txt\r\nmax_num_seqs: 128, max_num_batched_tokens: 512, request_rate: 6, e2el: 7715.94, throughput: 4.16, goodput: 4.16\r\n——\r\nmax_num_seqs: 128, max_num_batched_tokens: 1024, request_rate: 6, e2el: 8327.84, throughput: 4.14, goodput: 4.14\r\n——\r\nmax_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 6, e2el: 8292.39, throughput: 4.15, goodput: 4.15\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 512, request_rate: 6, e2el: 7612.31, throughput: 4.17, goodput: 4.17\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 1024, request_rate: 6, e2el: 8277.94, throughput: 4.14, goodput: 4.14\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 2048, request_rate: 6, e2el: 8234.81, throughput: 4.15, goodput: 4.15’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46feb580>)]), StructValue([(‘code’, ‘# bm_log_256_512_requestrate_6.txt\r\n============ Serving Benchmark Result ============\r\nSuccessful requests: 100 \r\nBenchmark duration (s): 24.01 \r\nTotal input tokens: 149900 \r\nTotal generated tokens: 20000 \r\nRequest throughput (req/s): 4.17 \r\nRequest goodput (req/s): 4.17 \r\nOutput token throughput (tok/s): 833.11 \r\nTotal Token throughput (tok/s): 7077.26 \r\n—————Time to First Token—————-\r\nMean TTFT (ms): 142.26 \r\nMedian TTFT (ms): 124.93 \r\nP99 TTFT (ms): 292.74 \r\n—–Time per Output Token (excl. 1st token)——\r\nMean TPOT (ms): 33.53 \r\nMedian TPOT (ms): 33.97 \r\nP99 TPOT (ms): 37.41 \r\n—————Inter-token Latency—————-\r\nMean ITL (ms): 33.53 \r\nMedian ITL (ms): 29.62 \r\nP99 ITL (ms): 53.84 \r\n—————-End-to-end Latency—————-\r\nMean E2EL (ms): 6814.22 \r\nMedian E2EL (ms): 6890.45 \r\nP99 E2EL (ms): 7612.31 \r\n==================================================’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46feb9d0>)])]>

Let’s look at the best-performing result to understand our position:

max_num_seqs: 256, max_num_batched_tokens: 512

These were the settings for the vLLM server during this specific test run.

request_rate: 6

This is the final input from the script’s loop. It means your script determined that sending 6 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 7 req/s, the latency was too high.

e2el: 7612.31

This is the P99 latency that was measured when the server was being hit with 6 req/s. Since 7612.31 is less than 10000, the script accepted this as a successful run.

throughput: 4.17

This is the actual, measured output. Even though you were sending requests at a rate of 6 per second, the server could only successfully process them at a rate of 4.17 per second.

TPU v6e (aka Trillium)
Let’s do the same optimization process for TPU now. You will find that vLLM has a robust ecosystem for supporting TPU-based inference and that there is little difference between how we execute TPU benchmarking and the previously described process.
First we’ll need to launch and configure networking for our TPU instance – in this case we can use Queued Resources. Back in our Cloud Shell, use the following command to deploy a v6e-4 instance. Be sure to select a zone where v6e is available.

code_block
<ListValue: [StructValue([(‘code’, ‘# Create instance\r\ngcloud compute tpus queued-resources create $NAME \\\r\n –node-id $NAME \\\r\n –project $PROJECT \\\r\n –zone $ZONE \\\r\n –accelerator-type v6e-4 \\\r\n –runtime-version v2-alpha-tpuv6e\r\n\r\n# Create firewall rule\r\ngcloud compute firewall-rules create open8004 \\\r\n –project=$PROJECT \\\r\n –direction=INGRESS \\\r\n –priority=1000 \\\r\n –network=default \\\r\n –action=ALLOW \\\r\n –rules=tcp:8004 \\\r\n –source-ranges=0.0.0.0/0 \\\r\n –target-tags=open8004\r\n\r\n# Apply tag to VM\r\ngcloud compute tpus tpu-vm update $NAME \\\r\n –zone $ZONE \\\r\n –project $PROJECT \\\r\n –add-tags open8004’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8e50>)])]>

To monitor the status of your request:

code_block
<ListValue: [StructValue([(‘code’, ‘# Monitor creation\r\ngcloud compute tpus queued-resources list –zone $ZONE –project $PROJECT’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8fd0>)])]>

Wait for the TPU VM to become active (status will update from PROVISIONING to ACTIVE). This might take some time depending on resource availability in the selected zone.
SSH directly into the instance with the following command:

code_block
<ListValue: [StructValue([(‘code’, ‘gcloud compute tpus tpu-vm ssh $NAME –zone $ZONE –project $PROJECT’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8ac0>)])]>

Now that we’re in, pull the vLLM-TPU Docker image, launch our container, and exec into the container:

code_block
<ListValue: [StructValue([(‘code’, ‘sudo docker pull docker.io/vllm/vllm-tpu:nightly\r\n\r\nsudo docker run -dit \\\r\n –name vllm-serve –net host –privileged \\\r\n –entrypoint /bin/bash vllm/vllm-tpu:nightly\r\n\r\nsudo docker exec -it vllm-serve bash’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc89d0>)])]>

Again, we will need to install a dependency, provide our HF_TOKEN and update our auto-tune script as we did above with our other machine type.

code_block
<ListValue: [StructValue([(‘code’, ‘# Head to main working directory\r\ncd benchmarks/auto_tune/\r\n\r\n# install required library\r\napt-get install bc\r\n\r\n# Provide HF_TOKEN\r\nexport HF_TOKEN=XXXXXXXXXXXXXXXXXXXXX\r\n\r\n# update auto_tune.sh with your preferred script editor and launch auto_tuner\r\nnano auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8b20>)])]>

We will want to make the following updates to the vllm/benchmarks/auto_tune.sh file:

And then execute:

code_block
<ListValue: [StructValue([(‘code’, ‘bash auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8d60>)])]>

As our auto_tune.sh executes we determine the largest possible gpu_utilization value our server can run on and then cycle through the different num_batched_tokens parameters to determine which is most efficient.

Troubleshooting Note: It can take a longer amount of time to start up a vLLM engine on TPU due to a series of compilation steps that are required. In some cases, this can go longer than 10 minutes – and when that occurs the auto_tune.sh script may kill the process. If this happens, update the start_server() function such that the for loop sleeps for 30 seconds rather than 10 seconds as shown here:

code_block
<ListValue: [StructValue([(‘code’, ‘start_server() {\r\n\r\n…\r\n\r\n for i in {1..60}; do \r\n RESPONSE=$(curl -s -X GET "http://0.0.0.0:8004/health" -w "%{http_code}" -o /dev/stdout)\r\n STATUS_CODE=$(echo "$RESPONSE" | tail -n 1) \r\n if [[ "$STATUS_CODE" -eq 200 ]]; then\r\n server_started=1\r\n break\r\n else\r\n sleep 10 # UPDATE TO 30 IF VLLM ENGINE START TAKES TOO LONG\r\n fi\r\n done\r\n if (( ! server_started )); then\r\n echo "server did not start within 10 minutes. Please check server log at $vllm_log".\r\n return 1\r\n else\r\n return 0\r\n fi\r\n}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8a60>)])]>

The outputs are printed as our program executes and we can also find them in log files at $BASE/auto-benchmark/TAG. We can see in these logs that our current configurations are still able to achieve our latency requirements.
Again we can observe our results.txt file:

code_block
<ListValue: [StructValue([(‘code’, ‘# request.txt\r\nmax_num_seqs: 128, max_num_batched_tokens: 512, request_rate: 9, e2el: 8549.13, throughput: 5.59, goodput: 5.59\r\n——\r\nmax_num_seqs: 128, max_num_batched_tokens: 1024, request_rate: 9, e2el: 9375.51, throughput: 5.53, goodput: 5.53\r\n——\r\nmax_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 9, e2el: 9869.43, throughput: 5.48, goodput: 5.48\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 512, request_rate: 9, e2el: 8423.40, throughput: 5.63, goodput: 5.63\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 1024, request_rate: 9, e2el: 9319.26, throughput: 5.54, goodput: 5.54\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 2048, request_rate: 9, e2el: 9869.08, throughput: 5.48, goodput: 5.48’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc88e0>)])]>

And the corresponding metrics for our best run:

code_block
<ListValue: [StructValue([(‘code’, ‘# bm_log_256_512_requestrate_9.txt\r\n\r\nTraffic request rate: 9.0 RPS.\r\nBurstiness factor: 1.0 (Poisson process)\r\nMaximum request concurrency: None\r\n============ Serving Benchmark Result ============\r\nSuccessful requests: 100 \r\nBenchmark duration (s): 17.75 \r\nTotal input tokens: 149900 \r\nTotal generated tokens: 20000 \r\nRequest throughput (req/s): 5.63 \r\nRequest goodput (req/s): 5.63 \r\nOutput token throughput (tok/s): 1126.50 \r\nTotal Token throughput (tok/s): 9569.63 \r\n—————Time to First Token—————-\r\nMean TTFT (ms): 178.27 \r\nMedian TTFT (ms): 152.43 \r\nP99 TTFT (ms): 379.75 \r\n—–Time per Output Token (excl. 1st token)——\r\nMean TPOT (ms): 37.28 \r\nMedian TPOT (ms): 37.51 \r\nP99 TPOT (ms): 41.47 \r\n—————Inter-token Latency—————-\r\nMean ITL (ms): 37.28 \r\nMedian ITL (ms): 36.27 \r\nP99 ITL (ms): 51.39 \r\n—————-End-to-end Latency—————-\r\nMean E2EL (ms): 7597.10 \r\nMedian E2EL (ms): 7692.51 \r\nP99 E2EL (ms): 8423.40 \r\n==================================================’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8850>)])]>

Let’s look at the best-performing result to understand our position:

max_num_seqs: 256, max_num_batched_tokens: 512

These were the settings for the vLLM server during this specific test run.

request_rate: 9

This is the final input from the script’s loop. It means your script determined that sending 9 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 10 req/s, the latency was too high.

e2el: 8423.40

This is the P99 latency that was measured when the server was being hit with 9 req/s. Since 8423.40 is less than 10,000, the script accepted this as a successful run.

throughput: 5.63

This is the actual, measured output. Even though you were sending requests at a rate of 9 per second, the server could only successfully process them at a rate of 5.63 per second.

Calculating Performance-Cost Ratio
Now that we have tuned and benchmarked our two primary accelerator candidates, we can bring the data together to make a final, cost-based decision. The goal is to find the most economical configuration that can meet our workload requirement of 100 requests per second while staying under our P99 end-to-end latency limit of 10,000 ms.
We will analyze the cost to meet our 100 req/s target using the best-performing configuration for both the anonymized candidate and the TPU v6e.
Anonymized Accelerator-optimized Candidate

Measured Throughput: The benchmark showed a single vLLM engine achieved a throughput of 4.17 req/s.

Instances Required: To meet our 100 req/s goal, we would need to run multiple instances. The calculation is:

Target Throughput / Throughput Per Instance = 100 req/s ÷ 4.17 req/s ≈ 23.98

Since we can’t provision a fraction of an instance, we must round up to 24 instances.

Estimated Cost: As of July 2025, the spot price for our anonymized machine type in us-central1 was approximately $2.25 per hour. The total hourly cost for our cluster would be: 24 instances × $2.25/hr = $54.00/hr

Note: We are choosing Spot instance pricing for the simple cost figures, this would not be a typical provisioning pattern for this type of workload.

Google Cloud TPU v6e (v6e-4)

Measured Throughput: The benchmark showed a single v6e-4 vLLM engine achieved a higher throughput of 5.63 req/s.

Instances Required: We perform the same calculation for the TPU cluster:

Target Throughput ÷ Throughput per Instance = 100 req/s ÷ 5.63 req/s ≈ 17.76

Again, we must round up to 18 instances to strictly meet the 100 req/s requirement.

Estimated Cost: As of July 2025, the spot price for a v6e-4 queued resource in us-central1 is approximately $0.56 per chip per hour. The total hourly cost for this cluster would be:

18 instances × 4 chips x $0.56/hr = $40.32/hr

Conclusion: The Most Cost-Effective Choice
Let’s summarize our findings in a table to make the comparison clear.

Metric
Anonymized Candidate
TPU (v6e-4)

Throughput per Instance
4.17 req/s
5.63 req/s

Instances Needed (100 req/s)
24
18

Spot Instance Cost Per Hour
$2.25 / hour
$0.56 x 4 chips = $2.24 / hour

Spot Cost Total
$54.00 / hour
$40.32 / hour

Total Monthly Cost (730h)
~ $39,400
~ $29,400

The results are definitive. For this specific workload (serving the gemma-3-27b-it model with long contexts), the v6e-4 configuration is the winner.
Not only does the v6e-4 instance provide higher throughput than our accelerator-optimized instance, but it does so at a reduced cost. This translates to massive savings at higher scales.
Looking at the performance-per-dollar, the advantage is clear:

Anonymized Candidate: 4.17 req/s ÷ $54.00/hr ≈ 0.08 req/s per dollar-hour

TPU v6e: 5.63 req/s ÷ $40.32/hr ≈ 0.14 req/s per dollar-hour

The v6e-4 configuration delivers almost twice the performance for every dollar spent, making it the superior, efficient choice for deploying this workload.
Final Reminder
This benchmarking and tuning process demonstrates the critical importance of evaluating different hardware options to find the optimal balance of performance and cost for your specific AI workload. We need to keep in mind the following sizing these workloads:

If our workload changed (e.g., input length, output length, prefix-caching percentage, or our requirements) the outcome of this guide may be different – the anonymized candidate could outperform v6e in several scenarios depending on the workload.

If we considered the other possible accelerators mentioned above, we may find a more cost effective approach that meets our requirements.

Finally, we covered a relatively small parameter space in our auto_tune.sh script for this example – perhaps if we searched a larger space we may have found a configuration with even greater cost savings potential.

Additional Resources
The following is a collection of additional resources to help you complete the guide and better understand the concepts described.

Auto Tune ReadMe in Github

TPU Optimization Tips

Currently Supported Models for TPU on vLLM

More on Optimization and Parallelism from vLLM

Great Article on KV Cache

AI Summary and Description: Yes

Summary: The text provides a detailed guide on optimizing the use of vLLM (a serving framework for large language models) in conjunction with Google Cloud’s TPU and accelerator-optimized instances. The focus is on selecting the right hardware and fine-tuning configurations to meet specific latency and throughput requirements for serving large language models effectively. This information is especially relevant for professionals navigating cloud computing, AI model deployment, and performance optimization.

Detailed Description:
The guide outlines a systematic approach for professionals interested in deploying and optimizing large language models (LLMs) using the vLLM framework on Google Cloud. Notably, it addresses key aspects such as accelerator selection, latency considerations, and configuration optimization to enhance performance and cost efficiency.

Key Points:

– **Overview of vLLM**: vLLM simplifies serving large language models but requires careful optimization for performance, particularly concerning latency and throughput.

– **Performance Requirements**: Establishing baseline performance metrics, such as the target of handling 100 requests per second with less than 10 seconds latency.

– **Accelerator Selection**: Comprehensive evaluation of various GPU and TPU options, including:
– Accelerator-optimized Google Cloud instances (e.g., g2-standard-48, a2-ultragpu-1g, a3-highgpu-1g).
– Google Cloud TPU offerings (v5e and v6e) with different memory and performance specifications.

– **Configuration Optimization**:
– Determining model precision, expected workload characteristics, and memory requirements for efficient operation.
– Utilization of suffix caching and Tensor Parallelism to enhance serving performance.

– **Benchmarking and Tuning**:
– Steps to benchmark the vLLM server, including autoscaling settings to identify the highest sustainable throughput while meeting latency constraints.
– A thorough walkthrough of command-line prompts to deploy instances, pull required Docker images, and run the benchmarking scripts.

– **Cost Analysis**:
– A detailed cost analysis comparing the performance and pricing of different configurations, highlighting that the TPU v6e-4 instance achieves superior performance per dollar compared to the anonymous accelerator-optimized machines.

– **Conclusions & Recommendations**: The v6e-4 TPU is recommended as the most cost-effective option for deploying the specific workload due to its performance efficiency, demonstrating the necessity of rigorous testing and tuning of machine learning infrastructures.

– **Additional Resources**: Mention of extra documentation and tips for further understanding of TPU optimization and other related subjects.

The content is imperative for AI and cloud professionals seeking to enhance their systems for deploying LLM applications, ensuring they remain competitive by optimizing resource use while maintaining high performance standards.

01 1 10 2 2025 24 3 4 5 53 7 800 a accelerator accelerator communication accelerator selection accelerators access account accuracy Act addresses ads age AGI AI ai model All alt analysis and anti API app Application applications Arch Arize art as at ated Auto autoscaling availability average based bash batch size batch sizes being benchmark benchmarking benchmarks Best Bi by Byte C Cache caching CERN challenge chip chips CI CIA CleaR Cloud cloud computing cloud instances cloud professionals Cloud Shell cluster co code Col command command-line tool communication communication overhead competitive compilation computation compute Computing concept concurrency Configuration configurations constant container content Context control controls cost cost analysis cost efficiency cost savings cost-effective creation critical cross CUDA Curl Current D data Data Type de decision default DeFi demo dependency deployment developer developers Docker docker images document documentation e E2 E2E ecosystem effective efficiency efficient election ELF end Engineer Entra Entry EU evaluation exp export face fact fail fault feature features file fine fine-tuning firewall first fixes following for framework full function g G2 Gemma Gemma 3 Gen general generated git GitHub Go goal Goodput Google Google Cloud Google Cloud project GPT GPU GPUs gs H handling hardware health high Highlight HP HR http HTTPS hugging Hugging Face image impact in Inference information infrastructure infrastructures ingress Instance instruction instruction-tuned model inter inux io iOS IRS issue ite J Just k keeping Key l Lance language language model language models large large language model large language models Large Language Models (LLMs) latency latency requirements learning led Lee level Li library line Linux Lite llm llms lm load logs long loop low M mac machine Machine Learning making man mass matt max mean media memory memory requirements memory utilization metrics mid mini Mode model model deployment model precision model weights models Monitor multi N nation native network Networking next no Node nomic non NPU NTP o oE of off on one only ons open openai operation opt optimization optimized options ory oS oss other out outcome output Outputs over Parallel parallelism parameter Paris patterns pay per performance performance metrics performance optimization performance requirements performance tuning point porting post potential Power pre precision prefix caching price pricing privilege pro process processes professionals project prompt prompts provisioning ps Py pytorch Q quantized question QUIC R rag rate RCE re real recommendations red Requirements resource resource scarcity resources response return reuse right Rigorous Testing Ro s sam saving Scale scaling scope search sec self sequence series server server configuration settings SHA short side Sig Sim Simple single size sizes small source source availability space specific speed SSE SSH SSO stable standards STAR start state structures sudo support system system prompt system prompts systems T team Teams tech technology ted tensor parallelism test Testing text the throughput Time to token tokens tool Tor TP TPUs trade traffic trie Trillium troubleshooting tuning turn two type UI Ultra under up update updates US use user uth utilization uv V v6 val Valuation Vantage version Vision vllm vm WAN Ware weight Wi workload workload characteristics workloads world x xpu yt z zone