Source URL: https://cloud.google.com/blog/topics/developers-practitioners/rightsizing-llm-serving-on-vllm-for-gpus-and-tpus/
Source: Cloud Blog
Title: Rightsizing LLM Serving on vLLM for GPUs and TPUs
Feedly Summary: Additional contributors include Hossein Sarshar and Ashish Narasimham.
Large Language Models (LLMs) are revolutionizing how we interact with technology, but serving these powerful models efficiently can be a challenge. vLLM has rapidly become the primary choice for serving open source large language models at scale, but using vLLM is not a silver bullet. Teams that are serving LLMs for downstream applications have stringent latency and throughput requirements that necessitate a thorough analysis of which accelerator to run on and what configuration offers the best possible performance.
This guide provides a bottoms-up approach to determining the best accelerator for your use case and optimizing your vLLM’s configuration to achieve the best and most cost effective results possible.
Note: This guide assumes that you are familiar with GPUs, TPUs, vLLM, and the underlying features that make it such an effective serving framework.
Prerequisites
Before we begin, ensure you have:
A Google Cloud Project with billing enabled.
The gcloud command-line tool installed and authenticated.
Basic familiarity with Linux commands and Docker.
Hugging Face account, a read token and access to the Gemma 3 27B model.
Gathering Information on Your Use Case
Choosing the right accelerator can feel like an intimidating process because each inference use case is unique. There is no a priori ideal set up from a cost/performance perspective; we can’t say model X should always be run on accelerator Y.
The following considerations need to be taken into account to best determine how to proceed:
What model are you using?
Our example model is google/gemma-3-27b-it. This is a 27-billion parameter instruction-tuned model from Google’s Gemma 3 family.
What is the precision of the model you’re using?
We will use bfloat16 (BF16).
Note: Model precision determines the number of bytes used to store each model weight. Common options are float32 (4 bytes), float16 (2 bytes), and bfloat16 (2 bytes). Many models are now also available in quantized formats like 8-bit, 4-bit (e.g., GPTQ, AWQ), or even lower. Lower precision reduces memory requirements and can increase speed, but may come with a slight trade-off in accuracy.
Workload characteristics: How many requests/second are you expecting?
We are targeting support for 100 requests/second.
What is the average sequence length per request?
Input Length: 1500 tokens
Output Length: 200 tokens
The total sequence length per request is therefore 1500 + 200 = 1700 tokens on average.
What is the maximum total sequence length we will need to be able to handle?
Let’s say in this case it is 2000 total tokens
What is the GPU Utilization you’ll be using?
The gpu_memory_utilization parameter in vLLM controls how much of the GPU’s VRAM is pre-allocated for the KV cache (given the allocated memory for the model weights). By default, this is 90% in vLLM, but we generally want to set this as high as possible to optimize performance without causing OOM issues – which is how our auto_tune.sh script works (as described in the “Benchmarking, Tuning and Finalizing Your vLLM Configuration" section of this post).
What is your prefix cache rate?
This will be determined from application logs, but we’ll estimate 50% for our calculations.
Note: Prefix caching is a powerful vLLM optimization that reuses the computed KV cache for shared prefixes across different requests. For example, if many requests share the same lengthy system prompt, the KV cache for that prompt is calculated once and shared, saving significant computation and memory. The hit rate is highly application-specific. You can estimate it by analyzing your request logs for common instruction patterns or system prompts.
What is your latency requirement?
The end-to-end latency from request to final token should not exceed 10 seconds (P99 E2E). This is our primary performance constraint.
Selecting Accelerators (GPU/TPU)
We live in a world of resource scarcity! What does this mean for your use case? It means that of course you could probably get the best possible latency and throughput by using the most up to date hardware – but as an engineer it makes no sense to do this when you can achieve your requirements at a better price/performance point.
Identifying Candidate Accelerators
We can refer to our Accelerator-Optimized Machine Family of Google Cloud Instances to determine which GPUs are viable candidates.
We can refer to our Cloud TPU offerings to determine which TPUs are viable candidates.
The following are examples of accelerators that can be used for our workloads, as we will see in the following Calculate Memory Requirements section.
The following options have different Tensor Parallelism (TP) configurations required depending on the total VRAM. Please see the next section for an explanation of Tensor Parallelism.
GPU Options
L4 GPUs
g2-standard-48 instance provides 4xL4 GPUs with 96 GB of GDDR6
TP = 4
A100 GPUs
a2-ultragpu-1g instance provides 1xA100 80GB GPU of HBM
TP = 1
H100 GPUs
a3-highgpu-1g instances provides 1xH100 80GB GPU of HBM
TP = 1
TPU Options
TPU v5e (16 GB of HBM per chip)
v5litepod-8 provides 8 v5e TPU chips with 128GB of total HBM
TP = 8
TPU v6e aka Trillium (32 GB of HBM per chip)
v6e-4 provides 4 v6e TPU chips with 128GB of total HBM
TP = 4
Calculate Memory Requirements
We must estimate the total minimum VRAM needed. This will tell us if the model can fit on a single accelerator or if we need to use parallelism. Memory utilization can be broken down into two main components: static memory from our model weights, activations, and overhead & the KV Cache memory.
The following tool was created to answer this question: Colab: HBM Calculator
You can enter the information we determined above to estimate the minimum required VRAM to run our model.
Hugging Face API Key
Model Name from Hugging Face
Number of Active Parameters (billions)
The average input and output length (in tokens) for your workload.
A batch size of 1
The calculation itself is generally out of scope for this discussion, but it can be determined from the following equation:
Required GPU/TPU memory = [(model_weight + non_torch_memory + pytorch_activation_peak_memory) + (kv_cache_memory_per_batch * batch_size)] ,
where
model_weight is equal to the number of parameters x a constant depending on parameter data type/precision
non_torch_memory is a buffer for memory overhead (estimated ~1GB)
pytorch_activation_peak_memory is the memory required for intermediate activations
kv_cache_memory_per_batch is the memory required for the KV cache per batch
batch_size is the number of sequences that will be processed simultaneously by the engine
A batch size of one is not a realistic value, but it does provide us with the minimum VRAM we will need for the engine to get off the ground. You can vary this parameter in the calculator to see just how much VRAM we will need to support our larger batch sizes of 128 – 512 sequences.
In our case, we find that we need a minimum of ~57 GB of VRAM to run gemma-3-27b-it on vLLM for our specific workload.
Is Tensor Parallelism Required?
In this case, the answer is that parallelism is not necessarily required, but we could and should consider our options from a price/performance perspective. Why does it matter?
Very quickly – what is Tensor Parallelism? At the highest level, Tensor Parallelism is a method of breaking apart a large model across multiple accelerators (GPU/TPU) so that the model can actually fit on the hardware we need. See here for more information.
vLLM supports Tensor Parallelism (TP). With tensor parallelism, accelerators must constantly communicate and synchronize with each other over the network for the model to work. This inter-accelerator communication can add overhead, which has a negative impact on latency. This means we have a tradeoff between cost and latency in our case.
Note: Tensor parallelism is required for TPU’s because of the particular size of this model. v5e and v6e have 16 GB and 32 GB of HBM respectively and mentioned above, so multiple chips are required to support the model size. In this guide, v6e-4 does pay a slight performance penalty for this communication overhead while our 1xH100 instance will not.
Benchmarking, Tuning and Finalizing Your vLLM Configuration
Now that you have your short list of accelerator candidates (4xL4, 1xA100-80GB, 1xH100-80GB, TPU v5e-8, TPU v6e-4), it is time to see the best level of performance we can across each potential setup. We will only overview the H100 and Trillium (v6e) benchmarking & tuning in this section – but the process would be nearly identical for the other accelerators:
Launch, SSH, Update VMs
Pull vLLM Docker Image
Update and Launch Auto Tune Script
Analyze Results
H100 80GB
In your project, open the Cloud Shell and enter the following command to launch an a3-highgpu-1g instance. Be sure to update your project ID accordingly and select a zone that supports the a3-highgpu-1g machine type for which you have quota.
code_block
SSH into the instance.
code_block
<ListValue: [StructValue([(‘code’, ‘gcloud compute ssh vllm-h100-instance –zone us-central1-a’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af46bea30>)])]>
Now that we’re in our running instance, we can go ahead and pull the latest vLLM Docker image and then run it interactively. A final detail – if we are using a gated model (and we are in this demo) we will need to provide our HF_TOKEN in the container:
code_block
<ListValue: [StructValue([(‘code’, ‘# install docker\r\nsudo apt update && sudo apt -y install docker.io\r\n\r\n# launch container\r\nsudo docker run –gpus=all -dit –privileged \\\r\n –shm-size=16g –name vllm-serve \\\r\n –entrypoint /bin/bash vllm/vllm-openai:latest\r\n\r\n# enter container\r\nsudo docker exec -it vllm-serve bash\r\n\r\n# install required library\r\napt-get install bc\r\n\r\n# Provide HF_TOKEN\r\nexport HF_TOKEN=XXXXXXXXXXXXXXXXXXXXX’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af46be490>)])]>
In our running container, we can now find a file called vllm-workspace/benchmarks/auto_tune/auto_tune.sh which we will need to update with the information we determined above to tune our vLLM configuration for the best possible throughput and latency.
code_block
<ListValue: [StructValue([(‘code’, ‘# navigate to correct directory\r\ncd benchmarks/auto_tune\r\n\r\n# update the auto_tune.sh script – user your preferred script editor\r\nnano auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af46bebe0>)])]>
In the auto_tune.sh script, you will need to make the following updates:
code_block
<ListValue: [StructValue([(‘code’, ‘TAG=$(date +"%Y_%m_%d_%H_%M")\r\nBASE="/vllm-workspace"\r\nMODEL="google/gemma-3-27b-it"\r\nSYSTEM="GPU"\r\nTP=1\r\nDOWNLOAD_DIR="/vllm-workspace/models"\r\nINPUT_LEN=1500\r\nOUTPUT_LEN=200\r\nMIN_CACHE_HIT_PCT=50\r\nMAX_LATENCY_ALLOWED_MS=10000\r\nNUM_SEQS_LIST="128 256"\r\nNUM_BATCHED_TOKENS_LIST="512 1024 2048"\r\n\r\nLOG_FOLDER="$BASE/auto-benchmark/$TAG"\r\nRESULT="$LOG_FOLDER/result.txt"\r\nPROFILE_PATH="$LOG_FOLDER/profile"\r\n\r\necho "result file: $RESULT"\r\necho "model: $MODEL"\r\n\r\nrm -rf $LOG_FOLDER\r\nrm -rf $PROFILE_PATH\r\nmkdir -p $LOG_FOLDER\r\nmkdir -p $PROFILE_PATH\r\n\r\ncd "$BASE"’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af46be1f0>)])]>
Specify the model we will be using.
Specify that we are leveraging GPU in this case.
Tensor Parallelism is set to 1.
Specify our inputs and outputs.
Specify our 50% min_cache_hit_pct.
Specify our latency requirement.
Update our num_seqs_list to reflect a range of common values for high performance.
Update num_batched_tokens_list if necessary
Likely will not be necessary, but if a use case is particularly small or particularly large inputs/outputs
Be sure to specify the BASE, DOWNLOAD_DIR, and cd “$BASE” statement exactly as shown.
Once the parameters have been updated, launch the auto_tune.sh script
code_block
<ListValue: [StructValue([(‘code’, ‘# launch script\r\nbash auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fcdf0>)])]>
The following processes occur:
Our auto_tune.sh script downloads the required model and attempts to start a vLLM server at the highest possible gpu_utilization (0.98 by default). If a CUDA OOM occurs, we go down 1% until we find a stable configuration.
Troubleshooting Note: In rare cases, a vLLM server may be able to start during the initial gpu_utilization test but then fail due to CUDA OOM at the start of the next benchmark. Alternatively, the initial test may fail and then not spawn a follow up server resulting in what appears to be a hang. If either happens, edit the auto_tune.sh near the very end of the file so that gpu_utilization begins at 0.95 or a lower value rather than beginning at 0.98.
Troubleshooting Note: By default, profiling is currently being passed to the benchmarking_server.py script. In some cases this may cause the process to hang if the GPU profiler is not capable of handling the large number of requests for that specific model. You can confirm this by reviewing the logs for the current run; if the logs include the following line with an indefinite hang afterwards, you’ve run into this problem:
code_block
<ListValue: [StructValue([(‘code’, ‘INFO 08-13 09:15:58 [api_server.py:1170] Stopping profiler…\r\n# Extensive wait time with only a couple additional logs’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fc8e0>)])]>
If that is the case, simply remove the –profile flag from the benchmarking_server.py calls in the auto_tune.sh script under the run_benchmark() function:
code_block
<ListValue: [StructValue([(‘code’, ‘# REMOVE PROFILE FLAG IF HANG OCCURS\r\npython3 benchmarks/benchmark_serving.py \\\r\n –backend vllm \\\r\n –model $MODEL \\\r\n –dataset-name random \\\r\n –random-input-len $adjusted_input_len \\\r\n –random-output-len $OUTPUT_LEN \\\r\n –ignore-eos \\\r\n –disable-tqdm \\\r\n –request-rate inf \\\r\n –percentile-metrics ttft,tpot,itl,e2el \\\r\n –goodput e2el:$MAX_LATENCY_ALLOWED_MS \\\r\n –num-prompts 1000 \\\r\n –random-prefix-len $prefix_len \\\r\n –port 8004 \\\r\n –profile &> "$bm_log" # Remove this flag, making sure to keep the &> "$bm_log" # on the argument above’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fc400>)])]>
Then, for each permutation of num_seqs_list and num_batched_tokens, a server is spun up and our workload is simulated.
A benchmark is first run with an infinite request rate.
If the resulting P99 E2E Latency is within the MAX_LATENCY_ALLOWED_MS limit, this throughput is considered the maximum for this configuration.
If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement.
In our results.txt file at /vllm-workspace/auto-benchmark/$TAG/result.txt, we will find which combination of parameters is most efficient, and then we can take a closer look at that run:
code_block
<ListValue: [StructValue([(‘code’, ‘# result.txt\r\nmax_num_seqs: 128, max_num_batched_tokens: 512, request_rate: 6, e2el: 7715.94, throughput: 4.16, goodput: 4.16\r\n——\r\nmax_num_seqs: 128, max_num_batched_tokens: 1024, request_rate: 6, e2el: 8327.84, throughput: 4.14, goodput: 4.14\r\n——\r\nmax_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 6, e2el: 8292.39, throughput: 4.15, goodput: 4.15\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 512, request_rate: 6, e2el: 7612.31, throughput: 4.17, goodput: 4.17\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 1024, request_rate: 6, e2el: 8277.94, throughput: 4.14, goodput: 4.14\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 2048, request_rate: 6, e2el: 8234.81, throughput: 4.15, goodput: 4.15’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fcee0>)])]>
code_block
<ListValue: [StructValue([(‘code’, ‘# bm_log_256_512_requestrate_6.txt\r\n============ Serving Benchmark Result ============\r\nSuccessful requests: 100 \r\nBenchmark duration (s): 24.01 \r\nTotal input tokens: 149900 \r\nTotal generated tokens: 20000 \r\nRequest throughput (req/s): 4.17 \r\nRequest goodput (req/s): 4.17 \r\nOutput token throughput (tok/s): 833.11 \r\nTotal Token throughput (tok/s): 7077.26 \r\n—————Time to First Token—————-\r\nMean TTFT (ms): 142.26 \r\nMedian TTFT (ms): 124.93 \r\nP99 TTFT (ms): 292.74 \r\n—–Time per Output Token (excl. 1st token)——\r\nMean TPOT (ms): 33.53 \r\nMedian TPOT (ms): 33.97 \r\nP99 TPOT (ms): 37.41 \r\n—————Inter-token Latency—————-\r\nMean ITL (ms): 33.53 \r\nMedian ITL (ms): 29.62 \r\nP99 ITL (ms): 53.84 \r\n—————-End-to-end Latency—————-\r\nMean E2EL (ms): 6814.22 \r\nMedian E2EL (ms): 6890.45 \r\nP99 E2EL (ms): 7612.31 \r\n==================================================’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fc250>)])]>
Let’s look at the best-performing result to understand our position:
max_num_seqs: 256, max_num_batched_tokens: 512
These were the settings for the vLLM server during this specific test run.
request_rate: 6
This is the final input from the script’s loop. It means your script determined that sending 6 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 7 req/s, the latency was too high.
e2el: 7612.31
This is the P99 latency that was measured when the server was being hit with 6 req/s. Since 7612.31 is less than 10000, the script accepted this as a successful run.
throughput: 4.17
This is the actual, measured output. Even though you were sending requests at a rate of 6 per second, the server could only successfully process them at a rate of 4.17 per second.
TPU v6e (aka Trillium)
Let’s do the same optimization process for TPU now. You will find that vLLM has a robust ecosystem for supporting TPU-based inference and that there is little difference between how we execute our benchmarking script for GPU and TPU.
First we’ll need to launch and configure networking for our TPU instance – in this case we can use Queued Resources. Back in our Cloud Shell, use the following command to deploy a v6e-4 instance. Be sure to select a zone where v6e is available.
code_block
<ListValue: [StructValue([(‘code’, ‘# Create instance\r\ngcloud compute tpus queued-resources create $NAME \\\r\n –node-id $NAME \\\r\n –project $PROJECT \\\r\n –zone $ZONE \\\r\n –accelerator-type v6e-4 \\\r\n –runtime-version v2-alpha-tpuv6e\r\n\r\n# Create firewall rule\r\ngcloud compute firewall-rules create open8004 \\\r\n –project=$PROJECT \\\r\n –direction=INGRESS \\\r\n –priority=1000 \\\r\n –network=default \\\r\n –action=ALLOW \\\r\n –rules=tcp:8004 \\\r\n –source-ranges=0.0.0.0/0 \\\r\n –target-tags=open8004\r\n\r\n# Apply tag to VM\r\ngcloud compute tpus tpu-vm update $NAME \\\r\n –zone $ZONE \\\r\n –project $PROJECT \\\r\n –add-tags open8004’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fc5b0>)])]>
To monitor the status of your request:
code_block
<ListValue: [StructValue([(‘code’, ‘# Monitor creation\r\ngcloud compute tpus queued-resources list –zone $ZONE –project $PROJECT’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fcb50>)])]>
Wait for the TPU VM to become active (status will update from PROVISIONING to ACTIVE). This might take some time depending on resource availability in the selected zone.
SSH directly into the instance with the following command:
code_block
<ListValue: [StructValue([(‘code’, ‘gcloud compute tpus tpu-vm ssh $NAME –zone $ZONE –project $PROJECT’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fc160>)])]>
Now that we’re in, pull the vLLM-TPU Docker image, launch our container, and exec into the container:
code_block
<ListValue: [StructValue([(‘code’, ‘sudo docker pull docker.io/vllm/vllm-tpu:nightly\r\n\r\nsudo docker run -dit \\\r\n –name vllm-serve –net host –privileged \\\r\n –entrypoint /bin/bash vllm/vllm-tpu:nightly\r\n\r\nsudo docker exec -it vllm-serve bash’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af592f160>)])]>
Again, we will need to install a dependency, provide our HF_TOKEN and update our auto-tune script as we did above with the H100.
code_block
<ListValue: [StructValue([(‘code’, ‘# Head to main working directory\r\ncd benchmarks/auto_tune/\r\n\r\n# install required library\r\napt-get install bc\r\n\r\n# Provide HF_TOKEN\r\nexport HF_TOKEN=XXXXXXXXXXXXXXXXXXXXX\r\n\r\n# update auto_tune.sh with your preferred script editor and launch auto_tuner\r\nnano auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af592f1c0>)])]>
We will want to make the following updates to the vllm/benchmarks/auto_tune.sh file:
code_block
<ListValue: [StructValue([(‘code’, ‘TAG=$(date +"%Y_%m_%d_%H_%M")\r\nSCRIPT_DIR=$( cd — "$( dirname — "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )\r\nBASE="/workspace"\r\nMODEL="google/gemma-3-27b-it"\r\nSYSTEM="TPU"\r\nTP=4\r\nDOWNLOAD_DIR="/workspace/models"\r\nINPUT_LEN=1500\r\nOUTPUT_LEN=200\r\nMAX_MODEL_LEN=2000\r\nMIN_CACHE_HIT_PCT=50\r\nMAX_LATENCY_ALLOWED_MS=10000\r\nNUM_SEQS_LIST="128 256"\r\nNUM_BATCHED_TOKENS_LIST="512 1024 2048"’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af592f8e0>)])]>
And then execute:
code_block
<ListValue: [StructValue([(‘code’, ‘bash auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af592fb20>)])]>
As our auto_tune.sh executes we determine the largest possible gpu_utilization value our server can run on and then cycle through the different num_batched_tokens parameters to determine which is most efficient.
Troubleshooting Note: It can take a longer amount of time to start up a vLLM engine on TPU compared to GPU due to a series of compilation steps that are required. In some cases, this can go longer than 10 minutes – and when that occurs the auto_tune.sh script may kill the process. If this happens, update the start_server() function such that the for loop sleeps for 30 seconds rather than 10 seconds as shown here:
code_block
<ListValue: [StructValue([(‘code’, ‘start_server() {\r\n\r\n…\r\n\r\n for i in {1..60}; do \r\n RESPONSE=$(curl -s -X GET "http://0.0.0.0:8004/health" -w "%{http_code}" -o /dev/stdout)\r\n STATUS_CODE=$(echo "$RESPONSE" | tail -n 1) \r\n if [[ "$STATUS_CODE" -eq 200 ]]; then\r\n server_started=1\r\n break\r\n else\r\n sleep 10 # UPDATE TO 30 IF VLLM ENGINE START TAKES TOO LONG\r\n fi\r\n done\r\n if (( ! server_started )); then\r\n echo "server did not start within 10 minutes. Please check server log at $vllm_log".\r\n return 1\r\n else\r\n return 0\r\n fi\r\n}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af7d14940>)])]>
The outputs are printed as our program executes and we can also find them in log files at $BASE/auto-benchmark/TAG. We can see in these logs that our current configurations are still able to achieve our latency requirements.
Again we can observe our results.txt file:
code_block
<ListValue: [StructValue([(‘code’, ‘# request.txt\r\nmax_num_seqs: 128, max_num_batched_tokens: 512, request_rate: 9, e2el: 8549.13, throughput: 5.59, goodput: 5.59\r\n——\r\nmax_num_seqs: 128, max_num_batched_tokens: 1024, request_rate: 9, e2el: 9375.51, throughput: 5.53, goodput: 5.53\r\n——\r\nmax_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 9, e2el: 9869.43, throughput: 5.48, goodput: 5.48\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 512, request_rate: 9, e2el: 8423.40, throughput: 5.63, goodput: 5.63\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 1024, request_rate: 9, e2el: 9319.26, throughput: 5.54, goodput: 5.54\r\n——\r\nmax_num_seqs: 256, max_num_batched_tokens: 2048, request_rate: 9, e2el: 9869.08, throughput: 5.48, goodput: 5.48’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af7d14c40>)])]>
And the corresponding metrics for our best run:
code_block
<ListValue: [StructValue([(‘code’, ‘# bm_log_256_512_requestrate_9.txt\r\n\r\nTraffic request rate: 9.0 RPS.\r\nBurstiness factor: 1.0 (Poisson process)\r\nMaximum request concurrency: None\r\n============ Serving Benchmark Result ============\r\nSuccessful requests: 100 \r\nBenchmark duration (s): 17.75 \r\nTotal input tokens: 149900 \r\nTotal generated tokens: 20000 \r\nRequest throughput (req/s): 5.63 \r\nRequest goodput (req/s): 5.63 \r\nOutput token throughput (tok/s): 1126.50 \r\nTotal Token throughput (tok/s): 9569.63 \r\n—————Time to First Token—————-\r\nMean TTFT (ms): 178.27 \r\nMedian TTFT (ms): 152.43 \r\nP99 TTFT (ms): 379.75 \r\n—–Time per Output Token (excl. 1st token)——\r\nMean TPOT (ms): 37.28 \r\nMedian TPOT (ms): 37.51 \r\nP99 TPOT (ms): 41.47 \r\n—————Inter-token Latency—————-\r\nMean ITL (ms): 37.28 \r\nMedian ITL (ms): 36.27 \r\nP99 ITL (ms): 51.39 \r\n—————-End-to-end Latency—————-\r\nMean E2EL (ms): 7597.10 \r\nMedian E2EL (ms): 7692.51 \r\nP99 E2EL (ms): 8423.40 \r\n==================================================’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af7d142b0>)])]>
Let’s look at the best-performing result to understand our position:
max_num_seqs: 256, max_num_batched_tokens: 512
These were the settings for the vLLM server during this specific test run.
request_rate: 9
This is the final input from the script’s loop. It means your script determined that sending 9 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 10 req/s, the latency was too high.
e2el: 8423.40
This is the P99 latency that was measured when the server was being hit with 9 req/s. Since 8423.40 is less than 10,000, the script accepted this as a successful run.
throughput: 5.63
This is the actual, measured output. Even though you were sending requests at a rate of 9 per second, the server could only successfully process them at a rate of 5.63 per second.
Calculating Performance-Cost Ratio
Now that we have tuned and benchmarked our two primary accelerator candidates, we can bring the data together to make a final, cost-based decision. The goal is to find the most economical configuration that can meet our workload requirement of 100 requests per second while staying under our P99 end-to-end latency limit of 10,000 ms.
We will analyze the cost to meet our 100 req/s target using the best-performing configuration for both the H100 GPU and the TPU v6e.
NVIDIA H100 x 80GB (a3-highgpu-1g)
Measured Throughput: The benchmark showed a single H100 vLLM engine achieved a throughput of 4.17 req/s.
Instances Required: To meet our 100 req/s goal, we would need to run multiple instances. The calculation is:
Target Throughput / Throughput Per Instance = 100 req/s ÷ 4.17 req/s ≈ 23.98
Since we can’t provision a fraction of an instance, we must round up to 24 instances.
Estimated Cost: As of July 2025, the spot price for an a3-highgpu-1g machine type in us-central1 is approximately $2.25 per hour. The total hourly cost for our cluster would be: 24 instances × $2.25/hr = $54.00/hr
Note: We are choosing Spot instance pricing for the simple cost figures, this would not be a typical provisioning pattern for this type of workload.
Google Cloud TPU v6e (v6e-4)
Measured Throughput: The benchmark showed a single v6e-4 vLLM engine achieved a higher throughput of 5.63 req/s.
Instances Required: We perform the same calculation for the TPU cluster:
Target Throughput ÷ Throughput per Instance = 100 req/s ÷ 5.63 req/s ≈ 17.76
Again, we must round up to 18 instances to strictly meet the 100 req/s requirement.
Estimated Cost: As of July 2025, the spot price for a v6e-4 queued resource in us-central1 is approximately $0.56 per chip per hour. The total hourly cost for this cluster would be:
18 instances × 4 chips x $0.56/hr = $40.32/hr
Conclusion: The Most Cost-Effective Choice
Let’s summarize our findings in a table to make the comparison clear.
Metric
H100 (a3-highgpu-1g)
TPU (v6e-4)
Throughput per Instance
4.17 req/s
5.63 req/s
Instances Needed (100 req/s)
24
18
Spot Instance Cost Per Hour
$2.25 / hour
$0.56 x 4 chips = $2.24 / hour
Spot Cost Total
$54.00 / hour
$40.32 / hour
Total Monthly Cost (730h)
~ $39,400
~ $29,400
The results are definitive. For this specific workload (serving the gemma-3-27b-it model with long contexts), the v6e-4 configuration is the winner.
Not only does the v6e-4 instance provide higher throughput than the a3-highgpu-1g instance, but it does so at a significantly reduced cost. This translates to massive savings at higher scales.
Looking at the performance-per-dollar, the advantage is clear:
H100: 4.17 req/s ÷ $54.00/hr ≈ 0.08 req/s per dollar-hour
TPU v6e: 5.63 req/s ÷ $40.32/hr ≈ 0.14 req/s per dollar-hour
The v6e-4 configuration delivers almost twice the performance for every dollar spent, making it the superior, efficient choice for deploying this workload.
Final Reminder
This benchmarking and tuning process demonstrates the critical importance of evaluating different hardware options to find the optimal balance of performance and cost for your specific AI workload. We need to keep in mind the following sizing these workloads:
If our workload changed (e.g., input length, output length, prefix-caching percentage, or our requirements) the outcome of this guide may be different – H100 could outperform v6e in several scenarios depending on the workload.
If we considered the other possible accelerators mentioned above, we may find a more cost effective approach that meets our requirements.
Finally, we covered a relatively small parameter space in our auto_tune.sh script for this example – perhaps if we searched a larger space we may have found a configuration with even greater cost savings potential .
Additional Resources
The following is a collection of additional resources to help you complete the guide and better understand the concepts described.
Auto Tune ReadMe in Github
TPU Optimization Tips
Currently Supported Models for TPU on vLLM
More on Optimization and Parallelism from vLLM
Great Article on KV Cache
AI Summary and Description: Yes
**Summary:** This text provides a comprehensive guide on optimizing the performance and cost-effectiveness of serving large language models (LLMs) using vLLM across various hardware accelerators such as GPUs and TPUs. The focus on achieving the optimal balance of latency and throughput for running these models makes it highly relevant for professionals in AI and infrastructure security.
**Detailed Description:**
The text discusses the challenges associated with deploying and serving Large Language Models (LLMs) efficiently. As LLMs like Google’s Gemma 3 family become more widely used, ensuring they operate efficiently under strict latency and throughput expectations is critical. The document focuses on the use of vLLM, a framework designed for serving LLMs, and outlines key considerations for optimizing its deployment on different types of accelerators, mainly GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units).
Key points include:
– **Accelerator Selection:**
– Choosing the appropriate accelerator (GPU or TPU) is crucial for meeting unique workload demands.
– Various factors such as model precision, workload characteristics, expected requests per second, and maximum sequence lengths must be considered.
– **Configuration Optimization:**
– The configuration of vLLM (including managing GPU memory utilization and prefix caching) is discussed to ensure high performance while adhering to memory limits.
– For instance, it highlights Tensor Parallelism as a method for utilizing multiple accelerators to manage large models effectively, albeit with noted trade-offs in latency.
– **Performance Benchmarking:**
– The document lays out a detailed procedure for benchmarking different configurations (H100 and TPU v6e) to evaluate throughput and ensure latency requirements are met.
– The exploration of both GPU and TPU configurations results in insights about which platform serves specific workload conditions better.
– **Cost Analysis:**
– A clear cost comparison between using NVIDIA H100 GPUs and Google Cloud TPU v6e chips is presented.
– The TPU configuration is shown to outperform in terms of requesting throughput and lower costs, revealing critical insights on performance-per-dollar ratios that are invaluable for budget-conscious deployment of LLMs.
– **Conclusion and Recommendations:**
– The text concludes with an emphasis on the importance of testing different configurations and understanding that workload characteristics may shift performance outcomes.
Further insights include recommendations for future considerations and additional resources for deeper exploration into the topics covered, ensuring that the reader can continue optimizing their LLM deployment strategies effectively.
This document is pertinent to professionals in AI, cloud computing, and infrastructure security by demonstrating how nuanced optimization strategies can lead to significant performance and cost benefits when deploying advanced models in production.