Cloud Blog: How Baseten achieves 225% better cost-performance for AI inference (and you can too)

Sep 4, 2025

—

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/how-baseten-achieves-better-cost-performance-for-ai-inference/
Source: Cloud Blog
Title: How Baseten achieves 225% better cost-performance for AI inference (and you can too)

Feedly Summary: Baseten is one of a growing number of AI infrastructure providers, helping other startups run their models and experiments at speed and scale. Given the importance of those two factors to its customers, Baseten has just passed a significant milestone.
By leveraging the latest Google Cloud A4 virtual machines (VMs) based on NVIDIA Blackwell, and Google Cloud’s Dynamic Workload Scheduler (‘DWS’) Baseten has achieved 225% better cost-performance for high-throughput inference and 25% better cost-performance for latency-sensitive inference.
Why it matters: This breakthrough in performance and efficiency enables companies to move powerful agentic AI and reasoning models out of the lab and into production affordably. For technical leaders, this provides a blueprint for building next-generation AI products — such as real-time voice AI, search, and agentic workflows — at a scale and cost-efficiency that has been previously unattainable.
The big picture: Inference is the cornerstone of enterprise AI. As models for multi-step reasoning and decision-making demand exponentially greater compute, the challenge of serving them efficiently has become the primary bottleneck. Enter Baseten, a six-year-old Series C company that partners with Google Cloud and NVIDIA to provide enterprise companies a scalable inference platform for their proprietary models as well as open models like Gemma, DeepSeek ,and Llama, with an emphasis on performance and cost efficiency. Their success hinges on a dual strategy: maximizing the potential of cutting-edge hardware and orchestrating it with a highly optimized, open software stack.
We wanted to share more about how Baseten architected its stack — and what this new level of cost-efficiency can unlock for your inference applications.

aside_block
), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Hardware optimization with the latest NVIDIA GPUs
Baseten delivers production-grade inference by leveraging a wide range of NVIDIA GPUs on Google Cloud, from NVIDIA T4s through the recent A4 VMs (NVIDIA HGX B200). This access to the latest hardware is critical for achieving new levels of performance.

With A4 VMs, Baseten now serves three of the most popular open-source models — DeepSeek V3, DeepSeek R1, and Llama 4 Maverick — directly on their Model APIs with over 225% better cost-performance for high throughput inference, and 25% better cost-performance for latency- sensitive inference.

In addition to its production-ready model APIs, Baseten provides additional flexibility with NVIDIA B200-powered dedicated deployments for customers seeking to run their own custom AI models with the same reliability and efficiency.

Advanced software for peak performance
Baseten’s approach is rooted in coupling the latest accelerated hardware with leading and open-source software to extract the most value possible from every chip. This integration is made possible with Google Cloud’s AI Hypercomputer, which includes a broad suite of advanced inference frameworks, including NVIDIA’s open-source software stack — NVIDIA Dynamo and TensorRT-LLM — as well as SGLang and vLLM.

Using TensorRT-LLM, Baseten optimizes and compiles custom LLMs for one of its largest AI customers, Writer. This has boosted their throughput by more than 60% for Writer’s Palmyra LLMs. The flexibility of TensorRT-LLM also enabled Baseten to develop a custom model builder that speeds up model compilation.

To serve reasoning models like DeepSeek R1 and Llama 4 on NVIDIA Blackwell GPUs, Baseten uses NVIDIA Dynamo. The combination of NVIDIA’s HGX B200 and Dynamo dramatically lowered latency and increased throughput, propelling Baseten to the top GPU performance spot on OpenRouter’s LLM ranking leaderboard.

The team leverages techniques such as kernel fusion, memory hierarchy optimization, and custom attention kernels to increase tokens per second, reduce time to first token, and support longer context windows and larger batch sizes — all while maintaining low latency and high throughput.

Building a backbone for high availability and redundancy
For mission-critical AI services, resilience is non-negotiable. Baseten runs globally across multiple clouds and regions, requiring an infrastructure that can handle ad hoc demand and outages. Flexible consumption models, such as the Dynamic Workload Scheduler within the AI Hypercomputer, help Baseten manage capacity similar to on-demand with additional price benefits. This allows them to scale up on Google Cloud if there are outages across other clouds.
“Baseten runs globally across multi-clouds and Dynamic Workload Scheduler has saved us more than once when we encounter a failure,” said Colin McGrath, head of infrastructure at Baseten. “Our automated system moves affected workloads to other resources including Google Cloud Dynamic Workload scheduler and within minutes, everyone is up and running again. It is impressive — by the time we’re paged and check-in, everything is back and healthy. This is amazing and would not be possible without DWS. It has been the backbone for us to run our business.”

Baseten’s scalable inference platform architecture

Unlocking new AI applications for end-users
Baseten’s collaboration with Google Cloud and NVIDIA demonstrates how a powerful combination of cutting-edge hardware and flexible, scalable cloud infrastructure can solve the most pressing challenges in AI inference through Google Cloud’s AI Hypercomputer.
This unique combination enables end-users across industries to bring new applications to market, such as powering agentic workflows in financial services, generating real-time audio and video content in media, and accelerating document processing in healthcare. And it’s all happening at a scale and cost that was previously unattainable.
You can easily get started with Baseten’s platform through the Google Cloud Marketplace, or read more about their technical architecture in their own post.

AI Summary and Description: Yes

Summary: The text discusses Baseten’s advancements in AI infrastructure, specifically its improved performance and cost-efficiency in high-throughput inference via collaborations with Google Cloud and NVIDIA. This breakthrough positions Baseten as a critical player in the AI sector, enhancing the scalability and affordability of deploying complex AI models effectively.

Detailed Description: The article provides an overview of Baseten’s achievements in enhancing AI model deployment efficiencies through innovative infrastructure partnerships. Key points include:

– **Infrastructure and Performance**:
– Baseten utilized Google Cloud’s A4 VMs leveraging NVIDIA Blackwell, achieving significant cost-performance improvements.
– It reported 225% better cost-performance for high-throughput inference and 25% for latency-sensitive tasks.

– **AI Model Deployment**:
– The advancements enable companies to transition powerful AI models from lab environments to production efficiently and affordably.
– Baseten serves several well-known open-source models (e.g., DeepSeek V3, Llama 4 Maverick) with an upgraded, highly optimized software architecture.

– **Collaboration with Technology Leaders**:
– Baseten collaborates with Google Cloud and NVIDIA, merging cutting-edge hardware and innovative software tools for optimal AI performance.
– Key technologies such as NVIDIA’s Dynamo and TensorRT-LLM are integrated to enhance AI model optimization and throughput.

– **Scalability and Resilience**:
– The infrastructure is designed for high availability and can adapt to varying workloads, ensuring resilience against outages.
– The Dynamic Workload Scheduler is a significant feature that helps automatically manage workloads across clouds, providing flexibility and maintaining operational continuity.

– **Broader Implications for AI Applications**:
– Baseten’s infrastructure can power diverse applications, from real-time media generation to document processing in healthcare, marking significant industry advancements.
– The architecture is positioned to help organizations bring innovative AI applications to the market at previously unattainable scales.

Overall, Baseten represents a compelling case of how infrastructural advancements can accelerate AI application deployment while ensuring cost efficiency and operational resilience, vital for professionals working in AI and cloud computing security spaces.

1 2 3 4 5 a A4 access Act ads advanced advancement advancements affordability age agent agentic agentic AI agentic workflows AGI AI AI applications ai model AI models All alt and API APIs app Application application deployment applications Arch architected architecture art as at ated audio Auto availability B200 backbone based batch size batch sizes benefits Bi black Blackwell board bot building business by C capacity challenge challenges chip CI CIA Cloud cloud computing cloud computing security cloud infrastructure co Col collaboration companies compilation compute computer Computing consumption content Context context window cost cost efficiency coupling critical cross custom Customer cutting D de decision decision-making deep DeepSeek DeepSeek R1 Deepseek v3 demand demo deployment deployments design document document processing dual Dynamic Workload Scheduler Dynamo e e-learning edge effective efficiency efficient end enterprise environment environments ERP exp fact fail feature financial financial services first flexibility for framework frameworks g Gemma Gen generation glob Global Go Google Google Cloud Google Cloud Marketplace GPU GPUs grade H hardware hardware optimization health Healthcare high high availability high-throughput HP HR http HTTPS Hyper Hypercomputer image implications in industry Inference inference framework inference frameworks infrastructure infrastructure providers integration io Iron IRS ite J Just k kernel kernels Key l Labor large latency leading learning led level Li liability llama Llama 4 llm llms lm load long low low latency M mac machine made making man market marketplace matt max media memory Mila milestone mission Mode model model deployment model optimization models multi multi-cloud my N nation new next no non Nvidia Nvidia Blackwell GPUs NVIDIA GPUs o of on one ons oost open open models open-source open-source models open-source software openrouter operation operational operational continuity operational resilience opt optimization optimized orchestrating organization organizations ory oS oss other out outage outages over partners partnership partnerships per performance performance improvement performance improvements platform play point post potential Power powered pre price pro process processing product production products professionals proprietary Proprietary model proprietary models ps Q R R1 rag Rama Rank ranking rate RCE re ready real real-time reasoning reasoning mode reasoning model reasoning models red redundancy Region Regions reliability report Resil resilience resource resources Ro Root row s s Position sam scalability scalable Scale search sec sector security series service services SHA side Sig Sim size sizes software software architecture software stack software tools source source model source models source software space specific speed SSE stack STAR start startup startups step reasoning Strategy support system T Task tasks team tech technical technical architecture techniques technologies technology technology leader technology leaders ted test text the throughput throughput inference Time to token tokens tool tools Tor TP transition trie two UI up upgrade ups US use user Users V V3 val video video content virtual virtual machine virtual machines virtual machines (VMs) vllm vm voice WAN Ware Well Wi Wind Windows workflow workflows workload workloads x yt z