Cloud Blog: Run OpenAI’s new gpt-oss model at scale with Google Kubernetes Engine

Aug 11, 2025

—

Source URL: https://cloud.google.com/blog/products/containers-kubernetes/run-openais-new-gpt-oss-model-at-scale-with-gke/
Source: Cloud Blog
Title: Run OpenAI’s new gpt-oss model at scale with Google Kubernetes Engine

Feedly Summary: It’s exciting to see OpenAI contribute to the open ecosystem with the release of their new open weights model, gpt-oss. In keeping with our commitment to provide the best platform for open AI innovation, we’re announcing immediate support for deploying gpt-oss-120b and gpt-oss-20b on Google Kubernetes Engine (GKE). To help customers make informed decisions while deploying their infrastructure, we’re giving customers detailed benchmarks of gpt-oss-120b on accelerators on Google Cloud. You can access it here.This continues our support for a broad and diverse ecosystem of models, from Google’s own Gemma family, to models like Llama 4, and now, OpenAI’s gpt-oss. We believe that offering choice and leveraging the best of the open community is critical for the future of AI.Run demanding AI workloads at scaleThe new gpt-oss model is expected to be large and require significant computational power, likely needing multiple NVIDIA H100 Tensor Core GPUs for optimal performance. This is where Google Cloud and GKE shine. GKE is designed to handle large-scale, mission-critical workloads, providing the scalability and performance needed to serve today’s most demanding models. With GKE, you can leverage Google Cloud’s advanced infrastructure, including both GPU and TPU accelerators, to power your generative AI applications.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>

Get started in minutes with GKE Inference QuickstartTo make deploying gpt-oss as simple as possible, we have made optimized configurations available through our GKE Inference Quickstart (GIQ) tool. GIQ provides validated, performance-tuned deployment recipes that let you serve state-of-the-art models with just a few clicks. Instead of manually configuring complex YAML files, you can use our pre-built configurations (refer to the screenshots below) to get up and running quickly.

GKE Inference Quickstart provides benchmarking and quick-start capabilities to ensure you are running with the best possible performance. You can learn more about how to use it in our official documentation.

You can also deploy the new OpenAI gpt-oss model via the gcloud CLI by simply setting up access to the weights from the OpenAI organization on Hugging Face and use the gcloud CLI to deploy the model on a GKE cluster with the appropriate accelerators. For example:

code_block
<ListValue: [StructValue([(‘code’, ‘gcloud alpha container ai profiles manifests create \\\r\n –model=google/gpt-oss-20b \\\r\n –model-server=vllm \\\r\n –accelerator-type=nvidia-h100-80gb \\\r\n –target-ntpot-milliseconds=200’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f38673cdd30>)])]>

Our commitment to open modelsOur support for gpt-oss is part of a broader, systematic effort to bring the most popular open models to GKE as soon as they are released while also giving customers detailed benchmarks to make informed choices while deploying their infrastructure. Get started with the new OpenAI gpt-oss model on GKE today.

AI Summary and Description: Yes

Summary: The announcement highlights OpenAI’s release of the gpt-oss model and Google Cloud’s immediate support for deploying it on Google Kubernetes Engine (GKE). This move aims to facilitate infrastructure deployment for AI workloads, enabling users to leverage powerful computing resources effectively.

Detailed Description: The text discusses OpenAI’s gpt-oss model and its integration into Google Cloud’s infrastructure, specifically through GKE. Key points include:

– **Release of gpt-oss Models**: OpenAI has introduced the gpt-oss models, named gpt-oss-120b and gpt-oss-20b, contributing to an open ecosystem of AI models.

– **Deployment on GKE**:
– Google Cloud is immediately supporting these models, emphasizing the platform’s commitment to open AI innovation.
– GKE is designed to manage large-scale, mission-critical workloads, which is essential for running demanding AI models.

– **Computational Requirements**:
– The gpt-oss model is expected to be resource-intensive, likely necessitating multiple NVIDIA H100 Tensor Core GPUs for effective performance.

– **Performance and Scalability**:
– Google Cloud’s infrastructure supports both GPU and TPU accelerators, providing the scalability and performance needed for generative AI applications.

– **GKE Inference Quickstart (GIQ)**:
– Google offers GIQ, which provides optimized configurations and validated deployment recipes to simplify the deployment process.
– Users can quickly start serving state-of-the-art models without manual YAML configurations.

– **Benchmarking and CLI Deployment**:
– Customers receive detailed benchmarks to assist in making informed decisions regarding model deployment on their infrastructure.
– Users have the option to deploy the gpt-oss model via the gcloud CLI with a straightforward command setup.

– **Commitment to Open Models**:
– Google Cloud’s ongoing support for popular open models like gpt-oss reflects a systematic effort to integrate and provide benchmarks for these resources.

This announcement provides essential insights into how organizations can effectively deploy and scale AI workloads using Google Cloud, reinforcing the significance of infrastructure and support for open model ecosystems in AI development.

1 10 2 3 4 7 a accelerator accelerators access ads advanced age AGI AI AI applications AI development ai model AI models AI workloads All and API APIs app Application applications art as at ated benchmark benchmarking benchmarks Best Bi building built by C capabilities CI CIA Cloud cluster co code command commit community computation computational power computational requirements Computing computing resources Configuration configurations Console container containers core critical critical workloads Customer D day de decision decisions demand deployment deployment recipes design development document documentation e ecosystem ecosystems effective exp face file for free future future of AI g Gemma Gen generative Generative AI GKE Go Google Google Cloud Google Kubernetes Google Kubernetes Engine GPT GPU GPUs H H100 Tensor Core high Highlight HR http HTTPS hugging Hugging Face image in Inference infrastructure infrastructure deployment infrastructure support innovation insights integration intensive io J Just k keeping Key Kubernetes Kubernetes Engine l language large led Li llama Llama 4 llm lm load low M made making man market marketplace media mission ML Mode model model deployment models multi N new NGO no non NTP Nvidia o of off on one ons open open ecosystem open models open weights openai OPM opt optimal performance optimized organization organizations oS oss out per performance platform point porting Power pre pro process product products ps Q QUIC R rag rate RCE re red release Requirements resource resources Ro s scalability Scale screenshot sec server shot side Sig Sim Simple source specific SSE STAR start state state-of-the-art models support system systems T ted text the to tool Tor TP trial type Uber UI up US use user Users V val Valid vllm weight Wi workload workloads x yaml z