Cloud Blog: Announcing Gemma 3 on Vertex AI

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/announcing-gemma-3-on-vertex-ai/
Source: Cloud Blog
Title: Announcing Gemma 3 on Vertex AI

Feedly Summary: Today, we’re sharing the new Gemma 3 model is available on Vertex AI Model Garden, giving you immediate access for fine-tuning and deployment. You can quickly adapt Gemma 3 to your use case using Vertex AI’s pre-built containers and deployment tools. 
In this post, you’ll learn how to fine-tune Gemma 3 on Vertex AI and deploy it as a production-ready endpoint.
Gemma 3 on Vertex AI: PEFT and vLLM deployment
Tuning and deploying large language models can be computationally expensive and time-consuming. That’s why we’re excited to announce Gemma 3 support for Parameter-Efficient Fine-Tuning (PEFT) and optimized deployment using vLLM on Vertex AI Model Garden. 
Gemma 3 fine-tuning allows you to achieve performance gains with significantly less computational resources compared to full fine-tuning. Our vLLM-based deployment is easy-to-use and fast. vLLM’s optimized inference engine maximizes throughput and minimizes latency, ensuring a responsive and scalable endpoint for your Gemma 3 applications on Vertex AI.
Let’s look at how you can fine-tune and deploy your Gemma 3 model on Vertex AI.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Fine-tuning Gemma 3 on Vertex AI
In Vertex AI Model Garden, you can fine-tune and deploy Gemma 3 using PEFT (LoRA) from Hugging Face in only a few steps. Before you run the notebook make sure you complete all of the initial steps as described in the notebook. 
Fine-tuning Gemma 3 on Vertex AI for your use case requires a custom dataset. The recommended format is a JSONL file, where each line is a valid JSON string. Here’s an example inspired by the timdettmers/openassistant-guanaco dataset:

code_block
<ListValue: [StructValue([(‘code’, ‘{“text": "### Human: Hola### Assistant: \\u00a1Hola! \\u00bfEn qu\\u00e9 puedo ayudarte hoy?"}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772b8dfd0>)])]>

The JSON object has a key text, which should match train_column; The value should be one training data point, i.e. a string. You can upload your dataset to Google Cloud Storage (preferred) or to Hugging Face datasets.
Choose the Gemma 3 variant that best suits your needs. For example, to use the 1B parameter model:

code_block
<ListValue: [StructValue([(‘code’, ‘base_model_id = "gemma-3-1b-pt"’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772ff7e80>)])]>

You have the flexibility to customize model parameters and job arguments. Let’s explore some key settings. LoRA (Low-Rank Adaptation) is a PEFT technique that significantly reduces the number of trainable parameters. The following parameters control LoRA’s behavior. lora_rank controls to control dimensionality of the update matrices (smaller rank = fewer parameters), lora_alpha that scales the LoRA updates, and lora_dropout to add regularization. The following settings are a reasonable starting point.

code_block
<ListValue: [StructValue([(‘code’, ‘lora_rank = 16\r\nlora_alpha = 32\r\nlora_dropout = 0.05’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772ff7ee0>)])]>

When fine-tuning large language models (LLMs), precision is a key consideration, impacting both memory usage and performance. Lower precision training, such as 4-bit quantization, reduces the memory footprint. However, this can come with a slight performance trade-off compared to higher precisions like 8-bit or float16. The train_precision parameter dictates the numerical precision used during the training process. Choosing the right precision involves balancing resource limitations with desired model accuracy.

code_block
<ListValue: [StructValue([(‘code’, ‘finetuning_precision_mode = "4bit"\r\ntrain_precision = "bfloat16"’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772ff7700>)])]>

Optimizing model performance involves tuning training parameters that impact speed, stability, and capabilities. Essential parameters include per_device_train_batch_size, which determines the batch size per GPU, with larger sizes accelerating training but demanding more memory. gradient_accumulation_steps allows simulating larger batch sizes by accumulating gradients over smaller batches, providing a memory-efficient alternative at the cost of increased training time. The learning_rate dictates the optimization step size, where a rate that is too high can lead to divergence, while a rate that is too low can slow down convergence. The lr_scheduler_type dynamically adjusts the learning rate throughout training, such as through linear decay, fostering better convergence and accuracy. And, the total training duration is defined by either max_steps, specifying the total number of training steps, or num_train_epochs, with max_steps taking precedence if both are specified. Below you have the full training recipe you find in the official notebook.

code_block
<ListValue: [StructValue([(‘code’, ‘train_job_args = [ "–config_file=vertex_vision_model_garden_peft/deepspeed_zero2_8gpu.yaml",\r\n "–task=instruct-lora",\r\n "–input_masking=True",\r\n "–pretrained_model_name_or_path=gg-hf-g/gemma-3-1b",\r\n "–train_dataset=timdettmers/openassistant-guanaco",\r\n "–train_split=train",\r\n "–train_column=text",\r\n "–output_dir=gs://your-adapter-repo",\r\n "–merge_base_and_lora_output_dir=gs://merged-model-repo",\r\n "–per_device_train_batch_size=1",\r\n "–gradient_accumulation_steps=4",\r\n "–lora_rank=16",\r\n "–lora_alpha=32",\r\n "–lora_dropout=0.05",\r\n "–max_steps=-1",\r\n "–max_seq_length=4096",\r\n "–learning_rate=5e-05",\r\n "–lr_scheduler_type=cosine",\r\n "–precision_mode=4bit",\r\n "–train_precision=bfloat16",\r\n "–gradient_checkpointing=True",\r\n "–num_train_epochs=1.0",\r\n "–attn_implementation=eager",\r\n "–optimizer=paged_adamw_32bit",\r\n "–warmup_ratio=0.01",\r\n "–report_to=tensorboard",\r\n "–logging_output_dir=gs://your-logs-repo",\r\n "–save_steps=10",\r\n "–logging_steps=10",\r\n "–train_template=openassistant-guanaco",\r\n "–huggingface_access_token=your-token",\r\n "–eval_dataset=timdettmers/openassistant-guanaco",\r\n "–eval_column=text",\r\n "–eval_template=openassistant-guanaco",\r\n "–eval_split=test",\r\n "–eval_steps=10",\r\n "–eval_metric_name=loss,perplexity,bleu",\r\n "–metric_for_best_model=perplexity"\r\n]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772becdc0>)])]>

Finally, create and run the CustomContainerTrainingJob to start the fine-tuning job.

code_block
<ListValue: [StructValue([(‘code’, ‘train_job = aiplatform.CustomContainerTrainingJob(\r\n display_name=job_name,\r\n container_uri=TRAIN_DOCKER_URI,\r\n labels=labels,\r\n)\r\n\r\ntrain_job.run(\r\n args=train_job_args,\r\n replica_count=replica_count,\r\n machine_type=training_machine_type,\r\n accelerator_type=training_accelerator_type,\r\n accelerator_count=per_node_accelerator_count,\r\n boot_disk_size_gb=500,\r\n service_account=SERVICE_ACCOUNT,\r\n base_output_dir=base_output_dir,\r\n sync=False,\r\n **dws_kwargs,\r\n)\r\n\r\ntrain_job.wait_for_resource_creation()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772becd60>)])]>

You can monitor the fine-tuning progress using Tensorboard. Once the job is complete, you can upload the tuned model to the Vertex AI Model Registry and deploy it as an endpoint for inference. Let’s dive into deployment next. 
Deploying Gemma 3 on Vertex AI
Deploying Gemma 3 on Vertex AI requires only three steps as described in this notebook. 
First, you need to provision a dedicated endpoint for your Gemma 3 model. This provides a scalable and managed environment for hosting your model. You use the create function to set the endpoint name (display_name), and ensure dedicated resources for your model (dedicated_endpoint_enabled).

code_block
<ListValue: [StructValue([(‘code’, ‘from google.cloud import aiplatform as vertex_ai\r\nendpoint = vertex_ai.Endpoint.create(\r\n display_name="gemma3-endpoint", \r\n dedicated_endpoint_enabled=True,\r\n )’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772becf40>)])]>

Next, register the Gemma 3 model within the Vertex AI Model Registry. Think of the Model Registry as a central hub for managing your models. It keeps track of different versions of your Gemma 3 model (in case you make improvements later), and is the central place from which you’ll deploy.

code_block
<ListValue: [StructValue([(‘code’, ‘vllm_serving_image_uri = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01"\r\n\r\nenv_vars = {\r\n "MODEL_ID": "google/gemma-3-1b-it",\r\n "DEPLOY_SOURCE": "notebook",\r\n "HF_TOKEN": "your-hf-token"\r\n}\r\n\r\nvllm_args = [\r\n "python",\r\n "-m",\r\n "vllm.entrypoints.api_server",\r\n "–host=0.0.0.0",\r\n "–port=8080",\r\n "–model=\’gs://vertex-model-garden-restricted-us/gemma3/gemma-3-1b-it\’",\r\n "–tensor-parallel-size=1",\r\n "–swap-space=16",\r\n "–gpu-memory-utilization=0.95",\r\n "–max-model-len=32768",\r\n "–dtype="auto",\r\n "–max-loras=1",\r\n "–max-cpu-loras=8",\r\n "–max-num-seqs=256",\r\n "–disable-log-stats",\r\n "–trust-remote-code",\r\n "–enforce-eager",\r\n "–enable-lora",\r\n "–enable-chunked-prefill",\r\n "–enable-prefix-caching"\r\n]\r\n\r\nmodel = aiplatform.Model.upload(\r\n display_name="gemma-3-1b",\r\n serving_container_image_uri=vllm_serving_image_uri,\r\n serving_container_args=vllm_args,\r\n serving_container_ports=[8080],\r\n serving_container_predict_route="/generate",\r\n serving_container_health_route="/ping",\r\n serving_container_environment_variables=env_vars,\r\n serving_container_shared_memory_size_mb=(16 * 1024),\r\n serving_container_deployment_timeout=7200,\r\n model_garden_source_model_name="publishers/google/models/gemma3",\r\n)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772becdf0>)])]>

This step involves a few important configurations including the serving container to deploy Gemma 3. 
To serve Gemma 3 on Vertex AI, use the Vertex AI Model Garden vLLM pre-built Docker image for fast and efficient model serving. The vLLM recipe to set how vLLM will serve Gemma 3 which includes  –tensor-parallel-size lets you spread the model across multiple GPUs if you need extra computation resources, –gpu-memory-utilization controls how much of the GPU memory you want to use and –max-model-len sets the maximum length of text the model can process at once. You also have some advanced settings like –enable-chunked-prefill, and –enable-prefix-caching to optimize performance, especially when dealing with longer pieces of text. 
There are also some of deployment configuration Vertex AI requires to serve the model including the port (8080 in our case) that the serving container will listen on, defines the URL path for making prediction requests (e.g., "/generate") and the URL path for health checks (e.g., "/ping"), allowing Vertex AI to monitor the model’s status.
Finally, use upload() to take this configuration – the serving container, your model-specific settings, and instructions for how to run the model – and bundle them up into a single, manageable unit within the Vertex AI Model Registry. This makes deployment and version control much easier.
Now you’re ready to deploy the model. To deploy the registered model to the endpoint, use the deploy method as shown below.

code_block
<ListValue: [StructValue([(‘code’, ‘model.deploy(\r\n endpoint=endpoint,\r\n machine_type="a3-highgpu-2g",\r\n accelerator_type="NVIDIA_L4", \r\n accelerator_count=1,\r\n deploy_request_timeout=1800,\r\n )’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772beceb0>)])]>

This is where you choose the computing power for our deployment including the type of virtual machine (like "a3-highgpu-2g", machine_type), the kind of accelerator (e.g., "NVIDIA_L4" GPUs, accelerator_type), how many accelerators to use (accelerator_count). 
Deploying the model requires some time and you can monitor the status of the deployment in Cloud Logging. Once you get the endpoint running, you can use the ChatCompletion API to call the model and integrate it within your applications as shown below.

code_block
<ListValue: [StructValue([(‘code’, ‘import google.auth\r\nimport openai\r\n\r\ncreds, project = google.auth.default()\r\nauth_req = google.auth.transport.requests.Request()\r\ncreds.refresh(auth_req)\r\n\r\nuser_message = "How is your day going?" \r\nmax_tokens = 50 \r\ntemperature = 1.0 \r\nstream = False\r\n\r\nBASE_URL = f"https://{your-dedicated-endpoint}/v1beta1/{your-endpoint-name}"\r\n\r\nclient = openai.OpenAI(base_url=BASE_URL, api_key=creds.token)\r\n\r\nmodel_response = client.chat.completions.create(\r\n model="",\r\n messages=[{"role": "user", "content": user_message}],\r\n temperature=temperature,\r\n max_tokens=max_tokens,\r\n stream=stream,\r\n)\r\n\r\nprint(model_response)\r\n# I\’m doing well, thanks for asking!…’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3772bec580>)])]>

Depending on the Gemma model you deploy, you can use the ChatCompletion API to call the model with multimodal inputs (images). You can find more in the “Deploy Gemma 3 4B, 12B and 27B multimodal models with vLLM on GPU” section of the model card notebook.
What’s next?
Visit the Gemma 3 model card on Vertex AI Model Garden to get started today. For a deeper understanding of the model’s architecture and performance, check out this developer guide on Gemma 3.

AI Summary and Description: Yes

Summary: The text describes the availability and deployment process of the Gemma 3 model on Vertex AI Model Garden, emphasizing the benefits of using Parameter-Efficient Fine-Tuning (PEFT) and vLLM deployment for optimized performance and reduced computational costs. This is highly relevant for professionals in AI, cloud, and infrastructure security, as it provides insights into deploying advanced AI models securely and efficiently.

Detailed Description:
The Gemma 3 model, now accessible via Vertex AI Model Garden, is designed for fine-tuning and deployment within AI applications. Here’s a breakdown of the major points discussed in the text, highlighting its significance for AI and cloud computing professionals:

– **Model Introduction**: Gemma 3 is a large language model (LLM) available for immediate use and can be adapted for specific use cases using Vertex AI.

– **Parameter-Efficient Fine-Tuning (PEFT)**:
– The introduction of PEFT, particularly using the LoRA (Low-Rank Adaptation) technique, allows for significant performance gains while utilizing fewer computational resources.
– This method is advantageous for organizations looking to optimize costs while gaining effective model training outcomes.

– **Optimized Deployment with vLLM**:
– The use of vLLM as a deployment method maximizes throughput and minimizes latency, providing a fast and responsive endpoint for applications.
– Measurements of performance like memory usage and processing speed can significantly affect user experience.

– **Fine-Tuning Procedure**:
– The text outlines steps to fine-tune Gemma 3 using custom datasets and various parameters that can optimize the training process.
– Key parameters include batch sizes, training precision, learning rates, and gradient accumulation strategies, which are crucial for effective model performance.

– **Deployment Setup**:
– Details on provisioning dedicated endpoints, registering models, and configuring Docker containers for deployment are provided.
– The deployment strategy involves monitoring via Tensorboard to track progress and ensure the model’s readiness.

– **Custom Container Training Job**:
– Instructions for setting up a custom container training job to facilitate the deployment of the model in production are specified, allowing for flexible resource management.

– **Monitoring and Support**:
– A provision to monitor deployment status through Cloud Logging, highlighting the importance of operational transparency for ongoing security and performance evaluations.

– **ChatCompletion API Integration**:
– The text describes how to integrate the deployed model using the ChatCompletion API, which can handle multimodal inputs, reflecting the model’s versatility and its application in diverse scenarios.

**Practical Implications**:
– **Resource Management**: Utilizing techniques like PEFT helps organizations to efficiently manage resources while achieving high-performance outcomes with AI models.
– **Scalability and Security**: The procedures outlined for endpoint management and dockerized deployment align with best practices in cloud security, ensuring scalable and secure AI solutions.
– **Ongoing Monitoring**: Continuous monitoring and the ability to adapt deployments in real-time are integral for maintaining compliance with security standards in AI operations.

Overall, the insights from this text are crucial for EI, cloud, and infrastructure security professionals looking to implement advanced AI solutions while ensuring compliance with performance standards.