Cloud Blog: Understanding Calendar mode for Dynamic Workload Scheduler: Reserve ML GPUs and TPUs

Source URL: https://cloud.google.com/blog/products/compute/dynamic-workload-scheduler-calendar-mode-reserves-gpus-and-tpus/
Source: Cloud Blog
Title: Understanding Calendar mode for Dynamic Workload Scheduler: Reserve ML GPUs and TPUs

Feedly Summary: Organizations need ML compute resources that can accommodate bursty peaks and periodic troughs. That means the consumption models for AI infrastructure need to evolve to be more cost-efficient, provide term flexibility, and support rapid development on the latest GPU and TPU accelerators.
Calendar mode is currently available in preview as the newest feature of Dynamic Workload Scheduler. This mode provides short-term ML capacity — up to 90 days of reserved capacity — without requiring long-term commitments. 
Calendar mode extends the capabilities of Compute Engine future reservations to provide co-located GPU and TPU capacity that’s a good fit for model training, fine-tuning, experimentation and inference workloads. 
Similar to a flight or hotel booking experience, Calendar mode makes it easy to search for and reserve ML capacity. Simply define your resource type, number of instances, expected start date and duration, and in a few seconds, you’ll be able to see the available capacity and reserve it. Once the capacity reservation is confirmed and delivered to your project, you can consume it via Compute Engine, Google Kubernetes Engine (GKE), Vertex AI custom training, and Google Batch.
What customers are saying
Over the past year, early access customers have used Calendar mode to reserve ML compute resources for a variety of use cases, from drug discovery to training new models.

“To accelerate drug discovery, Schrödinger relies on large-scale simulations to identify promising, high-quality molecules. Reserving GPUs through Google Cloud’s DWS Calendar Mode provides us the crucial flexibility and assurance needed to cost-effectively scale our compute environment for critical, time-sensitive projects." – Shane Brauner, EVP/CIO, Schrödinger

"For Vilya, Dynamic Workload Scheduler has delivered on two key fronts: affordability and performance. The cost efficiency received was a significant benefit, and the reliable access to GPUs has empowered our teams to complete projects much faster, and it’s been invaluable for our computationally intensive tasks. It’s allowed us to be more efficient and productive without breaking the budget." – Patrick Salveson, co founder and CTO

"Databricks simplifies the deployment and management of machine learning models, enabling fine tuning and real-time inference for scalable production environments. DWS Calendar Mode alleviated the burden of GPU capacity planning and provided seamless access to the latest generation GPU hardware for dynamic demand for testing and ongoing training." – Ravi Gadde, Sr. Director, Serverless Platform

Using Calendar mode
With these concepts and use cases under our belts, let’s take a look at how to find and reserve capacity via the Google Cloud console. Navigate to Cloud console -> Compute Engine -> Reservation. Then, on the Future Reservation tab, click Create a Future Reservation. Selecting a supported GPU or TPU will expose the Search for capacity section as shown below.

Proceed to the Advanced Settings to determine if the reservation should be shared across multiple projects. The final step is to name the reservations upon creation.

The reservation is approved within minutes and can be consumed once it is in the Fulfilled status at the specified start time.
Get started today
Calendar mode with AI Hypercomputer makes finding, reserving, consuming, and managing capacity easy for ML workloads. Get started today with Calendar mode for TPUs. Contact your account team for GPU access in Compute Engine, GKE, or Slurm. To learn more see Calendar mode documentation and Dynamic Workload Scheduler pricing.

AI Summary and Description: Yes

Summary: The text discusses the introduction of “Calendar mode” as a feature of Dynamic Workload Scheduler in Google Cloud, enabling organizations to reserve ML compute resources flexibly and cost-effectively. This innovation is particularly relevant for AI and machine learning workloads, allowing users to efficiently manage GPU and TPU resources in response to varying demand.

Detailed Description: The provided content elaborates on the evolving requirements for AI infrastructure, particularly the need for flexible, cost-efficient ML compute resources that accommodate fluctuating demand. Calendar mode in Dynamic Workload Scheduler (DWS) addresses these needs by allowing organizations to reserve ML capacity for short-term use (up to 90 days) without long-term commitments.

Key points include:

– **Cost-Efficiency & Flexibility**:
– Organizations can reserve GPU/TPU resources without committing to long-term contracts, accommodating both peak and low-demand periods.
– This model supports the rapid development of AI applications and machine learning projects while maintaining budgetary control.

– **Ease of Use**:
– The process for reserving resources is simplified, akin to booking flights or hotels. Users can define parameters such as resource type, number of instances, start date, and duration, allowing for a quick view of available capacity.

– **Real-World Applications**:
– Case studies provided illustrate the successful application of Calendar mode:
– **Schrödinger**: Utilizes the feature for drug discovery simulations, emphasizing the need for flexibility and cost-effectiveness.
– **Vilya**: Highlights affordability and improved project completion times due to reliable access to GPUs.
– **Databricks**: Benefits from alleviated capacity planning burdens and enhanced access to GPU hardware for dynamic ML demands.

– **User Guidance**:
– The text briefly outlines the steps to reserve ML capacity via the Google Cloud console, showcasing the intuitive interface and capabilities for managing reservations across projects.

Overall, the introduction of Calendar mode presents a significant advancement for organizations looking to optimize their AI and ML resource management, reflecting a broader trend toward greater flexibility and efficiency in cloud computing environments. For security and compliance professionals, understanding such developments is crucial in evaluating the impact of cloud resource management solutions on organizational security posture and operational efficiency.