Source URL: https://cloud.google.com/blog/products/compute/managed-slurm-and-other-cluster-director-enhancements/
Source: Cloud Blog
Title: New Cluster Director features: Simplified GUI, managed Slurm, advanced observability
Feedly Summary: In April, we released Cluster Director, a unified management plane that makes deploying and managing large-scale AI infrastructure simpler and more intuitive than ever before, putting the power of an AI supercomputer at your fingertips. Today, we’re excited to release new features in preview including an intuitive interface, managed Slurm experience, and observability dashboard that intercepts performance anomalies.
From complex configuration to easy creation
AI infrastructure users can spend weeks wrestling with complex configurations for compute, networking, and storage. Because distributed training workloads are highly synchronized jobs across thousands of nodes and are highly sensitive to network latency, performance bottlenecks can be difficult to diagnose and resolve. Cluster Director solves these challenges with a single, unified interface that automates the complex setup of AI and HPC clusters, integrating Google Cloud’s optimized compute, networking, and storage into a cohesive, performant, and easily managed environment.
LG Research uses Google Cloud to train their large language models, most recently Exaone 3.5. They have significantly reduced the time it takes to have a cluster running with their code — from over a week to less than one day. That’s hundreds of GPU hours saved for real workloads.
“Thanks to Cluster Director, we’re able to deploy and operate large-scale, high-performance GPU clusters flexibly and efficiently, even with minimal human resources.” – Jiyeon Jung, AI Infra Sr Engineer, LG AI Research
Biomatter uses Google Cloud to scale their in silico design processes. Cluster Director has made the cluster deployment and management smooth, enabling them to dedicate more focus to the scientific challenges at the core of their work.
“Cluster Director on Google Cloud has significantly simplified the way we create, configure, and manage Slurm-based AI and HPC clusters. With an intuitive UI and easy access to GPU-accelerated instances, we’ve reduced the time and effort spent on infrastructure.” – Irmantas Rokaitis, Chief Technology Officer, Biomatter
Read on for what’s new in the latest version of Cluster Director.
aside_block
Simplified cluster management across compute, network, and storage
Use a new intuitive view in the Google Cloud console to easily create, update, and delete clusters. Instead of a blank slate, you start with a choice of validated, optimized reference architectures. You can add one or more machine configurations from a range of VM families (including A3 and A4 GPUs) and specify the machine type, the number of GPUs, and the number of instances. You can choose your consumption model, selecting on-demand capacity (where supported), DWS Calendar or Flex start modes, Spot VMs for cost savings, or attaching a specific reservation for capacity assurance.
Cluster Director also simplifies networking by allowing you to deploy the cluster on a new, purpose-built VPC network or an existing one. If you create a new network, the firewall rules required for internal communication and SSH access are configured automatically, removing a common pain point. For storage, you can create and attach a new Filestore or Google Cloud Managed Lustre instance, or connect to an existing Cloud Storage bucket. These integrations help ensure that your high-performance file system is correctly mounted and available to all nodes in the cluster from the moment they launch.
Powerful job scheduling with Managed Slurm
Cluster Director provides fault-tolerant and highly scalable job scheduling out of the box with a managed, pre-configured Slurm environment. The controller node is managed for you, and you can easily configure the login nodes, including machine type, source image, and boot-disk size. Partitions and nodesets are pre-configured based on your compute selections, but you retain the flexibility to customize them, now or in the future.
Topology-aware placement
To maximize performance, Cluster Director is deeply integrated with Google’s network topology. This begins when clusters are created, when VMs are placed in close physical proximity. Crucially, this intelligence is also built directly into the managed Slurm environment. The Slurm scheduler is natively topology-aware, meaning it understands the underlying physical network and automatically co-locates your job’s tasks on nodes with the lowest-latency paths between them. This integration of initial placement and ongoing job scheduling is a key performance enhancer, dramatically reducing network contention during large, distributed training jobs.
Comprehensive visibility and insights
Cluster Director’s integrated observability dashboard provides a clear view of your cluster’s health, utilization, and performance, so you can quickly understand your system’s behavior and diagnose issues in a single place. The dashboard is designed to easily scale to tens of thousands of VMs.
Advanced diagnostics to detect performance anomalies
In distributed ML training, stragglers refer to small numbers of faulty or slow nodes that eventually slow down the entire workload. Cluster Director makes it easy to quickly find and replace stragglers to avoid performance degradation and wasted spend.
Try out Cluster Director today!
We are excited to invite you to be among the first to experience Cluster Director. To learn more and express your interest in joining the preview, talk to your Google Cloud account team or sign up here. We can’t wait to see what you will build.
AI Summary and Description: Yes
Summary: The text discusses the launch of Cluster Director, a unified management tool designed to simplify and enhance the deployment and management of large-scale AI and high-performance computing (HPC) infrastructure on Google Cloud. It highlights significant improvements in cluster creation speeds and efficiency, as seen by users like LG Research and Biomatter.
Detailed Description:
The text reveals key features and advantages of Cluster Director, which streamlines AI infrastructure management in the cloud. Here are the significant points:
– **Introduction of Cluster Director**:
– A unified management plane for deploying large-scale AI infrastructure efficiently.
– Provides powerful tools akin to an AI supercomputer at users’ fingertips.
– **Ease of Deployment**:
– Solves complex configuration challenges in compute, networking, and storage setups for AI and HPC clusters.
– Enhances distributed training workloads, which are sensitive to network latency.
– **User Testimonials**:
– **LG Research**: Reduced cluster setup time from over a week to less than a day, saving hundreds of GPU hours.
– **Biomatter**: Streamlined cluster management allows them to focus on scientific challenges.
– **Key Features**:
– **Simplified management interface**: Users can easily create, update, and delete clusters with optimized templates.
– **Flexible configurations**: Allows users to customize VM types, GPU instances, and network settings effectively.
– **Automated firewall setup**: Simplifies networking complexities.
– **Job Scheduling**:
– **Managed Slurm Environment**: Offers scalable job scheduling with automatic configuration of essential components like login nodes and pre-set partitions.
– Topology-aware scheduling improves performance by optimally placing jobs to reduce latency.
– **Integrated Observability**:
– A dashboard providing insights into the cluster’s health, utilization, and performance.
– Simplifies diagnostics with features to quickly identify and replace faulty nodes (stragglers) enhancing performance.
– **Performance Enhancements**:
– Topology-aware job placement aligns workloads to minimize network contention during large operations, is specifically designed for distributed training jobs.
– **Call to Action**: Encourages users to experience Cluster Director and participate in the preview phase for first-hand experience.
These advancements in Cluster Director not only enhance efficiency for AI infrastructure management but also have notable implications for organizations considering hybrid cloud environments and those needing to optimize heavy computational workloads. Security and compliance professionals can take note of how simplifying infrastructure management can lead to fewer human errors and minimize exposures to potential vulnerabilities.