Cloud Blog: Moloco: 10x faster model training times with TPUs on Google Kubernetes Engine

Source URL: https://cloud.google.com/blog/products/containers-kubernetes/moloco-uses-gke-and-tpus-for-ml-workloads/
Source: Cloud Blog
Title: Moloco: 10x faster model training times with TPUs on Google Kubernetes Engine

Feedly Summary: In today’s congested digital landscape, businesses of all sizes face the challenge of optimizing their marketing budgets. They must find ways to stand out amid the bombardment of messages vying for potential customers’ attention. Moreover, they grapple with rising customer acquisition costs and dwindling retention rates, impeding their profitability.
Adding to this complexity is the abundance of consumer data, which businesses often struggle to harness effectively to target the right audience. To address these challenges, companies are seeking data-driven approaches to enhance their advertising effectiveness, to help ensure their continued relevance and profitability.
Moloco offers AI-powered advertising solutions that drive user acquisition, retention, and monetization efforts.​ Moloco Ads, its demand-side platform (DSP), utilizes its customers’ unique first-party data, helping them to target and acquire high-value users based on real-time consumer behavior — ultimately, delivering higher conversion rates and return on investment. 
To meet this demand, Moloco leverages predictions from a dozen deep neural networks, while continuously designing and evaluating new models. The platform ingests 10 petabytes of data and processes bid requests per day at a peak rate of 10.5 million queries per second (QPS). 
Moloco has seen tremendous growth over the last three years, with its business growing over 8X and multiple customers spending more than $50 million annually. Moloco’s rapid growth required an infrastructure that could handle massive data processing and real-time ML predictions while remaining cost effective. As Moloco’s models grew in complexity, training times increased, hindering productivity and innovation. Separately, the Moloco team realized that they also needed to optimize serving efficiency to scale low-latency ad experiences for users across the globe.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>

Training complex ML models with GKE
After evaluating multiple cloud providers and their solutions, Moloco opted for Google Cloud for its scalability, flexibility, and robust partner ecosystem. The infrastructure provided by Google Cloud aligned with Moloco’s requirements for handling its rapidly growing data and machine learning workloads that are instrumental to optimizing customers’ advertising performance.
Google Kubernetes Engine (GKE) was a primary reason for Moloco selecting Google Cloud over other cloud providers. As Moloco discovered, GKE is more than a container orchestration tool; it’s a gateway to harnessing the full potential of AI and ML. GKE provides scalability and performance optimization tools to meet diverse ML workloads, and supports a wide range of frameworks, allowing Moloco to customize the platform according to their specific needs. 
GKE serves as a foundation for a unified AI/ML platform, integrating with other Google Cloud services, facilitating a robust environment for the data processing and distributed computing that underpin Moloco’s complex AI and ML tasks. GKE’s ML data layer offers the high-throughput storage solutions that are crucial for read-heavy workloads. Features like cluster autoscaler, node-auto provisioner, and pod autoscalers ensure efficient resource allocation. 
“Scaling our infrastructure as Moloco’s Ads business grew exponentially was a huge challenge. GKE’s autoscaling capabilities enabled the engineering team to focus on development without spending a ton of effort on operations.” – Sechan Oh, Director of Machine Learning, Moloco
Shortly after migrating to Google Cloud, Moloco began using GKE for model training. However, Moloco quickly found that using traditional CPUs was not competitive at its scale, in terms of both cost and velocity. GKE’s ability to autoscale on multi-host Tensor Processing Units (TPUs), Google’s specialized processing units for machine learning workloads, was critical to Moloco’s success, allowing Moloco to harness TPUs at scale, resulting in significant enhancements in training speed and efficiency.
Moloco further leveraged GKE’s AI and ML capabilities to optimize the management of its compute resources, minimizing idle time and generating cost savings while improving performance. Notably, GKE empowered Moloco to scale its ML infrastructure to accommodate exponential business growth without straining its engineering team. This enabled Moloco’s engineers to concentrate on developing AI and ML software instead of managing infrastructure.
“The GKE team collaborated closely with us to enable auto scaling for multi host TPUs, which is a recently added feature. Their help has really enabled amazing performance on TPUs, reducing our cost per training job by 2-4 times.” – Kunal Kukreja, Senior Machine Learning Engineer, Moloco
In addition to training models on TPUs, Moloco also uses GPUs on GKE to deploy ML models into production. This lets the Moloco platform handle real-time inference requests effectively and benefit from GKE’s scalability and operational stability, enhancing performance and supporting more complex models.
Moloco collaborated closely with the Google Cloud team throughout the implementation process, leveraging their expertise and guidance. The Google Cloud team supported Moloco in implementing solutions that ensured a smooth transition and minimal disruption to operations. Specifically, Moloco worked with the Google Cloud team to migrate its ML workloads to GKE using the platform’s autoscaling and pod prioritization capabilities to optimize resource utilization and cost efficiency. Additionally, Moloco integrated Cloud TPUs into its training pipeline, resulting in significantly reduced training times for complex ML models. Furthermore, Moloco optimized its serving infrastructure with GPUs, ensuring low-latency ad experiences for its customers.
A powerful foundation for ML training and inference
Moloco’s collaboration with Google Cloud profoundly transformed its capacity for innovation.
“By harnessing Google Cloud’s solutions, such as GKE and Cloud TPU, Moloco dramatically reduced ML training times by up to tenfold.” – Sechan Oh, Director of Machine Learning, Moloco
This in turn facilitated swift model iteration and experimentation, empowering Moloco’s engineers to innovate with unprecedented speed and efficiency. Moreover, the scalability and performance of Google Cloud’s infrastructure enabled Moloco to manage increasingly intricate models and expansive datasets, to create and implement cutting-edge machine learning solutions. Notably, Moloco’s low-latency advertising experiences, bolstered by GPUs, fostered enhanced customer satisfaction and retention.
Moloco’s success demonstrates the power of Google Cloud’s solutions to enable businesses achieve their full potential. By leveraging GKE, Cloud TPU, and GPUs, Moloco was able to scale its infrastructure, accelerate its ML training, and deliver exceptional ad experiences to its customers. As Moloco continues to grow and innovate, Google Cloud will remain a critical partner in its success. 
Meanwhile, GKE is transforming the AI and ML landscape by offering a blend of scalability, flexibility, cost-efficiency, and performance. And Google Cloud continues to invest in GKE so it can handle even the most demanding AI training workloads. For example, GKE now supports 65,000-node clusters, offering unmatched scale for training or inference. For more, watch this demo of 65,000 nodes on a single GKE cluster.

AI Summary and Description: Yes

**Summary:** The text discusses Moloco’s adoption of Google Cloud’s solutions, specifically Google Kubernetes Engine (GKE) and Cloud TPUs, to enhance its machine learning (ML) capabilities for advertising. It highlights how these technologies have enabled significant improvements in training times, scalability, and operational efficiency, thereby empowering Moloco to innovate rapidly and effectively manage complex ML workloads.

**Detailed Description:**
The provided text revolves around Moloco’s strategic decision to utilize Google Cloud for its advertising technology needs, particularly focusing on how this aligns with their requirements in the realm of machine learning and data processing. Here are the key points and their significance:

– **Business Context:**
– Moloco is facing challenges in optimizing marketing budgets, customer acquisition costs, and retention amid a crowded marketplace.
– There is a critical need for data-driven approaches, leveraging consumer data effectively for better targeting.

– **Adoption of AI-Powered Solutions:**
– Moloco has introduced AI-powered advertising solutions through its demand-side platform (DSP), utilizing first-party data for enhanced user acquisition and retention.
– The platform processes massive amounts of data, reflecting the increasing complexity and scale of their operations.

– **Infrastructure Selection:**
– After evaluating various cloud providers, Moloco chose Google Cloud for its scalability, flexibility, and supportive ecosystem.
– GKE played a central role in this choice, showcasing its capabilities that extend beyond simple container orchestration to serve as a robust foundation for AI and ML workloads.

– **Enhanced Performance Through GKE:**
– GKE not only offers scalability but also integrates well with various frameworks and services, facilitating efficient processing of large datasets.
– Specific features such as autoscalers significantly alleviate operational burdens, allowing engineers to focus more on development rather than infrastructure management.

– **Data Processing and Machine Learning:**
– The implementation of Cloud TPUs to speed up model training demonstrated cost-effectiveness and enhanced performance, reducing training times significantly.
– GKE’s optimization of GPU resources underscores its role in managing real-time inference requests, a crucial component for Moloco’s low-latency advertising needs.

– **Collaboration with Google Cloud:**
– Moloco’s close collaboration with Google Cloud was pivotal in implementing these solutions, ensuring a seamless transition while minimizing disruption.
– Continued investment by Google Cloud in GKE capabilities is highlighted as a factor that will support future demands, indicative of ongoing advancements in the AI and ML landscape.

**Implications for Security and Compliance Professionals:**
– **Infrastructure Security Considerations:**
– The use of cloud infrastructures like Google Cloud raises volume-related data security and compliance issues that need examination, particularly concerning user data handling and protection.

– **AI Model Management:**
– With increasing complexity in AI models, secure management practices must be in place to ensure integrity, availability, and confidentiality throughout the machine learning lifecycle.

– **Cost vs. Security Tradeoff:**
– While seeking cost-efficiencies in cloud solutions, organizations must ensure that security measures do not become an afterthought, balancing operational effectiveness with robust data protection strategies.

In summary, the collaboration between Moloco and Google Cloud exemplifies the transformative potential of cloud-based frameworks in enhancing AI/ML capabilities for business operations, while also emphasizing the need for vigilance regarding security and compliance issues that accompany such advancements.