Cloud Blog: Distributed data preprocessing with GKE and Ray: Scaling for the enterprise

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/preprocessing-large-datasets-with-ray-and-gke/
Source: Cloud Blog
Title: Distributed data preprocessing with GKE and Ray: Scaling for the enterprise

Feedly Summary: The exponential growth of machine learning models brings with it ever-increasing datasets. This data deluge creates a significant bottleneck in the Machine Learning Operations (MLOps) lifecycle, as traditional data preprocessing methods struggle to scale. The preprocessing phase, which is critical for transforming raw data into a format suitable for model training, can become a major roadblock to productivity.
To address this challenge, in this article, we propose a distributed data preprocessing pipeline that leverages the power of Google Kubernetes Engine (GKE), a managed Kubernetes service, and Ray, a distributed computing framework for scaling Python applications. This combination allows us to efficiently preprocess large datasets, handle complex transformations, and accelerate the overall ML workflow.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

The data preprocessing imperative
The data preprocessing phase in MLOps is foundational, directly impacting the quality and performance of machine learning models. Preprocessing includes tasks such as data cleaning, feature engineering, scaling, and encoding, all of which are essential for ensuring that models learn effectively from the data.
When data preprocessing requires a large number of operations, it may cause bottlenecks slowing down the overall speed at which the data is processed. In the following example, we walk through a preprocessing dataset use case that includes uploading several images to a Google Cloud Storage bucket. This involves up to 140,000 operations that, when executed serially, create a bottleneck and take over 8 hours to complete.
DatasetFor this example, we use a pre-crawled dataset consisting of 20,000 products.

Data preprocessing stepsThe dataset has 15 different columns. The columns of our interest are: ‘uniq_id’, ‘product_name’, ‘description’, ‘brand’ ,’product_category_tree’, ‘image’ ,’product_specifications’.
Besides dropping null values and duplicates, we perform the following steps on the relevant columns:

description: Clean up Product Description by removing stop words and punctuation.

product_category_tree: Split into different columns.

product_specifications: Parse the Product Specifications into Key:Value pairs.

image: Parse the list of image urls. Validate the URL and download the image.

Now, consider the scenario where a preprocessing task involves extracting multiple image URLs from each row of a large dataset and uploading the images to a Cloud Storage bucket. This might sound straightforward, but with a dataset that contains 20,000+ rows, each with potentially up to seven URLs, the process can become incredibly time-consuming when executed serially in Python. In our experience, such a task can take upwards of eight hours to complete!
Solution: Implement parallelism for scalability
To tackle this scalability issue, we turn to parallelism. By breaking the dataset into smaller chunks and distributing the processing across multiple threads, we can drastically reduce the overall execution time. We chose to use Ray as our distributed computing platform.
Ray: Distributed computing simplified
Ray is a powerful framework designed for scaling Python applications and libraries. It provides a simple API for distributing computations across multiple workers, making it a strong choice for implementing parallel data preprocessing pipelines.
In our specific use case, we leverage Ray to distribute the Python function responsible for downloading images from URLs to Cloud Storage buckets across multiple Ray workers. Ray’s abstraction layer handles the complexities of worker management and communication, allowing us to focus on the core preprocessing logic.
Ray’s core capabilities include:

Task parallelism: Ray enables arbitrary functions to be executed asynchronously as tasks on separate Python workers, providing a straightforward way to parallelize our image download process.

Actor model: Ray’s “actors” offer a way to encapsulate stateful computations, making them suitable for complex preprocessing scenarios where shared state might be necessary.

Simplified scaling: Ray seamlessly scales from a single machine to a full-blown cluster, making it a flexible solution for varying data sizes and computational needs.

Implementation details 
We ran the data preprocessing on GKE using the accelerated platforms repository, which provides the code to build your GKE cluster and configure pre-requisites like running Ray on the cluster so you can run data preprocessing on the cluster as a container. The job consisted of three phases:
1. Dataset partitioning: We divide the large dataset into smaller chunks.
The 20,000 rows of input data were divided into 101 smaller chunks, each with 199 rows. Each chunk is assigned to a Ray task, which is executed on a Ray worker.
2. Ray task distribution: We created Ray remote tasks. Ray creates and manages the workers and distributes the task onto them.

3. Parallel data processing: The Ray tasks prepare the data and download the images to Cloud Storage concurrently.

Results
By leveraging Ray and GKE, we achieved a dramatic reduction in processing time. The preprocessing time for 20,000 rows decreased from over 8 hours to just 17 minutes, representing a speedup of approximately 23x. If the data size increases, you can adjust the batch size and use Ray autoscaling to achieve similar performance. 
Data preprocessing challenges no more
Distributed data preprocessing with GKE and Ray provides a robust and scalable solution for addressing the data preprocessing challenges faced by modern ML teams. By leveraging the power of parallelism and cloud infrastructure, we can accelerate data preparation, reduce bottlenecks, and empower data scientists and ML engineers to focus on model development and innovation. To learn more, run the deployment that demonstrates this data preprocessing use case using Ray on GKE cluster.

AI Summary and Description: Yes

Summary: The text introduces a distributed data preprocessing pipeline using Google Kubernetes Engine (GKE) and Ray to alleviate bottlenecks in the Machine Learning Operations (MLOps) lifecycle caused by traditional methods. The proposed solution emphasizes scaling through parallel processing, achieving a significant reduction in preprocessing time from over 8 hours to 17 minutes.

Detailed Description: The article discusses the challenges faced in the data preprocessing phase of the MLOps lifecycle, emphasizing how the exponential growth of datasets can hinder productivity. It highlights the need for efficient data preprocessing methods given the volume of operations involved.

– **Key Challenges**:
– Traditional data preprocessing struggles to manage large datasets.
– Preprocessing is critical for machine learning model performance and includes cleaning, feature engineering, scaling, and encoding.
– Serial processing of extensive datasets can lead to significant bottlenecks, as evidenced by an example that outlines the processing of a dataset of 20,000 products requiring 140,000 operations and taking over 8 hours.

– **Proposed Solution**:
– Implement a distributed data preprocessing pipeline using GKE and Ray.
– **Ray’s Capabilities**:
– Task parallelism allows asynchronous execution of tasks across multiple workers, improving efficiency.
– The actor model in Ray handles stateful computations, essential for complex preprocessing tasks.
– Ray’s flexibility enables seamless scaling from single machines to clusters to accommodate varying data sizes.

– **Implementation Steps**:
– **Dataset Partitioning**: The original dataset is divided into smaller chunks (199 rows each) and processed in parallel.
– **Task Distribution**: Ray orchestrates the creation and management of worker tasks to efficiently execute the preprocessing logic.
– **Parallel Data Processing**: Images are downloaded to Cloud Storage via concurrent Ray tasks.

– **Results**:
– The collaboration of GKE and Ray significantly reduces the preprocessing time from over 8 hours to just 17 minutes for 20,000 rows, demonstrating a 23x speedup.
– Scalability features of Ray allow easy adjustments in batch sizes and the ability to use autoscaling for handling varying dataset volumes.

– **Conclusion**:
– The paper stresses the importance of distributed data preprocessing in overcoming challenges that modern ML teams face.
– Utilizing parallelism and cloud infrastructure empowers data scientists and ML engineers to focus more on developing models rather than being bogged down by data preparation bottlenecks. This approach promotes innovation and efficiency in the MLOps lifecycle.

This content is highly relevant for professionals in AI and cloud-based environments looking to optimize their data preprocessing workflows and reduce time-to-insight in machine learning projects.