Source URL: https://cloud.google.com/blog/products/data-analytics/introducing-google-cloud-serverless-for-apache-spark-in-bigquery/
Source: Cloud Blog
Title: Google Cloud Serverless for Apache Spark: high-performance, unified with BigQuery
Feedly Summary: At Google Cloud, we’re committed to providing the most streamlined, powerful, and cost-effective production- and enterprise-ready serverless Spark experience. To that end, we’re thrilled to announce a significant evolution for Apache Spark on Google Cloud, with Google Cloud Serverless for Apache Spark.
Serverless Spark is now also generally available directly within the BigQuery experience. This deeply integrated experience brings the full power of Google Cloud Serverless for Apache Spark into the BigQuery unified data-to-AI platform, offering a unified developer experience in BigQuery Studio, seamless interoperability, and industry-leading price/performance.
Why Google Cloud Serverless for Apache Spark?
Apache Spark is an incredibly popular and powerful open-source engine for data processing, analytics and AI/ML. However, developers often get bogged down managing clusters, optimizing jobs, and troubleshooting, taking valuable time away from building business logic.
By simplifying your Spark experience, you can focus on deriving insights, not managing infrastructure. Google Cloud Serverless for Apache Spark (formerly Dataproc Serverless) addresses these challenges with:
On-demand Spark for reduced total cost of ownership (TCO):
Reduce TCO by up to 60% compared to alternatives.
No cluster management. Develop business logic in Spark for interactive, batch, and AI workloads, without worrying about infrastructure.
Pay only for the job’s runtime, not for environment spinup/teardown.
On-demand Spark environments, so no more long running, under-utilized clusters.
Exceptional performance:
Support for Lightning Engine (in Preview), a Spark processing engine with vectorized execution, intelligent caching, and optimized storage I/O, for up to 3.6x faster query performance on industry benchmarks*
Highly optimized BigQuery, Google Cloud Storage, and Spanner connectors
Full support (DDL, DML, schema evolution) for open data formats such as Apache Iceberg and Delta Lake
Openness and flexibility:
Full OSS compatibility for your existing Spark code and libraries
Support for Google Cloud native (BigQuery, Spanner, Bigtable), and open-source (Apache Iceberg, Apache Parquet, Delta Lake) data formats
Choice of language (Python, Java, Scala, R) and development environment (BigQuery Studio, Vertex AI Workbench, your own Jupyter or VS Code)
Gemini-powered productivity and assistance at every step:
Gemini-based PySpark code generation for developer assistance (in Preview)
Gemini Cloud Assist for troubleshooting recommendations (in Preview)
Easily distributed AI/ML:
Popular ML libraries like XGBoost, PyTorch, Transformers, and many more, all pre-packaged with Google-certified serverless Spark images, boosting productivity, improving startup times, and reducing potential security issues from custom image management
GPU acceleration for distributed training and inference workloads
Enterprise-grade security capabilities:
No SSH access to VMs
Encryption by default, including support for Customer Managed Encryption Keys (CMEK)
Custom Org Policies for setting and enforcing enterprise guardrails
End-user credential support to ensure traceability for all data access
Production ready capabilities:
Support for job isolation, so jobs do not contend for resources
Full control over Spark job configuration for Spark experts
On-demand Spark monitoring for all jobs, so you don’t have to set up your own Persistent History Server (PHS)
Easy deployment using Apache Airflow/Cloud Composer operators, or the orchestration/scheduling tool of your choice
aside_block
A Unified Spark and BigQuery experience
Building on the power of serverless Spark, we’ve worked to reimagine how you work with Spark and BigQuery, so that you can get the flexibility to use the right engine for the right job, with a unified platform, notebook interface, and on a single copy of data.
With the general availability of serverless Apache Spark in BigQuery, we’re bringing Apache Spark directly into the BigQuery unified data platform. This means you can now develop, run and deploy Spark code interactively in the BigQuery Studio, offering an alternative, scalable, OSS processing framework alongside BigQuery’s renowned SQL engine.
“We rely on machine learning for connecting our customers with the greatest travel experiences at the best prices. With Google Serverless for Apache Spark, our platform engineers save countless hours configuring, optimizing, and monitoring Spark clusters, while our data scientists can now spend their time on true value-added work like building new business logic. We can seamlessly interoperate between engines and use BigQuery, Spark and Vertex AI capabilities for our AI/ML workflows. The unified developer experience across Spark and BigQuery, with built-in support for popular OSS libraries like PyTorch, Tensorflow, Transforms etc., greatly reduces toil and allows us to iterate quickly.” – Andrés Sopeña Pérez, Head of Content Engineering, trivago
Key capabilities and benefits of Spark in BigQuery
Apart from all the features and benefits of Google Cloud Serverless for Apache Spark outlined above, Spark in BigQuery offers deep unification:
Unified developer experience in BigQuery Studio:
Develop SQL and Spark code side-by-side in BigQuery Studio notebooks.
Leverage Gemini-based PySpark Code Generation (Preview), with the intelligent context of your data to prevent hallucination in generated code.
Use Spark Connect for remote connectivity to serverless Spark sessions.
Because Spark permissions are unified with default BigQuery roles, you can get started without needing additional permissions.
Unified data access and engine interoperability:
Powered by the BigLake metastore, Spark and BigQuery can operate on a single copy of your data, whether it’s BigQuery managed tables or open formats like Apache Iceberg. No more juggling separate security policies or data governance models across engines. Refer to the documentation on using BigLake metastore with Spark.
Additionally, all data access to BigQuery, both native and OSS formats, are unified via the BigQuery Storage Read API. Reads from serverless Spark jobs via the Storage API are now available at no additional cost
3. Easy operationalization:
Collaborate with your team and integrate into your Git-based CI/CD workflows using BigQuery repositories.
Orchestrate your Spark jobs with the rest of your business logic using BigQuery Pipelines and Schedules.
In addition to functional unification, BigQuery spend-based CUDs now apply to all usage from serverless Spark jobs. For more information about serverless Spark pricing, please visit our pricing page.
How to get started with Spark in BigQuery Studio
Getting started is incredibly easy. Within BigQuery Studio, you can spin up a Spark session using one of the templates in the notebook.
Creating a default Spark session:
You can create a default Spark session with a single line of code, as shown below.
code_block
<ListValue: [StructValue([(‘code’, ‘from google_spark_session.session.spark.connect import DataprocSparkSession\r\n# This line creates a default serverless Spark session powered by Google Cloud Serverless for Apache Spark\r\nspark = DataprocSparkSession.builder.getOrCreate()\r\n\r\n# Now you can use the \’spark\’ variable to run your Spark code\r\n# For example, reading a BigQuery table:\r\ndf = spark.read.format("bigquery") \\\r\n .option("table", "your-project.your_dataset.your_table") \\\r\n .load()\r\ndf.show()’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed8285c8e80>)])]>
Customizing your Spark session:If you want to customize your session — for example, use a different VPC network, or a service account — you can get full control over the session’s configuration, using existing session templates or by providing configurations inline. For detailed instructions on configuring your Spark sessions, reading from and writing to BigQuery, and more, please refer to the documentation.
And that’s it, you are now ready to develop your business logic using the Spark session.
The bigger picture: A unified and open data cloud
With Google Cloud Serverless for Apache Spark and its new, deep integration with BigQuery, we’re breaking down barriers between powerful analytics engines, enabling you to choose the best tool for your specific task, all within a cohesive and managed environment.
We invite you to experience the power and simplicity of Google Cloud Serverless for Apache Spark and its new, deep integration with BigQuery.
Explore Google Cloud Serverless for Apache Spark
Open BigQuery Studio and try one of the Spark templates
Read the Documentation
Watch the demo
We are incredibly excited to see what you will build. Stay tuned for more innovations as we continue to enhance Google Cloud Serverless for Apache Spark and its integrations across the Google Cloud ecosystem.
* The queries are derived from the TPC-H standard and as such are not comparable to published TPC-H standard results, as these runs do not comply with all requirements of the TPC-H standard specification.
Related Article
Google Cloud’s open lakehouse: Architected for AI, open data, and unrivaled performance
New in Google Cloud’s lakehouse are BigLake Iceberg native storage; united operational and analytical engines; and faster BigQuery SQL, t…
Read Article
AI Summary and Description: Yes
**Summary**: The text announces the introduction of Google Cloud Serverless for Apache Spark, which offers a streamlined experience for developers by eliminating the need for cluster management and providing an integrated environment within BigQuery. This innovation significantly enhances productivity in AI/ML workflows and reduces the total cost of ownership while ensuring enterprise-grade security features.
**Detailed Description**: The announcement outlines several major points regarding Google Cloud’s new Serverless for Apache Spark:
– **Serverless Experience**:
– Removes the complexity of cluster management, allowing developers to focus on building applications rather than managing infrastructure.
– On-demand Spark environments address costs by only charging for job runtime.
– **Cost Efficiency**:
– Promises up to 60% reduction in Total Cost of Ownership (TCO) compared to traditional Spark setups.
– Users pay solely for the execution time of their jobs without incurring expenses during setup.
– **Performance Improvements**:
– Introduction of the Lightning Engine, which offers up to 3.6 times faster query processing.
– Optimized connectors to BigQuery and Google Cloud Storage.
– **Flexibility and Openness**:
– Support for a variety of programming languages (Python, Java, Scala, R).
– Compatibility with open-source data formats, enabling seamless transition for existing workloads.
– **Integration with BigQuery**:
– The service is integrated into the BigQuery platform, providing a unified experience for analytics and data processing.
– Users can develop, run, and deploy Spark applications directly within BigQuery.
– **AI/ML Enhancement**:
– Features Gemini-powered components for code generation and error troubleshooting.
– Pre-packaged machine learning libraries to simplify the usage of popular ML frameworks.
– **Security Features**:
– Enterprise-grade security measures including built-in encryption and organization policies for data access management.
– Eliminates the need for SSH access to virtual machines, enhancing operational security.
– **Operationalization**:
– Enhanced collaboration through Git-based CI/CD workflows and capabilities for job orchestration.
– **Getting Started**:
– Users can easily initiate a Spark session within BigQuery Studio with minimal coding needed.
– **Future Vision**:
– A vision of creating a unified open data cloud where users benefit from the combined capabilities of Spark and BigQuery.
– **Bigger Picture**:
– This new service represents a shift towards integrating powerful analytics tools and promoting an easier workflow for data-driven projects.
**Key Implications for Security and Compliance Professionals**:
– Emphasis on enterprise-grade security and compliance within cloud environments is crucial, as organizations increasingly rely on cloud-based services.
– The removal of cluster management reduces the opportunity for misconfigurations and vulnerabilities associated with manual settings, thus enhancing security postures.
This development not only streamlines data analytics but also aligns with current trends emphasizing cost-saving, efficiency, and enhanced security in cloud computing solutions, making it particularly relevant for professionals in information and cloud security domains.