Updated June 2026 21 hours of live training delivered over 3-5 days. Data engineers, Python developers, and data professionals who need to develop, manage, and optimize batch and streaming data pipelines with Apache Spark 4 on GCP Dataproc serverless. Teams who also need to schedule and orchestrate these pipelines can continue to Apache Airflow Programming: Developing, Configuring, and Automating Workflows. The course equips participants with practical skills to develop, manage, and optimize Apache Spark 4 pipelines on GCP Dataproc serverless. By the end, attendees will understand Spark batch and streaming use-cases, master its execution model, Spark Connect architecture, and core data structures - including the VARIANT type for semi-structured data - and build reusable, performance-tuned pipelines for diverse data workloads. Comprehensive courseware is distributed online at the start of class. All students receive a downloadable MP4 recording of the training. Students will need access to a GCP account with Dataproc serverless configured. If students are unable to configure access, cloud environment can be provided.Practical Apache Spark for Data Pipelines
Class Duration
Student Prerequisites
Target Audience
Description
Learning Outcomes
Training Materials
Software Requirements
Training Topics
Spark Overview
Spark Architecture and Use-Cases
Core Data Structures
Spark Execution Model
Batch and Streaming Pipelines
Advanced Features
Performance Tuning
Case Study and Wrap-up