Introduction to Spark with Python (SPK103)

Course Length: 3 days

Delivery Methods: Available as private class only

Course Overview

This Introduction to Spark with Python course provides a comprehensive overview of Apache Spark, a powerful open-source framework for big data processing. Designed for developers and data professionals, this course covers the foundational concepts of Spark, its ecosystem, and its application for large-scale data analytics using Python.

The course begins with an Introduction to Spark, exploring the motivations behind Spark, its components, and how it compares to Hadoop. You will learn how to acquire, install, and configure Spark, and get hands-on experience with the Spark Shell and SparkContext.

Next, in RDDs and Spark Architecture, you will dive into Resilient Distributed Datasets (RDDs), understanding their concepts, lifecycle, and the importance of lazy evaluation. You will learn about RDD partitioning and transformations and work with RDDs to perform various data processing tasks using functions like map and filter.

The Spark SQL, DataFrames, and DataSets module introduces Spark's powerful SQL capabilities, teaching you how to create and manipulate DataFrames and DataSets, load and save data in various formats (JSON, CSV, Parquet, etc.), and execute SQL-based queries. You'll also explore the differences between DataFrames and RDDs, as well as advanced techniques like mapping and splitting.

In Shuffling Transformations and Performance, you'll learn about key Spark operations, such as grouping, reducing, and joining, and understand how shuffling impacts performance. This module also covers the Catalyst Query Optimizer and the Tungsten Optimizer, providing insights into optimizing query plans and improving performance.

Performance Tuning focuses on optimizing Spark applications, teaching you how to effectively use caching, minimize shuffling, and leverage broadcast variables and accumulators. You'll gain practical performance guidelines to maximize the efficiency of your Spark jobs.

The module on Creating Standalone Applications walks you through building and deploying Spark applications, from configuring a SparkSession to using different cluster managers like Standalone, YARN, and Mesos. You'll learn about the application lifecycle, logging, and debugging to ensure your applications run smoothly in production environments.

Finally, in Spark Streaming, you'll explore the fundamentals of real-time data processing with Spark. You'll learn the basics of structured streaming, set up continuous applications, and work with data sources like Kafka to consume and process streaming data efficiently.

By the end of this Spark with Python course, you will have the skills and knowledge to build, optimize, and deploy powerful data processing applications using Apache Spark. You'll be equipped to handle large-scale data analytics and real-time data streaming, making you proficient in using Spark for various data-driven projects.

Course Benefits

Understand the need for Spark in data processing
Understand the Spark architecture and how it distributes computations to cluster nodes
Be familiar with basic installation / setup / layout of Spark
Use the Spark shell for interactive and ad-hoc operations
Understand RDDs (Resilient Distributed Datasets), and data partitioning, pipelining, and computations
Understand/use RDD ops such as map(), filter() and others.
Understand and use Spark SQL and the DataFrame API.
Understand the DataFrame capabilities, including the Catalyst query optimizer and Tungsten memory/cpu optimizations.
Be familiar with performance issues, and use DataFrames and Spark SQL for efficient computations
Understand Spark’s data caching and use it for efficient data transfer
Write/run standalone Spark programs with the Spark API
Use Spark Structured Streaming to process streaming (real-time) data
Ingest streaming data from Kafka, and process via Spark Structured Streaming
Understand performance implications and optimizations when using Spark

Course Outline

Introduction to Spark
1. Overview, Motivations, Spark Systems
2. Spark Ecosystem
3. Spark vs. Hadoop
4. Acquiring and Installing Spark
5. The Spark Shell, SparkContext
RDDs and Spark Architecture
1. RDD Concepts, Lifecycle, Lazy Evaluation
2. RDD Partitioning and Transformations
3. Working with RDDs - Creating and Transforming (map, filter, etc.)
Spark SQL, DataFrames, and DataSets
1. Overview
2. SparkSession, Loading/Saving Data, Data Formats (JSON, CSV, Parquet, text ...)
3. Introducing DataFrames (Creation and Schema Inference)
4. Supported Data Formats (JSON, Text, CSV, Parquet)
5. Working with the DataFrame (untyped) Query DSL (Column, Filtering, Grouping, Aggregation)
6. SQL-based Queries
7. Mapping and Splitting (flatMap(), explode(), and split())
8. DataFrames vs. RDDs
Shuffling Transformations and Performance
1. Grouping, Reducing, Joining
2. Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
3. Exploring the Catalyst Query Optimizer (explain(), Query Plans, Issues with lambdas)
4. The Tungsten Optimizer (Binary Format, Cache Awareness, Whole-Stage Code Gen)
Performance Tuning
1. Caching - Concepts, Storage Type, Guidelines
2. Minimizing Shuffling for Increased Performance
3. Using Broadcast Variables and Accumulators
4. General Performance Guidelines
Creating Standalone Applications
1. Core API, SparkSession.Builder
2. Configuring and Creating a SparkSession
3. Building and Running Applications - sbt/build.sbt and spark-submit
4. Application Lifecycle (Driver, Executors, and Tasks)
5. Cluster Managers (Standalone, YARN, Mesos)
6. Logging and Debugging
Spark Streaming
1. Introduction and Streaming Basics
2. Streaming Introduction
3. Structured Streaming (Spark 2+)
4. Continuous Applications
5. Table Paradigm, Result Table
6. Steps for Structured Streaming
7. Sources and Sinks
8. Consuming Kafka Data
9. Kafka Overview
10. Structured Streaming - "kafka" format
11. Processing the Stream

Class Materials

Each student will receive a comprehensive set of materials, including course notes and all the class examples.

Class Prerequisites

Experience in the following is required for this Spark class:

Working knowledge of some programming language. No Python experience necessary.

Live Private Class

Private Class for your Team
Live training
Online or On-location
Customizable
Expert Instructors

Request Proposal