HW DEV-343 - HDP Developer: Apache Spark 2.3 (DEV-343)

This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. It is based on the Spark 2.x release. The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface.It includes in-depth coverage of Spark SQL, DataFrames, and DataSets, which are now the preferred programming API. This includes exploring possible performance issues and strategies for optimization. The course also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and integrating with the Kafka server.

Student Testimonials

Instructor did a great job, from experience this subject can be a bit dry to teach but he was able to keep it very engaging and made it much easier to focus. Student
Excellent presentation skills, subject matter knowledge, and command of the environment. Student
Instructor was outstanding. Knowledgeable, presented well, and class timing was perfect. Student

Click here to print this page »

Prerequisites


Students should be familiar with programming principles and have previous experience in software development using Scala. Previous experience with data streaming, SQL, and HDP is also helpful, but not required.

Detailed Class Syllabus


DAY 1: Scala Ramp Up, Introduction to Spark


OBJECTIVES
Scala Introduction
Working with: Variables, Data Types, and Control Flow
The Scala Interpreter
Collections and their Standard Methods (e.g. map())
Working with: Functions, Methods, and Function Literals
Define the Following as they Relate to Scale: Class, Object, and Case Class
Overview, Motivations, Spark Systems
Spark Ecosystem
Spark vs. Hadoop
Acquiring and Installing Spark
The Spark Shell, SparkContext
LABS
Setting Up the Lab Environment
Starting the Scala Interpreter
A First Look at Spark
A First Look at the Spark Shell

DAY 2: RDDs and Spark Architecture, Spark SQL, DataFrames and DataSets


OBJECTIVES
RDD Concepts, Lifecycle, Lazy Evaluation
RDD Partitioning and Transformations
Working with RDDs Including: Creating and Transforming
An Overview of RDDs
SparkSession, Loading/Saving Data, Data Formats
Introducing DataFrames and DataSets
Identify Supported Data Formats
Working with the DataFrame (untyped) Query DSL
SQL-based Queries
Working with the DataSet (typed) API
Mapping and Splitting
DataSets vs. DataFrames vs. RDDs
LABS
RDD Basics
Operations on Multiple RDDs
Data Formats
Spark SQL Basics
DataFrame Transformations
The DataSet Typed API
Splitting Up Data

DAY 3: Shuffling, Transformations and Performance, Performance Tuning


OBJECTIVES
Working with: Grouping, Reducing, Joining
Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
Exploring the Catalyst Query Optimizer
The Tungsten Optimizer
Discuss Caching, Including: Concepts, Storage Type, Guidelines
Minimizing Shuffling for Increased Performance
Using Broadcast Variables and Accumulators
General Performance Guidelines
LABS
Exploring Group Shuffling
Seeing Catalyst at Work
Seeing Tungsten at Work
Working with Caching, Joins, Shuffles, Broadcasts, Accumulators
Broadcast General Guidelines

DAY 4: Creating Standalone Applications and Spark Streaming


OBJECTIVES
Core API, SparkSession.Builder
Configuring and Creating a SparkSession
Building and Running Applications
Application Lifecycle (Driver, Executors, and Tasks)
Cluster Managers (Standalone, YARN, Mesos)
Logging and Debugging
Introduction and Streaming Basics
Spark Streaming (Spark 1.0+)
Structured Streaming (Spark 2+)
Consuming Kafka Data
LABS
Spark Job Submission
Additional Spark Capabilities
Spark Streaming
Spark Structured Streaming
Spark Structured Streaming with Kafka