MS-20775 - Perform Data Engineering on Microsoft HD Insight

The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

Click here to print this page »


In addition to their professional experience, students who attend this course should have:
  • Programming experience using R, and familiarity with common R packages
  • Knowledge of common statistical methods and data analysis best practices.
  • Basic knowledge of the Microsoft Windows operating system and its core functionality.
  • Working knowledge of relational databases.

Detailed Class Syllabus

Module 1: Getting Started with HDInsight

What is Big Data?
Introduction to Hadoop
Working with MapReduce Function
Introducing HDInsight

Module 2: Deploying HDInsight Clusters

Identifying HDInsight cluster types
Managing HDInsight clusters by using the Azure portal
Managing HDInsight Clusters by using Azure PowerShell

Module 3: Authorizing Users to Access Resources

Non-domain Joined clusters
Configuring domain-joined HDInsight clusters
Manage domain-joined HDInsight clusters

Module 4: Loading data into HDInsight

Storing data for HDInsight processing
Using data loading tools
Maximising value from stored data

Module 5: Troubleshooting HDInsight

Analyze HDInsight logs
YARN logs
Heap dumps
Operations management suite

Module 6: Implementing Batch Solutions

Apache Hive storage
HDInsight data queries using Hive and Pig
Operationalize HDInsight

Module 7: Design Batch ETL solutions for big data with Spark

What is Spark?
ETL with Spark
Spark performance

Module 8: Analyze Data with Spark SQL

Implementing iterative and interactive queries
Perform exploratory data analysis

Module 9: Analyze Data with Hive and Phoenix

Implement interactive queries for big data with interactive hive.
Perform exploratory data analysis by using Hive
Perform interactive processing by using Apache Phoenix

Module 10: Stream Analytics

Stream analytics
Process streaming data from stream analytics
Managing stream analytics jobs

Module 11: Implementing Streaming Solutions with Kafka and HBase

Building and Deploying a Kafka Cluster
Publishing, Consuming, and Processing data using the Kafka Cluster
Using HBase to store and Query Data

Module 12: Develop big data real-time processing solutions with Apache Storm

Persist long term data
Stream data with Storm
Create Storm topologies
Configure Apache Storm

Module 13: Create Spark Streaming Applications

Working with Spark Streaming
Creating Spark Structured Streaming Applications
Persistence and Visualization