Levi, Ray & Shoup, Inc.
  • Courses
  • Site Content

MS-20775 - Perform Data Engineering on Microsoft HD Insight


The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

Click here to print this page »

Prerequisites


In addition to their professional experience, students who attend this course should have:
  • Programming experience using R, and familiarity with common R packages
  • Knowledge of common statistical methods and data analysis best practices.
  • Basic knowledge of the Microsoft Windows operating system and its core functionality.
  • Working knowledge of relational databases.

Detailed Class Syllabus


Module 1: Getting Started with HDInsight


What is Big Data?
Introduction to Hadoop
Working with MapReduce Function
Introducing HDInsight

Module 2: Deploying HDInsight Clusters


Identifying HDInsight cluster types
Managing HDInsight clusters by using the Azure portal
Managing HDInsight Clusters by using Azure PowerShell

Module 3: Authorizing Users to Access Resources


Non-domain Joined clusters
Configuring domain-joined HDInsight clusters
Manage domain-joined HDInsight clusters

Module 4: Loading data into HDInsight


Storing data for HDInsight processing
Using data loading tools
Maximising value from stored data

Module 5: Troubleshooting HDInsight


Analyze HDInsight logs
YARN logs
Heap dumps
Operations management suite

Module 6: Implementing Batch Solutions


Apache Hive storage
HDInsight data queries using Hive and Pig
Operationalize HDInsight

Module 7: Design Batch ETL solutions for big data with Spark


What is Spark?
ETL with Spark
Spark performance

Module 8: Analyze Data with Spark SQL


Implementing iterative and interactive queries
Perform exploratory data analysis

Module 9: Analyze Data with Hive and Phoenix


Implement interactive queries for big data with interactive hive.
Perform exploratory data analysis by using Hive
Perform interactive processing by using Apache Phoenix

Module 10: Stream Analytics


Stream analytics
Process streaming data from stream analytics
Managing stream analytics jobs

Module 11: Implementing Streaming Solutions with Kafka and HBase


Building and Deploying a Kafka Cluster
Publishing, Consuming, and Processing data using the Kafka Cluster
Using HBase to store and Query Data

Module 12: Develop big data real-time processing solutions with Apache Storm


Persist long term data
Stream data with Storm
Create Storm topologies
Configure Apache Storm

Module 13: Create Spark Streaming Applications


Working with Spark Streaming
Creating Spark Structured Streaming Applications
Persistence and Visualization