×

HWHDPDS - HDP Analyst Data Science

This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikitlearn), the Natural Language Toolkit (NLTK), and Spark MLlib.

Click here to print this page »

Prerequisites


Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics,
and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course

Detailed Class Syllabus


Course Objectives


Recognize use cases for data science on Hadoop
Describe the Hadoop and YARN architecture
Describe supervised and unsupervised learning differences
Use Mahout to run a machine learning algorithm on Hadoop
Describe the data science life cycle
Use Pig to transform and prepare data on Hadoop
Write a Python script
Describe options for running Python code on a Hadoop cluster
Write a Pig User-Defined Function in Python
Use Pig streaming on Hadoop with a Python script
Use machine learning algorithms
Describe use cases for Natural Language Processing (NLP)
Use the Natural Language Toolkit (NLTK)
Describe the components of a Spark application
Write a Spark application in Python
Run machine learning algorithms using Spark MLlib
Take data science into production

Course Outline


Format:
50% Lecture/Discussion
50% Hands-on Labs
Hands-On Labs:
Lab: Setting Up a Development Environment
Demo: Block Storage
Lab: Using HDFS Commands
Demo: MapReduce
Lab: Using Apache Mahout for Machine Learning
Demo: Apache Pig
Lab: Getting Started with Apache Pig
Lab: Exploring Data with Pig
Lab: Using the IPython Notebook
Demo: The NumPy Package
Demo: The pandas Library
Lab: Data Analysis with Python
Lab: Interpolating Data Points
Lab: Defining a Pig UDF in Python
Lab: Streaming Python with Pig
Demo: Classification with Scikit-Learn
Lab: Computing K-Nearest Neighbor
Lab: Generating a K-Means Clustering
Lab: POS Tagging Using a Decision Tree
Lab: Using NLTK for Natural Language Processing
Lab: Classifying Text using Naive Bayes
Lab: Using Spark Transformations and Actions
Lab Using Spark MLlib
Lab: Creating a Spam Classifier with MLlib