


Part 1: Python for Big Data

Course Objective

·         Master the fundamentals of writing Python scripts

·         Learn Python programming elements such as variables and flow control structures

·         Discover how to work with Python data structure lists and dictionary data

·         Write Python functions to facilitate code reuse

·         Use Python to handle data in files, xml, json and databases

·         Make their code robust by handling errors and exceptions properly

·         Work with the Python libraries

·         Explore Python's object-oriented features

·         Search text using regular expressions

·         Data visualization

·         Hands on Practice Web scrapping to collection data

·         Hands on Practice Web development to represent information

·         Learn Data engineer roadmap


Python programming

·         Introduction to Python

·         Setup development IDE

·         Debug

·         Strings

·         Numbers

·         Control Structure

Data Structure

·         List

·         Dictionary

Data Operations

·         Flat file

·         JSON

·         XML

·         RDBMS

Regular expression

·         Matching

·         Searching

·         Searching and modifying

Common used Libraries

·         numPy

·         pandas

·         sciPy

Application Design

·         Requirements Collection

·         Configuration Design

·         Log mechanism

·         Data Migration and validation Practice

Web Scrapping

·         Web Scrapping basics

·         Web Scrapping advanced

·         The Yahoo! Finance Stock Quote Server

Data Visualization

·        Introduction

o   Data

o   Information

o   Knowledge

o   Data analysis and insight

·        Data Analysis and Visualization

o   Planning visualization

o   Visualization tools

·         Visualization Practice

o   Health Care

o   Sports

o   Trends over time

o   Financial and Statistical

Web Apps Development 

·         Web Frameworks

·         Building a Social Website

·         Sharing Content in Website

·         Tracking User Actions

Data Engineer roadmap


Part 2: Big Data Solution - Spark


Course overview

Data scientists/engineer/analyst build information platform to provide deep insight and answer previously unimaginable questions. Spark and Hadoop are transforming how data scientists/engineer/analyst works by allowing interactive and integrative data analysis at scale.

You will learn how Spark and Hadoop enable data scientists/engineer/analyst to help companies reduce costs, increase profits, improve products, retain customers, and identify the new opportunities.

You will learn what data scientists/engineer/analyst do, the problems they solve, the tools and techniques they use. Through in-class simulations, participates apply data analysis methods to real-world challenges in different industries and, ultimately, prepare for big data application development and big data analyst roles in the field.



Part I Fundamental

Module 1 - Spark Introduction and Basic Programming

Introduction Spark

What is Spark?

A brief History of Spark

Programming with RDDs

Module 2 - Advanced Spark Programming

Spark Storage - Loading and saving data

Advanced Spark Programming          

Standalone applications

Module 3 - Spark SQL

          Linking with Spark SQL

            Using Spark SQL in Applications

            JDBC/ODBC server

            User-Defined Functions

            Spark SQL Performance

Module 4 - Spark Streaming

          Architecture and abstraction

          Input/output operations

          Streaming UI

          Performance Considerations

Module 5 - Tuning and Debug Spark

          Configuration Spark

          Key Performance considerations

Module 6 - Running on Cluster

          Runtime Architecture

          Cluster Manager

Part II Applications

Module 7 - Machine Learning

          Designing a Machine learning system

            Building a Recommendation Engine with Spark      

MLlib Decision Trees

Module 8 – Prediction with Decision tree

          Decision tree

Training Examples
          Preparing the data

          A First Decision tree

          Tuning Decision Trees

          Making Predictions


Module 9 – Anomaly Detection with K-means Clustering

          Anomaly Detection

            K-means clustering

            A First Take on Clustering

            Choosing k


            Feature Normalisation

            Clustering in action

Module 10 – Exploring Property Location data 

          Loading data

Variables to explore

Exploring property value

Exploring lot size

Exploring costs   

Exploring the year a property has been built      

Exploring rent and income     

Module 11 - Estimating Financial Risk through Mote Carlo Simulation

          Build model

Getting the data


Determine the factor Weights

Visualizing the results

Evaluating results

Module 12 - Interactive Data Analysis with Zeppelin



Appendix Scala programming Essential



Part 3: Big Data Solution - Hadoop

Introduction Big Data

All about Data!

Data Storage and Analysis                            

Comparison with Other Systems

Rational Database Management System

Grid Computing

Volunteer Computing

A Brief History of Hadoop


Installation single node Hadoop

Prerequisites Installation Configuration Standalone Mode

Pseudo distributed Mode Configuration SSH Formatting HDFS filesystem

Starting and stopping MapReduce

Fully Distributed Mode

Creating Eclipse Plugin for Hadoop-2.x.0


Download and install Eclipse

Install git

Download source code for Hadoop Plugin for Eclipse from git

Compile and create jar

Install the plugin to eclipse

Developing a MapReduce Application

The Configuration Combining Resources Variable Expansion

Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunne

Writing a Unit Test with MRUnit



Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver

Running on a Cluster Packaging a Job Launching a Job

The MapReduce Web UI Retrieving the Results Debugging a Job

Hadoop Logs Remote Debugging Tuning a Job Profiling Tasks

MapReduce Workflows

Decomposing a Problem into MapReduce Jobs


Apache Oozie

MapReduce Features


Built-in Counters

User-Defined Java Counters

User-Defined Streaming Counters

Sorting Preparation Partial Sort Total Sort Secondary Sort Joins

Map-Side Joins

Reduce-Side Joins

Side Data Distribution

Using the Job Configuration Distributed Cache MapReduce Library Classes

Setting Up a Hadoop Cluster

Cluster Specification

Network Topology

Cluster Setup and Installation

Installing Java

Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration

Hadoop Configuration

Configuration Management

Environment Settings

Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties

User Account Creation

YARN Configuration

Important YARN Daemon Properties YARN Daemon Addresses and Ports Security

Kerberos and Hadoop

Delegation Tokens

Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks

User Jobs

Hadoop in the Cloud

Apache Whirr

Administering Hadoop


Persistent Data Structures

Safe Mode Audit Logging Tools Monitoring Logging Metrics

Java Management Extensions


Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades


Installing and Running Pig

Execution Types Running Pig Programs Grunt

Pig Latin Editors An Example Generating Examples

Comparison with Databases

Pig Latin Structure Statements Expressions Types Schemas Functions Macros

User-Defined Functions

A Filter UD An Eval UDF A Load UDF

Data Processing Operators Loading and Storing Data Filtering Data

Grouping and Joining Data

Sorting Data

Combining and Splitting Data

Pig in Practice


Parameter Substitution


Installing Hive The Hive Shell An Example Running Hive

Configuring Hive

Hive Services

The Metastore

Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes HiveQL

Data Types

Operators and Functions


Managed Tables and External Tables

Partitions and Buckets

Storage Formats

Importing Data Altering Tables Dropping Tables Querying Data

Sorting and Aggregating

MapReduce Scripts

Joins Subqueries Views

User-Defined Functions

Writing a UDF Writing a UDAF


HBasics Backdrop Concepts

Whirlwind Tour of the Data Model


Installation Test Drive Clients


Avro, REST, and Thrift

Example Schemas Loading Data Web Queries

HBase Versus RDBMS

Successful Service


Use Case: HBase at Streamy.com

Praxis Versions HDFS

UI Metrics

Schema Design


Bulk Load

Case Studies

Hadoop Usage at Last.fm

Last.fm: The Social Music Revolution

Hadoop at Last.fm

Generating Charts with Hadoop The Track Statistics Program Summary

本页最后更新: | -- | 网站设计和虚拟主机服务 WECAN