大数据就业直通车(Pyhton/Spark/Hadoop)
Part 1: Python for Big Data
Course Objective
· Master the fundamentals of writing Python scripts
· Learn Python programming elements such as variables and flow control structures
· Discover how to work with Python data structure lists and dictionary data
· Write Python functions to facilitate code reuse
· Use Python to handle data in files, xml, json and databases
· Make their code robust by handling errors and exceptions properly
· Work with the Python libraries
· Explore Python's object-oriented features
· Search text using regular expressions
· Data visualization
· Hands on Practice Web scrapping to collection data
· Hands on Practice Web development to represent information
· Learn Data engineer roadmap
Outline
Python programming
· Introduction to Python
· Setup development IDE
· Debug
· Strings
· Numbers
· Control Structure
· List
· Dictionary
· Flat file
· JSON
· XML
· RDBMS
· Matching
· Searching
· Searching and modifying
Common used Libraries
· numPy
· pandas
· sciPy
Application Design
· Requirements Collection
· Configuration Design
· Log mechanism
· Data Migration and validation Practice
Web Scrapping
· Web Scrapping basics
· Web Scrapping advanced
· The Yahoo! Finance Stock Quote Server
Data Visualization
· Introduction
o Data
o Information
o Knowledge
o Data analysis and insight
· Data Analysis and Visualization
o Planning visualization
o Visualization tools
· Visualization Practice
o Health Care
o Sports
o Trends over time
o Financial and Statistical
Web Apps Development
· Web Frameworks
· Building a Social Website
· Sharing Content in Website
· Tracking User Actions
Data Engineer roadmap
Part 2: Big Data Solution - Spark
Course overview
Data scientists/engineer/analyst build information platform to provide deep insight and answer previously unimaginable questions. Spark and Hadoop are transforming how data scientists/engineer/analyst works by allowing interactive and integrative data analysis at scale.
You will learn how Spark and Hadoop enable data scientists/engineer/analyst to help companies reduce costs, increase profits, improve products, retain customers, and identify the new opportunities.
You will learn what data scientists/engineer/analyst do, the problems they solve, the tools and techniques they use. Through in-class simulations, participates apply data analysis methods to real-world challenges in different industries and, ultimately, prepare for big data application development and big data analyst roles in the field.
Outline
Part I Fundamental
Module 1 - Spark Introduction and Basic Programming
Introduction Spark
What is Spark?
A brief History of Spark
Programming with RDDs
Module 2 - Advanced Spark Programming
Spark Storage - Loading and saving data
Advanced Spark Programming
Standalone applications
Module 3 - Spark SQL
Linking with Spark SQL
Using Spark SQL in Applications
JDBC/ODBC server
User-Defined Functions
Spark SQL Performance
Module 4 - Spark Streaming
Architecture and abstraction
Input/output operations
Streaming UI
Performance Considerations
Module 5 - Tuning and Debug Spark
Configuration Spark
Key Performance considerations
Module 6 - Running on Cluster
Runtime Architecture
Cluster Manager
Part II Applications
Module 7 - Machine Learning
Designing a Machine learning system
Building a Recommendation Engine with Spark
MLlib Decision Trees
Module 8 – Prediction with Decision tree
Decision tree
Training Examples
Preparing the data
A First Decision tree
Tuning Decision Trees
Making Predictions
Conclusions
Module 9 – Anomaly Detection with K-means Clustering
Anomaly Detection
K-means clustering
A First Take on Clustering
Choosing k
Visualization
Feature Normalisation
Clustering in action
Module 10 – Exploring Property Location data
Loading data
Variables to explore
Exploring property value
Exploring lot size
Exploring costs
Exploring the year a property has been built
Exploring rent and income
Module 11 - Estimating Financial Risk through Mote Carlo Simulation
Build model
Getting the data
Preprocessing
Determine the factor Weights
Visualizing the results
Evaluating results
Module 12 - Interactive Data Analysis with Zeppelin
Appendix Scala programming Essential
Part 3: Big Data Solution - Hadoop
Introduction Big Data
All about Data!
Data Storage and Analysis
Comparison with Other Systems
Rational Database Management System
Grid Computing
Volunteer Computing
A Brief History of Hadoop
Compatibility
Installation single node Hadoop
Prerequisites Installation Configuration Standalone Mode
Pseudo distributed Mode Configuration SSH Formatting HDFS filesystem
Starting and stopping MapReduce
Fully Distributed Mode
Creating Eclipse Plugin for Hadoop-2.x.0
Contents
Download and install Eclipse
Install git
Download source code for Hadoop Plugin for Eclipse from git
Compile and create jar
Install the plugin to eclipse
Developing a MapReduce Application
The Configuration Combining Resources Variable Expansion
Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunne
Writing a Unit Test with MRUnit
Mapper
Reducer
Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver
Running on a Cluster Packaging a Job Launching a Job
The MapReduce Web UI Retrieving the Results Debugging a Job
Hadoop Logs Remote Debugging Tuning a Job Profiling Tasks
MapReduce Workflows
Decomposing a Problem into MapReduce Jobs
JobControl
Apache Oozie
MapReduce Features
Counters
Built-in Counters
User-Defined Java Counters
User-Defined Streaming Counters
Sorting Preparation Partial Sort Total Sort Secondary Sort Joins
Map-Side Joins
Reduce-Side Joins
Side Data Distribution
Using the Job Configuration Distributed Cache MapReduce Library Classes
Setting Up a Hadoop Cluster
Cluster Specification
Network Topology
Cluster Setup and Installation
Installing Java
Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration
Hadoop Configuration
Configuration Management
Environment Settings
Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties
User Account Creation
YARN Configuration
Important YARN Daemon Properties YARN Daemon Addresses and Ports Security
Kerberos and Hadoop
Delegation Tokens
Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks
User Jobs
Hadoop in the Cloud
Apache Whirr
Administering Hadoop
HDFS
Persistent Data Structures
Safe Mode Audit Logging Tools Monitoring Logging Metrics
Java Management Extensions
Maintenance
Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades
Pig
Installing and Running Pig
Execution Types Running Pig Programs Grunt
Pig Latin Editors An Example Generating Examples
Comparison with Databases
Pig Latin Structure Statements Expressions Types Schemas Functions Macros
User-Defined Functions
A Filter UD An Eval UDF A Load UDF
Data Processing Operators Loading and Storing Data Filtering Data
Grouping and Joining Data
Sorting Data
Combining and Splitting Data
Pig in Practice
Parallelism
Parameter Substitution
Hive
Installing Hive The Hive Shell An Example Running Hive
Configuring Hive
Hive Services
The Metastore
Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes HiveQL
Data Types
Operators and Functions
Tables
Managed Tables and External Tables
Partitions and Buckets
Storage Formats
Importing Data Altering Tables Dropping Tables Querying Data
Sorting and Aggregating
MapReduce Scripts
Joins Subqueries Views
User-Defined Functions
Writing a UDF Writing a UDAF
HBase
HBasics Backdrop Concepts
Whirlwind Tour of the Data Model
Implementation
Installation Test Drive Clients
Java
Avro, REST, and Thrift
Example Schemas Loading Data Web Queries
HBase Versus RDBMS
Successful Service
HBase
Use Case: HBase at Streamy.com
Praxis Versions HDFS
UI Metrics
Schema Design
Counters
Bulk Load
Case Studies
Hadoop Usage at Last.fm
Last.fm: The Social Music Revolution
Hadoop at Last.fm
Generating Charts with Hadoop The Track Statistics Program Summary