Spark Development and data analysis
(更多资料和具体参加方法)
Course overview
Data scientists/engineer/analyst build information platform to provide deep insight and answer previously unimaginable questions. Spark and Hadoop are transforming how data scientists/engineer/analyst works by allowing interactive and integrative data analysis at scale.
You will learn how Spark and Hadoop enable data scientists/engineer/analyst to help companies reduce costs, increase profits, improve products, retain customers, and identify the new opportunities.
You will learn what data scientists/engineer/analyst do, the problems they solve, the tools and techniques they use. Through in-class simulations, participates apply data analysis methods to real-world challenges in different industries and, ultimately, prepare for big data application development and big data analyst roles in the field.
Outline
Part I Fundamental
Module 1 - Spark Introduction and Basic Programming
Introduction Spark
What is Spark?
A brief History of Spark
Programming with RDDs
Module 2 - Advanced Spark Programming
Spark Storage - Loading and saving data
Advanced Spark Programming
Standalone applications
Module 3 - Spark SQL
Linking with Spark SQL
Using Spark SQL in Applications
JDBC/ODBC server
User-Defined Functions
Spark SQL Performance
Module 4 - Spark Streaming
Architecture and abstraction
Input/output operations
Streaming UI
Performance Considerations
Module 5 - Tuning and Debug Spark
Configuration Spark
Key Performance considerations
Module 6 - Running on Cluster
Runtime Architecture
Cluster Manager
Part II Applications
Module 7 - Machine Learning
Designing a Machine learning system
Building a Recommendation Engine with Spark
MLlib Decision Trees
Module 8 – Prediction with Decision tree
Decision tree
Training Examples
Preparing the data
A First Decision tree
Tuning Decision Trees
Making Predictions
Conclusions
Module 9 – Anomaly Detection with K-means Clustering
Anomaly Detection
K-means clustering
A First Take on Clustering
Choosing k
Visualization
Feature Normalisation
Clustering in action
Module 10 – Exploring Property Location data
Loading data
Variables to explore
Exploring property value
Exploring lot size
Exploring costs
Exploring the year a property has been built
Exploring rent and income
Module 11 - Estimating Financial Risk through Mote Carlo Simulation
Build model
Getting the data
Preprocessing
Determine the factor Weights
Visualizing the results
Evaluating results
Module 12 - Interactive Data Analysis with Zeppelin
Appendix Scala programming Essential