Python 实战班




 Applied Data Science with Python and R --- Python Part

Data science is now playing more and more important role in the era of big data. The posted jobs are more than the applicants for data scientists' job in the current job market. One of important reason for that is most of data analysts can only use SAS to do data analysis in non-Hadoop environment and don't know how to use any open source tools (such as R, Python, Scala, etc.) to do analysis.  
However, in Canada, open source tools get more and more popular across all industries, and will be over SAS quickly for data analysis in industry. For instance, 5 big Banks, Telecom and consulting companies are using Python, R or Scala to do big data analysis and modeling instead of using SAS in Hadoop. Skills in SAS are no longer attractive to employers as before, instead the open source tools become required and more attractive to employers. For the popularity of programming language, you can get details at In terms of TIOBE index report 2016, you can see Python moved up three spots within the last year to claim the number 5 spot. Meanwhile, R was ranked to 16, while SAS just dropped to number 21. Data Scientist is new role vs. previous data analyst, and more opportunity and more promising and more payed. Please go to any job search website, and try to search for Data Scientist to feel how the job is so hot and so demanding.  
In order to fit the needs of data scientist job market, this course intends to provide required knowledge and skills to help data analyst optimize Data Science learning path to successfully transit into data scientist. The topics in this course come from an analysis of real requirements in data scientist job listings from the biggest tech employers. The course will not only introduce you step-by-step to the process of installing the Python interpreter and data ingestion/wrangling, but also guide you from end-to-end to develop models with machine learning in Python.
The course is created around three themes designed to get you started and using Python for applied machine learning effectively and quickly. These three parts are as follows:
Lessons: Learn how data can be processed in Python (Session1: Part 1 ~ Part 4), and how machine learning project map onto Python and the best practice way of working through each task (Session 2: Part 5 ~ Part 16) through two sessions
Projects: Tie together all of the knowledge from the lessons by working through case study data processing and predictive modeling problems
Recipes: Apply machine learning with a catalog of standalone recipes in Python are provided as bonus, which you can copy-and-paste as a starting point for your new projects
Who is this course designed for:
Anyone without prior coding or scripting experience but with science/engineering/finance background and aspiration to be data scientists
New graduates with science/engineering/finance background, and would like to exploit Python to perform data science operations
Developers and programmers who intend to expand their knowledge and learn about data manipulation and machine learning
SAS programmer in the finance, telecom or other non-tech industries who want to transition from reporting or data cleaners into the data scientist role
You can seek and do a data scientist job after mastering all you learnt from this course with confidence 

Course Outline

Session 1
Part 1: Introduction to Applied Data Science with Python Course
What You Learn From This Course
Part 2: Python Ecosystem for Machine Learning
Python Ecosystem Installation
Jupyter Installation
SciPy – NumPy, Matplotlib, Pandas, Scikit-Learn, and statsmodels
Part 3: Python Programming Basics
Variables and Data Types
Basic Operators
Number Type Conversion
Mathematical, Random Number, Trigonometric Functions and Mathematics Constants
Working With Lists, Tuples, Strings, Sets and Dictionary
Working With Sequences
Working With Collections
Exercises 1
Part 4: Data Ingestion and Munging
Data Loading
a. Load CSV Files with the Python Standard Library
b. Load CSV Files with NumPy
c. Load CSV Files with Pandas
Data Processing
a. Data Preprocessing with Pandas
a) Data Selection and Manipulation
b) Dealing with problematic data and Missing Value
c) Dealing with big datasets
d) Accessing other data formats
b. Data Preprocessing with NumPy
a) Creating NumPy Arrays
b) NumPy Fast Operation and Computations
Working with Categorical and Textual Data
a) Introducing the basics of matplotlib
b) Selected graphical examples with pandas
c) Advanced data learning representation
Exercises 2
Session 2
Part 5: Introducing EDA 
Understand Data With Descriptive Statistics
Understand Data With Visualization
The Detection and Treatment of Outliers
a. Univariate outlier detection
b. EllipticEnvelope
c. OneClassSVM
Pre-Process Data
a. Data Transforms
b. Rescale Data
c. Standardize Data
d. Normalize Data
e. Binarize Data
Dimensionality Reduction
a. The Covariance Matrix
b. Principal Component Analysis (PCA)
c. RandomizedPCA
d. Latent Factor Analysis (LFA)
e. Linear Discriminant Analysis (LDA)
f. Latent Semantical Analysis (LSA)
g. Independent Component Analysis (ICA)
h. Kernel PCA
Exercise 3
Part 6: Feature Selection 
Univariate Selection
Recursive Feature Elimination
Stability and L1 Based Selection
Feature Importance
Exercise 4
Part 7: Resampling Methods
Train and Test Sets.
K-fold Cross Validation.
Leave One Out Cross Validation.
Repeated Random Test-Train Splits
Part 8: Algorithm Evaluation Metrics
Classification Metrics
a. Classification Accuracy
b. Logarithmic Loss
c. Area under ROC Curve
d. Confusion Matrix
e. Classification Report
Regression Metrics
a. Mean Absolute Error
b. Mean Squared Error
c. R-square
Part 9: Model Techniques Selection for Classification
Linear Machine Learning Algorithms
a. Logistic Regression
b. Linear Discriminant Analysis
Nonlinear Machine Learning Algorithms:
a. K-Nearest Neighbors.
b. Naive Bayes.
c. Classification and Regression Trees.
d. Support Vector Machines
Exercise 5
Part 10: Model Techniques Exploration for Regression
Linear Machine Learning Algorithms
a. Linear Regression.
b. Ridge Regression.
c. LASSO Linear Regression.
d. Elastic Net Regression
Nonlinear Machine Learning Algorithms:
a. K-Nearest Neighbors.
b. Classification and Regression Trees.
c. Support Vector Machines
Exercise 6
Part 11: Champion Model Technique Selection
How to formulate an experiment to directly compare machine learning algorithms
A reusable template for evaluating the performance of multiple algorithms 
How to report and visualize the results when comparing algorithm performance
Exercise 7
Part 12: Pipelines Machine Learning Work Flows Automation
How to use pipelines to minimize data leakage.
How to construct a data preparation and modeling pipeline.
How to construct a feature extraction and modeling pipeline
Exercise 8
Part 13: Ensemble Methods
Bagging. Building multiple models from different subsamples of the training dataset
a. Bagged Decision Trees
b. Random Forest
c. Extra Trees
Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the sequence of models
a. adaBoost
b. Stochastic Gradient Boosting
Voting. Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions
Exercise 9
Part 14: Algorithm Parameter Tuning
The importance of algorithm parameter tuning to improve algorithm performance.
How to use a grid search algorithm tuning strategy.
How to use a random search algorithm tuning strategy
Exercise 10
Part 15: Save and Load Machine Learning Models
Finalize Your Model with pickle
Finalize Your Model with joblib
Part 16: Projects
Predictive Modeling Project Template
a. Use A Structured Step-By-Step Process
b. Machine Learning Project Template in Python
c. Machine Learning Project Template Steps
d. Tips For Using The Template Well
Project 1: The Hello World of Classification Machine Learning (multinomial target model)
Project 2: Regression Machine Learning Case Study Project (continuous target model)
Project 3: Binary Classification Machine Learning Case Study Project (binary target model) 
【授课名师】Mr. Chen ;国内大学信号处理博士,曾是国内著名大学教授。自来北美后,一直从事Database Marketing工作,涉及到物流、IT、超市零售及电讯等不同领域。其间经历了不同Level的工作岗位,从Data Analyst到Statistician,再到Senior Manager,使其在Database Marketing方面有着丰富的知识与北美工作经验。从2004年开始执教数据分析相关课程,已经帮助了几百名华人同胞找到了第一份数据分析的工作,并且在职场中不断获得提升。他的学员已经遍布北美各大政府、银行、通讯等世界500强大公司,并且担任公司在数据分析方面的重要角色。 特别值得称道的是,陈老师作为华人“Data Mining”课程的创始人,将高深的理论知识讲的深入浅出,获得学员的高度好评! 
本页最后更新: | -- | 网站设计和虚拟主机服务 WECAN