Show Menu
Cheatography

Three Common Types of Problem

Regression To find the relati­onship between a dependent variable and many indepe­ndent variables
Classi­fic­ation To classify an observ­ation to one of the several known catogories
Clustering To group a set of objects into several unknown clusters

Regression

Model evaluation methods:
R2, Adjusted R2, MAE (mean absolute error), MSE (mean square error), RMSE (root mean square error), AIC (Akaike inform­ation criterion), BIC (Bayesian inform­ation criterion), Residual analysis, Goodne­ss-­of-fit test, Cross validation

Classi­fic­ation

Model evaluation methods
Accuracy, Confusion matrix, Sensit­ivity and specif­icity, ROC (receiver operating charac­ter­istic), AUC (area under the curve), Cross validation

Clustering

Model evaluation methods
Models can be externally evaluated using data that are not used for clustering but with known class labels

General steps to build a model

1. Collecting the data.
2. Preparing the data and fixing issues such as missing values and outliers.
3. Use explor­atory analysis to help study the content of your data and select a proper algorithm that suits your need.
4. Training a model using the algorithm you just selected. Start with a simple model that only uses the most important variab­les­/fe­atures.
5. Check model perfor­mance using the evaluation methods.
6. If the model is not satisf­actory, choose another algorithm or introduce different variables into the exsiting model.

Popular tools of implem­ent­ation

R ML libraries including stats, glmnet, caret
Python popular packages for ML including scikit­-learn, statsm­odels
Alteryx Designer 'drag-­n-drop' and requires minimum coding
Microsoft Azure Machine Learning Studio 'drag-­n-drop' and requires minimum coding
 

Linear Regression

Learning style
Supervised
Problem
Regression
Use case
Revenue prediction
Widely used for predicting numeric values (or quanti­ties). It trains and predicts fast, but can be prone to overfi­tting so proper feature selection is often needed.

Logistic regression

Learning style
Supervised
Problem
Classi­fic­ation
Use case
Customer churn prediction
A genera­lized linear model with dependent variable being binary (0-1). Mostly used to predict whether an event is going to occur based on the dependent variables.

Decision Tree

Learning style
Supervised
Problem
Classi­fic­ati­on/­Reg­ression
Use case
Targeted advert­ising
It requires little data prepar­ation and can handle both numeric and catego­rical data. Easy to interpret and visualize but suscep­tible to overfi­tting.

Random Forest

Learning style
Supervised
Problem
Classi­fic­ati­on/­Reg­ression
Use case
Credit card fraud detection
An ensemble method that combines many decision trees together. It has all pros that a basic decision tree has, can handle many features and usually has high accuracy.

K-means

Learning style
Unsupe­rvised
Problem
Clustering
Use case
Customer segeme­ntation
This method groups objects into k clusters. The goal is to have the objects in one cluster more similar to each other than to any object in other clusters. When k is not pre-de­ter­mined, many methods can be used to find a good value of k, such as the elbow method and silhouette method.
 

Naïve Bayes

Learning style
Supervised
Problem
Classi­fic­ation
Use case
Email spam filtering
A condit­ional probab­ility model that assumes all features are condit­ionally indepe­ndent on each other. Trains and predicts fast but the precision is low for small datasets and can suffer from 'zero-­fre­quency' problem.

K-nearest Neighbors (KNN)

Learning style
Supervised
Problem
Classi­fic­ation
Use case
Bank credit risk analysis
A lazy learning algorithm that doesn't require much in training, but can be slow in prediction if you have a large data set.

Support Vector Machine (SVM)

Learning style
Supervised
Problem
Classi­fic­ati­on/­Reg­ression
Use case
Text classi­fic­ation
It uses some kernel function to map data points to a higher dimens­ional space and find a hyperplane to divide these points in that space. Ideal for very large data set with high dimens­ions, or if you know the decision boundary is not linear.

References

 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          Discrete Math Cheat Sheet
          Linear Algebra - MATH 232 Cheat Sheet