# Data Mining Cheat Sheet by HockeyPlay21

Basic cheat sheet for popular data mining concepts based on Tan's, Steinbach's, and Kumar's book

### Data Mining Steps

 1. Data Cleaning Removal of noise and incons­istent records 2. Data Integr­ation Combing multiple sources 3. Data Selection Only data relevant for the task are retrieved from the database 4. Data Transf­orm­ation Converting data into a form more approp­riate for mining 5. Data Mining Applic­ation of intell­igent methods to extract data patterns 6. Model Evaluation Identi­fic­ation of truly intere­sting patterns repres­enting knowledge 7. Knowledge Presen­tation Visual­ization or other knowledge presen­tation techniques
Data mining could also be called Knowledge Discovery in Databases (see kdnugg­ets.com)

### Types of Attributes

 Nomial e.g., ID numbers, eye color, zip codes Ordinal e.g., rankings, grades, height Interval e.g., calendar dates, temper­atures Ratio e.g., length, time, counts

### Distance Measures

Manhattan = City Block

Jaccard coeffi­cient, Hamming, Cosine are a similarity / dissim­ilarity measures

### Model Evaluation

Kappa = (observed agreement - chance agreement) / (1- chance agreement)

Kappa = (Dreal – Drandom) / (Dperfect – Drandom), where D indicates the sum of values in diagonal of the confusion matrix

### K-Nearest Neighbor

 ```* Compute distance between two points * Determine the class from nearest neighbor list     * Take the majority vote of class labels       among the k-nearest neighbors     * Weigh the vote according to distance         * weight factor, w = 1 / d^2```

### Rule-based Classi­fic­ation

 Classify records by using a collection of “if…then…” rules Rule: (Condi­tion) --> y where: * Condition is a conjun­ction of attributes * y is the class label LHS: rule antecedent or condition RHS: rule consequent Examples of classi­fic­ation rules: (Blood Type=Warm) ^ (Lay Eggs=Yes) --> Birds (Taxable Income < 50K) ^ (Refun­d=Yes) --> Evade=No Sequential covering is a rule-based classi­fier.

### Bayesian Classi­fic­ation

p(a,b) is the probab­ility that both a and b happen.

p(a|b) is the probab­ility that a happens, knowing that b has already happened.

### Terms

 Associ­ation Analysis Min-Ap­riori, LIFT, Simpson's Paradox, Anti-m­onotone property Ensemble Methods Staking, Random Forest Decision Trees C4.5, Pessim­istic estimate, Occam's Razor, Hunt's Algorithm Model Evaluation Cross-­val­ida­tion, Bootstrap, Leave-one out (C-V), Miscla­ssi­fic­ation error rate, Repeated holdout, Strati­fic­ation Bayes Probab­ilistic classifier Data Visual­ization Chernoff faces, Data cube, Percentile plots, Parallel coordi­nates Nonlinear Dimens­ion­ality Reduction Principal compon­ents, ISOMAP, Multid­ime­nsional scaling

### Ensemble Techniques

Mani­pulate training data: bagging and boosting ensemble of “experts”, each specia­lizing on different portions of the instance space

Mani­pulate output values: error-­cor­recting output coding (ensemble of “experts”, each predicting 1 bit of the {multibit} full class label)

Meth­ods: BAGGing, Boosting, AdaBoost

### Apriori Algorithm

 ```Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified     Generate length (k+1) candidate itemsets from     length k frequent itemsets     Prune candidate itemsets containing subsets     of length k that are infrequent     Count the support of each candidate by     scanning the DB     Eliminate candidates that are infrequent,     leaving only those that are frequent```

### K-means Clustering

 ```Select K points as the initial centroids repeat     Form K Clusters by assigning all points to the closest centroid     Recompute the centroid of each cluster until the centroids don't change```
Clos­eness is measured by distance (e.g., Euclid­ean), similarity (e.g., Cosine), correl­ation.

Cent­roid is typically the mean of the points in the cluster

### Hierar­chical Clustering

 Sing­le-Link or MIN Similarity of two clusters is based on the two most similar (closest / minimum) points in the different clusters Determined by one pair of points, i.e., by one link in the proximity graph. Complete or MAX Similarity of two clusters is based on the two least similar (most distant, maximum) points in the different clusters Determined by all pairs of points in the two clusters Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters
Aggl­ome­rat­ive clustering starts with points as individual clusters and merges closest clusters until only one cluster left.

Divi­sive clustering starts with one, all-in­clusive cluster and splits a cluster until each cluster only has one point.

### Dendrogram Example

Data­set: {7, 10, 20, 28, 35}

### Densit­y-Based Clustering

 ```current_cluster_label <-- 1 for all core points do     if the core point has no cluster label then         current_cluster_label <-- current_cluster_label +1         Label the current core point with the cluster label     end if     for all points in the Eps-neighborhood, except i-th the point itself do         if the point does not have a cluster label then             Label the point with cluster label         end if     end for end for```
DBSCAN is a popular algorithm

Density = number of points within a specified radius (Eps)

A point is a core point if it has more than a specified number of points (MinPts) within Eps

These are points that are at the interior of a cluster

A border point has fewer than MinPts within Eps, but is in the neighb­orhood of a core point

A noise point is any point that is not a core point or a border point

### Other Clustering Methods

 Fuzzy is a partit­ional clustering method. Fuzzy cluste­ring (also referred to as soft cluste­ring) is a form of clustering in that each data point can belong to more than one cluster. Grap­h-b­ased methods: Jarvis­-Pa­trick, Shared­-Near Neighbor (SNN, Density), Chameleon Mode­l-b­ased methods: Expect­ati­on-­Max­imi­zation

### Regression Analysis

 * Linear Regression  ­ ­| Least squares * Subset selection * Stepwise selection * Regula­rized regression  ­ ­| Ridge  ­ ­| Lasso  ­ ­| El­astic Net

### Anomaly Detection

 Anomaly is a pattern in the data that does not conform to the expected behavior (e.g., outliers, except­ions, peculi­ari­ties, surprise) Types of Anomaly  ­ ­Po­int: An individual data instance is anomalous w.r.t. the data  ­ ­Co­nte­xtual: An individual data instance is anomalous within a context  ­ ­Co­lle­ctive: A collection of related data instances is anomalous Appr­oac­hes  ­ * Graphical (e.g., boxplots, scatter plots)  ­ * Statis­tical (e.g., normal distri­bution, likeli­hood)  ­ ­ ­ | Parametric Techniques  ­ ­ ­ | Non-pa­ram­etric Techniques  ­ * Distance (e.g., neares­t-n­eig­hbor, density, cluste­ring)
Local outlier factor (LOF) is a densit­y-based distance approach

Maha­lanobis Distance is a cluste­rin­g-based distance approach

## Download the Data Mining Cheat Sheet

6 Pages
//media.cheatography.com/storage/thumb/hockeyplay21_data-mining.750.jpg

PDF (recommended)