Show Menu
Cheatography

Machine learning - Data Mining Cheat Sheet (DRAFT) by

This is a draft cheat sheet. It is a work in progress and is not finished yet.

What is Data?

Collection of data objects and their attrib­utes.
Attribute is a property of an object
Variable, field, charac­ter­istic, feature
Collection of attributes decribes an object
Record, point, case, sample, entity, instance, observ­ation

Type of attributes

Discrete
Finite (count­ably)
Integer
Zip, Counts
Continuous
Real numbers
Floating points
Temp., height, weight

Hierarchy of attributes types

Qualit­ative
Nominal
Category (=, !=)
ID, zip, eye, color
 
Ordinal
Ranked (>,­<)
Grades, {low, med., high}
Quantitive
Interval
Distance (+,-)
Dates, temp (C/F)
 
Ratio
Zero means absence (*, /)
Length, time, temp(K)
 

Type of data sets

Record
Collection of dataob­jects and their attributes
Table
Relational
Collection of data objects and their relation
Graph
Ordered
Ordered collection of data objects
Sequence

Data quality

High quality
Are fit for their intended use
 
Correctly represent the phenomena they correspond to
Problems
Noise
 
Outliers
 
Missing values
 

Noise

Definition
Unwanted pertub­ation to a signal
 
Unwanted data
Reasons
Limits in measur­ement accuracy
 
Intere­rence from other signals
 
Measur­ement of attributes not related to the data modeling task
Handling
Exclude noisy attributes
 
Remove noise by filtering
 
Include a model of noise

Outliers

Definition
Data objects which are signif­icantly different from most others
Reasons
Measur­ement errors
 
Natural property of data
Handling
Identify & exclude outliers
 
Model the outliers

Missing values

Definition
No value is stored for an attribute in a data object
Reasons
Inform­ation is not collected
People decline to give their age
 
Attribute is not applicable
Annual income is not applicable to children
Handling
Eliminate data objects
 
Estimate missing values
e.g. average
 
Ignore the missing value in analysis