Show Menu
Cheatography

Data Science 101 Cheat Sheet (DRAFT) by

It is a Cheat Sheet of data Science Topics.

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Linear Regression Cheat Sheet

Linear Regression Overview
Linear regression is a statis­tical technique used to model the relati­onship between a dependent variable and one or more indepe­ndent variables.

It assumes a linear relati­onship between the indepe­ndent variables and the dependent variable.
Simple Linear Regression
Simple linear regression involves a single indepe­ndent variable (x) and a dependent variable (y) related by the equation: y = mx + c, where m is the slope and c is the intercept.
Multiple Linear Regression
Multiple linear regression involves more than one indepe­ndent variable (x1, x2, x3, etc.) and a dependent variable (y) related by the equation: y = b0 + b1x1 + b2x2 + ... + bnxn, where b0 is the intercept, and b1, b2, ..., bn are the coeffi­cients.
Assump­tions of Linear Regression
Linearity: There should be a linear relati­onship between the indepe­ndent and dependent variables.

Independence: The observ­ations should be indepe­ndent of each other.

Homoscedasticity: The variance of the residuals should be constant across all levels of the indepe­ndent variables.

Normality: The residuals should be normally distri­buted.

No multic­oll­ine­arity: The indepe­ndent variables should not be highly correlated with each other.
Fitting the Model
The goal is to find the best-f­itting line that minimizes the sum of squared residuals (diffe­rences between predicted and actual values).

This is typically achieved using the method of least squares.
Interp­reting Coeffi­cients
The intercept (b0) represents the expected value of the dependent variable when all indepe­ndent variables are zero.

The coeffi­cients (b1, b2, ..., bn) represent the change in the dependent variable associated with a one-unit change in the corres­ponding indepe­ndent variable, holding other variables constant.
Evaluating Model Perfor­mance
R-squared (R²): Indicates the proportion of variance in the dependent variable explained by the indepe­ndent variables. Higher values indicate a better fit.

Adjusted R-squared: Similar to R-squared, but adjusts for the number of predictors in the model.

Root Mean Squared Error (RMSE): Represents the average prediction error of the model. Lower values indicate better perfor­mance.

Residual Analysis: Plotting residuals to check for patterns or outliers that violate assump­tions.
Handling Nonlin­earity
Polynomial Regres­sion: Transf­orming the indepe­ndent variables by adding polynomial terms (e.g., x2, x3) to capture nonlinear relati­ons­hips.

Logarithmic Transf­orm­ation: Taking the logarithm of the dependent or indepe­ndent variables to handle expone­ntial growth or decay.
Dealing with Multic­oll­ine­arity
Check Correl­ation: Identify highly correlated indepe­ndent variables using correl­ation matrices or variance inflation factor (VIF) analysis.

Remove or Combine Variables: Remove one of the highly correlated variables or combine them into a single variable.
Regula­riz­ation Techniques
Ridge Regres­sion: Adds a penalty term to the sum of squared residuals to shrink the coeffi­cients, reducing the impact of multic­oll­ine­arity.

Lasso Regres­sion: Similar to Ridge regres­sion, but with a penalty that can shrink coeffi­cients to zero, effect­ively performing feature selection.

Logistic Regression Cheat Sheet

Logistic Regression Overview
Logistic regression is a statis­tical technique used to model the relati­onship between a dependent variable and one or more indepe­ndent variables.

It is primarily used for binary classi­fic­ation problems, where the dependent variable takes on two categories
Binary Logistic Regression
Binary logistic regression involves a binary dependent variable (y) and one or more indepe­ndent variables (x1, x2, x3, etc.).

The logistic regression equation models the probab­ility of the dependent variable belonging to a specific category.
Logistic Regression Equation
The logistic regression equation is repres­ented as:
p = 1 / (1 + e^(-z)), where p is the probab­ility of the event occurring, and z is the linear combin­ation of the indepe­ndent variables and their coeffi­cients.
Link Function
Logistic regression uses the logistic or sigmoid function as the link function to map the linear combin­ation of indepe­ndent variables to a probab­ility value between 0 and 1.
Estimating Coeffi­cients
Coeffi­cients are estimated using maximum likelihood estima­tion, which finds the values that maximize the likelihood of the observed data given the model.

The coeffi­cients represent the log-odds ratio, indicating the change in the log-odds of the event occurring for a one-unit change in the indepe­ndent variable.
Interp­reting Coeffi­cients
The coeffi­cients can be expone­ntiated to obtain odds ratios, repres­enting the change in odds of the event occurring for a one-unit change in the indepe­ndent variable.

Odds ratios greater than 1 indicate a positive associ­ation, while those less than 1 indicate a negative associ­ation.
Evaluating Model Perfor­mance
Accuracy: The proportion of correctly classified instances.

Confusion Matrix: A table showing the true positives, true negatives, false positives, and false negatives.

Precision: The proportion of true positives out of all positive predic­tions (TP / (TP + FP)).

Recall (Sensi­tiv­ity): The proportion of true positives out of all actual positives (TP / (TP + FN)).

Specificity: The proportion of true negatives out of all actual negatives (TN / (TN + FP)).

F1 Score: A measure that combines precision and recall to balance their import­ance.
Regula­riz­ation Techniques
Ridge Regression (L2 regula­riz­ation): Adds a penalty term to the loss function to shrink the coeffi­cients, reducing overfi­tting.

Lasso Regression (L1 regula­riz­ation): Similar to Ridge regression but can shrink coeffi­cients to zero, effect­ively performing feature selection.
Multiclass Logistic Regression
Multiclass logistic regression extends binary logistic regression to handle more than two catego­ries.

One-vs-Rest (OvR) or One-vs-All (OvA) is a common approach where separate binary logistic regression models are trained for each class against the rest.
Dealing with Imbalanced Data
Adjust Class Weights: Assign higher weights to the minority class to address the class imbalance during model training.

Resampling Techni­ques: Oversa­mpling the minority class or unders­ampling the majority class to create a balanced dataset.

k-Nearest Neighbors Cheat Sheet

k-Nearest Neighbors Overview
k-Nearest Neighbors is a non-pa­ram­etric and instan­ce-­based machine learning algorithm used for classi­fic­ation and regression tasks.

It predicts the class or value of a new data point based on the majority vote or average of its k nearest neighbors in the feature space.
Choosing k
The value of k represents the number of nearest neighbors to consider when making predic­tions.

A small value of k (e.g., 1) may lead to overfi­tting, while a large value of k may lead to oversi­mpl­ifi­cation and loss of local patterns.

The optimal value of k is typically determined through hyperp­ara­meter tuning using techniques like cross-­val­idation
Distance Metrics
Euclidean Distance: Calculates the straig­ht-line distance between two points in the feature space.

Manhattan Distance: Calculates the sum of absolute differ­ences between the coordi­nates of two points.

Other distance metrics like Minkowski, Cosine, and Hamming distance can also be used depending on the data type and problem domain.
Feature Scaling
It's crucial to scale the features before applying k-NN, as it is sensitive to the scale of the features.

Standardization (mean = 0, standard deviation = 1) or normal­ization (scaling to a range) techniques like min-max scaling are commonly used.
Handling Catego­rical Features
Catego­rical features must be encoded into numerical values before applying k-NN.

One-Hot Encoding: Creates binary dummy variables for each category, repres­enting their presence or absence.

Label Encoding: Assigns a unique numerical label to each category.
Classi­fying New Instances
For classi­fic­ation tasks, the class of a new instance is determined by the majority class among its k nearest neighbors.

Voting Mechan­isms: Simple majority vote, weighted vote (based on distance or confid­ence), or distan­ce-­wei­ghted vote (inverse of distance) can be used.
Regression with k-NN
For regression tasks, the predicted value of a new instance is typically the average (mean or median) of the target values of its k nearest neighbors.
Model Evaluation
Accuracy: Proportion of correctly classified instances for classi­fic­ation tasks.

Mean Squared Error (MSE): Average of the squared differ­ences between the predicted and actual values for regression tasks.

Cross-Validation: Technique to assess the perfor­mance of the k-NN model by splitting the data into multiple folds.
Curse of Dimens­ion­ality
As the number of features increases, the feature space becomes increa­singly sparse, making k-NN less effective.

Feature selection or dimens­ion­ality reduction techniques (e.g., Principal Component Analysis) can help mitigate this issue.
Advantages and Limita­tions
Advant­ages: Simpli­city, no assump­tions about data distri­bution, and ability to capture complex patterns.

Limitations: Comput­ati­onally expensive for large datasets, sensit­ivity to feature scaling, and inability to handle missing values well.

Support Vector Machines Cheet Sheat

Support Vector Machines Overview
Support Vector Machines is a supervised machine learning algorithm used for classi­fic­ation and regression tasks.

It finds an optimal hyperplane that maximally separates or fits the data points in the feature space.
Linear SVM
Linear SVM constructs a linear decision boundary to separate data points of different classes.

It aims to maximize the margin, which is the perpen­dicular distance between the decision boundary and the nearest data points (support vectors).
Kernel Trick
The kernel trick allows SVMs to effici­ently handle non-li­nearly separable data by mapping the data to a higher­-di­men­sional feature space.

Common kernel functions include Linear, Polyno­mial, Radial Basis Function (RBF), and Sigmoid.
Soft Margin SVM
Soft Margin SVM allows for some miscla­ssi­fic­ation in order to achieve a more flexible decision boundary.

It introduces a regula­riz­ation parameter (C) to control the trade-off between maximizing the margin and minimizing miscla­ssi­fic­ation.
Choosing the Right Kernel
Linear Kernel: Suitable for linearly separable data or when the number of features is large compared to the number of samples.

Polynomial Kernel: Suitable for problems with interm­ediate complexity and higher­-order polynomial relati­ons­hips.

RBF Kernel: Suitable for complex and non-linear relati­ons­hips; the most commonly used kernel.

Sigmoid Kernel: Suitable for problems influenced by logistic regression or neural networks.
Model Training and Optimi­zation
SVM training involves solving a quadratic progra­mming problem to find the optimal hyperp­lane.

The optimi­zation process can be comput­ati­onally expensive for large datasets, but various optimi­zation techniques (e.g., Sequential Minimal Optimi­zation) can improve effici­ency.
Tuning Parameters
C (Regul­ari­zation Parame­ter): Controls the trade-off between miscla­ssi­fic­ation and the width of the margin. A smaller C allows more miscla­ssi­fic­ation, while a larger C enforces stricter classi­fic­ation.

Gamma (Kernel Coeffi­cient): Influences the shape of the decision boundary. A higher gamma value leads to a more complex decision boundary.
Multi-­Class Classi­fic­ation
One-vs­-Rest (OvR) or One-vs-One (OvO) strategies can be used to extend SVM to multi-­class classi­fic­ation problems.

OvR: Trains separate binary classi­fiers for each class against the rest.

OvO: Trains a binary classifier for every pair of classes
Handling Imbalanced Data
Class imbalance can affect SVM perfor­mance. Techniques such as resampling (under­sam­pling or oversa­mpling) and adjusting class weights can help address this issue.
Advantages and Limita­tions
Advant­ages: Effective in high-d­ime­nsional spaces, robust against overfi­tting, and suitable for both linear and non-linear classi­fic­ation.

Limitations: Comput­ati­onally intensive for large datasets, sensitive to hyperp­ara­meter tuning, and challe­nging to interpret complex models.

Decision Tree Cheat Sheet

Decision Tree Overview
Decision Trees are a supervised machine learning algorithm used for classi­fic­ation and regression tasks.

They learn a hierar­chical structure of decisions and conditions from the data to make predic­tions.
Tree Constr­uction
Decision Trees are constr­ucted through a top-down, recursive partit­ioning process called recursive binary splitting.

The algorithm selects the best feature at each node to split the data based on certain criteria (e.g., inform­ation gain, Gini impurity).
Splitting Criteria
Inform­ation Gain: Measures the reduction in entropy (or increase in inform­ation) achieved by splitting on a particular feature.

Gini Impurity: Measures the probab­ility of miscla­ssi­fying a randomly chosen element if it were labeled randomly according to the class distri­bution.
Handling Continuous and Catego­rical Features
For continuous features, decision tree algorithms use threshold values to split the data.

For catego­rical features, each category forms a separate branch in the decision tree.
Tree Pruning
Pruning is a technique used to avoid overfi­tting by reducing the complexity of the decision tree.

Pre-pruning: Setting constr­aints on tree depth, minimum samples per leaf, or maximum number of leaf nodes during tree constr­uction.

Post-pruning: Removing or collapsing branches that provide little inform­ation gain or result in minimal improv­ements in perfor­mance.
Handling Missing Values
Decision Trees can handle missing values by treating them as a separate category or by imputing missing values before tree constr­uction.
Handling Imbalanced Data
Imbalanced class distri­butions can bias the decision tree. Techniques like class weighting, unders­amp­ling, or oversa­mpling can help address this issue.
Feature Importance
Decision Trees provide feature importance scores based on how much each feature contri­butes to the overall split decisions.

Importance can be measured by the total reduction in impurity or the total inform­ation gain associated with a feature
Ensemble Methods
Random Forest: An ensemble of decision trees where each tree is trained on a random subset of the data with replac­ement. It reduces overfi­tting and improves perfor­mance.

Gradient Boosting: Builds an ensemble by sequen­tially adding decision trees, with each tree correcting the mistakes made by the previous trees.
Advantages and Limita­tions
Advant­ages: Easy to understand and interpret, handles both numerical and catego­rical data, and can capture non-linear relati­ons­hips.

Limitations: Prone to overfi­tting, sensitive to small changes in data, and may not generalize well to unseen data if the tree structure is too complex.

Random Forest Cheat Sheet

Random Forest Overview
Random Forest is an ensemble learning algorithm that combines multiple decision trees to make predic­tions.

It is used for both classi­fic­ation and regression tasks and improves upon the individual decision trees' perfor­mance and robust­ness.
Ensemble of Decision Trees
Random Forest creates an ensemble by constr­ucting a set of decision trees on random subsets of the training data (bootstrap sampling).

Each decision tree is trained indepe­nde­ntly, making predic­tions based on majority voting (class­ifi­cation) or averaging (regre­ssion) of the individual tree predic­tions.
Random Feature Subsets
In addition to using random subsets of the training data, Random Forest also considers a random subset of features at each node for constr­ucting the decision trees.

This randomness reduces the correl­ation between trees and promotes diversity, leading to improved genera­liz­ation.
Building Decision Trees
Each decision tree in the Random Forest is constr­ucted using a subset of the training data and a subset of the available features.

Tree constr­uction follows the usual process of recursive binary splitting based on criteria like inform­ation gain or Gini impurity.
Feature Importance
Random Forest provides a measure of feature importance based on how much each feature contri­butes to the ensemble's predictive perfor­mance.

Importance can be calculated by evaluating the average decrease in impurity or the average decrease in a split criterion (e.g., Gini index) caused by a feature.
Out-of-Bag (OOB) Error
Random Forest uses the out-of-bag samples (not included in the bootstrap sample) to estimate the model's perfor­mance without the need for cross-­val­ida­tion.

OOB error provides a good estimate of the model's genera­liz­ation perfor­mance and can be used for model evaluation and hyperp­ara­meter tuning.
Hyperp­ara­meter Tuning
Important hyperp­ara­meters to consider when working with Random Forests include the number of trees (n_est­ima­tors), maximum depth of each tree (max_d­epth), minimum samples required to split a node (min_s­amp­les­_sp­lit), and maximum number of features to consider for each split (max_f­eat­ures).
Handling Imbalanced Data
Random Forests can handle imbalanced data by adjusting class weights during tree constr­uction or by using sampling techniques like oversa­mpling the minority class or unders­ampling the majority class.
Advantages and Limita­tions
Advant­ages: Robust to overfi­tting, can handle high-d­ime­nsional data, provides feature import­ance, and performs well on various types of problems.

Limitations: Requires more comput­ational resources than individual decision trees, can be slower to train and predict, and may not perform well on extremely imbalanced datasets.
Applic­ations
Random Forests are commonly used in various domains, including classi­fic­ation tasks such as image recogn­ition, text classi­fic­ation, fraud detection, and regression tasks like predicting housing prices or stock market trends.

Gradient Boosting Cheat Sheet

Gradient Boosting Overview
Gradient Boosting is an ensemble learning algorithm that combines multiple weak prediction models (typically decision trees) to create a strong predictive model.

It is used for both classi­fic­ation and regression tasks and sequen­tially improves the model's perfor­mance by minimizing the errors of the previous models.
Boosting Process
Gradient Boosting builds the ensemble by adding decision trees sequen­tially, with each subsequent tree correcting the mistakes of the previous ones.

The trees are built in a greedy manner, minimizing a loss function (e.g., mean squared error for regres­sion, log loss for classi­fic­ation) at each step.
Gradient Descent
Gradient Boosting optimizes the loss function using gradient descent.

The model calculates the negative gradient of the loss function with respect to the current model's predic­tions and fits a new weak learner to this gradient.
Learning Rate and Number of Trees
The learning rate (shrinkage factor) controls the contri­bution of each tree to the ensemble. A smaller learning rate requires more trees for conver­gence but can lead to better genera­liz­ation.

The number of trees (itera­tions) determines the complexity of the model and affects both training time and the risk of overfi­tting.
Regula­riz­ation Techniques
Regula­riz­ation is applied to control the complexity of the model and avoid overfi­tting.

Tree Depth: Restri­cting the maximum depth of each tree can prevent overfi­tting and speed up training.

Tree Pruning: Applying pruning techniques to remove branches with little contri­bution to the model's perfor­mance.
Feature Subsam­pling
Gradient Boosting can use random feature subsets similar to Random Forests to introduce randomness and increase diversity among the weak learners.

It can prevent overfi­tting when dealing with high-d­ime­nsional data or datasets with a large number of features.
Handling Imbalanced Data
Techniques such as class weighting or sampling (under­sam­pling the majority class or oversa­mpling the minority class) can be applied to address imbalanced datasets during Gradient Boosting.
Hyperp­ara­meter Tuning
Important hyperp­ara­meters to consider when working with Gradient Boosting include the learning rate, number of trees, maximum depth of each tree, and regula­riz­ation parameters like subsample and colsam­ple­_by­tree.
Early Stopping
Early stopping is a technique used to prevent overfi­tting and speed up training by monitoring the model's perfor­mance on a validation set.

Training stops when the perfor­mance on the validation set does not improve for a specified number of iterat­ions.
Applic­ations
Gradient Boosting has been succes­sfully applied to a wide range of tasks, including web search ranking, anomaly detection, click-­through rate predic­tion, and person­alized medicine.
 

Naive Bayes Cheat Sheet

Naive Bayes Overview
Naive Bayes is a probab­ilistic machine learning algorithm based on Bayes' theorem with the assumption of indepe­ndence between features.

It is primarily used for classi­fic­ation tasks and is efficient, simple, and often works well in practice
Bayes' Theorem
Bayes' theorem calculates the posterior probab­ility of a class given the observed evidence.

P(Class|Features) = (P(Fea­tur­es|­Class) * P(Class)) / P(Feat­ures)
Assumption of Feature Indepe­ndence
Naive Bayes assumes that the features are condit­ionally indepe­ndent given the class label, which is a simpli­fying assumption to make the calcul­ations more tractable.

Despite this assumption rarely being true in reality, Naive Bayes can still perform well in practice
Types of Naive Bayes Classi­fiers
Gaussian Naive Bayes: Assumes a Gaussian distri­bution for continuous features and estimates the mean and variance for each class.

Multinomial Naive Bayes: Suitable for discrete features, typically used for text classi­fic­ation tasks, where features represent word freque­ncies.

Bernoulli Naive Bayes: Similar to multin­omial, but assumes binary features (presence or absence).
Feature Probab­ility Estimation
For continuous features, Gaussian Naive Bayes estimates the mean and variance for each class.

For discrete features, Multin­omial Naive Bayes estimates the probab­ility of each feature occurring in each class.

For binary features, Bernoulli Naive Bayes estimates the probab­ility of each feature being present in each class.
Handling Zero Probab­ilities
The Naive Bayes classifier may encounter zero probab­ilities if a particular feature does not occur in the training set for a specific class.

To handle this, techniques like Laplace smoothing or add-one smoothing can be applied to avoid zero probab­ili­ties.
Handling Continuous Features
Gaussian Naive Bayes assumes a Gaussian distri­bution for continuous features.

Continuous features can be discre­tized into bins or transf­ormed into catego­rical variables before using Naive Bayes.
Text Classi­fic­ation with Naive Bayes
Naive Bayes is commonly used for text classi­fic­ation tasks, such as spam detection or sentiment analysis.

Text data is typically prepro­cessed by tokeni­zation, removing stop words, and applying techniques like TF-IDF or Bag-of­-Words repres­ent­ation before using Naive Bayes.
Advantages and Limita­tions
Advant­ages: Simpli­city, effici­ency, and can handle high-d­ime­nsional data well.

Limitations: Strong indepe­ndence assumption may not hold in reality, and it can be sensitive to irrelevant features. It may struggle with rare or unseen combin­ations of features.
Handling Imbalanced Data
Naive Bayes can face challenges with imbalanced datasets where the class distri­bution is skewed.

Techniques like class weighting or resampling (under­sam­pling or oversa­mpling) can help alleviate the impact of imbalanced data.

Principal Component Analysis Cheat Sheet

PCA Overview
PCA is a dimens­ion­ality reduction technique used to transform a high-d­ime­nsional dataset into a lower-­dim­ens­ional space.

It identifies the principal compon­ents, which are orthogonal directions that capture the maximum variance in the data.
Variance and Covariance
PCA is based on the varian­ce-­cov­ariance matrix or the correl­ation matrix of the dataset.

Variance measures the spread of data along a specific axis, while covariance measures the relati­onship between two variables.
Steps in PCA
Standa­rdize the data: PCA works best with standa­rdized data to ensure equal importance across different variables.

Calculate the covariance matrix or correl­ation matrix: This represents the relati­onships between the variables in the dataset.

Compute the eigenv­ectors and eigenv­alues: These eigenv­ectors represent the principal compon­ents, and the corres­ponding eigenv­alues indicate the amount of variance explained by each component.

Select the desired number of principal compon­ents: Choose the top components that explain the majority of the variance in the data.

Transform the data: Project the original data onto the selected principal components to obtain the lower-­dim­ens­ional repres­ent­ation.
Explained Variance and Scree Plot
Explained variance ratio indicates the proportion of variance explained by each principal component.

A scree plot visualizes the explained variance ratio for each component, helping to determine the number of components to retain.
Dimens­ion­ality Reduction and Recons­tru­ction
PCA reduces the dimens­ion­ality of the dataset by selecting a subset of principal compon­ents.

Reconstruction of the original data is possible by projecting the lower-­dim­ens­ional repres­ent­ation back into the original feature space.
Applic­ations of PCA
Dimens­ion­ality reduction: PCA can help visualize high-d­ime­nsional data, reduce noise, and eliminate redundant or correlated features.

Data compre­ssion: PCA can compress the data by retaining only the most important compon­ents.

Feature extrac­tion: PCA can extract meaningful features from complex data, facili­tating subsequent analysis.
Interp­ret­ation of Principal Components
Principal components are linear combin­ations of the original features.

The direction of a principal component represents the most signif­icant variation in the data.

The magnitude of the compon­ent's loading on a particular feature indicates its contri­bution to that component.
Assump­tions and Limita­tions
PCA assumes linear relati­onships between variables and requires variables to be continuous or approx­imately contin­uous.

It may not be suitable for datasets with nonlinear relati­onships or when interp­ret­ability of individual features is essential.
Extensions to PCA
Kernel PCA: An extension that allows nonlinear transf­orm­ations of the data.

Sparse PCA: A variant that encourages sparsity in the loadings, resulting in a more interp­retable repres­ent­ation.
Implem­ent­ation and Libraries
PCA is implem­ented in various progra­mming languages. Commonly used libraries include scikit­-learn (Python), caret (R), and numpy (Python) for numerical comput­ations.

Cluster Analysis Cheat Sheet

Cluster Analysis Overview
Cluster Analysis is an unsupe­rvised learning technique used to group similar objects or data points into clusters based on their charac­ter­istics or proximity.

It helps discover hidden patterns, simila­rities, or structures within the data.
Types of Cluster Analysis
Hierar­chical Cluste­ring: Builds a hierarchy of clusters by recurs­ively merging or splitting clusters based on a similarity measure.

K-means Cluste­ring: Divides the data into a predet­ermined number (k) of non-ov­erl­apping clusters by minimizing the within­-cl­uster sum of squares.

Density-based Cluste­ring: Groups data points based on density and identifies regions with higher density as clusters.

Model-based Cluste­ring: Assumes a specific statis­tical model for each cluster and estimates model parameters to assign data points to clusters.
Similarity and Distance Measures
Cluster analysis often relies on similarity or distance measures to determine the proximity between data points.

Common distance measures include Euclidean distance, Manhattan distance, and cosine simila­rity.
Hierar­chical Clustering
Agglom­erative (Botto­m-Up): Starts with each data point as a separate cluster and iterat­ively merges the closest pairs of clusters until all points belong to a single cluster.

Divisive (Top-D­own): Begins with all data points in one cluster and recurs­ively splits clusters until each data point is in its own cluster.
K-means Clustering
Randomly initia­lizes k cluster centroids, assigns each data point to the nearest centroid, recalc­ulates the centroids based on the mean of assigned points, and repeats until conver­gence.

The choice of the number of clusters (k) is important and can impact the results.
Densit­y-based Clustering (DBSCAN)
Densit­y-based Spatial Clustering of Applic­ations with Noise (DBSCAN) groups data points based on density and identifies core points, border points, and noise points.

It defines clusters as dense regions separated by sparser areas and does not require specifying the number of clusters in advance.
Model-­based Clustering (Gaussian Mixture Models)
Gaussian Mixture Models (GMM) assume that the data points are generated from a mixture of Gaussian distri­but­ions.

It estimates the parameters of the Gaussian distri­butions and assigns data points to clusters based on the likeli­hood.
Evaluation of Clustering
Internal Evalua­tion: Measures the quality of clustering using intrinsic criteria such as the silhouette coeffi­cient or within­-cl­uster sum of squares.

External Evalua­tion: Compares the clustering results to a known ground truth, if available, using external criteria like purity or F-measure.
Handling Missing Data and Outliers
Missing data can be handled by imputation techniques before cluste­ring.

Outliers can signif­icantly impact clustering results. Techniques like outlier detection or prepro­cessing methods can be applied to mitigate their influence.
Visual­ization of Clustering Results
Dimens­ion­ality reduction techniques like PCA or t-SNE can be used to visualize high-d­ime­nsional clustering results in lower-­dim­ens­ional space.

Scatter plots, heatmaps, or dendro­grams can provide insights into the clustering structure.

Neural Networks Cheat Sheet

Neural Network Basics
Neural networks are a class of machine learning models inspired by the human brain's structure and functi­oning.

They consist of interc­onn­ected nodes called neurons, organized in layers (input, hidden, and output).
Activation Functions
Activation functions introduce non-li­nearity to the neural network and help model complex relati­ons­hips.

Common activation functions include sigmoid, tanh, ReLU, and softmax (for multiclass classi­fic­ation).
Forward Propag­ation
Forward propag­ation is the process of passing input data through the neural network to obtain predic­tions.

Each neuron applies a weighted sum of inputs, followed by the activation function, to produce an output.
Loss Functions
Loss functions quantify the difference between predicted outputs and true labels.

Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE).

Binary Classi­fic­ation: Binary Cross-­Ent­ropy.

Multiclass Classi­fic­ation: Catego­rical Cross-­Ent­ropy.
Backpr­opa­gation
Backpr­opa­gation is used to update the weights of the neural network based on the calculated gradients of the loss function.

It propagates the error from the output layer to the previous layers, adjusting the weights through gradient descent.
Gradient Descent Optimi­zation
Gradient Descent is an optimi­zation algorithm used to minimize the loss function and update the weights iterat­ively.

Common variants include Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent, and Adam.
Regula­riz­ation Techniques
Regula­riz­ation helps prevent overfi­tting and improves the genera­liz­ation of the neural network.

Common techniques include L1 and L2 regula­riz­ation (weight decay), dropout, and early stopping.
Hyperp­ara­meter Tuning
Neural networks have various hyperp­ara­meters that need to be tuned for optimal perfor­mance.

Examples include learning rate, number of layers, number of neurons per layer, batch size, and activation functions.
Convol­utional Neural Networks (CNN)
CNNs are specia­lized neural networks commonly used for image and video processing tasks.

They consist of convol­utional layers, pooling layers, and fully connected layers, exploiting the spatial structure of data.
Recurrent Neural Networks (RNN)
RNNs are designed for sequential data processing tasks, such as natural language processing and time series analysis.

They have recurrent connec­tions that allow inform­ation to persist and flow across different time steps.
Transfer Learning
Transfer learning leverages pre-tr­ained neural network models on large datasets for similar tasks to improve perfor­mance on smaller datasets.

By using pre-tr­ained models as a starting point, training time can be reduced, and genera­liz­ation can be enhanced.
Hardware Accele­ration
To speed up training and inference, specia­lized hardware like GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units) can be utilized.

Convol­utional Neural Networks Cheat Sheet

Convol­utional Neural Networks Overview
CNNs are a type of neural network specif­ically designed for processing grid-like data, such as images.

They leverage the concept of convol­ution to extract relevant features from the input data.
Convol­utional Layers
Convol­utional layers perform the main feature extraction in CNNs.

Each layer consists of multiple filters (also called kernels) that scan the input data through convol­ution operat­ions.

Convolution applies a sliding window over the input and performs elemen­t-wise multip­lic­ation and summing to produce feature maps.
Pooling Layers
Pooling layers reduce the spatial dimensions of the feature maps, reducing comput­ational complexity and providing spatial invari­ance.

Common types of pooling include Max Pooling (selecting the maximum value in each pooling region) and Average Pooling (taking the average).
Activation Functions
Activation functions introduce non-li­nearity to the CNN and enable modeling complex relati­ons­hips.

ReLU (Rectified Linear Unit) is commonly used as the activation function in CNNs, promoting faster conver­gence and avoiding the vanishing gradient problem.
Fully Connected Layers
Fully connected layers, also known as dense layers, are tradit­ional neural network layers where each neuron is connected to every neuron in the previous layer.

They provide the final classi­fic­ation or regression output by combining the learned features.
Loss Functions
Loss functions quantify the difference between predicted outputs and true labels in CNNs.

Common loss functions include Mean Squared Error (MSE) for regression tasks and Cross-­Entropy for classi­fic­ation tasks.
Training Techniques
CNNs are typically trained using backpr­opa­gation and gradient descent optimi­zation methods.

Techniques like Dropout (randomly deacti­vating neurons during training) and Batch Normal­ization (norma­lizing inputs to accelerate training) are commonly used to improve genera­liz­ation and perfor­mance.
Data Augmen­tation
Data augmen­tation techniques help increase the diversity of the training data by applying transf­orm­ations such as rotations, transl­ations, flips, or scaling.

This helps improve the model's ability to generalize and reduces overfi­tting.
Transfer Learning
Transfer learning leverages pretrained CNN models on large datasets and adapts them to new tasks or smaller datasets.

Pretrained models like VGGNet and ResNet are available, allowing transfer of learned features to new applic­ations.
Object Locali­zation and Detection
CNNs can be extended to perform object locali­zation and detection tasks using techniques like bounding box regression and region proposal networks (RPN).
Semantic Segmen­tation
Semantic segmen­tation assigns a label to each pixel or region in an image, allowing detailed object­-level unders­tan­ding.

Fully Convol­utional Networks (FCNs) are commonly used for semantic segmen­tation.
Hardware Accele­ration
CNNs can benefit from specia­lized hardware like GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units) for faster training and inference.

Recurrent Neural Networks Cheat Sheet

Recurrent Neural Network (RNN) Basics
RNNs are a class of neural networks designed for processing sequential data, such as time series, natural language, and speech.

They have recurrent connec­tions that allow inform­ation to persist and flow across different time steps
RNN Cell
The basic building block of an RNN is the RNN cell, which maintains a hidden state and takes input at each time step.

The hidden state captures the memory of past inputs and influences future predic­tions.
Vanishing and Exploding Gradients
RNNs can suffer from the vanishing gradient problem, where gradients diminish expone­ntially as they propagate through time, leading to diffic­ulties in learning long-term depend­encies.

Conversely, exploding gradients can occur when gradients grow rapidly during backpr­opa­gation.
Long Short-Term Memory (LSTM)
LSTMs are a type of RNN that address the vanishing gradient problem by using gating mechan­isms.

They introduce memory cells, input gates, output gates, and forget gates to select­ively remember or forget inform­ation.
Gated Recurrent Unit (GRU)
GRUs are another type of RNN that address the vanishing gradient problem and have a simpler archit­ecture compared to LSTMs.

They use reset and update gates to control the flow of inform­ation through the network.
Bidire­ctional RNNs
Bidire­ctional RNNs process the input sequence in both forward and backward direct­ions, capturing inform­ation from past and future contexts.

They are useful when the current prediction depends on both past and future context.
Sequen­ce-­to-­Seq­uence Models
Sequen­ce-­to-­seq­uence models, often built with RNNs, are used for tasks such as machine transl­ation, text summar­iza­tion, and speech recogn­ition.

They encode the input sequence into a fixed-size repres­ent­ation (context vector) and decode it to generate the output sequence.
Attention Mechanism
Attention mechanisms enhance the capability of RNNs by select­ively focusing on different parts of the input sequence.

They assign different weights to each input element, emphas­izing more relevant inform­ation during decoding or genera­ting.
Training and Backpr­opa­gation Through Time (BPTT)
RNNs are trained using BPTT, which extends backpr­opa­gation to handle sequences.

BPTT unfolds the RNN through time, allowing error gradients to be calculated and applied to update the weights.
Applic­ations of RNNs
Language modeling, text genera­tion, and sentiment analysis.

Machine transl­ation and natural language unders­tan­ding.

Speech recogn­ition and speech synthesis.
Time series foreca­sting and anomaly detection.
Handling Variab­le-­Length Inputs
Techniques like padding, masking, and sequence bucketing can be used to handle inputs of different lengths in RNNs.
Hardware Accele­ration
RNNs, especially LSTMs and GRUs, can benefit from specia­lized hardware like GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units) for faster training and inference.

Generative Advers­arial Networks Cheat Sheet

Generative Advers­arial Networks (GAN) Basics
GANs are a class of deep learning models composed of two compon­ents: a generator and a discri­min­ator.

The generator learns to generate synthetic data samples that resemble real data, while the discri­minator tries to distin­guish between real and fake samples.
Generator
The generator takes random noise as input and generates synthetic samples.

It typically consists of one or more layers of neural networks, often using transpose convol­utions for upsamp­ling.
Discri­minator
The discri­minator takes a sample as input and estimates the probab­ility of it being real or fake.

It typically consists of one or more layers of neural networks, often using convol­utions for feature extrac­tion.
Advers­arial Training
The generator and discri­minator are trained in an advers­arial manner.

The generator tries to generate samples that fool the discri­min­ator, while the discri­minator aims to correctly classify real and fake samples.
Loss Functions
The generator and discri­minator are trained using different loss functions.

The genera­tor's loss function encourages the generated samples to be classified as real by the discri­min­ator.
The discri­min­ator's loss function penalizes miscla­ssi­fying real and fake samples.
Mode Collapse
Mode collapse occurs when the generator produces limited and repetitive samples, failing to capture the diversity of the real data distri­bution.

Techniques like minibatch discri­min­ation and feature matching can help alleviate mode collapse.
Deep Convol­utional GAN (DCGAN)
DCGAN is a popular GAN archit­ecture that uses convol­utional neural networks for both the generator and discri­min­ator.

It leverages convol­utional and transpose convol­utional layers to generate and discri­minate images.
Condit­ional GAN (cGAN)
cGANs introduce additional inform­ation (such as class labels) to guide the generation process.

The generator and discri­minator take both random noise and condit­ional inform­ation as input.
Evaluation of GANs
Evaluating GANs is challe­nging as there is no direct objective function to optimize.

Common evaluation methods include visual inspec­tion, Inception Score, Fréchet Inception Distance (FID), and Precision and Recall curves.
Unsupe­rvised Repres­ent­ation Learning
GANs can learn meaningful repres­ent­ations without explicit labels.

By training on a large unlabeled dataset, the generator can capture and generate high-level features.
Variat­ional Autoen­coder (VAE) vs. GAN
VAEs and GANs are both generative models but differ in their underlying princi­ples.

VAEs focus on learning latent repres­ent­ations and recons­tru­ction, while GANs emphasize generating realistic samples.
Applic­ations of GANs
Image synthesis and genera­tion.

Style transfer and image-­to-­image transl­ation.

Data augmen­tation and synthesis for training other models.

Text-to-image synthesis and genera­tion.
 

Transfer Learning Cheat Sheet

What is Transfer Learning?
Transfer learning is a technique in machine learning where knowledge gained from training one model is applied to another related task or dataset.

It leverages pre-tr­ained models and their learned repres­ent­ations to improve perfor­mance and reduce the need for extensive training on new datasets.
Benefits of Transfer Learning
Reduces the need for large labeled datasets for training new models.

Saves comput­ational resources and time required for training.

Helps generalize learned features to new tasks or domains.

Improves model perfor­mance, especially with limited data.
Popular Pre-tr­ained Models
Image Classi­fic­ation: VGG, ResNet, Inception, MobileNet, Effici­entNet.

Natural Language Proces­sing: Word2Vec, GloVe, BERT, GPT, Transf­ormer.
Steps for Transfer Learning
Select a pre-tr­ained model: Choose a model that was trained on a large dataset and is suitable for your task.

Remove the top layers: Remove the final layers respon­sible for task-s­pecific predic­tions.

Feature Extrac­tion: Extract features from the pre-tr­ained model by passing your dataset through the remaining layers.

Add new layers: Add new layers to the pre-tr­ained model to adapt it to your specific task.

Train the new model: Fine-tune the new layers with your labeled dataset while keeping the pre-tr­ained weights fixed or updating them with a smaller learning rate.

Evaluate and Iterate: Evaluate the perfor­mance of your model on a validation set and iterate on the archit­ecture or hyperp­ara­meters if necessary.
Transfer Learning Techniques
Feature Extrac­tion: Extract high-level features from the pre-tr­ained model and add new layers for task-s­pecific predic­tions.

Fine-tuning: Fine-tune the pre-tr­ained model's weights by updating them during training with a smaller learning rate.
Data Augmen­tation
Apply data augmen­tation techniques such as rotation, transl­ation, scaling, flipping, or cropping to increase the diversity of your training data.

Data augmen­tation helps prevent overfi­tting and improves genera­liz­ation.
Domain Adaptation
Domain adaptation is a form of transfer learning where the source and target domains differ, requiring adjust­ments to make the model generalize well.

Techniques like advers­arial training, self-t­rai­ning, or domain­-sp­ecific fine-t­uning can be used for domain adapta­tion.
Choosing Layers for Transfer
Earlier layers in a pre-tr­ained model learn low-level features like edges and textures, while later layers learn high-level features.

For small datasets, it's often beneficial to use earlier layers for transfer, as they capture more general features.
Size of Training Data
The size of the new dataset influences the amount of transfer learning required.

With limited data, it's crucial to rely more on the pre-tr­ained weights and perform minimal fine-t­uning to avoid overfi­tting.
Transfer Learning in Different Domains
Transfer learning is applicable across various domains, including computer vision, natural language proces­sing, audio proces­sing, and more.

The choice of pre-tr­ained models and the techniques used may vary based on the specific domain.
Avoiding Negative Transfer
Negative transfer occurs when the knowledge from the source task hinders the perfor­mance on the target task.

It can be mitigated by selecting a source task that is related or has shared underlying patterns with the target task.
Model Evaluation
Evaluate the perfor­mance of the transfer learning model using approp­riate metrics for your specific task, such as accuracy, precision, recall, F1-score, or mean squared error.

Reinfo­rcement Learning Cheat Sheet

Reinfo­rcement Learning Basics
RL is a branch of machine learning where an agent learns to interact with an enviro­nment to maximize a reward signal.

The agent learns through a trial-­and­-error process, taking actions and receiving feedback from the enviro­nment.
Key Components
Agent: The learner or decisi­on-­maker that interacts with the enviro­nment.

Environment: The external system with which the agent interacts.

State: The current repres­ent­ation of the enviro­nment at a particular time step.

Action: The decision or choice made by the agent based on the state.

Reward: The feedback signal that the agent receives from the enviro­nment after taking an action.
Markov Decision Process (MDP)
MDP provides a mathem­atical framework for modeling RL problems with states, actions, rewards, and state transi­tions.

It assumes the Markov property, where the future state depends only on the current state and action, disreg­arding the history
Value Function
The value function estimates the expected return or cumulative reward an agent will receive from a particular state or state-­action pair.

Value functions can be repres­ented as state-­value functions (V(s)) or action­-value functions (Q(s, a)).
Policy
The policy determines the agent's behavior, mapping states to actions.

It can be determ­inistic or stocha­stic, providing the agent's action selection strategy.
Explor­ation vs. Exploi­tation
Explor­ation refers to the agent's search for new actions or states to gather more inform­ation about the enviro­nment.

Exploitation refers to the agent's tendency to choose actions that are expected to yield the highest immediate rewards based on its current knowledge.
Temporal Difference (TD) Learning
TD learning is a method for updating value functions based on the difference between the estimated and actual rewards received.

Q-learning and SARSA are popular TD learning algori­thms.
Policy Gradient Methods
Policy gradient methods directly optimize the policy by updating its parameters based on the gradients of expected rewards.

They use techniques like REINFORCE, Proximal Policy Optimi­zation (PPO), and Trust Region Policy Optimi­zation (TRPO).
Explor­ation Techniques
Epsilo­n-G­reedy: Randomly selects a random action with a probab­ility epsilon to encourage explor­ation.

Upper Confidence Bound (UCB): Balances explor­ation and exploi­tation using an optimistic value estimate.

Thompson Sampling: Selects actions based on random samples from the posterior distri­bution of action values.
Deep Reinfo­rcement Learning (DRL)
DRL combines RL with deep neural networks to handle high-d­ime­nsional state spaces and complex tasks.

Deep Q-Networks (DQN) and Deep Determ­inistic Policy Gradient (DDPG) are popular DRL algori­thms.
Off-Policy vs. On-Policy
Off-policy methods learn the value function or policy using data collected from a different policy.

On-policy methods learn from the direct intera­ction of the agent with the enviro­nment.
Model-­Based vs. Model-Free
Model-­based methods learn a model of the enviro­nment to plan and make decisions.

Model-free methods directly learn the optimal policy or value function without explicitly modeling the enviro­nment dynamics.

Time Series Foreca­sting Cheat Sheet

Time Series Basics
Time series data is a sequence of observ­ations collected over time, typically at regular intervals.

It exhibits temporal depend­encies, trends, season­ality, and may contain noise.
Statio­narity
Stationary time series have constant mean, variance, and autoco­var­iance over time.

Stationarity is desirable for accurate foreca­sting.
Trends and Season­ality
Trend refers to the long-term upward or downward movement in a time series.

Seasonality refers to patterns that repeat at fixed intervals.

Identifying and handling trends and season­ality is important for accurate foreca­sting.
Autoco­rre­lation Function (ACF) and Partial Autoco­rre­lation Function (PACF)
ACF measures the correl­ation between a time series and its lagged values.

PACF measures the correl­ation between a time series and its lagged values, excluding the interm­ediate lags.
They help identify the order of autore­gre­ssive (AR) and moving average (MA) components in time series models.
Time Series Models
Autore­gre­ssive Integrated Moving Average (ARIMA): A linear model that combines AR and MA components to handle stationary time series.

Seasonal ARIMA (SARIMA): Extends ARIMA to handle seasonal time series data.
Exponential Smoothing Methods: Models that assign expone­ntially decreasing weights to past observ­ations.

Prophet: An additive regression model that captures trend, season­ality, and holiday effects.

Vector Autore­gre­ssion (VAR): A multiv­ariate time series model that captures the relati­onships between variables.
Machine Learning for Time Series
Regression Models: Linear regres­sion, random forest, support vector machines (SVM), or gradient boosting algorithms can be used with approp­riate feature engine­ering.

Long Short-Term Memory (LSTM) Networks: A type of recurrent neural network (RNN) suitable for modeling sequential data.

Convolutional Neural Networks (CNN): Can be applied to time series data by treating the series as an image.
Feature Engine­ering
Lagged Variables: Include lagged versions of the target variable or other relevant variables as features.

Rolling Statis­tics: Compute rolling mean, standard deviation, or other statistics over a window of observ­ations.

Seasonal Features: Extract features repres­enting day of the week, month, or other seasonal patterns.

Fourier Transform: Convert time series data to frequency domain to identify periodic compon­ents.
Validation and Evaluation Metrics
Train-­Val­ida­tio­n-Test Split: Split the time series into training, valida­tion, and test sets.

Evaluation Metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and symmetric MAPE (sMAPE) are commonly used.
Cross-­Val­idation for Time Series
Time Series Cross-­Val­ida­tion: Use rolling window or expanding window techniques to simulate the real-time foreca­sting scenario.
Ensemble Methods
Combine forecasts from multiple models or model config­ura­tions to improve accuracy and robust­ness.
Examples include model averaging, weighted averaging, and stacking
Outliers and Anomalies
Identify and handle outliers and anomalies to prevent their influence on the foreca­sting process.

Techniques include moving averages, median filtering, or statis­tical tests.
Handling Missing Data
Imputation Techni­ques: Use interp­ola­tion, mean imputa­tion, or model-­based imputation to fill missing values.

Hyperp­ara­meter Tuning Cheat Sheet

What are Hyperp­ara­meters?
Hyperp­ara­meters are config­uration settings that are not learned from the data but are set before the training process.

They control the behavior and perfor­mance of machine learning models.
Hyperp­ara­meter Tuning Techni­ques:
Grid Search: Exhaus­tively searches all possible combin­ations of hyperp­ara­meters within predefined ranges.

Random Search: Randomly samples hyperp­ara­meters from predefined ranges, allowing more efficient explor­ation.

Bayesian Optimi­zation: Uses prior knowledge and statis­tical methods to intell­igently search the hyperp­ara­meter space.

Genetic Algori­thms: Mimics natural selection to evolve a population of hyperp­ara­meter config­ura­tions over multiple iterat­ions.

Automated Hyperp­ara­meter Tuning Libraries: Tools like Optuna, Hyperopt, or scikit­-le­arn's GridSe­archCV and Random­ize­dSe­archCV can automate the hyperp­ara­meter tuning process.
Hyperp­ara­meters to Consider
Learning Rate: Controls the step size during model training.

Number of Hidden Units/­Layers: Determines the complexity and capacity of neural networks.

Regularization Parame­ters: Control the trade-off between model complexity and overfi­tting.

Batch Size: Determines the number of samples processed before updating model weights.

Dropout Rate: Probab­ility of dropping out units during training to prevent overfi­tting.

Activation Functions: Choices like sigmoid, tanh, ReLU, or Leaky ReLU impact the model's non-li­nea­rity.

Optimizer: Algorithms like stochastic gradient descent (SGD), Adam, or RMSprop that update model weights during training.

Number of Trees and Tree Depth: Parameters for ensemble methods like Random Forest or Gradient Boosting models.

Kernel Type and Parame­ters: For models like Support Vector Machines (SVM) that use kernel functions.
Define Hyperp­ara­meter Ranges
Establish reasonable ranges for each hyperp­ara­meter based on prior knowledge, litera­ture, or experi­men­tation.

Consider the scale and distri­bution of values (linear, logari­thmic) that make sense for each hyperp­ara­meter.
Sequential vs. Parallel Tuning
Sequential tuning explores hyperp­ara­meter combin­ations one by one, allowing feedback from each trial to inform the next.

Parallel tuning performs multiple hyperp­ara­meter evalua­tions simult­ane­ously, making efficient use of comput­ational resources.
Evaluate and Compare Models
Define an evaluation metric (e.g., accuracy, F1-score, mean squared error) that reflects the perfor­mance of interest.

Keep a record of the perfor­mance for each hyperp­ara­meter config­uration to compare the models later.
Cross-­Val­idation
Use techniques like k-fold cross-­val­idation to estimate the genera­liz­ation perfor­mance of different hyperp­ara­meter config­ura­tions.

Avoid tuning hyperp­ara­meters on the test set to prevent overfi­tting and biased perfor­mance estima­tion.
Early Stopping
Monitor a validation metric during training and stop the training process early if perfor­mance deteri­orates consis­tently.

Prevents overfi­tting and saves comput­ational resources.
Feature Selection and Dimens­ion­ality Reduction
Consider using techniques like feature selection or dimens­ion­ality reduction algorithms (e.g., PCA) as part of hyperp­ara­meter tuning.

They can influence model perfor­mance and help improve effici­ency.
Domain Knowledge
Leverage domain knowledge to guide the selection of hyperp­ara­meters.

Prior knowledge can help narrow down the search space and focus on hyperp­ara­meters likely to have a signif­icant impact.
Regularize Hyperp­ara­meters
Apply regula­riz­ation techniques like L1 or L2 regula­riz­ation to hyperp­ara­meters.

Regularization helps control the complexity and prevent overfi­tting of the models.
Docume­ntation and Reprod­uci­bility
Keep a record of the hyperp­ara­meter config­ura­tions, evaluation metrics, and other relevant details for reprod­uci­bility.

Document the lessons learned and insights gained during the hyperp­ara­meter tuning process.

Model Evaluation and Metrics Cheat Sheet

Confusion Matrix
A table that summarizes the perfor­mance of a classi­fic­ation model.

It shows the counts of true positives, true negatives, false positives, and false negatives.
Accuracy
The proportion of correct predic­tions over the total number of predic­tions.

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision
The proportion of true positive predic­tions over the total number of positive predic­tions.

Precision = TP / (TP + FP)
Recall (Sensi­tivity or True Positive Rate)
The proportion of true positive predic­tions over the total number of actual positives.

Recall = TP / (TP + FN)
Specif­icity (True Negative Rate)
The proportion of true negative predic­tions over the total number of actual negatives.

Specificity = TN / (TN + FP)
F1-Score
The harmonic mean of precision and recall.

F1-Score = 2 (Precision Recall) / (Precision + Recall)
Receiver Operating Charac­ter­istic (ROC) Curve
A plot of the true positive rate (sensi­tivity) against the false positive rate (1 - specif­icity) at various classi­fic­ation thresh­olds.

It illust­rates the trade-off between sensit­ivity and specif­icity.
Area Under the ROC Curve (AUC-ROC)
A measure of the overall perfor­mance of a binary classi­fic­ation model.

AUC-ROC ranges from 0 to 1, with higher values indicating better perfor­mance.
Mean Squared Error (MSE)
The average of the squared differ­ences between predicted and actual values.

MSE = (1/n) * Σ(y_pred - y_actu­al)^2
Root Mean Squared Error (RMSE)
The square root of the mean squared error.

RMSE = √(MSE)
Mean Absolute Error (MAE)
The average of the absolute differ­ences between predicted and actual values.

MAE = (1/n) * Σ|y_pred - y_actual|
R-squared (Coeff­icient of Determ­ina­tion)
A measure of how well the regression model fits the data.

R-squared ranges from 0 to 1, with higher values indicating a better fit.
Mean Average Percentage Error (MAPE)
The average percentage difference between predicted and actual values.

MAPE = (1/n) Σ(|y_pred - y_actual| / y_actual) 100
Cross-­Val­idation
A technique to assess the perfor­mance of a model on unseen data by splitting the data into multiple folds.
It helps estimate the model's genera­liz­ation perfor­mance and mitigate issues like overfi­tting.
Bias-V­ariance Trade-off
Bias refers to the error introduced by approx­imating a real-world problem with a simplified model.

Variance refers to the model's sensit­ivity to fluctu­ations in the training data.

Balancing bias and variance is crucial for building models that generalize well.
Overfi­tting and Underf­itting
Overfi­tting occurs when a model performs well on training data but poorly on unseen data.

Underfitting occurs when a model is too simple to capture the underlying patterns in the data.

Regularization techniques and proper model complexity selection can help address these issues.
Feature Importance
Techniques like feature importance scores, permut­ation import­ance, or SHAP values help identify the most influe­ntial features in a model.
Model Selection
Compare and select models based on evaluation metrics, cross-­val­idation results, and domain­-sp­ecific consid­era­tions.

Avoid selecting models solely based on a single metric without consid­ering the context.