This is a brief cheat sheet for machine learning process pipeline. This is by no means elaborate, just listing the concepts i learned from the Coursera course structurely to get familiar with. The content is from and only from the Practical Machine Learning course of Data Science specialization of Coursera, so the credit goes to the awesome instructors and relevant sources.
Most of the Machine Learning algorithms and various other concepts were created by various people. So to use these, you need to use relevent packages. To make the job easier, a unifying framework has been created called caret
. This package has almost everything you need in machine learning.
The important aspects of machine learning in order of importance is,
Question > Data > Features > Algorithms
Everything on the right hand side depends heavily on things on the left hand side. Giving much thought to earlier parts makes the later part easy and effective.
Any data has signal and noise. Machine learning and defining predictors is to capture the signal.
The first thing to do is, splitting the data to number of subsets.
On splitting data there is a concept called Cross Validation. It is converting training set to a test/train set and averaging results before applying to test. It helps in
In caret createdatapartition
is used to split data.
Common types of Cross validating:
createresample
is used.createFolds
is used.createtimeslices
is also used for specific needs.
Exploratory analysis is very important step in understanding the data and understanding features. It shows major trends or patterns in data without much hassle, shows imbalance in outcomes/ predictors, outliers, skewed variables and groups of points that are not explained by a predictor.
summary
, str
ucture and so on to understand basic dataset properties. featurePlot
to get understanding on top level patterns. qlopt
or ggplot
interesting variates from Feature plot. Make use of color, regression smoothers geom_smooth
, etc to better analyse. cut2
in Hmisc
package is useful. table
, jitter
, density plots, etc too.Most of the time pre procesing is done to make predictions accurate. Mainly, pre processing is a mechanism of cleaning data and also compressing data without loss of features.
Standardizing
When deviance is high in certain variates, standardizing helps reduce deviance.
caret
has methods and parameters to tweak standardizing methods. center
, scale
are two main method parameters used.
Box-Cox transform is a standardizing method to normalize continuous data using maximum likelihood.
imputing data is also used. k-nearest-neighbors is the mainly used technique in imputing. knnImpute
is used in caret
.
Covariate Creation
At times it helps to make new co variates to get a better prediction. Raw data is used to create covariate
and at times transforming tidy covariate to new variates is also done.
Factor variables are converted to new indicator variables using dummyVars
Removing zero covariates is done with nearZeroVar
Spline basis is another method of covariate creation. bs
from splines
package or gam
in caret
can be used.
Pre-processing with Principal Component Analysis
Weighted combination of predictors are used to make a better prediction model. This greatly helps in data compression too.
Correlated predictors are found with cor
Principal Componant Analysis (PCA) is the Right singular value from Singular Value Decomposition (SVD). prcomp
in caret
or else method="pca"
in preProcess
can be used.
All these pre processings can be configured in the train
method of caret
. In addition train control
parameter can be set too.
Bootstrap(boot
) or cross validation(cv
),repeatedcv
, LOOCV
methods can also be set.
Further Metric options can be set to the train function call. It takes following parameters
Prediction models or algorithms should be
Varoius prediction models
Predicting with linear regression. method is set as lm
Predicting with trees. Evaluate homogeneity within groups. method is set as rpart
The measures of impurity in this model are,
a. Misclassification Error
b. Gini index
c. Deviance/information gain.
To visually plot the tree, plot
or fancyRpartPlot
from package rattle
can be used.
Bootstrap Aggregating (bagging). Averaging complex models to get balanced and accurate model. bagged loess is created which is similar to spline curves.
In caret, method=bagEarth
, treebag
, bagFDA
or your custom method can be used.
Random Forests. method is set as rf
Boosting. Weighting weak predictors. Similar to bagging or random forests. Number of ways can be used in R,
gbm
: boosting with trees
mboost
: model based boosting
ada
: statistical boosting based on additive logistic regression
gamBoost
: boosting generalized additive models
Model Based Prediction.
Regularized Regression. fiting linear regression or generelized linear regression and then penalizing or shrinking large coefficients. lasso regularized regression can help with model selection.
combining predictors. Bagging, Boosting and RF are combining similar classifiers. Model Stacking, Model Ensembling are combining different classifiers
Forecasting and Time Series Prediction can be complicated since data are dependent over time. Test/Train splitting, correlation and extrapolation has to be done carefully.
Unsupervised Prediction. Creating, naming clusters and building predictors from the clusters. k-means clustering is mainly used. cl_predict
in package clue
can be used.
confusionMatrix
is widely used to compare models or quantify accuracy, sensitivity and specificity.
There are number of error definition that can be used.
Common error measures:
and also,
In Sample Error (Resubstitution error). The error rate you get in the same data set you used to build predictor. Out of Sample Error (Generalization error). The error rate you get in new data sets.
ROC (Receiver Operating Characteristic curves). How good the prediction algorithm is. Area Under Curve (AUC) of FalsePositive ~ TruePositive curve 1 is perfect classifier, 0.8 above is good classifier.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.