knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
easymlr is an open-source library designed to perform exploratory data analysis, to help with missing data imputation and to give baseline models. Also, it assists in feature selection which is a common problem when undertaking a data science or machine learning analysis. As it's name indicates, this function performs all task from eda to helping in choosing model for machine learning process very easily.
Package installation in R:
The development version can be found in Github.
install.packages("devtools") devtools::install_github("UBC-MDS/easymlr")
This package introduces a data science enthusiast, with little to no knowledge of machine learning, to the common steps required when undertaking a Supervised learning analysis. The package contains four functions that taken in data and perform a task. All functions can be used on a dataset with numerical features. Each function has their own required arguements and optional arguments, with some default parameters.
eda_analysis: The eda_analysis
function will split the original data into train and test dataset and will generate a statistical report such as correlation between the variables, number of missing data, class imbalance and type of data present in the dataset.
miss_data: The miss_data
function will impute missing data in the data frame using one of mean or median methods.
baseline_fun: The baseline_fun
function will give users a quick check of the performance of the selected sklearn models compared to a baseline model, upon which the model can be improved.
feature_select: The feature_select
function will remove redundant features using a forward selection iterative algorithm.
You can also find function descriptions and their use cases in package vignettes.
This is a basic example demonstration of how users may use the easymlr
package.
library(easymlr) library(MASS) attach(Boston) data <- Boston X <- data [,0:13] Y <- as.data.frame(data [,14]) y <- data [,14] eda_results <- eda_analysis(X) # SAMPLE OUTPUT #> eda_analysis(X) #$data_head # crim zn indus chas nox rm age dis rad tax ptratio black lstat #415 45.74610 0 18.10 0 0.693 4.519 100.0 1.6582 24 666 20.2 88.27 36.98 #463 6.65492 0 18.10 0 0.713 6.317 83.0 2.7344 24 666 20.2 396.90 13.99 #179 0.06642 0 4.05 0 0.510 6.860 74.4 2.9153 5 296 16.6 391.27 6.92 #14 0.62976 0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21.0 396.90 8.26 #195 0.01439 60 2.93 0 0.401 6.604 18.8 6.2196 1 265 15.6 376.70 4.38 #426 15.86030 0 18.10 0 0.679 5.896 95.4 1.9096 24 666 20.2 7.68 24.39 imputed_data <- miss_data(data, data, 'mean') # SAMPLE OUTPUT # imputed_df[[1]][1,] # crim zn indus chas nox rm age dis rad tax ptratio black lstat medv #1 0.25915 33 9.69 1 0.538 6.208 77.7 3.1992 5 330 19.1 391.43 11.38 21.2 baseline_model <- baseline_fun(X, Y, type = 'regression') # SAMPLE OUTPUT - LINEAR REGRESSION MODEL SUMMARU # baseline_model # RMSE Rsquared MAE # 4.86746 0.7202505 3.424126 best_features <- feature_select(X, y, threshold=0.05) #> best_features #[1] "lstat" "black" "ptratio"
There are some packages that help with missing data imputation in the R-ecosystem such as Hmisc, missForest, Amelia and mice. There are also some packages that help feature selection such as CARET package. However, to our knowledge, there is no general-purpose library for performing the above task in the R ecosystem.
| Package | Minimum Supported Version |
| ------------------------------------------------------------------------- | ------------------------- |
| tidyverse | 0.8.3 |
| rlist | 0.4.6.1 |
| testthat | 2.3.1 |
| MASS | 7.3 |
Development leaders:
Ifeanyi Anene Lara Habashy Sakshi Jain Zhenrui Yu
Please ensure to adhere to the code of conduct if you would like to contrtibute to this project.
We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab. If you would like to contribute, please view our contributing guidelines and get familiar with the Github flow workflow.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.