README.md
In UBC-MDS/easymlr: Easy machine learning in r

easymlr

easymlr is an open-source library designed to perform exploratory data analysis, to help with missing data imputation and to give baseline models. Also, it assists in feature selection which is a common problem when undertaking a data science or machine learning analysis. As it’s name indicates, this function performs all task from eda to helping in choosing model for machine learning process very easily.

Package installation in R:

The development version can be found in Github.

install.packages("devtools")
devtools::install_github("UBC-MDS/easymlr")

This package introduces a data science enthusiast, with little to no knowledge of machine learning, to the common steps required when undertaking a Supervised learning analysis. The package contains four functions that taken in data and perform a task. All functions can be used on a dataset with numerical features. Each function has their own required arguements and optional arguments, with some default parameters.

eda_analysis: The eda_analysis function will split the original data into train and test dataset and will generate a statistical report such as correlation between the variables, number of missing data, class imbalance and type of data present in the dataset.
miss_data: The miss_data function will impute missing data in the data frame using one of mean or median methods.
baseline_fun: The baseline_fun function will give users a quick check of the performance of the selected sklearn models compared to a baseline model, upon which the model can be improved.
feature_select: The feature_select function will remove redundant features using a forward selection iterative algorithm.

You can also find function descriptions and their use cases in package vignettes.

Vignettes

This is a basic example demonstration of how users may use the easymlr package.

library(easymlr)

library(MASS)
attach(Boston)
data <- Boston
X <- data [,0:13]
Y <- as.data.frame(data [,14])
y <- data [,14]

eda_results <- eda_analysis(X)

# SAMPLE OUTPUT
#> eda_analysis(X)
#$data_head
#        crim zn indus chas   nox    rm   age    dis rad tax ptratio  black lstat
#415 45.74610  0 18.10    0 0.693 4.519 100.0 1.6582  24 666    20.2  88.27 36.98
#463  6.65492  0 18.10    0 0.713 6.317  83.0 2.7344  24 666    20.2 396.90 13.99
#179  0.06642  0  4.05    0 0.510 6.860  74.4 2.9153   5 296    16.6 391.27  6.92
#14   0.62976  0  8.14    0 0.538 5.949  61.8 4.7075   4 307    21.0 396.90  8.26
#195  0.01439 60  2.93    0 0.401 6.604  18.8 6.2196   1 265    15.6 376.70  4.38
#426 15.86030  0 18.10    0 0.679 5.896  95.4 1.9096  24 666    20.2   7.68 24.39

imputed_data <- miss_data(data, data, 'mean')

# SAMPLE OUTPUT
# imputed_df[[1]][1,]
#     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat medv
#1 0.25915 33  9.69    1 0.538 6.208 77.7 3.1992   5 330    19.1 391.43 11.38 21.2

baseline_model <- baseline_fun(X, Y, type = 'regression')
# SAMPLE OUTPUT - LINEAR REGRESSION MODEL SUMMARU
# baseline_model
#  RMSE     Rsquared   MAE     
#  4.86746  0.7202505  3.424126

best_features <- feature_select(X, y, threshold=0.05)
#> best_features
#[1] "lstat"   "black"   "ptratio"

There are some packages that help with missing data imputation in the R-ecosystem such as Hmisc, missForest, Amelia and mice. There are also some packages that help feature selection such as CARET package. However, to our knowledge, there is no general-purpose library for performing the above task in the R ecosystem.

R version 4.0.3

| Package | Minimum Supported Version | | ------------------------------------------------------------------------- | ------------------------- | | tidyverse | 0.8.3 | | rlist | 0.4.6.1 | | testthat | 2.3.1 | | MASS | 7.3 |

Development leaders:

Ifeanyi Anene
Lara Habashy
Sakshi Jain
Zhenrui Yu

Please ensure to adhere to the code of conduct if you would like to contrtibute to this project.

We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab. If you would like to contribute, please view our contributing guidelines and get familiar with the Github flow workflow.

UBC-MDS/easymlr documentation built on March 22, 2021, 1:46 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com