README.md

easymlr

codecov R-CMD-check

easymlr is an open-source library designed to perform exploratory data analysis, to help with missing data imputation and to give baseline models. Also, it assists in feature selection which is a common problem when undertaking a data science or machine learning analysis. As it’s name indicates, this function performs all task from eda to helping in choosing model for machine learning process very easily.

Installation

Package installation in R:

The development version can be found in Github.

install.packages("devtools")
devtools::install_github("UBC-MDS/easymlr")

Features

This package introduces a data science enthusiast, with little to no knowledge of machine learning, to the common steps required when undertaking a Supervised learning analysis. The package contains four functions that taken in data and perform a task. All functions can be used on a dataset with numerical features. Each function has their own required arguements and optional arguments, with some default parameters.

You can also find function descriptions and their use cases in package vignettes.

Documentation

Usage Example

This is a basic example demonstration of how users may use the easymlr package.

library(easymlr)

library(MASS)
attach(Boston)
data <- Boston
X <- data [,0:13]
Y <- as.data.frame(data [,14])
y <- data [,14]

eda_results <- eda_analysis(X)

# SAMPLE OUTPUT
#> eda_analysis(X)
#$data_head
#        crim zn indus chas   nox    rm   age    dis rad tax ptratio  black lstat
#415 45.74610  0 18.10    0 0.693 4.519 100.0 1.6582  24 666    20.2  88.27 36.98
#463  6.65492  0 18.10    0 0.713 6.317  83.0 2.7344  24 666    20.2 396.90 13.99
#179  0.06642  0  4.05    0 0.510 6.860  74.4 2.9153   5 296    16.6 391.27  6.92
#14   0.62976  0  8.14    0 0.538 5.949  61.8 4.7075   4 307    21.0 396.90  8.26
#195  0.01439 60  2.93    0 0.401 6.604  18.8 6.2196   1 265    15.6 376.70  4.38
#426 15.86030  0 18.10    0 0.679 5.896  95.4 1.9096  24 666    20.2   7.68 24.39

imputed_data <- miss_data(data, data, 'mean')

# SAMPLE OUTPUT
# imputed_df[[1]][1,]
#     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat medv
#1 0.25915 33  9.69    1 0.538 6.208 77.7 3.1992   5 330    19.1 391.43 11.38 21.2

baseline_model <- baseline_fun(X, Y, type = 'regression')
# SAMPLE OUTPUT - LINEAR REGRESSION MODEL SUMMARU
# baseline_model
#  RMSE     Rsquared   MAE     
#  4.86746  0.7202505  3.424126

best_features <- feature_select(X, y, threshold=0.05)
#> best_features
#[1] "lstat"   "black"   "ptratio"

Relevance to the R Ecosystem

There are some packages that help with missing data imputation in the R-ecosystem such as Hmisc, missForest, Amelia and mice. There are also some packages that help feature selection such as CARET package. However, to our knowledge, there is no general-purpose library for performing the above task in the R ecosystem.

Dependencies

| Package | Minimum Supported Version | | ------------------------------------------------------------------------- | ------------------------- | | tidyverse | 0.8.3 | | rlist | 0.4.6.1 | | testthat | 2.3.1 | | MASS | 7.3 |

Contributors

Development leaders:

Ifeanyi Anene
Lara Habashy
Sakshi Jain
Zhenrui Yu

Code of conduct

Please ensure to adhere to the code of conduct if you would like to contrtibute to this project.

Contributors

We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab. If you would like to contribute, please view our contributing guidelines and get familiar with the Github flow workflow.



UBC-MDS/easymlr documentation built on March 22, 2021, 1:46 p.m.