Xinbin Huang, Akshi Chaudhary, Tian Qian 2018-03-17
The tidyplusR
package is an essential data cleaning package with features like missing value treatment, data manipulation and displaying data as markdown table for documents. The package adds a few additional functionality on the existing data wrangling packages in popular statistical software like R. The objective of this package is to provide a few specific functions to solve some of the pressing issues in data cleaning.
Akshi Chaudhary
: akshi8Tina Qian
: TinaQian2017Xinbin Huang
: xinbinhuangYou can install tidyplusR
from github using the following:
# install.packages("devtools")
devtools::install_github("UBC-MDS/tidyplusR")
tidyplusR
Data Manipulation
: Datatype cleansingtypemix
cleanmix
typemix
function, name of the column (a vector of the name of columns) that they want to clean, the type of data they want to work on, and if we want to keep or delete the certain type. The output will be a data frame like the original type but with specified data type in certain columns deleted.Missing Value Treatment
: Basic Imputation using impute
Imputation: replace missing values in a column of a dataframe, or multiple columns of dataframe based on the method
of imputation
(Method = 'Mean')
replace using mean
(Method = 'Median')
replace using median(Method = 'Mode')
replace using modeMarkdown Table
:
md_new()
: This function creates a bare bone for generating a markdown table. Alignments, and size of the table can be input by users.
md_data()
: This function converts a dataframe or matrix into a markdown table format.This is a basic example which shows you how to solve a common problem:
The section has two functions, typemix and cleanmix.
The input for typemix function is a data frame
, and the output is a list of 3 data frames. The first one is the same as the input data frame, the second one tells you the location and types of data in the columns where there is type mixture. The third data frame is a summary of the second data frame.
The input for cleanmix function is the result from typemix function, the column(s) you want to work on, the type(s) of data you want to keep/delete, and if you want to keep/delete the instances specified.
library(tidyplusR)
dat<-data.frame(x1=c(1,2,3,"1.2.3"),
x2=c("test","test",1,TRUE),
x3=c(TRUE,TRUE,FALSE,FALSE))
#Input data with mixed datatypes
dat
## x1 x2 x3
## 1 1 test TRUE
## 2 2 test TRUE
## 3 3 1 FALSE
## 4 1.2.3 TRUE FALSE
#Identified and cleaned(removed) datatypes based on the types mentioned
tidyplusR::cleanmix(typemix(dat),column=c(1,2),type=c("number","character"))
## x1 x2 x3
## 1 1 test TRUE
## 2 2 test TRUE
## 3 3 <NA> FALSE
## 4 <NA> <NA> FALSE
dataframe
as an input for missing value treatment using mean/median/mode### Dummy dataframe
dat <- data.frame(x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
w=as.numeric(sample(0:50,20,TRUE)),
z=sample(letters[1:3],20,TRUE),
b = as.logical(sample(0:1,20,TRUE)),
a=sample(0:100,20,TRUE),
stringsAsFactors=FALSE)
dat[c(5,10,15),1] <- NA
dat[c(3,7),2] <- NA
dat[c(1,3,5),3] <- NA
dat[c(4,5,9),4] <- NA
dat[c(4,5,9),5] <- NA
dat[,4] <- factor(dat[,4] )
dat[c(4,5,9),6] <- NA
#Input data with missing values
dat
## x y w z b a
## 1 b b NA a FALSE 13
## 2 c a 34 a FALSE 41
## 3 c <NA> NA a FALSE 56
## 4 a b 31 <NA> NA NA
## 5 <NA> a NA <NA> NA NA
## 6 c b 0 b FALSE 35
## 7 c <NA> 42 c TRUE 19
## 8 b a 24 b FALSE 48
## 9 c a 26 <NA> NA NA
## 10 <NA> b 49 a FALSE 40
## 11 b b 20 c TRUE 98
## 12 a a 15 b FALSE 42
## 13 c b 50 c TRUE 87
## 14 b c 43 a TRUE 27
## 15 <NA> b 29 b FALSE 12
## 16 b b 30 b FALSE 52
## 17 a c 7 c FALSE 5
## 18 c a 5 c TRUE 39
## 19 b b 2 c TRUE 88
## 20 c a 13 c FALSE 38
#### Calling impute function
#Missing value replaced with method = mode
tidyplusR::impute(dat,method = "mode") ## method can be replaced by median and mean as well
## x y w z b a
## 1 b b 27.78 a FALSE 13.00
## 2 c a 34.00 a FALSE 41.00
## 3 c b 27.78 a FALSE 56.00
## 4 a b 31.00 c FALSE 40.77
## 5 c a 27.78 c FALSE 40.77
## 6 c b 0.00 b FALSE 35.00
## 7 c b 42.00 c TRUE 19.00
## 8 b a 24.00 b FALSE 48.00
## 9 c a 26.00 c FALSE 40.77
## 10 c b 49.00 a FALSE 40.00
## 11 b b 20.00 c TRUE 98.00
## 12 a a 15.00 b FALSE 42.00
## 13 c b 50.00 c TRUE 87.00
## 14 b c 43.00 a TRUE 27.00
## 15 c b 29.00 b FALSE 12.00
## 16 b b 30.00 b FALSE 52.00
## 17 a c 7.00 c FALSE 5.00
## 18 c a 5.00 c TRUE 39.00
## 19 b b 2.00 c TRUE 88.00
## 20 c a 13.00 c FALSE 38.00
md_new()
can create an empty markdown table by specifying the number of columns and number of rows.## default: ncol = 2 and nrow = 2, alignment = "l"
md_new()
##
## | | |
## |:---|:---|
## | | |
## | | |
## 3 by 3 table
md_new(nrow = 3, ncol = 3)
##
## | | | |
## |:---|:---|:---|
## | | | |
## | | | |
## | | | |
## different alignments:
md_new(nrow = 1, align = "c")
##
## | | |
## |:--:|:--:|
## | | |
md_new(nrow = 1, align = "r")
##
## | | |
## |---:|---:|
## | | |
## providing header
h <- c("foo", "boo")
md_new(header = h)
##
## | foo| boo|
## |:---|:---|
## | | |
## | | |
md_data()
can create an markdown table given input as matrix of data frame.md_data(mtcars, row.index = 1:3, col.index = 1:4)
##
## | |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## alignment to right
md_data(mtcars, row.index = 1:3, col.index = 1:4, align = "r")
##
## | |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## provide header
md_data(mtcars, row.index = 1:3, col.index = 1:4, header = c("a","b","c","d"))
##
## | |a|b|c|d|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## not include row names
md_data(mtcars, row.index = 1:3, col.index = 1:4, row.names = F)
##
## |mpg|cyl|disp|hp|
## |---:|---:|---:|---:|
## |21|6|160|110|
## |21|6|160|110|
## |22.8|4|108|93|
tidyplusR
dplyr and tidyverse these R libraries have very powerful data wrangling tools but with tidyplus
user can explicitly perform string processing/ datatype conversion without affecting the overall column type (which is convenient when you have really messed up data with mix of strings and numbers)
Missing Value treatment
EM algorithm
for missing value treatment, which in fact is very efficient and accurate MICE package in R do provide limited imputation using mean, mode, etc.Kable
which can output a dataset in the form of a markdown table but with tidyplus
user will have more freedom with data types and formatting.This is an open source project. Please follow the guidelines below for contribution. - Open an issue for any feedback and suggestions. - For contributing to the project, please refer to Contributing for details.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.