vignettes/tidyplusr.md

TidyplusR

Xinbin Huang, Akshi Chaudhary, Tian Qian 2018-03-17

Introduction

The tidyplusR package is an essential data cleaning package with features like missing value treatment, data manipulation and displaying data as markdown table for documents. The package adds a few additional functionality on the existing data wrangling packages in popular statistical software like R. The objective of this package is to provide a few specific functions to solve some of the pressing issues in data cleaning.

Contributors:

Installation

You can install tidyplusR from github using the following:

# install.packages("devtools")
devtools::install_github("UBC-MDS/tidyplusR")

Functions included:

Three main parts including different functions in tidyplusR

Examples

This is a basic example which shows you how to solve a common problem:

Datatype cleansing

The section has two functions, typemix and cleanmix.

library(tidyplusR)


dat<-data.frame(x1=c(1,2,3,"1.2.3"),
                x2=c("test","test",1,TRUE),
                x3=c(TRUE,TRUE,FALSE,FALSE))
#Input data with mixed datatypes
dat 
##      x1   x2    x3
## 1     1 test  TRUE
## 2     2 test  TRUE
## 3     3    1 FALSE
## 4 1.2.3 TRUE FALSE
#Identified and cleaned(removed) datatypes based on the types mentioned
tidyplusR::cleanmix(typemix(dat),column=c(1,2),type=c("number","character"))
##     x1   x2    x3
## 1    1 test  TRUE
## 2    2 test  TRUE
## 3    3 <NA> FALSE
## 4 <NA> <NA> FALSE

Missing Value imputation

### Dummy dataframe
dat <- data.frame(x=sample(letters[1:3],20,TRUE), 
                  y=sample(letters[1:3],20,TRUE),
                  w=as.numeric(sample(0:50,20,TRUE)),
                  z=sample(letters[1:3],20,TRUE), 
                  b = as.logical(sample(0:1,20,TRUE)),
                  a=sample(0:100,20,TRUE),
                  stringsAsFactors=FALSE)

dat[c(5,10,15),1] <- NA
dat[c(3,7),2] <- NA
dat[c(1,3,5),3] <- NA
dat[c(4,5,9),4] <- NA
dat[c(4,5,9),5] <- NA
dat[,4] <- factor(dat[,4] )
dat[c(4,5,9),6] <- NA
#Input data with missing values
dat 
##       x    y  w    z     b  a
## 1     b    b NA    a FALSE 13
## 2     c    a 34    a FALSE 41
## 3     c <NA> NA    a FALSE 56
## 4     a    b 31 <NA>    NA NA
## 5  <NA>    a NA <NA>    NA NA
## 6     c    b  0    b FALSE 35
## 7     c <NA> 42    c  TRUE 19
## 8     b    a 24    b FALSE 48
## 9     c    a 26 <NA>    NA NA
## 10 <NA>    b 49    a FALSE 40
## 11    b    b 20    c  TRUE 98
## 12    a    a 15    b FALSE 42
## 13    c    b 50    c  TRUE 87
## 14    b    c 43    a  TRUE 27
## 15 <NA>    b 29    b FALSE 12
## 16    b    b 30    b FALSE 52
## 17    a    c  7    c FALSE  5
## 18    c    a  5    c  TRUE 39
## 19    b    b  2    c  TRUE 88
## 20    c    a 13    c FALSE 38
#### Calling impute function
#Missing value replaced with method = mode
tidyplusR::impute(dat,method = "mode")   ## method can be replaced by median and mean as well
##    x y     w z     b     a
## 1  b b 27.78 a FALSE 13.00
## 2  c a 34.00 a FALSE 41.00
## 3  c b 27.78 a FALSE 56.00
## 4  a b 31.00 c FALSE 40.77
## 5  c a 27.78 c FALSE 40.77
## 6  c b  0.00 b FALSE 35.00
## 7  c b 42.00 c  TRUE 19.00
## 8  b a 24.00 b FALSE 48.00
## 9  c a 26.00 c FALSE 40.77
## 10 c b 49.00 a FALSE 40.00
## 11 b b 20.00 c  TRUE 98.00
## 12 a a 15.00 b FALSE 42.00
## 13 c b 50.00 c  TRUE 87.00
## 14 b c 43.00 a  TRUE 27.00
## 15 c b 29.00 b FALSE 12.00
## 16 b b 30.00 b FALSE 52.00
## 17 a c  7.00 c FALSE  5.00
## 18 c a  5.00 c  TRUE 39.00
## 19 b b  2.00 c  TRUE 88.00
## 20 c a 13.00 c FALSE 38.00

Markdown table

## default: ncol = 2 and nrow = 2, alignment = "l"
md_new()
## 
## |    |    |
## |:---|:---|
## |    |    |
## |    |    |
## 3 by 3 table
md_new(nrow = 3, ncol = 3)
## 
## |    |    |    |
## |:---|:---|:---|
## |    |    |    |
## |    |    |    |
## |    |    |    |
## different alignments:
md_new(nrow = 1, align = "c")
## 
## |    |    |
## |:--:|:--:|
## |    |    |
md_new(nrow = 1, align = "r")
## 
## |    |    |
## |---:|---:|
## |    |    |
## providing header
h <- c("foo", "boo")
md_new(header = h)
## 
## | foo| boo|
## |:---|:---|
## |    |    |
## |    |    |
md_data(mtcars, row.index = 1:3, col.index = 1:4)
## 
## |    |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## alignment to right
md_data(mtcars, row.index = 1:3, col.index = 1:4, align = "r")
## 
## |    |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## provide header
md_data(mtcars, row.index = 1:3, col.index = 1:4, header = c("a","b","c","d"))
## 
## |    |a|b|c|d|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## not include row names
md_data(mtcars, row.index = 1:3, col.index = 1:4, row.names = F)
## 
## |mpg|cyl|disp|hp|
## |---:|---:|---:|---:|
## |21|6|160|110|
## |21|6|160|110|
## |22.8|4|108|93|

Existing features in R ecosystem similar to tidyplusR

License

MIT

Contributing

This is an open source project. Please follow the guidelines below for contribution. - Open an issue for any feedback and suggestions. - For contributing to the project, please refer to Contributing for details.



UBC-MDS/tidyplusR documentation built on May 25, 2019, 1:36 p.m.