Home

/

GitHub

/

UBC-MDS/tidyplusR

/

vignettes/tidyplusr.md

tidyplusR: An extension on tidyverse to perform specified tasks

TidyplusR

Xinbin Huang, Akshi Chaudhary, Tian Qian 2018-03-17

Introduction

The tidyplusR package is an essential data cleaning package with features like missing value treatment, data manipulation and displaying data as markdown table for documents. The package adds a few additional functionality on the existing data wrangling packages in popular statistical software like R. The objective of this package is to provide a few specific functions to solve some of the pressing issues in data cleaning.

Contributors:

Akshi Chaudhary : akshi8
Tina Qian : TinaQian2017
Xinbin Huang: xinbinhuang

Installation

You can install tidyplusR from github using the following:

# install.packages("devtools")
devtools::install_github("UBC-MDS/tidyplusR")

Functions included:

Three main parts including different functions in `tidyplusR`

Data Manipulation : Datatype cleansing
typemix
- The function helps to find the columns containing different types of data, like character and numeric. The input of the function is a data frame, and the output of the function will be a list of 3 data frames.
cleanmix
- The function helps to clean our data frame. After knowing the location of discrepancy of data types, one can use this function to keep a type of data in certain columns.
- Here, the input will be the output by typemix function, name of the column (a vector of the name of columns) that they want to clean, the type of data they want to work on, and if we want to keep or delete the certain type. The output will be a data frame like the original type but with specified data type in certain columns deleted.
Missing Value Treatment : Basic Imputation using impute
- Imputation: replace missing values in a column of a dataframe, or multiple columns of dataframe based on the method of imputation
- (Method = 'Mean') replace using mean
- (Method = 'Median') replace using median
- (Method = 'Mode') replace using mode
Markdown Table:
md_new(): This function creates a bare bone for generating a markdown table. Alignments, and size of the table can be input by users.
- Input: the size of table (number of rows and number of columns)
- Output: a character vector of the source code.
md_data(): This function converts a dataframe or matrix into a markdown table format.
- Input: a matrix or dataframe
- Output: a character vector of the source code.

Examples

This is a basic example which shows you how to solve a common problem:

Datatype cleansing

The section has two functions, typemix and cleanmix.

The input for typemix function is a data frame, and the output is a list of 3 data frames. The first one is the same as the input data frame, the second one tells you the location and types of data in the columns where there is type mixture. The third data frame is a summary of the second data frame.
The input for cleanmix function is the result from typemix function, the column(s) you want to work on, the type(s) of data you want to keep/delete, and if you want to keep/delete the instances specified.

library(tidyplusR)


dat<-data.frame(x1=c(1,2,3,"1.2.3"),
                x2=c("test","test",1,TRUE),
                x3=c(TRUE,TRUE,FALSE,FALSE))
#Input data with mixed datatypes
dat

##      x1   x2    x3
## 1     1 test  TRUE
## 2     2 test  TRUE
## 3     3    1 FALSE
## 4 1.2.3 TRUE FALSE

#Identified and cleaned(removed) datatypes based on the types mentioned
tidyplusR::cleanmix(typemix(dat),column=c(1,2),type=c("number","character"))

##     x1   x2    x3
## 1    1 test  TRUE
## 2    2 test  TRUE
## 3    3 <NA> FALSE
## 4 <NA> <NA> FALSE

Missing Value imputation

This function requires a dataframe as an input for missing value treatment using mean/median/mode

### Dummy dataframe
dat <- data.frame(x=sample(letters[1:3],20,TRUE), 
                  y=sample(letters[1:3],20,TRUE),
                  w=as.numeric(sample(0:50,20,TRUE)),
                  z=sample(letters[1:3],20,TRUE), 
                  b = as.logical(sample(0:1,20,TRUE)),
                  a=sample(0:100,20,TRUE),
                  stringsAsFactors=FALSE)

dat[c(5,10,15),1] <- NA
dat[c(3,7),2] <- NA
dat[c(1,3,5),3] <- NA
dat[c(4,5,9),4] <- NA
dat[c(4,5,9),5] <- NA
dat[,4] <- factor(dat[,4] )
dat[c(4,5,9),6] <- NA
#Input data with missing values
dat

##       x    y  w    z     b  a
## 1     b    b NA    a FALSE 13
## 2     c    a 34    a FALSE 41
## 3     c <NA> NA    a FALSE 56
## 4     a    b 31 <NA>    NA NA
## 5  <NA>    a NA <NA>    NA NA
## 6     c    b  0    b FALSE 35
## 7     c <NA> 42    c  TRUE 19
## 8     b    a 24    b FALSE 48
## 9     c    a 26 <NA>    NA NA
## 10 <NA>    b 49    a FALSE 40
## 11    b    b 20    c  TRUE 98
## 12    a    a 15    b FALSE 42
## 13    c    b 50    c  TRUE 87
## 14    b    c 43    a  TRUE 27
## 15 <NA>    b 29    b FALSE 12
## 16    b    b 30    b FALSE 52
## 17    a    c  7    c FALSE  5
## 18    c    a  5    c  TRUE 39
## 19    b    b  2    c  TRUE 88
## 20    c    a 13    c FALSE 38

#### Calling impute function
#Missing value replaced with method = mode
tidyplusR::impute(dat,method = "mode")   ## method can be replaced by median and mean as well

##    x y     w z     b     a
## 1  b b 27.78 a FALSE 13.00
## 2  c a 34.00 a FALSE 41.00
## 3  c b 27.78 a FALSE 56.00
## 4  a b 31.00 c FALSE 40.77
## 5  c a 27.78 c FALSE 40.77
## 6  c b  0.00 b FALSE 35.00
## 7  c b 42.00 c  TRUE 19.00
## 8  b a 24.00 b FALSE 48.00
## 9  c a 26.00 c FALSE 40.77
## 10 c b 49.00 a FALSE 40.00
## 11 b b 20.00 c  TRUE 98.00
## 12 a a 15.00 b FALSE 42.00
## 13 c b 50.00 c  TRUE 87.00
## 14 b c 43.00 a  TRUE 27.00
## 15 c b 29.00 b FALSE 12.00
## 16 b b 30.00 b FALSE 52.00
## 17 a c  7.00 c FALSE  5.00
## 18 c a  5.00 c  TRUE 39.00
## 19 b b  2.00 c  TRUE 88.00
## 20 c a 13.00 c FALSE 38.00

Markdown table

md_new() can create an empty markdown table by specifying the number of columns and number of rows.

## default: ncol = 2 and nrow = 2, alignment = "l"
md_new()

## 
## |    |    |
## |:---|:---|
## |    |    |
## |    |    |

## 3 by 3 table
md_new(nrow = 3, ncol = 3)

## 
## |    |    |    |
## |:---|:---|:---|
## |    |    |    |
## |    |    |    |
## |    |    |    |

## different alignments:
md_new(nrow = 1, align = "c")

## 
## |    |    |
## |:--:|:--:|
## |    |    |

md_new(nrow = 1, align = "r")

## 
## |    |    |
## |---:|---:|
## |    |    |

## providing header
h <- c("foo", "boo")
md_new(header = h)

## 
## | foo| boo|
## |:---|:---|
## |    |    |
## |    |    |

md_data() can create an markdown table given input as matrix of data frame.

md_data(mtcars, row.index = 1:3, col.index = 1:4)

## 
## |    |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|

## alignment to right
md_data(mtcars, row.index = 1:3, col.index = 1:4, align = "r")

## 
## |    |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|

## provide header
md_data(mtcars, row.index = 1:3, col.index = 1:4, header = c("a","b","c","d"))

## 
## |    |a|b|c|d|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|

## not include row names
md_data(mtcars, row.index = 1:3, col.index = 1:4, row.names = F)

## 
## |mpg|cyl|disp|hp|
## |---:|---:|---:|---:|
## |21|6|160|110|
## |21|6|160|110|
## |22.8|4|108|93|

Existing features in R ecosystem similar to `tidyplusR`

Data Manipulation
dplyr and tidyverse these R libraries have very powerful data wrangling tools but with tidyplus user can explicitly perform string processing/ datatype conversion without affecting the overall column type (which is convenient when you have really messed up data with mix of strings and numbers)
Missing Value treatment
R doesn't have imputation methods which use EM algorithm for missing value treatment, which in fact is very efficient and accurate MICE package in R do provide limited imputation using mean, mode, etc.
Markdown table in R
R has library Kable which can output a dataset in the form of a markdown table but with tidyplus user will have more freedom with data types and formatting.

License

MIT

Contributing

This is an open source project. Please follow the guidelines below for contribution. - Open an issue for any feedback and suggestions. - For contributing to the project, please refer to Contributing for details.

UBC-MDS/tidyplusR documentation built on May 25, 2019, 1:36 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

UBC-MDS/tidyplusR
An extension on tidyverse to perform specified tasks

vignettes/tidyplusr.md
In UBC-MDS/tidyplusR: An extension on tidyverse to perform specified tasks

TidyplusR

Introduction

Contributors:

Installation

Functions included:

Three main parts including different functions in `tidyplusR`

Examples

Datatype cleansing

Missing Value imputation

Markdown table

Existing features in R ecosystem similar to `tidyplusR`

License

Contributing

R Package Documentation

Browse R Packages

We want your feedback!

UBC-MDS/tidyplusR An extension on tidyverse to perform specified tasks

vignettes/tidyplusr.md In UBC-MDS/tidyplusR: An extension on tidyverse to perform specified tasks

TidyplusR

Introduction

Contributors:

Installation

Functions included:

Three main parts including different functions in tidyplusR

Examples

Datatype cleansing

Missing Value imputation

Markdown table

Existing features in R ecosystem similar to tidyplusR

License

Contributing

R Package Documentation

Browse R Packages

We want your feedback!

UBC-MDS/tidyplusR
An extension on tidyverse to perform specified tasks

vignettes/tidyplusr.md
In UBC-MDS/tidyplusR: An extension on tidyverse to perform specified tasks

Three main parts including different functions in `tidyplusR`

Existing features in R ecosystem similar to `tidyplusR`