README.md

TidyPlusR: a tool for data wrangling

contributions welcome Build Status codecov

Contributors:

Latest

About

The tidyplusR package is an essential data cleaning package with features like missing value treatment, data manipulation and displaying data as markdown table for documents. The package adds a few additional functionality on the existing data wrangling packages in popular statistical software like R. The objective of this package is to provide a few specific functions to solve some of the pressing issues in data cleaning.

Installation

You can install tidyplusR from github with:

# install.packages("devtools")
devtools::install_github("UBC-MDS/tidyplusR")

Functions included:

Three main parts include different functions in tidyplusR

Example

This is a basic example which shows you how to solve a common problem:

Data Type Cleansing

The section has two functions, typemix and cleanmix.

library(tidyplusR)
dat<-data.frame(x1=c(1,2,3,"1.2.3"),
                x2=c("test","test",1,TRUE),
                x3=c(TRUE,TRUE,FALSE,FALSE))

typemix(dat) #
## [[1]]
##      x1   x2    x3
## 1     1 test  TRUE
## 2     2 test  TRUE
## 3     3    1 FALSE
## 4 1.2.3 TRUE FALSE
## 
## [[2]]
##          x1        x2 x3
## 1    number character NA
## 2    number character NA
## 3    number    number NA
## 4 character   logical NA
## 
## [[3]]
##   Column_ID number character logical
## 1         1      3         1       0
## 2         2      1         2       1
cleanmix(typemix(dat),column=c(1,2),type=c("number","character"))
##     x1   x2    x3
## 1    1 test  TRUE
## 2    2 test  TRUE
## 3    3 <NA> FALSE
## 4 <NA> <NA> FALSE

Missing Value imputation

library(tidyverse)
## ── Attaching packages ────────

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ─────────────────
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# Dummy dataframe
dat <- data.frame(x=sample(letters[1:3],20,TRUE), 
                  y=sample(letters[1:3],20,TRUE),
                  w=as.numeric(sample(0:50,20,TRUE)),
                  z=sample(letters[1:3],20,TRUE), 
                  b = as.logical(sample(0:1,20,TRUE)),
                  a=sample(0:100,20,TRUE),
                  stringsAsFactors=FALSE)

dat[c(5,10,15),1] <- NA
dat[c(3,7),2] <- NA
dat[c(1,3,5),3] <- NA
dat[c(4,5,9),4] <- NA
dat[c(4,5,9),5] <- NA
dat[,4] <- factor(dat[,4] )
dat[c(4,5,9),6] <- NA

# Calling impute function
# method can be replaced by median and mean as well

impute(dat,method = "mode") %>% head()
##   x y    w z     b     a
## 1 a b 34.6 a  TRUE 40.00
## 2 b c 33.0 c FALSE  1.00
## 3 c c 34.6 a  TRUE 38.00
## 4 c c 15.0 a FALSE 23.53
## 5 b b 34.6 a FALSE 23.53
## 6 c c 22.0 b FALSE 37.00

Markdown table

## default: ncol = 2 and nrow = 2, alignment = "l"
md_new()
## 
## |    |    |
## |:---|:---|
## |    |    |
## |    |    |
## 3 by 3 table
md_new(nrow = 3, ncol = 3)
## 
## |    |    |    |
## |:---|:---|:---|
## |    |    |    |
## |    |    |    |
## |    |    |    |
## different alignments:
md_new(nrow = 1, align = "c")
## 
## |    |    |
## |:--:|:--:|
## |    |    |
md_new(nrow = 1, align = "r")
## 
## |    |    |
## |---:|---:|
## |    |    |
## providing header
h <- c("foo", "boo")
md_new(header = h)
## 
## | foo| boo|
## |:---|:---|
## |    |    |
## |    |    |
md_data(mtcars, row.index = 1:3, col.index = 1:4)
## 
## |    |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## alignment to right
md_data(mtcars, row.index = 1:3, col.index = 1:4, align = "r")
## 
## |    |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## provide header
md_data(mtcars, row.index = 1:3, col.index = 1:4, header = c("a","b","c","d"))
## 
## |    |a|b|c|d|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## not include row names
md_data(mtcars, row.index = 1:3, col.index = 1:4, row.names = F)
## 
## |mpg|cyl|disp|hp|
## |---:|---:|---:|---:|
## |21|6|160|110|
## |21|6|160|110|
## |22.8|4|108|93|

User Scenario

Existing features in R and Python ecosystem similar to tidyplus

Branch coverage

License

MIT

Contributing

This is an open source project. Please follow the guidelines below for contribution. - Open an issue for any feedback and suggestions. - For contributing to the project, please refer to Contributing for details.



UBC-MDS/tidyplusR documentation built on May 25, 2019, 1:36 p.m.