desc <- suppressWarnings(readLines("DESCRIPTION")) regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)" loc <- grep(regex, desc) ver <- gsub(regex, "\\2", desc[loc]) library(pacman) # verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver) verbadge <- '' p_load(dplyr, wakefield, knitr, tidyr, ggplot2) ```` ```r knit_hooks$set(htmlcap = function(before, options, envir) { if(!before) { paste('<p class="caption"><b><em>',options$htmlcap,"</em></b></p>",sep="") } }) knitr::opts_knit$set(self.contained = TRUE, cache = FALSE) knitr::opts_chunk$set(fig.path = "tools/figure/")
wakefield is designed to quickly generate random data sets. The user passes n
(number of rows) and predefined vectors to the r_data_frame
function to produce a dplyr::tbl_df
object.
To download the development version of wakefield:
Download the zip ball or tar ball, decompress and run R CMD INSTALL
on it, or use the pacman package to install the development version:
if (!require("pacman")) install.packages("pacman") pacman::p_load_gh("trinker/wakefield") pacman::p_load(dplyr, tidyr, ggplot2)
You are welcome to: submit suggestions and bug-reports at: https://github.com/trinker/wakefield/issues send a pull request on: https://github.com/trinker/wakefield/ * compose a friendly e-mail to: tyler.rinker@gmail.com
The r_data_frame
function (random data frame) takes n
(the number of rows) and any number of variables (columns). These columns are typically produced from a wakefield variable function. Each of these variable functions has a pre-set behavior that produces a named vector of n length, allowing the user to lazily pass unnamed functions (optionally, without call parenthesis). The column name is hidden as a varname
attribute. For example here we see the race
variable function:
race(n=10) attributes(race(n=10))
When this variable is used inside of r_data_frame
the varname
is used as a column name. Additionally, the n
argument is not set within variable functions but is set once in r_data_frame
:
r_data_frame( n = 500, race )
The power of r_data_frame
is apparent when we use many modular variable functions:
r_data_frame( n = 500, id, race, age, sex, hour, iq, height, died )
There are r length(variables())
wakefield based variable functions to chose from, spanning R's various data types (see ?variables
for details).
p_load(pander, xtable) variables("matrix", ncol=5) %>% xtable() %>% print(type = 'html', include.colnames = FALSE, include.rownames = FALSE, html.table.attributes = '') #matrix(c(sprintf("`%s`", vect), blanks), ncol=4) %>% # pandoc.table(format = "markdown", caption = "Available variable functions.")
However, the user may also pass their own vector producing functions or vectors to r_data_frame
. Those with an n
argument can be set by r_data_frame
:
r_data_frame( n = 500, id, Scoring = rnorm, Smoker = valid, race, age, sex, hour, iq, height, died )
r_data_frame( n = 500, id, age, age, age, grade, grade, grade )
While passing variable functions to r_data_frame
without call parenthesis is handy, the user may wish to set arguments. This can be done through call parenthesis as we do with data.frame
or dplyr::data_frame
:
r_data_frame( n = 500, id, Scoring = rnorm, Smoker = valid, `Reading(mins)` = rpois(lambda=20), race, age(x = 8:14), sex, hour, iq, height(mean=50, sd = 10), died )
Often data contains missing values. wakefield allows the user to add a proportion of missing values per column/vector via the r_na
(random NA
). This works nicely within a dplyr/magrittr %>%
then pipeline:
r_data_frame( n = 30, id, race, age, sex, hour, iq, height, died, Scoring = rnorm, Smoker = valid ) %>% r_na(prob=.4)
The r_series
function allows the user to pass a single wakefield function and dictate how many columns (j
) to produce.
set.seed(10) r_series(likert, j = 3, n=10)
Often the user wants a numeric score for Likert type columns and similar variables. For series with multiple factors the as_integer
converts all columns to integer values. Additionally, we may want to specify column name prefixes. This can be accomplished via the variable function's name
argument. Both of these features are demonstrated here.
set.seed(10) as_integer(r_series(likert, j = 5, n=10, name = "Item"))
r_series
can be used within a r_data_frame
as well.
set.seed(10) r_data_frame(n=100, id, age, sex, r_series(likert, 3, name = "Question") )
set.seed(10) r_data_frame(n=100, id, age, sex, r_series(likert, 5, name = "Item", integer = TRUE) )
The user can also create related series via the relate
argument in r_series
. It allows the user to specify the relationship between columns. relate
may be a named list of \code{c("operation", "mean", "sd")} or a short hand string of the form of "fM_sd"
where:
f
is one of (+, -, *, /)M
is a mean valuesd
is a standard deviation of the mean value For example you may use relate = "*4_1"
. If relate = NULL
no relationship is generated between columns. I will use the short hand string form here.
r_series(grade, j = 5, n = 100, relate = "+1_6") r_series(age, 5, 100, relate = "+5_0") r_series(likert, 5, 100, name ="Item", relate = "-.5_.1") r_series(grade, j = 5, n = 100, relate = "*1.05_.1")
Use the sd
command to adjust correlations.
round(cor(r_series(grade, 8, 10, relate = "+1_2")), 2) round(cor(r_series(grade, 8, 10, relate = "+1_0")), 2) round(cor(r_series(grade, 8, 10, relate = "+1_20")), 2) round(cor(r_series(grade, 8, 10, relate = "+15_20")), 2)
dat <- r_data_frame(12, name, r_series(grade, 100, relate = "+1_6") ) dat %>% gather(Time, Grade, -c(Name)) %>% mutate(Time = as.numeric(gsub("\\D", "", Time))) %>% ggplot(aes(x = Time, y = Grade, color = Name, group = Name)) + geom_line(size=.8) + theme_bw()
The user may wish to expand a factor
into j
dummy coded columns. The r_dummy
function expands a factor into j
columns and works similar to the r_series
function. The user may wish to use the original factor name as the prefix to the j
columns. Setting prefix = TRUE
within r_dummy
accomplishes this.
set.seed(10) r_data_frame(n=100, id, age, r_dummy(sex, prefix = TRUE), r_dummy(political) )
It is helpful to see the column types and NA
s as a visualization. The table_heat
(also the plot
method assigned to tbl_df
as well) can provide visual glimpse of data types and missing cells.
set.seed(10) r_data_frame(n=100, id, dob, animal, grade, grade, death, dummy, grade_letter, gender, paragraph, sentence ) %>% r_na() %>% plot(palette = "Set1")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.