This package was written for Lara Kammrath from Wake Forest University and her research. It contains some general functions that may apply to multiple, more general problems (section 1) and some very specific functions (section 2) that are tailored to Lara's specific tasks.
It relies on several different packages, which are installed automatically once this package is installed
You can run the following code to install this package from GitHub:
install.packages("devtools") devtools::install_github("kthorstmann/padis") library(padis)
Th package contains a fake example data frame, wide_example_data
, which is used to illustrate the functions.
library(padis) data <- wide_example_data head(data)
This function can be used to select specific variables from a data set. It is primarily build on grep
or dplyr::select
.
subset <- select_vars(data, id = "id", prefix="var_s") head(subset, 3) subset <- select_vars(data, id = c("id", "id2"), prefix="var_.+3$") head(subset, 3)
To get more information on the use of regular expressions, see ?base::regex
.
gather_df <- gather_one(data=data, key_vars="id", varying = c("var_p_1", "var_p_2"), new_name="var") head(gather_df) # sort by id to make it visible newdata <- gather_df[order(gather_df$id),] head(newdata, 20)
This function is used to change variable names since the gather
function (see below, based on tidyr::gather
) relies on a specific strucutre of item names. It is primarily used in some functions specific to the problem at hand (section 2).
swap_char("this.string", ".", "_")
Similar to gather_one
, gather_multiple
turns several variables from a wide data frame into a long data frame. Several key_vars
can be added, however, the new names of the resulting variables should be changed manually. The varying object should be set to NULL, and preferably all variables that should not be turned into th long format should be removed before applying this function.
gather_mtpl_df <- gather_multiple(data=data, key_vars=c("id", "id2"), varying = NULL) head(gather_mtpl_df)
library(padis) df <- aggregate_df(wide_example_data, id="id") head(df, 2)
Example when doing computations only on a few variables:
# Here, we keep only a few variables, i.e. var_p_1 and var_p_2 head(wide_example_data) df <- aggregate_df(wide_example_data, id="id", intake_var=c("var_p_1", "var_p_2")) head(df, 2)
Missing data in the ID variable:
## we add some missings: wide_with_missings <- wide_example_data wide_with_missings[sample(1:nrow(wide_with_missings), 10),"id"] <- NA wide_with_missings[sample(1:nrow(wide_with_missings), 10),"var_p_1"] <- NA df <- aggregate_df(wide_with_missings, id="id", intake_var=c("var_p_1", "var_p_2")) head(df, 2)
data <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_2/Aggregating.csv") df <- aggregate_df(data, id="PAR.ID") head(df) ## return only the sd per participant df <- aggregate_df(data, id="PAR.ID", out_values="sd") head(df, 2)
# If the function is called for the first time and the file does not exist write_to_wd(data, folder = NULL, type = "xlsx", name="example_data_wfu", overwrite=FALSE)
Section 2 contains the functions that are specifically built for the tasks at Wake Forest Unviersity and Lara Kammrath's workgroup. They may not be applicable to a broader set of problems. However, they all rely on the functions above.
This function searches for two items that have the same stem but end on two different end-strings, e.g. .r and .i:
library(padis) # get example data data <- imaginer_example_data # add a random variable, here 2 data$random <- 2 head(data) names <- match_ir(data, var_stem = "PSP1.variable") data[1:10, names] # select the matched variables
It does not matter if you write var_stem="PSP1.variable"
or var_stem="PSP1.variable."
(the difference being the dot at the end), the function works either way. Further, the default for the end-pattern is ".r"
and ".i"
.
This function looks for two items that have the same stem, but different end-strings. It then matches these two variables in the following way:
The first variable with the .r at the end, e.g. PSP1.variable.r, and the second variable with .i at the end, e.g. PSP1.variable.i, are scanned and the function identifies which one has a number in it, and if either, put that number in the first new variable, e.g. PSP1.variable. A separate 0/1 variable, e.g. PSP.real.imagined, would indicate where the numbers in the set of variables came from.
data <- imaginer_example_data data$PSP1.variable.r[!is.na(data$PSP1.variable.r)] <- 1 data$PSP1.variable.r[is.na(data$PSP1.variable.r)] <- 0 data$PSP1.variable.i[!is.na(data$PSP1.variable.i)] <- 1 data$PSP1.variable.i[is.na(data$PSP1.variable.i)] <- 0 data$PSP1.variable.i + data$PSP1.variable.r head(data) return_df <- imaginer(data, var_stem="PSP1.variable.", return_complete_data=TRUE) head(return_df, 2)
This function merges all data up or down into several other data sets. If merge_down = FALSE
, the data will be merged up, i.e. unique rows of the dataframe data_from
will be used for the aggregation.
The result is a list with data frames with the same length as data_list_to
, which has to be a list of data frames as well. The argument select_variables
allows selecting a specific set of variabels, based on their prefix (this is based on the function padis::select_vars
).
library(padis) ## simulate data data_long_1 <- data.frame(PAR.ID = letters[1:10], var1 = sample(1:10, 10), var2 = sample(1:10, 10)) data_long_2 <- data.frame(PAR.ID = sort(rep(letters[1:10], 3)), var1 = sort(rep(1:10, 3)), var2 = sort(rep(11:20, 3))) data_long_2_er <- data.frame(PAR.ID = sort(rep(letters[1:10], 3)), var1 = sort(rep(1:10, 3)), var2 = sort(rep(11:20, 3))) data_long_2_er[3, "var2"] <- 12 ## add a variable that is not the same as the other variables data_long_3 <- data.frame(PAR.ID = rep(letters[1:10], 4), var1 = sample(1:40, 40), var2 = sample(1:40, 40)) data_short_2 <- data.frame(PAR.ID = rep(letters[1:10], 1), var1 = sample(1:40, 10), var2 = sample(1:40, 10)) data_short_3 <- data.frame(PAR.ID = rep(letters[1:10], 1), var1 = sample(1:40, 10), var2 = sample(1:40, 10)) ## merge down, i. e. duplicate rows when going from a higher data set (e.g .level-2) to a lower data set (e.g. level-1), here, also select one additional variable from data_from (this could also be a prefix) data_list <- merge_multiple_df(data_from = data_long_1, id_var="PAR.ID", list(data_long_2, data_long_3), merge_down = TRUE, select_variables = "var2") purrr::map(data_list, head) ## merge up, i. e. make the longer data frame short first and merge then data_list <- merge_multiple_df(data_long_1, id_var="PAR.ID", list(data_short_2, data_short_3), merge_down = FALSE, select_variables = "var2") purrr::map(data_list, head) data_list <- merge_multiple_df(data_long_2_er, id_var="PAR.ID", list(data_short_2, data_short_3), merge_down = FALSE, select_variables = "var2") purrr::map(data_list, head)
If variables need to selected first, use the function select_vars
to select variables from a data frame first, e.g.
select_df <- padis::select_vars(data=data, ids="PAR.ID", prefix="the_prefix_of_the_variables") # function is not run, this would cause an error
There are two functions to split the raw data frames:
get_selection_long
: Turn raw daily data into the selection data settransform_data
: Turn raw daily data and intake data sets into the five specific data setsFirst, the data sets need to be loaded (this part is not documented here)
## First, the data need to be read: setwd("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data") data_dy <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data/Raw Diary Selection Data.csv", sep = ";", stringsAsFactors = FALSE) data_in <- xlsx::read.xlsx("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data/Raw Intake Sample.xlsx", 1)
head(data_dy, 3) head(data_in, 3)
library(padis) selection_long <- get_selection_long(data_dy) head(selection_long, 3)
# the results will now appear (in the default mode) in the workspace transform_data(data_dy, data_in, overwrite=TRUE)
id
and a corresponding days_in
varnext_day = TRUE
, each entry that has one "one-lower" match is flagged and all corresponding values from another set of variables
are extracted, if next_day = FALSE
, it searches for the value on the previous dayprefix
"N" for next day and "P" for previous day.library(padis) data <- next_day_data #returns next day other_day(data, next_day=TRUE) #returns previous day other_day(data)
For the Real.Imagined Function, I want the function to identify all the pairs of variables with matching names except for a .r or .i suffix and create new variables for all of them with a .c suffix.
library(padis) real_imagined(real_imagined_data, ignore_double=TRUE)
This is how one could run the imaginer function over several variables at once, using a loop.
data <- read.csv2("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_2/KaiTrial.csv", sep = ",") library(padis) ## this would be for one variable match_ir(data, var_stem="Issue1.PSP3.ContactChat.") return_df <- imaginer(data, var_stem="Issue1.PSP3.ContactChat.", return_complete_data=FALSE) head(return_df) ## if you want to do this now for every variable, do: names_data <- names(data) ## get the names of the data sets issue_vars <- grep(x = names_data, pattern = "Issue+.+i$", value = TRUE) ## get the names that end with .i stems <- stringr::str_replace_all(issue_vars, ".i$", "") ## replace the .i with nothing to generate the stems that we will be looking for list <- list() for (i in seq_along(stems)) { ## a for loop, to apply the function to each stem that was detected beforehand list[[i]] <- imaginer(data, var_stem=stems[i], return_complete_data=FALSE) } purrr::map(list, head)
# data <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_5/SSD.PracticeWideData2.csv") # head(data) # gather_mtpl_df <- gather_multiple(data=data, key_vars=c("Day_UniqueID", "PAR_ID", "Day_Num"), varying = NULL, search = "\\d") # # # ## change the function so that it can detect numbers in the middle as well in the variable names # names(data) # function(data, key_vars, varying = NULL, search = "Issue\\d_"){ # stopifnot(is.data.frame(data)) # if (is.null(varying)){ # all_names <- names(data) # varying <- setdiff(all_names, key_vars) # } # stems <- generate_stems(varying, search = search) # varying_list <- purrr::map(stems, ~ grep(x=varying, pattern=., value = TRUE)) # long_list <- purrr::map2(varying_list, stems, ~ gather_one(data=data, key_vars = key_vars, # varying=.x, new_name=.y)) # # purrr::map(long_list, tail) # # long_df <- as.data.frame(long_list) # long_re_df <- long_df[c(key_vars, stems)] # long_re_df # } # # ## add this as error messages # # head(long_re_df, 100) # any(table(data$Day_UniqueID) != 1) # which(table(long_re_df$Day_UniqueID) > 2) # # sum(table(long_re_df$Day_UniqueID) == 2) # sum(table(long_re_df$Day_UniqueID) == 4) # table(long_re_df$Day_UniqueID)["290.1"] # head(data)
From the e-mail:
If I have a dataset with these variable names: PAR.ID, PSP1.close, PSP2.close, PSP1.freq, PSP2.freq and I want to stack the variables, how does the function know to stack the two close variables on top of each other and the two freq variables on top of each other? I didn't see anything in the arguments in the vignette that would tell the function which variables to stack. That could be one of the things we changed in our meeting and I'm just fuzzy on the details of how the new way works. Or maybe I forgot to bring it up.
In this case, the function would need as search
argument "PSP\\d."
, searching for each string that has "PSPsomenumber." The function does not work if the common elements are in the middle of the string (like "somenumberPSPnumber*_freq").
data <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_5/SSD.PracticeWideData2.csv") head(data) gather_mtpl_df <- gather_multiple(data=data, key_vars=c("Day_UniqueID", "PAR_ID", "Day_Num"), varying = NULL, search = "Issue\\d_") head(gather_mtpl_df)
library(padis) data <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_6/SelectionLevel.no16 test.csv") ssd.agg.slct <- aggregate_df(data, id="Issue.UniqueID", intake_var=c("Sought.Mother", "Sought.Father", "Sought.Partner", "Sought.BestFriend", "Sought.Attach.c7","Sought.NonAttach.c7","Sought.Attach","Sought.NonAttach"), out_values="true") head(ssd.agg.slct) # check input subset(data, Issue.UniqueID == 111805) # check output subset(ssd.agg.slct, Issue.UniqueID == 111805)
ssd.day <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_4/DayLevel 17.08.01.csv") ssd.agg.day <- aggregate_df(ssd.day, id="PAR.ID", out_values="max") head(ssd.agg.day, 3)
Wotking with lists makes things easier, because multiple data frames can be used at the same time.
## get the head of each data frame in the list object list purrr::map(data_list, head) ## get the head of each data frame in the list object list purrr::map(data_list, psych::describe) ## or make your own function purrr::map(data_list, function(x) {x + 1}) # in this case add one two each variable. If the variable is not numerical, then there is a Warning (and in this case, Missings arise)
# for help for regex, see: ?regex names <- names(data) ## all containing 'Is' grep( x = names, pattern = "Is", value = TRUE) ## all containing any number grep( x = names, pattern = "[0-9]", value = TRUE) ## all containing '1' grep( x = names, pattern = "1", value = TRUE) ## all containing 'd' grep( x = names, pattern = "D", value = TRUE) ## all containing 'd' at the end grep( x = names, pattern = "D$", value = TRUE) ## all containing 'd' at the beginning grep( x = names, pattern = "^D", value = TRUE) ## Issue at the beginning and Bad at the end grep( x = names, pattern = "Issue.+.Bad", value = TRUE) #.+. = anything inbetween data <- cbind(data[4], data) data[c(1, 3:4)]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.