Introduction

This package was written for Lara Kammrath from Wake Forest University and her research. It contains some general functions that may apply to multiple, more general problems (section 1) and some very specific functions (section 2) that are tailored to Lara's specific tasks.

It relies on several different packages, which are installed automatically once this package is installed

How to install this package:

You can run the following code to install this package from GitHub:

install.packages("devtools")
devtools::install_github("kthorstmann/padis")
library(padis)

Section 1 - General functions

Example data frame

Th package contains a fake example data frame, wide_example_data, which is used to illustrate the functions.

library(padis)
data <- wide_example_data
head(data)

Data selection and transformation

select_vars

This function can be used to select specific variables from a data set. It is primarily build on grep or dplyr::select.

subset <- select_vars(data, id = "id", prefix="var_s")
head(subset, 3)

subset <- select_vars(data, id = c("id", "id2"), prefix="var_.+3$")
head(subset, 3)

To get more information on the use of regular expressions, see ?base::regex.

gather_one

gather_df <- gather_one(data=data, key_vars="id", varying = c("var_p_1", "var_p_2"), new_name="var")
head(gather_df)
# sort by id to make it visible
newdata <- gather_df[order(gather_df$id),] 
head(newdata, 20)

swap_char

This function is used to change variable names since the gather function (see below, based on tidyr::gather) relies on a specific strucutre of item names. It is primarily used in some functions specific to the problem at hand (section 2).

swap_char("this.string", ".", "_")

gather_multiple

Similar to gather_one, gather_multiple turns several variables from a wide data frame into a long data frame. Several key_vars can be added, however, the new names of the resulting variables should be changed manually. The varying object should be set to NULL, and preferably all variables that should not be turned into th long format should be removed before applying this function.

gather_mtpl_df <- gather_multiple(data=data, key_vars=c("id", "id2"), varying = NULL)
head(gather_mtpl_df)

aggregate_df

library(padis)
df <- aggregate_df(wide_example_data, id="id")
head(df, 2)

Example when doing computations only on a few variables:

# Here, we keep only a few variables, i.e. var_p_1 and var_p_2
head(wide_example_data)
df <- aggregate_df(wide_example_data, id="id", 
                   intake_var=c("var_p_1", "var_p_2"))
head(df, 2)

Missing data in the ID variable:

## we add some missings:

wide_with_missings <- wide_example_data
wide_with_missings[sample(1:nrow(wide_with_missings), 10),"id"] <- NA
wide_with_missings[sample(1:nrow(wide_with_missings), 10),"var_p_1"] <- NA

df <- aggregate_df(wide_with_missings, id="id", 
                   intake_var=c("var_p_1", "var_p_2"))
head(df, 2)
data <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_2/Aggregating.csv")
df <- aggregate_df(data, id="PAR.ID")
head(df)

## return only the sd per participant
df <- aggregate_df(data, id="PAR.ID", out_values="sd")
head(df, 2)

Saving data

# If the function is called for the first time and the file does not exist
write_to_wd(data, folder = NULL, type = "xlsx", name="example_data_wfu", overwrite=FALSE)

Section 2 - specific padis functions

Section 2 contains the functions that are specifically built for the tasks at Wake Forest Unviersity and Lara Kammrath's workgroup. They may not be applicable to a broader set of problems. However, they all rely on the functions above.

match_ir

This function searches for two items that have the same stem but end on two different end-strings, e.g. .r and .i:

library(padis)
# get example data
data <- imaginer_example_data
# add a random variable, here 2
data$random <- 2
head(data)
names <- match_ir(data, var_stem = "PSP1.variable")
data[1:10, names] # select the matched variables

It does not matter if you write var_stem="PSP1.variable" or var_stem="PSP1.variable." (the difference being the dot at the end), the function works either way. Further, the default for the end-pattern is ".r" and ".i".

imaginer

This function looks for two items that have the same stem, but different end-strings. It then matches these two variables in the following way:

The first variable with the .r at the end, e.g. PSP1.variable.r, and the second variable with .i at the end, e.g. PSP1.variable.i, are scanned and the function identifies which one has a number in it, and if either, put that number in the first new variable, e.g. PSP1.variable. A separate 0/1 variable, e.g. PSP.real.imagined, would indicate where the numbers in the set of variables came from.

data <- imaginer_example_data
data$PSP1.variable.r[!is.na(data$PSP1.variable.r)] <- 1
data$PSP1.variable.r[is.na(data$PSP1.variable.r)] <- 0
data$PSP1.variable.i[!is.na(data$PSP1.variable.i)] <- 1
data$PSP1.variable.i[is.na(data$PSP1.variable.i)] <- 0

data$PSP1.variable.i + data$PSP1.variable.r

head(data)
return_df <- imaginer(data, var_stem="PSP1.variable.", return_complete_data=TRUE)
head(return_df, 2)

merge_multiple_df

This function merges all data up or down into several other data sets. If merge_down = FALSE, the data will be merged up, i.e. unique rows of the dataframe data_from will be used for the aggregation.

The result is a list with data frames with the same length as data_list_to, which has to be a list of data frames as well. The argument select_variables allows selecting a specific set of variabels, based on their prefix (this is based on the function padis::select_vars).

library(padis)
## simulate data
data_long_1 <- data.frame(PAR.ID = letters[1:10],
                          var1 = sample(1:10, 10),
                          var2 = sample(1:10, 10))
data_long_2 <- data.frame(PAR.ID = sort(rep(letters[1:10], 3)),
                          var1 = sort(rep(1:10, 3)),
                          var2 = sort(rep(11:20, 3)))
data_long_2_er <- data.frame(PAR.ID = sort(rep(letters[1:10], 3)),
                          var1 = sort(rep(1:10, 3)),
                          var2 = sort(rep(11:20, 3)))
data_long_2_er[3, "var2"] <- 12 ## add a variable that is not the same as the other variables
data_long_3 <- data.frame(PAR.ID = rep(letters[1:10], 4),
                          var1 = sample(1:40, 40),
                          var2 = sample(1:40, 40))
data_short_2 <- data.frame(PAR.ID = rep(letters[1:10], 1),
                          var1 = sample(1:40, 10),
                          var2 = sample(1:40, 10))
data_short_3 <- data.frame(PAR.ID = rep(letters[1:10], 1),
                          var1 = sample(1:40, 10),
                          var2 = sample(1:40, 10))



## merge down, i. e. duplicate rows when going from a higher data set (e.g .level-2) to a lower data set (e.g. level-1), here, also select one additional variable from data_from (this could also be a prefix)
data_list <- merge_multiple_df(data_from = data_long_1, id_var="PAR.ID", list(data_long_2, data_long_3), merge_down = TRUE, select_variables = "var2")
purrr::map(data_list, head)
## merge up, i. e. make the longer data frame short first and merge then
data_list <- merge_multiple_df(data_long_1, id_var="PAR.ID", list(data_short_2, data_short_3), merge_down = FALSE, select_variables = "var2")
purrr::map(data_list, head)

data_list <- merge_multiple_df(data_long_2_er, id_var="PAR.ID", list(data_short_2, data_short_3), merge_down = FALSE, select_variables = "var2")
purrr::map(data_list, head)

If variables need to selected first, use the function select_vars to select variables from a data frame first, e.g.

select_df <- padis::select_vars(data=data, ids="PAR.ID", prefix="the_prefix_of_the_variables") # function is not run, this would cause an error

Functions for splitting the data frames

There are two functions to split the raw data frames:

First, the data sets need to be loaded (this part is not documented here)

## First, the data need to be read:

setwd("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data")
data_dy <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data/Raw Diary Selection Data.csv", sep = ";", stringsAsFactors = FALSE)
data_in <- xlsx::read.xlsx("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data/Raw Intake Sample.xlsx", 1)
head(data_dy, 3)
head(data_in, 3)

get_selection_long

library(padis)
selection_long <- get_selection_long(data_dy)
head(selection_long, 3)

transform_data

# the results will now appear (in the default mode) in the workspace
transform_data(data_dy, data_in, overwrite=TRUE)

next_day

library(padis)
data <- next_day_data

#returns next day
other_day(data, next_day=TRUE)

#returns previous day
other_day(data)

real_imagined

For the Real.Imagined Function, I want the function to identify all the pairs of variables with matching names except for a .r or .i suffix and create new variables for all of them with a .c suffix.

library(padis)
real_imagined(real_imagined_data, ignore_double=TRUE)

Example for specific data sets

Example: imaginer for several variables

This is how one could run the imaginer function over several variables at once, using a loop.

data <- read.csv2("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_2/KaiTrial.csv", sep = ",")
library(padis)

## this would be for one variable
match_ir(data, var_stem="Issue1.PSP3.ContactChat.")
return_df <- imaginer(data, var_stem="Issue1.PSP3.ContactChat.", return_complete_data=FALSE)
head(return_df)


## if you want to do this now for every variable, do:
names_data <- names(data) ## get the names of the data sets
issue_vars <- grep(x = names_data, pattern = "Issue+.+i$", value = TRUE) ## get the names that end with .i
stems <- stringr::str_replace_all(issue_vars, ".i$", "") ## replace the .i with nothing to generate the stems that we will be looking for

list <- list()
for (i in seq_along(stems)) { ## a for loop, to apply the function to each stem that was detected beforehand
  list[[i]] <- imaginer(data, var_stem=stems[i], return_complete_data=FALSE)
}
purrr::map(list, head)

Example: Gather multiple

# data <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_5/SSD.PracticeWideData2.csv")
# head(data)
# gather_mtpl_df <- gather_multiple(data=data, key_vars=c("Day_UniqueID", "PAR_ID", "Day_Num"), varying = NULL, search = "\\d")
# 
# 
# ## change the function so that it can detect numbers in the middle as well in the variable names
# names(data)
# function(data, key_vars, varying = NULL, search = "Issue\\d_"){
#   stopifnot(is.data.frame(data))
#   if (is.null(varying)){
#     all_names <- names(data)
#     varying <- setdiff(all_names, key_vars)
#   }
#   stems <- generate_stems(varying, search = search)
#   varying_list <- purrr::map(stems, ~ grep(x=varying, pattern=., value = TRUE))
#   long_list <- purrr::map2(varying_list, stems, ~ gather_one(data=data, key_vars = key_vars,
#                                                              varying=.x, new_name=.y))
#   
#   purrr::map(long_list, tail)
#   
#   long_df <- as.data.frame(long_list)
#   long_re_df <- long_df[c(key_vars, stems)]
#   long_re_df
# }
# 
# ## add this as error messages
# 
# head(long_re_df, 100)
# any(table(data$Day_UniqueID) != 1)
# which(table(long_re_df$Day_UniqueID) > 2)
# 
# sum(table(long_re_df$Day_UniqueID) == 2)
# sum(table(long_re_df$Day_UniqueID) == 4)
# table(long_re_df$Day_UniqueID)["290.1"]
# head(data)

From the e-mail:

If I have a dataset with these variable names: PAR.ID, PSP1.close, PSP2.close, PSP1.freq, PSP2.freq and I want to stack the variables, how does the function know to stack the two close variables on top of each other and the two freq variables on top of each other? I didn't see anything in the arguments in the vignette that would tell the function which variables to stack. That could be one of the things we changed in our meeting and I'm just fuzzy on the details of how the new way works. Or maybe I forgot to bring it up.

In this case, the function would need as search argument "PSP\\d.", searching for each string that has "PSPsomenumber." The function does not work if the common elements are in the middle of the string (like "somenumberPSPnumber*_freq").

data <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_5/SSD.PracticeWideData2.csv")
head(data)
gather_mtpl_df <- gather_multiple(data=data, key_vars=c("Day_UniqueID", "PAR_ID", "Day_Num"), varying = NULL, search = "Issue\\d_")
head(gather_mtpl_df)

Example: aggregate_df

library(padis)
data <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_6/SelectionLevel.no16 test.csv")

ssd.agg.slct <- aggregate_df(data, id="Issue.UniqueID", intake_var=c("Sought.Mother", "Sought.Father", "Sought.Partner", "Sought.BestFriend", "Sought.Attach.c7","Sought.NonAttach.c7","Sought.Attach","Sought.NonAttach"), out_values="true")
head(ssd.agg.slct)


# check input
subset(data, Issue.UniqueID == 111805)
# check output
subset(ssd.agg.slct, Issue.UniqueID == 111805)

aggregate_df, second example

ssd.day <- read.csv("/Users/kaihorstmann/Dropbox/01Uni/01Projekte/96collaboration/Lara Kammrath/Materials/data_4/DayLevel 17.08.01.csv")
ssd.agg.day <- aggregate_df(ssd.day, id="PAR.ID", out_values="max")
head(ssd.agg.day, 3)

Working with lists

Wotking with lists makes things easier, because multiple data frames can be used at the same time.

## get the head of each data frame in the list object list
purrr::map(data_list, head)

## get the head of each data frame in the list object list
purrr::map(data_list, psych::describe)

## or make your own function
purrr::map(data_list, function(x) {x + 1}) # in this case add one two each variable. If the variable is not numerical, then there is a Warning (and in this case, Missings arise)

Example regex

# for help for regex, see:
?regex
names <- names(data)
## all containing 'Is'
grep( x = names, pattern = "Is", value = TRUE)

## all containing any number
grep( x = names, pattern = "[0-9]", value = TRUE)

## all containing '1'
grep( x = names, pattern = "1", value = TRUE)

## all containing 'd'
grep( x = names, pattern = "D", value = TRUE)

## all containing 'd' at the end
grep( x = names, pattern = "D$", value = TRUE)

## all containing 'd' at the beginning
grep( x = names, pattern = "^D", value = TRUE)

## Issue at the beginning and Bad at the end
grep( x = names, pattern = "Issue.+.Bad", value = TRUE) #.+. = anything inbetween

data <- cbind(data[4], data)
data[c(1, 3:4)]


kthorstmann/padis documentation built on May 24, 2019, 5:01 a.m.