#Only to be run locally to link to vignette from GitHub README
rmarkdown::render("vignettes/getting_started.Rmd", output_file="doc/getting_started.html")
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette introduces some of the basic functions that quickly get you from a dataset to publication-ready outputs. It uses tidyverse/dplyr-functions alongside timesaveR as that is the recommended workflow.

library(timesaveR)
library(dplyr)

Create scales

Let's create scales for health behaviours and depressive symptoms, each including some reverse coding. For that, scale items need to be included into a list of named character vectors; if any items are to be reversed, they should be added into another list of character vectors.

scales_items <- list(
  depression = c("fltdpr", "flteeff", "slprl", "wrhpp", "fltlnl", 
                 "enjlf", "fltsd", "cldgng"),
  healthy_eating = c("etfruit", "eatveg")
  )

scales_reverse <- list(
  depression = c("wrhpp", "enjlf"),
  healthy_eating = c("etfruit", "eatveg")
)

Then, the make_scales() command can be used to calculate scale scores as well as descriptives. By default, it also returns histograms showing the distribution of each item and the resulting scale, which can help to spot problems. By default, it also returns both scale scores and descriptives that can be processed separately.

scales <- make_scales(ess_health, items = scales_items, reversed = scales_reverse)

#Check descriptives, including reliability
scales$descriptives %>% round_df()

#Add scale scores to dataset
ess_health <- cbind(ess_health, scales$scores)

Report correlations and descriptive statistics

At the start of a data analysis project, we are often interested in descriptive statistics, variable distributions and correlations. As far as numerical variables are concerned, all of them can be reported in a pretty table with the report_cor_table() function. That function needs a correlation matrix as its first argument, which can usually be created with the cor_matrix() function.

ess_health %>%
  select(agea, health, depression, healthy_eating) %>%
  cor_matrix() %>%
  report_cor_table()

It is often helpful to use different variable names for display. For this, functions in the package typically accept rename tibbles that contain an old and a new column. By using the tibble::tribble notation, they can be entered more cleanly that named character vectors, and the get_rename_tribble() helper function creates most of the code.

#Use get_rename_tribbles to get most of this code
var_renames <- tibble::tribble(
  ~old,             ~new,             
   "agea",           "Age",          
   "health",         "Poor health",        
   "depression",     "Depression",    
   "healthy_eating",  "Healthy eating"
)

# If var_names are provided, only variables included in that argument 
# are included in the correlation table
ess_health %>% cor_matrix(var_names = var_renames) %>% report_cor_table()

Often, it is interesting to also include variable distributions into this initial data summary table. This can also be done with the report_cor_table() function. By default, variables with 5 distinct values or fewer are shown with a histogram, while for those with more values, a density plot is provided - but this can be changed with the plot_type argument.

ess_health %>% cor_matrix(var_names = var_renames) %>%
  report_cor_table(add_distributions = TRUE, data = ess_health)

Describe categorical variables and their relation with an outcome

Often, we are also interested in how the means of an outcome variable differ between different groups. It can be fiddly to get these tables and the pairwise significance tests done, but this function does it in a breeze.

If you want to rename variable levels as well as names, the get_rename_tribbles() function becomes a real timesaver.

# Start with get_rename_tribbles(ess_health, gndr, cntry) in the console

var_renames <- tibble::tribble(
    ~old,     ~new,     
    "gndr",   "Gender",  
    "cntry",  "Country"
)

level_renames <- tibble::tribble(
    ~var,     ~level_old, ~level_new, 
    "gndr",   "1",        "male",       
    "gndr",   "2",        "female",       
    "cntry",  "DE",       "Germany",      
    "cntry",  "FR",       "France",      
    "cntry",  "GB",       "UK"
)

Then run report_cat_vars() to get a table with the distribution of the categories and their association with the outcome variable.

report_cat_vars(ess_health, health, gndr, cntry, var_names = var_renames, 
              level_names = level_renames)

Report regression models with standardized coefficients

In psychology, it is often expected that regression models are reported with both unstandardised (B) and standardized (β) coefficients. This can be fiddly and lead to redundant information. The lm_std() function easily runs linear regression models that return standardised coefficients and the report_lm_with_std() function creates side-by-side regression tables that minimize redundancy.

At the most basic, you just need to run two models and pass them to the reporting function:

ess_health$gndr <- factor(ess_health$gndr)

#Standard lm model
mod1 <- lm(depression ~ agea + gndr + health + cntry, ess_health)

#Model with standardised coefficients
mod2 <- lm_std(depression ~ agea + gndr + health + cntry, ess_health)

report_lm_with_std(mod1, mod2)

Often, the coefficient names - particularly for dummy variables - are not particularly clear or pleasing. In such cases, they can be renamed easily - again, there is a helper function (get_coef_rename_tribble()) that creates most of the necessary code. In the new names, you can use markdown formatting, for instance to add information on reference levels.

#get_coef_rename_tribble(mod1) is the starting point.

coef_names <- tibble::tribble(
  ~old,           ~new,           
   "(Intercept)",  "*(Intercept)*", 
   "agea",         "Age",        
   "gndr2",        "Gender *(female)*",       
   "health",       "Poor health",      
   "cntryFR",      "France *(vs DE)*",     
   "cntryGB",      "UK *(vs DE)*"
)

report_lm_with_std(mod1, mod2, coef_renames = coef_names)

You can also display two nested models side-by-side and get the F-change significance test. (More models can be displayed, but not yet statistically tested against each other.) To enable model comparisons, all models need to be fitted on the same dataset, so that I will first omit missing data.

ess_health <- tidyr::drop_na(ess_health)

mod1 <- lm(depression ~ agea + gndr + health + cntry, ess_health)
mod2 <- lm_std(depression ~ agea + gndr + health + cntry, ess_health)


mod3 <- lm(depression ~ agea * gndr + eisced + health + cntry, ess_health)
mod4 <- lm_std(depression ~ agea * gndr + eisced + health + cntry, ess_health)

coef_names <- tibble::tribble(
  ~old,           ~new,           
   "(Intercept)",  "*(Intercept)*", 
   "agea",         "Age",        
   "gndr2",        "Gender *(female)*",       
   "health",       "Poor health",      
   "cntryFR",      "France *(vs DE)*",     
   "cntryGB",      "UK *(vs DE)*",
   "eisced",       "Education",
   "agea:gndr2",   "Age x Female",
)

report_lm_with_std(mod = list(mod1, mod3), mod_std = list(mod2, mod4),
                   coef_renames = coef_names, R2_change = TRUE,
                   model_names = c("Base model", "Extended model"))

Happy exploring, and please do report bugs and unexpected behaviors by opening an issue on GitHub or emailing me on lukas.wallrich@gmail.com



LukasWallrich/timesaveR documentation built on Nov. 29, 2024, 4:47 a.m.