knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(devtools)
load_all("./")

Installation

Install the ExpDesign2021 package from github.

#install.packages("remotes")
#install.packages("BiocManager")
#BiocManager::install("VanAndelInstitute/ExpDesign2021", build_vignettes=TRUE)

Introduction

The recommended starting point for modeling things in the tidyverse is through the tidymodels package, and the tidymodels folks suggest A ModernDive into R and the Tidyverse for beginners. So let's use Chapter 5: Basic Regression as a roadmap. We will also try and make this a bit easier.

The assignments spreadsheet

To avoid some annoyance, I wrapped the usual read_sheet and gs4_deauth rigamarole in a function called fetchAssignments or, explicitly, ExpDesign2021::fetchAssignments. (If you want to be super clear about where a function should come from, package::function(arguments) is the way to go.) Let's start by loading data from the first few weeks' assignments.

# this package
library(ExpDesign2021)
library(knitr)

# grab some data to verify it works 
assignments <- fetchAssignments()
# note that this is equivalent to:
# fetchAssignments() -> assignments 

# verify we got it: 
kable(head(assignments)) # the first few rows
# you can also echo the result of assigning a variable:
# (fetchAssignments() -> assignments) # try it if you like

# rename the columns for brevity in the following code:
columnNames <- colnames(assignments)           # we will re-use this later; 
names(columnNames) <- c("timestamp", "ID",     # these will be the new names
                        "assignment", "start", 
                        "end", "comments")

# what's it look like now? Mostly so we can see what we're doing.
tibble(oldName=columnNames, newName=names(columnNames)) %>% kable

# now make this tidier:
fetchAssignments() %>%                         # get data, then 
  rename(columnNames) %>%                      # rename columns, then 
  select(c("timestamp", "ID", "comments")) %>% # select columns, then 
  head(2) %>%                                  # get the first two lines, then 
  kable                                        # make a table out of them. 

FooBarBazQux is the userID I (tim) chose. I'm not a student, so out it goes. Let's also add a column for time taken, and filter based on that.

library(lubridate)                             # for date handling

fetchAssignments() %>%                         # get data, then 
  rename(columnNames) %>%                      # rename columns, then 
  filter(ID != "FooBarBazQux") %>%             # exclude my entry, then  
  mutate(minutes = end - start) %>%            # add a column `minutes`, then
  filter(minutes > 0) ->                       # drop negative times, then
    assignments                                # put the result in `assignments`

# How many entries remain? 
assignments %>% dim 
# [1] 95  7

Exploration

Let's do a bit of exploratory analysis on the tidied assignment data. Unless you are trying to hide data, a beeswarm is usually a good default. (Tip: statisticians and statistical reviewers will pounce on boxplots.)

library(ggplot2)  # ggplot
library(ggforce)  # geom_sina
library(ggthemes) # theme_tufte

# minimalist sinaplot (beeswarm):
assignments %>%                                        # feed data to...
  ggplot(aes(x = ID, y = minutes, color = ID)) +       # a plot with aesthetics
    geom_sina(show.legend = FALSE) +                   # rendered as a sinaplot
    scale_x_discrete(guide = guide_axis(angle = 60)) + # with IDs at 60' angle
    theme_tufte(base_size = 12,                        # tufte theme, 12pt font
                base_family = "sans serif") +          # with sans serif text
    theme(axis.title.x = element_blank()) +            # blank the axis title
    ggtitle("Time spent on assignments by ID")         # add a plot title  

# same but in hours:
assignments %>%               
  mutate(hours = as.numeric(as.duration(minutes), "hours")) %>% # get hours 
  ggplot(aes(x = ID, y = hours, color = ID)) +         # a plot with aesthetics
    geom_sina(show.legend = FALSE) +                   # rendered as a sinaplot
    scale_x_discrete(guide = guide_axis(angle = 60)) + # with IDs at 60' angle
    theme_tufte(base_size = 12,                        # tufte theme, 12pt font
                base_family = "sans serif") +          # with sans serif text
    theme(axis.title.x = element_blank()) +            # blank the axis title
    ggtitle("Hours spent on assignments by ID")        # add a plot title  

Let's do a little bit of modeling on the data. First, let's group assignments. (It turns out that both the comments and the assignment names can use some work) Let's see if we can extract some rhyme or reason from the words making up these. (For more on this topic, you can visit the tidy text mining book site.)

# a quick helper function: 
assignmentType <- function(x) {
  # grepl tests for a pattern and returns TRUE or FALSE  
  # ifelse operates as ifelse(condition, ifTRUE, ifFALSE)
  # here we nest a second ifelse into the first to mop up
  ifelse(grepl("(git|management|studio|lab)", x), 
         "lab", ifelse(grepl("(islr|statistical|modeling)", x), 
                       "ISLRv2", "EDfB"))
}

# now tokenize by type:
library(tidytext)
assignments %>%                                          # feed data to... 
  mutate(assid = tolower(assignment)) %>%                # create a new column
  mutate(comid = tolower(comments)) %>%                  # create a new column
  mutate(atype = assignmentType(assid)) %>%              # group assignments 
  unnest_tokens(input = comments, output = ctext) %>%    # put words in `ctext`
  select(ID, assid, atype, comid, ctext, minutes) ->     # select a few columns 
    assignment_comments                                  # assign the results

# what has this done for us? Let's look:
assignment_comments %>%                   # feed the tokenized data to..
  filter(!is.na(ctext)) %>%               # only look at non-NA comments
  select(assid, atype, ctext, comid) %>%  # only look at a few columns 
  head(10) %>%                            # grab the first few lines and
  kable                                   # make a table out of them 

# what words seem to be particularly common?
assignment_comments %>%               # feed the tokenized data to...
  count(atype, ctext, sort = TRUE) -> # count comment words by atype, assign to
    atype_ctext_counts

atype_ctext_counts %>% head %>% kable

Some further questions (not necessary to answer with code just yet)

This is a first pass at restructuring a couple of free-text fields; you could read the tidytext chapter on term frequency - inverse document frequency for some additional ideas if you like.

In any event, digesting the text of these fields made it a bit easier to group. Now let's plot things again.

# default sinaplot
assignments %>%                                        # feed data to...
  mutate(assid = tolower(assignment)) %>%              # create a new column
  mutate(atype = assignmentType(assid)) %>%            # group assignments 
  ggplot(aes(x = atype, y = minutes, color = ID)) +    # a plot with aesthetics
    geom_sina(maxwidth=0.1, show.legend = FALSE) +     # rendered as a sinaplot
    xlab("Assignment type") +                          # grouped by type
    theme_tufte(base_size = 12,                        # tufte theme, 12pt font
                base_family = "sans serif") +          # with sans serif text
    ggtitle("Time spent on assignments by type")         # add a plot title  

# a new sinaplot
assignments %>%                                        # feed data to...
  mutate(assid = tolower(assignment)) %>%              # create a new column
  mutate(atype = assignmentType(assid)) %>%            # group assignments 
  ggplot(aes(x = atype, y = minutes, color = ID)) +    # a plot with aesthetics
    geom_sina(maxwidth=0.3, show.legend = FALSE) +     # rendered as a sinaplot
    xlab("Assignment type") +                          # grouped by type
    coord_flip() +                                     # but left-to-right 
    theme_tufte(base_size = 12,                        # tufte theme, 12pt font
                base_family = "sans serif") +          # with sans serif text
    ggtitle("Time spent on assignments by type")         # add a plot title  

# same but in hours
assignments %>%                                        # feed data to...
  mutate(assid = tolower(assignment)) %>%              # create a new column
  mutate(atype = assignmentType(assid)) %>%            # group assignments 
  mutate(hours = as.numeric(as.duration(minutes), "hours")) %>% # get hours 
  ggplot(aes(x = atype, y = hours, color = ID)) +      # a plot with aesthetics
    geom_sina(maxwidth=0.1, show.legend = FALSE) +     # rendered as a sinaplot
    xlab("Assignment type") +                          # grouped by type
    coord_flip() +                                     # but left-to-right 
    theme_tufte(base_size = 12,                        # tufte theme, 12pt font
                base_family = "sans serif") +          # with sans serif text
    ggtitle("Hours spent on assignments by type")         # add a plot title  

Faceting and whatnot

ggplot2 is useful in that it will happily split up plots into facets (sub-plots) among its many other handy features. Here's one example:

# facet it 
assignments %>%                                        # feed data to...
  mutate(assid = tolower(assignment)) %>%              # create a new column
  mutate(atype = assignmentType(assid)) %>%            # group assignments 
  mutate(hours = as.numeric(as.duration(minutes), "hours")) %>% # get hours 
  ggplot(aes(x = atype, y = hours, color = ID)) +      # a plot with aesthetics
    geom_boxplot(show.legend = FALSE) +                # with an added boxplot
    geom_point(show.legend = FALSE) +                  # and all the data points
    xlab("Assignment type") +                          # grouped by type
    facet_wrap( ~ ID) +                                # split by ID
    coord_flip() +                                     # but left-to-right 
    theme_minimal()                                    # and less minimal

If you fitted a linear regression to some of the data at some point, you probably recognize this formula notation. In R, it is ubiquitous:

y ~ x1 + x2 + x1:x2 + ...
# setup
assignments %>%                              # feed data to...
  filter(minutes > 0) %>%                    # only keep rows with minutes > 0
  mutate(assid = tolower(assignment)) %>%    # create a new column `assid`
  mutate(atype = assignmentType(assid)) %>%  # create `atype` w/assignmentType
  mutate(hours = as.numeric(as.duration(minutes), "hours")) -> # add hours 
    hours_by_type                            # put the result here


# quick look
glimpse(hours_by_type)

# should we transform these?
hours_by_type %>%
  ggplot(aes(x = atype, y = hours, color = ID)) +    # a plot with aesthetics
    geom_sina(show.legend = FALSE) +                   # and a beeswarm style
    xlab("Assignment type") +                          # grouped by type
    theme_minimal() +                                  # save plotspace
    ggtitle("raw times") ->                            # add a title
      plot_raw                                         # put result in plot_raw 

hours_by_type %>%
  mutate(loghours = log(hours)) %>%                    # add a logged column
  ggplot(aes(x = atype, y = loghours, color = ID)) +   # a plot with aesthetics
    geom_sina(show.legend = FALSE) +                   # and a beeswarm style
    xlab("Assignment type") +                          # grouped by type
    theme_minimal() +                                  # save plotspace
    ggtitle("log10 scaled") ->                         # add a title 
      plot_log10                                       # put into plot_log10 

library(patchwork)                                     # for easier arrangement
plot_raw + plot_log10                                  # this is via patchwork
# personally I'm OK with not transforming these 

The recommended starting point for modeling things in the tidyverse is through the tidymodels package, and the tidymodels folks suggest A ModernDive into R and the Tidyverse for beginners. So let's use Chapter 5: Basic Regression as a roadmap.

library(moderndive) # for get_regression_table

# traditional linear model fit: y ~ x1 + x2 + ... 
fit0 <- lm(hours ~ atype, data = hours_by_type) # regress on atype only
get_regression_table(fit0) %>% kable

fit1 <- lm(hours ~ atype + ID, data = hours_by_type) # regress on atype + ID
get_regression_table(fit1) %>% kable
anova(fit0, fit1) # does it fit any better?

fit2 <- lm(hours ~ atype * ID, data = hours_by_type) # regress on interaction
get_regression_table(fit2) %>% kable
anova(fit0, fit2) # does this fit any better?

# compare:
anova(fit1, fit2) # do either of these matter?

At this point, for a fixed effects regression, all that seems to matter is the type of assignment (in terms of predicting the time taken to complete it). What if we wanted to try and carve up the random effects (variance components)?

Mixed models (optional)

Mixed models attempt to partition variance into fixed and random components, such as a condition (fixed, mean difference) versus a measurement group (random differences). This is handy when running replicates, particularly if some are technical and some are biological, but there's also the possibility of partially pooling some terms. For now, let's just intro some handlers for these types of models. (These aren't really handled in either EDfMB or ISLRv2, but they're incredibly useful in actual practice.)

library(tidymodels)
library(broom.mixed) 
library(lme4)

# lmer == "linear mixed effects regression"
# kind of an old-school example since parsnip doesn't like mixed models yet
lmm0 <- lmer(hours ~ atype + (1 | ID), data = hours_by_type) # note syntax
lmm1 <- lmer(hours ~ atype + (atype | ID), data = hours_by_type) # note syntax

tidy(lmm0)
tidy(lmm1)
anova(lmm0, lmm1) # lmm0 is "good enough", it seems

# this is fairly traditional 
tidy(lmm0, effects = "fixed")
tidy(lmm0, effects = "fixed", conf.int=TRUE)
tidy(lmm0, effects = "fixed", conf.int=TRUE, conf.method="profile")

# this isn't
tidy(lmm0, effects = "ran_coefs")
tidy(lmm0, effects = "ran_vals", conf.int=TRUE)
tidy(lmm0, effects = "ran_pars", conf.int=TRUE)

So, if you visit the broom.mixed vignette, you'll find some code to make a plot of the estimates for various factors in a different regression.

Fitting existing data to sample from

We have something that looks somewhat interesting within the lab exercises. Suppose we treat the observations as what they look like (a mixture) and try to determine how that fits. Note that we fit a mixture model below, which isn't directly related to mixed models. I didn't come up with the names...

library(mclust) # mixture model clustering, incredibly handy package

# fit 2 components
hours_by_type %>% 
  filter(atype == "lab") %>% # fit a mixture to just the lab assignments 
  select("hours") %>%        # use the hours column we set up earlier 
  Mclust(G=1:2) ->           # fit a mixture with either 1 or 2 components 
    mfit0                    # save the results for plotting

plot(mfit0, "classification") 
plot(mfit0, "density") 

# free to fit 1:10 parts
hours_by_type %>% 
  filter(atype == "lab") %>% # fit a mixture to just the lab assignments 
  select("hours") %>%        # use the hours column we set up earlier 
  Mclust() ->                # fit a mixture and the number of components in it 
    mfit1                    # save the results for plotting

plot(mfit1, "classification") 
plot(mfit1, "density") 

# this seems more reasonable
show(mfit1$parameters) 
# pro == proportions of observations in each group 
# mean == means of each component (group) 
# variance$sigmasq == variance of each component (group) 

params <- with(mfit1$parameters,
               tibble(proportion=pro,
                      mean=mean,
                      sd=sqrt(variance$sigmasq)))
params %>% kable 

This puts us in a position to simulate additional draws from data generated by similar processes to those observed in the assignment data. (Question: is that always a good thing? What could interfere with this approach to power/design?)



VanAndelInstitute/ExpDesign2021 documentation built on Dec. 18, 2021, 6:15 p.m.