knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(devtools) load_all("./")
Install the ExpDesign2021 package from github.
#install.packages("remotes") #install.packages("BiocManager") #BiocManager::install("VanAndelInstitute/ExpDesign2021", build_vignettes=TRUE)
The recommended starting point for modeling things in the tidyverse is through the tidymodels package, and the tidymodels folks suggest A ModernDive into R and the Tidyverse for beginners. So let's use Chapter 5: Basic Regression as a roadmap. We will also try and make this a bit easier.
To avoid some annoyance, I wrapped the usual read_sheet and gs4_deauth
rigamarole in a function called fetchAssignments
or, explicitly,
ExpDesign2021::fetchAssignments. (If you want to be super clear about where
a function should come from, package::function(arguments) is the way to go.)
Let's start by loading data from the first few weeks' assignments.
# this package library(ExpDesign2021) library(knitr) # grab some data to verify it works assignments <- fetchAssignments() # note that this is equivalent to: # fetchAssignments() -> assignments # verify we got it: kable(head(assignments)) # the first few rows # you can also echo the result of assigning a variable: # (fetchAssignments() -> assignments) # try it if you like # rename the columns for brevity in the following code: columnNames <- colnames(assignments) # we will re-use this later; names(columnNames) <- c("timestamp", "ID", # these will be the new names "assignment", "start", "end", "comments") # what's it look like now? Mostly so we can see what we're doing. tibble(oldName=columnNames, newName=names(columnNames)) %>% kable # now make this tidier: fetchAssignments() %>% # get data, then rename(columnNames) %>% # rename columns, then select(c("timestamp", "ID", "comments")) %>% # select columns, then head(2) %>% # get the first two lines, then kable # make a table out of them.
FooBarBazQux is the userID I (tim) chose. I'm not a student, so out it goes. Let's also add a column for time taken, and filter based on that.
library(lubridate) # for date handling fetchAssignments() %>% # get data, then rename(columnNames) %>% # rename columns, then filter(ID != "FooBarBazQux") %>% # exclude my entry, then mutate(minutes = end - start) %>% # add a column `minutes`, then filter(minutes > 0) -> # drop negative times, then assignments # put the result in `assignments` # How many entries remain? assignments %>% dim # [1] 95 7
Let's do a bit of exploratory analysis on the tidied assignment data. Unless you are trying to hide data, a beeswarm is usually a good default. (Tip: statisticians and statistical reviewers will pounce on boxplots.)
library(ggplot2) # ggplot library(ggforce) # geom_sina library(ggthemes) # theme_tufte # minimalist sinaplot (beeswarm): assignments %>% # feed data to... ggplot(aes(x = ID, y = minutes, color = ID)) + # a plot with aesthetics geom_sina(show.legend = FALSE) + # rendered as a sinaplot scale_x_discrete(guide = guide_axis(angle = 60)) + # with IDs at 60' angle theme_tufte(base_size = 12, # tufte theme, 12pt font base_family = "sans serif") + # with sans serif text theme(axis.title.x = element_blank()) + # blank the axis title ggtitle("Time spent on assignments by ID") # add a plot title # same but in hours: assignments %>% mutate(hours = as.numeric(as.duration(minutes), "hours")) %>% # get hours ggplot(aes(x = ID, y = hours, color = ID)) + # a plot with aesthetics geom_sina(show.legend = FALSE) + # rendered as a sinaplot scale_x_discrete(guide = guide_axis(angle = 60)) + # with IDs at 60' angle theme_tufte(base_size = 12, # tufte theme, 12pt font base_family = "sans serif") + # with sans serif text theme(axis.title.x = element_blank()) + # blank the axis title ggtitle("Hours spent on assignments by ID") # add a plot title
Let's do a little bit of modeling on the data. First, let's group assignments. (It turns out that both the comments and the assignment names can use some work) Let's see if we can extract some rhyme or reason from the words making up these. (For more on this topic, you can visit the tidy text mining book site.)
# a quick helper function: assignmentType <- function(x) { # grepl tests for a pattern and returns TRUE or FALSE # ifelse operates as ifelse(condition, ifTRUE, ifFALSE) # here we nest a second ifelse into the first to mop up ifelse(grepl("(git|management|studio|lab)", x), "lab", ifelse(grepl("(islr|statistical|modeling)", x), "ISLRv2", "EDfB")) } # now tokenize by type: library(tidytext) assignments %>% # feed data to... mutate(assid = tolower(assignment)) %>% # create a new column mutate(comid = tolower(comments)) %>% # create a new column mutate(atype = assignmentType(assid)) %>% # group assignments unnest_tokens(input = comments, output = ctext) %>% # put words in `ctext` select(ID, assid, atype, comid, ctext, minutes) -> # select a few columns assignment_comments # assign the results # what has this done for us? Let's look: assignment_comments %>% # feed the tokenized data to.. filter(!is.na(ctext)) %>% # only look at non-NA comments select(assid, atype, ctext, comid) %>% # only look at a few columns head(10) %>% # grab the first few lines and kable # make a table out of them # what words seem to be particularly common? assignment_comments %>% # feed the tokenized data to... count(atype, ctext, sort = TRUE) -> # count comment words by atype, assign to atype_ctext_counts atype_ctext_counts %>% head %>% kable
This is a first pass at restructuring a couple of free-text fields; you could read the tidytext chapter on term frequency - inverse document frequency for some additional ideas if you like.
One thought: if you look at the comments quantitatively, and don't assume that NAs are wholly uninformative, is there any pattern to the results?
Another thought: can we do better in terms of extracting assignment types?
In any event, digesting the text of these fields made it a bit easier to group. Now let's plot things again.
# default sinaplot assignments %>% # feed data to... mutate(assid = tolower(assignment)) %>% # create a new column mutate(atype = assignmentType(assid)) %>% # group assignments ggplot(aes(x = atype, y = minutes, color = ID)) + # a plot with aesthetics geom_sina(maxwidth=0.1, show.legend = FALSE) + # rendered as a sinaplot xlab("Assignment type") + # grouped by type theme_tufte(base_size = 12, # tufte theme, 12pt font base_family = "sans serif") + # with sans serif text ggtitle("Time spent on assignments by type") # add a plot title # a new sinaplot assignments %>% # feed data to... mutate(assid = tolower(assignment)) %>% # create a new column mutate(atype = assignmentType(assid)) %>% # group assignments ggplot(aes(x = atype, y = minutes, color = ID)) + # a plot with aesthetics geom_sina(maxwidth=0.3, show.legend = FALSE) + # rendered as a sinaplot xlab("Assignment type") + # grouped by type coord_flip() + # but left-to-right theme_tufte(base_size = 12, # tufte theme, 12pt font base_family = "sans serif") + # with sans serif text ggtitle("Time spent on assignments by type") # add a plot title # same but in hours assignments %>% # feed data to... mutate(assid = tolower(assignment)) %>% # create a new column mutate(atype = assignmentType(assid)) %>% # group assignments mutate(hours = as.numeric(as.duration(minutes), "hours")) %>% # get hours ggplot(aes(x = atype, y = hours, color = ID)) + # a plot with aesthetics geom_sina(maxwidth=0.1, show.legend = FALSE) + # rendered as a sinaplot xlab("Assignment type") + # grouped by type coord_flip() + # but left-to-right theme_tufte(base_size = 12, # tufte theme, 12pt font base_family = "sans serif") + # with sans serif text ggtitle("Hours spent on assignments by type") # add a plot title
ggplot2 is useful in that it will happily split up plots into facets (sub-plots) among its many other handy features. Here's one example:
# facet it assignments %>% # feed data to... mutate(assid = tolower(assignment)) %>% # create a new column mutate(atype = assignmentType(assid)) %>% # group assignments mutate(hours = as.numeric(as.duration(minutes), "hours")) %>% # get hours ggplot(aes(x = atype, y = hours, color = ID)) + # a plot with aesthetics geom_boxplot(show.legend = FALSE) + # with an added boxplot geom_point(show.legend = FALSE) + # and all the data points xlab("Assignment type") + # grouped by type facet_wrap( ~ ID) + # split by ID coord_flip() + # but left-to-right theme_minimal() # and less minimal
If you fitted a linear regression to some of the data at some point, you probably recognize this formula notation. In R, it is ubiquitous:
y ~ x1 + x2 + x1:x2 + ...
# setup assignments %>% # feed data to... filter(minutes > 0) %>% # only keep rows with minutes > 0 mutate(assid = tolower(assignment)) %>% # create a new column `assid` mutate(atype = assignmentType(assid)) %>% # create `atype` w/assignmentType mutate(hours = as.numeric(as.duration(minutes), "hours")) -> # add hours hours_by_type # put the result here # quick look glimpse(hours_by_type) # should we transform these? hours_by_type %>% ggplot(aes(x = atype, y = hours, color = ID)) + # a plot with aesthetics geom_sina(show.legend = FALSE) + # and a beeswarm style xlab("Assignment type") + # grouped by type theme_minimal() + # save plotspace ggtitle("raw times") -> # add a title plot_raw # put result in plot_raw hours_by_type %>% mutate(loghours = log(hours)) %>% # add a logged column ggplot(aes(x = atype, y = loghours, color = ID)) + # a plot with aesthetics geom_sina(show.legend = FALSE) + # and a beeswarm style xlab("Assignment type") + # grouped by type theme_minimal() + # save plotspace ggtitle("log10 scaled") -> # add a title plot_log10 # put into plot_log10 library(patchwork) # for easier arrangement plot_raw + plot_log10 # this is via patchwork # personally I'm OK with not transforming these
The recommended starting point for modeling things in the tidyverse is through the tidymodels package, and the tidymodels folks suggest A ModernDive into R and the Tidyverse for beginners. So let's use Chapter 5: Basic Regression as a roadmap.
library(moderndive) # for get_regression_table # traditional linear model fit: y ~ x1 + x2 + ... fit0 <- lm(hours ~ atype, data = hours_by_type) # regress on atype only get_regression_table(fit0) %>% kable fit1 <- lm(hours ~ atype + ID, data = hours_by_type) # regress on atype + ID get_regression_table(fit1) %>% kable anova(fit0, fit1) # does it fit any better? fit2 <- lm(hours ~ atype * ID, data = hours_by_type) # regress on interaction get_regression_table(fit2) %>% kable anova(fit0, fit2) # does this fit any better? # compare: anova(fit1, fit2) # do either of these matter?
At this point, for a fixed effects regression, all that seems to matter is the type of assignment (in terms of predicting the time taken to complete it). What if we wanted to try and carve up the random effects (variance components)?
Mixed models attempt to partition variance into fixed
and random
components, such as a condition (fixed, mean difference) versus a measurement
group (random differences). This is handy when running replicates,
particularly if some are technical and some are biological, but there's also
the possibility of partially pooling some terms. For now, let's just intro
some handlers for these types of models. (These aren't really handled in
either EDfMB or ISLRv2, but they're incredibly useful in actual practice.)
library(tidymodels) library(broom.mixed) library(lme4) # lmer == "linear mixed effects regression" # kind of an old-school example since parsnip doesn't like mixed models yet lmm0 <- lmer(hours ~ atype + (1 | ID), data = hours_by_type) # note syntax lmm1 <- lmer(hours ~ atype + (atype | ID), data = hours_by_type) # note syntax tidy(lmm0) tidy(lmm1) anova(lmm0, lmm1) # lmm0 is "good enough", it seems # this is fairly traditional tidy(lmm0, effects = "fixed") tidy(lmm0, effects = "fixed", conf.int=TRUE) tidy(lmm0, effects = "fixed", conf.int=TRUE, conf.method="profile") # this isn't tidy(lmm0, effects = "ran_coefs") tidy(lmm0, effects = "ran_vals", conf.int=TRUE) tidy(lmm0, effects = "ran_pars", conf.int=TRUE)
So, if you visit the broom.mixed vignette, you'll find some code to make a plot of the estimates for various factors in a different regression.
We have something that looks somewhat interesting within the lab exercises. Suppose we treat the observations as what they look like (a mixture) and try to determine how that fits. Note that we fit a mixture model below, which isn't directly related to mixed models. I didn't come up with the names...
library(mclust) # mixture model clustering, incredibly handy package # fit 2 components hours_by_type %>% filter(atype == "lab") %>% # fit a mixture to just the lab assignments select("hours") %>% # use the hours column we set up earlier Mclust(G=1:2) -> # fit a mixture with either 1 or 2 components mfit0 # save the results for plotting plot(mfit0, "classification") plot(mfit0, "density") # free to fit 1:10 parts hours_by_type %>% filter(atype == "lab") %>% # fit a mixture to just the lab assignments select("hours") %>% # use the hours column we set up earlier Mclust() -> # fit a mixture and the number of components in it mfit1 # save the results for plotting plot(mfit1, "classification") plot(mfit1, "density") # this seems more reasonable show(mfit1$parameters) # pro == proportions of observations in each group # mean == means of each component (group) # variance$sigmasq == variance of each component (group) params <- with(mfit1$parameters, tibble(proportion=pro, mean=mean, sd=sqrt(variance$sigmasq))) params %>% kable
This puts us in a position to simulate additional draws from data generated by similar processes to those observed in the assignment data. (Question: is that always a good thing? What could interfere with this approach to power/design?)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.