knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

set.seed(1337)

Advanced statistical programming with R

Are you taking the course "Advanced statistical programming with R" in the summer term 2020? Then this page is for you. It explains what you need to do to get the credits for this module. r emo::ji("wink")

In this course, you are going work on a programming project in a group of three students. Each group needs to develop an R package, in which they implement a statistical method for the so-called location-scale regression model class, and a small simulation study to evaluate their method and implementation. In September, you need to submit a term paper about your project.

This git repository contains the asp20model package for R, which implements the location-scale regression model class and will be the basis for your package.

Please note: If you are planning to take this course, you need to preregister on Stud.IP before March 31. Follow this link to sign up: https://studip.uni-goettingen.de/dispatch.php/course/details?sem_id=5b8822a839a628e8166648a57f4f1eac.

Update on the Coronavirus: Unsurprisingly, the Coronavirus crisis makes some changes to this course necessary. All previously announced dates remain valid, but I am going to offer the introductory session and the Q&A session as YouTube livestreams. See below for the links. The intermediate and final presentations might be replaced with short written reports, depending on the further developments. I am going to make a final decision on the presentations before the Q&A session on May 4.

Important dates

Technical & statistical prerequisites

The R6 OOP system

The asp20model package uses R6 classes, so make sure to get familiar with mutable objects, inheritance, etc. See the chapter on R6 classes in [3], available online: https://adv-r.hadley.nz/r6.html.

Version control using git & GitLab

The asp20model package is hosted on https://gitlab.gwdg.de (the GitLab instance of the computing center of the university), where you are going to develop your R package as well. I created a GitLab group for this course and a GitLab project for each group of students.

Please note: To start working on your project, log in to GitLab with your student account once and send me an email. After that, I can add you to the group and the project.

GitLab is an online platform where you can host git repositories, very similar to GitHub, and git is a popular version control system for software development. Systematic, version-controlled software development is an essential component of this course, so if you are not yet familiar with git, please take a look at the git chapter in [4], available online: http://r-pkgs.had.co.nz/git.html.

Please note: Your code will only be visible to me, the project supervisors, and your fellow students. Feel free to make mistakes and ask questions!

Location-scale regression

As mentioned above, we are going to work with the location-scale regression model class in this course. So, what is location-scale regression? Let's take a step back and look at the standard linear model first. It is defined as

y_i = \boldsymbol{x}_i' \boldsymbol{\beta} + \varepsilon_i, \text{ where } \varepsilon_i \overset{i.i.d.}{\sim} \mathcal{N}(0, \sigma^2).

From the definition, it follows that

y_i \overset{ind.}{\sim} \mathcal{N}(\boldsymbol{x}_i' \boldsymbol{\beta}, \sigma^2),

and from this representation, the model can easily be extended with a second linear predictor for the standard deviation. For this purpose, we introduce a covariate vector \$\boldsymbol{z}_i\$ and a parameter vector \$\boldsymbol{\gamma}\$:

y_i \overset{ind.}{\sim} \mathcal{N}(\boldsymbol{x}_i' \boldsymbol{\beta}, (\exp(\boldsymbol{z}_i' \boldsymbol{\gamma}))^2).

What is the \$\exp\$ function doing in this formula? Well, one thing that is a little bit tricky about predicting the standard deviation is that the standard deviation needs to be positive, while the linear predictor \$\boldsymbol{z}_i' \boldsymbol{\gamma}\$ can become negative for some choices of the parameter vector \$\boldsymbol{\gamma}\$. Hence, to ensure that our predictions for the standard deviation are valid, we introduce the \$\exp\$ function as a so-called response function for the standard deviation and define \$\sigma_i = \exp(\boldsymbol{z}_i' \boldsymbol{\gamma})\$.

Why should we care about location-scale regression? It is not uncommon to observe heteroscedastic data like in the figure below. The standard way to deal with heteroscedastic residuals in a linear model is FGLS estimation, but FGLS does not provide an interpretable description of the variance structure of the data. A location-scale regression model can fill this gap, if we have access to explanatory variables for the variance or the standard deviation of the response variable.

n <- 500
x <- runif(n)
y <- x + rnorm(n, sd = exp(-3 + 2 * x))
plot(x, y)
abline(0, 1, lwd = 2)
curve(x + 1.96 * exp(-3 + 2 * x), -0.1, 1.1, add = TRUE)
curve(x - 1.96 * exp(-3 + 2 * x), -0.1, 1.1, add = TRUE)

Working with the asp20model package

First, you need to install the asp20model package from GitLab and load it:

install.packages("devtools")
devtools::install_gitlab("asp20/asp20model", host = "gitlab.gwdg.de")
library(asp20model)

Using the data from the plot above, we can set up a location-scale regression model. Note that the first command only sets up the design matrices and the parameter vectors but does not do any kind of inference. All parameters are initialized with a value of 0. We can use the loglik() method to obtain the log-likelihood of the model in the initial state (i.e. with all parameters set to 0):

model <- LocationScaleRegression$new(y ~ x, ~ x)
model$beta
model$gamma
model$loglik()

Now, let's update the parameter values manually and see how the log-likelihood changes:

model$beta <- c(0.1, 0.1)
model$loglik()

For most inference algorithms, we also need the gradient of the log-likelihood with respect to the parameters, which we can obtain with the grad() method:

model$grad_beta()
model$grad_gamma()

Finally, the asp20model package also comes with a simple gradient descent algorithm for maximum likelihood inference. Let's apply it to our model:

gradient_descent(model)
model$beta
model$gamma
model$grad_beta()
model$grad_gamma()

Whoops, that didn't work very well! The algorithm didn't converge, the estimated parameters are far away from the true values we used to simulate the data (\$\hat{\boldsymbol{\beta}}\$ should be close to (0, 1), \$\hat{\boldsymbol{\gamma}}\$ should be close to (-3, 2)), and the gradient is not close to 0. Let's try again with a smaller step size and more iterations:

gradient_descent(model, stepsize = 1e-06, maxit = 500000)
model$beta
model$gamma
model$grad_beta()
model$grad_gamma()

These results do look better, but they also show how inefficient the gradient descent algorithm is. The algorithm took more than a quarter million iterations to converge!

Your task is to do better than me and implement a more efficient inference algorithm for the location-scale regression model class. Use my code for the gradient_descent() function as an example. Build on the LocationScaleRegression R6 class and its methods and extend them if necessary. Think about inheritance and whether your extensions could be useful for other groups. If yes, feel free to share your code on GitLab and open a pull request in the asp20model repository.

Student projects

asp20boot

asp20cv

asp20lasso

asp20boost

asp20regPrior

asp20ssPrior

asp20hMCMC

asp20iwlsMCMC

asp20plot

asp20user

Requirements to pass the course

To get the credits for this module, you need to...

You may earn "bonus points" for active exchange with other groups, preferably on GitLab via issues and pull requests. For example, the four groups working on Bayesian inference could develop a common API for the specification of priors. If you need additional features in the asp20model package, feel free to open issues and pull requests in this repository.

Readings



hriebl/asp20model documentation built on April 2, 2020, 12:05 a.m.