Overview

StatComp21064 is a simple R package developed to compare the performance of R and R++ (implemented through the R package Rcpp) for the 'Statistical Computing' course. One function is considered, namely, gibbs (generate random nubers using Gibbs sampler). For each function, both R and Rcpp versions are produced. Namely gibbsR and gibbsC for C++.In addition, according to the requirements of the project, there are another two functions gradient.descent and EM for R.

The R package 'microbenchmark' can be used to benchmark the above R and C++ functions.

Benchmarking gibbsR and gibbsC

The source R code for gibbsR is as follows:

gibbsR <- function(N, thin, n, aa, bb) {
  mat <- matrix(nrow = N, ncol = 2)
  x <- y <- 0
  for (i in 1:N) {
    for (j in 1:thin) {
      x <- rbinom(1,n,y)
      y <- rbeta(1,x+aa,n-x+bb)
    }
    mat[i, ] <- c(x, y)
  }
  mat
}

The above R code involves two loops, which could be very slow even for R-3.01 or any higher version. The corresponding C++ code is as follows.

NumericMatrix gibbsC(int N, int thin, int n, int aa, int bb) {
  NumericMatrix mat(N, 2);
  double x = 0, y = 0;
  for(int i = 0; i < N; i++) {
    for(int j = 0; j < thin; j++) {
      x = rbinom(1,n,y)[0];
      y = rbeta(1,x+aa,n-x+bb)[0];
    }
    mat(i, 0) = x;
    mat(i, 1) = y;
  }
  return(mat);
}

The R code for benchmarking gibbsR and gibbsC is as follows.

library(StatComp21064)
library(microbenchmark)
tm2 <- microbenchmark(
  vR = gibbsR(1e2, 10, 22, 2, 2),
  vC = gibbsC(1e2, 10, 22, 2, 2)
)
knitr::kable(summary(tm2)[,c(1,3,5,6)])

The results again show an evident computational speed gain of C++ against R.

gradient.descent

This is a function which depends on gradient descent method. The descending direction is the gradient direction.The descending step length can be fixed, or it can be solved by the backtracking method. The general idea of the backtracking method is that on the function f, use the parameter alpha to make a secant on f at the point before the drop. If the point after the drop is within the intersecting section, there is no need to adjust the stepsize, otherwise use the parameter b to adjust the stepsize. \ The source R code for gradient.descent is as follows:

gradient.descent<-function(x,y,epsilon=1e-14,maxit=1000,stepsize=1e-3,alpha=0.25,b=0.8,stepmethod="backtracking",verbose=TRUE,plotLoss=TRUE){

  stopifnot(stepsize<=1e-3)
  m<-nrow(x)
  x<-cbind(rep(1,m),x)
  n<-ncol(x)
  beta<-numeric(n)
  iter<-1
  error<-1
  loss<-crossprod(x%*%beta-y)

  while((error>epsilon)&&(iter<maxit)){
    h<-x %*% beta
    grad<-crossprod(x,(h-y))
    beta.new<-beta-stepsize*grad
    loss.new<-crossprod(x%*%beta.new-y)

    #update for next iter
    if(stepmethod=="backtracking")
      if(loss.new>(loss[iter]+alpha*stepsize*sum(grad*grad)))
        stepsize=stepsize*b

    error<-max(abs(beta.new-beta))
    loss<-append(loss,loss.new)
    beta<-beta.new
    iter<-iter+1 #thistime

    if(verbose&& iter%%10==0)
      message(paste('Iteration:',iter))
  }

  if(plotLoss)
    plot(loss, type="l",bty="n")

  return(list(par=beta,RSE=sqrt(crossprod(x%*%beta.new-y)/(nrow(x)- ncol(x))),iter=iter))
}
n=100
x1<-rnorm(n)
x2<-rnorm(n)
y=1+0.5*x1+x2+rnorm(n,sd=0.1)
x<-cbind(x1,x2)
gradient.descent(x,y)

EM

Normal mixed model: $$f(x)=\sum^C_{i=1}\pi_i\phi(x;\mu_i,\Sigma_i),$$ Among them, $\phi(x;\mu_i,\Sigma_i)$ is the bivariate normal distribution with the mean value of $\mu_i$ and the covariance matrix of $\Sigma_i$, and the parameters are $(\pi_i,\mu_i,\Sigma_i ),i=1,\cdots,C$, remember $x_j,j=1,\cdots,n$ is the observed n samples, $z_{ij}=P(x_j\sim \phi(x;\ mu_i,\Sigma_i)),i=1,\cdots,C,j=1,\cdots,n$ are missing values. \ In the tth iteration of the EM algorithm, given $(\pi^t_i,\mu^t_i,\Sigma^t_i),i=1,\cdots,C,$ \ Step E is: $$ \begin{aligned} z^t_{kl}&=E[z_{kl}|\pi^t_i,\mu^t_i,\Sigma^t_i,x_j,i=1,\cdots,C,j=1,\cdots,n]\ &=P(x_l\sim \phi(x;\mu_k,\Sigma_k)|\pi^t_i,\mu^t_i,\Sigma^t_i,x_l,i=1,\cdots,C)\ &=\frac{\pi^t_k\phi(x_l;\mu^t_k,\Sigma^t_k)}{\sum^C_{i=1}\pi^t_i\phi(x_l;\mu^t_i,\Sigma^t_i)} \end{aligned} $$ The M steps are: $$ \begin{aligned} \pi^{t+1}i&=\frac{\sum^n{j=1}z^t_{ij}}{n}\ \mu^{t+1}i&=\frac{\sum^n{j=1}z^t_{ij}x_j}{\sum^n_{j=1}z^t_{ij}}\ \Sigma^{t+1}i&=\frac{\sum^n{j=1}z^t_{ij}(x_j-\mu^{t+1}i)(x_j-\mu^{t+1)_i)^T}}{\sum^n{j=1}z^t_{ij}} \end{aligned} $$

The important point is to draw x first (if the dimension is greater than 2 dimensions, it can be drawn in two or two dimensions, but it is not suitable for too high dimensions), estimate that the data is approximately mixed by several normals, determine C, and approximately determine mu0 based on the aggregation (This is very important!!! Mu0 must not deviate too far from the final true value, otherwise an error will be reported!!!) \ The source R code for EM is as follows:

library(mvtnorm)
EM<-function(x,C,iter,mu0){
  n <- nrow(x)
  p <- ncol(x)
  # initialize Z, mu, Sig, pi
  Z <- matrix(c(rep(1, C), rep(0, (n - 1) * C)), n, C)
  Sig <- matrix(rep(c(1, 0, 0, 1), C), ncol = p, byrow = T)
  pi <- rep(1 / C, C)
  mu<-mu0
  # start EM here
  for (t in 1:iter) {
    for (k in 1:n) {
      for (l in 1:C) {
        Z[k,l] <- pi[l] *mvtnorm::dmvnorm(x[k,], mu[l,], Sig[(p * (l - 1) + 1):(p * l),])
      }
      Z[k,] <- Z[k,] / sum(Z[k,])
    }
    # update Z (E-step)

    pi <- colMeans(Z)
    # update pi (M-step-1) 

    for (i in 1:C) {
      mu[i,] <- t(Z[,i]) %*% x / sum(Z[,i])
      # update mu (M-step-2)

      sumsig <- Z[1,i] * (x[1,] - mu[i,]) %*% t(x[1,] - mu[i,]) 
      for (k in 2:n) {
        sumsig <- sumsig + Z[k,i] * (x[k,] - mu[i,]) %*% t(x[k,] - mu[i,])
      }
      Sig[(p * (i - 1) + 1):(p * i),] <- sumsig / sum(Z[,i])
      # update Sigma (M-step-3) 
    }
  }
  return(list(mu=mu,Sig=Sig))
}

Do it with the R-built-in data "faithful", this mu0 is observed according to the figure:

library(datasets)
x<-as.matrix(datasets::faithful)
EM(x,2,20,matrix(c(2,50,4,80),2,2,byrow=T))


Lrrzz/StatComp21064 documentation built on Dec. 23, 2021, 10:21 p.m.