Overview

SC19020 is a simple R package developed to calculate two important concepts in the "Non-parametric statistic" course and "Statistic computing" course. Bootstrap is a very important resampling method in Statistic and will be of great help for us to study the properties of the given data especially when sample size is not large. The kernel density estimator(KDE) is a very important non-parametric method to estimate the true density function of a given set of data when we know little about the true distribution of the data, and with KDE, we can study many other relevant concepts of the given data. In this package, two functions are considered, namely, bootstrapR (generate bootstraping samples of the given data) and KDER (compute the kernel density estimator of the given data).

bootstrapR

The source R code for bootstrapR is as follows:

bootstrapR <- function(X, n) {
  m <- length(X)
  out <- matrix(nrow = n, ncol = m)
  for(i in 1:n){
    out[i,] <- sample(X, m, replace = TRUE)
  }
  return(out)
}

In order to show how to use bootstapR, we give an example.

library(SC19020)
X <- c(96,87,88,90,85)
outer <- bootstrapR(X, 5)
outer

KDER

The source R code for KDER is as follows:

KDER <- function(X, h, x) {
  n <- length(X)
  out <- (1/h)*mean(dnorm((x-X)/h))
  return(out)
}

In order to show how to use KDER, we give an example.

X <- rnorm(100,1,2)
x <- seq(-1, 1, by = 0.01)
h <- 0.1
y <- numeric(length(x))
for(i in 1:length(x)){
  y[i] <- KDER(X, h, x[i])
}
plot(x, y, type = "l")

Homework of 20190920

Question

Use knitr to porduce at least 3 examples(texts, figures, tables)

Answer

1.figure

x<-c(20,30,40,45,60)
y<-c(16,20,27,40,60)
plot(x,y,type = "l",col="blue")
title(main="An example")

2.table

math<-c(95,80,74,66,84)
engl<-c(80,77,68,92,85)
phys<-c(90,82,65,74,90)
chem<-c(88,96,70,84,80)
mydata<-data.frame(math,engl,phys,chem)
mydata

3.text

We derive the nonparametric maximum likelihood estimate, $\hat{F}$ say, of a lifetime distribution $\boldsymbol{F}$ on the basis of two independent samples, one a sample of size $m$ from $\boldsymbol{F}$ and the other a sample of size $n$ from the length-biased distribution of $\boldsymbol{F}$, i.e.from $G_{F}(y) \equiv G(y)=\frac{1}{\mu} \int_{0}^{y} x d F(x),\quad y\geq 0$. We further show that $(m+n)^{1 / 2}(\hat{F}-F)$ converges weakly to a pinned Gaussian process with a simple covariance function, when $m+n \rightarrow \infty$ and $m / n \rightarrow$ constant.

Homework of 20190929

Question1

3.4 The Rayleigh density [156, Ch. 18] is $f(x)=\frac{x}{\sigma^{2}} e^{-x^{2} /\left(2 \sigma^{2}\right)}, \quad x \geq 0, \sigma>0$. Develop an algorithm to generate random samples from a Rayleigh($\sigma$) distribution. Generate Rayleigh($\sigma$) samples for several choices of $\sigma>0$ and check that the mode of the generated samples is close to the theoretical mode $\sigma$ (check the histogram).

Answer

1.basic ideas

采用The inverse transform method,可以求出Rayleigh distribution 的cdf,即$F(x)=1-e^{-\frac{x^{2}}{2 \sigma^{2}}}$,先从均匀U(0,1)分布中生成随机样本,再反解出x=$F_{x}^{-1}(U)$,从而样本x服从于Rayleigh distribution

2.code

(1)take sigma as 1

sigma <- 1;n <- 100000
u <- runif(n)
x <- sqrt(-2*(sigma^2)*log(1-u)) # F(x) = 1-exp(-x^2/(2*sigma^2)), x>=0,sigma>0
hist(x,prob =TRUE, breaks=100,xlim=c(0,6),main=expression(f(x)==x*exp(-x^2/2)))
y <- seq(0, 6, .01)
lines(y, y/(sigma^2)*exp(-y^2/(2*(sigma^2))))

(2)take sigma as 2

sigma <- 2; n <- 100000
u <- runif(n)
x <- sqrt(-2*(sigma^2)*log(1-u)) # F(x) = 1-exp(-x^2/(2*sigma^2)), x>=0,sigma>0
hist(x,prob =TRUE,breaks=100,xlim=c(0,8), main=expression(f(x)==x/4*exp(-x^2/8)))
y <- seq(0, 8, 0.01)
lines(y, y/(sigma^2)*exp(-y^2/(2*(sigma^2))))

3.my thoughts

The inverse transform method performs well in generating samples from complicated distributions but with inverse function of its cdf.

Question2

3.11 Generate a random sample of size 1000 from a normal location mixture. The components of the mixture have $N(0,1)$and $N(3,1)$distributions with mixing probabilities $p_{1}$ and $p_{2}=1-p_{1}$. Graph the histogram of the sample with density superimposed, for $p_{1}=0.75$. Repeat with different values for $p_{1}$ and observe whether the empirical distribution of the mixture appears to be bimodal. Make a conjecture about the values of $p_{1}$ that produce bimodal mixtures.

Answer

1.basic ideas

利用课件上的方法,生成随机样本,再比较样本的频率分布直方图和总体密度曲线的拟合程度

2.code

n <- 1e4
X1 <- rnorm(n,0,1)
X2 <- rnorm(n,3,1) 
r <- sample(c(0,1),n,replace=TRUE,prob=c(0.25,0.75))
Z <- r*X1+(1-r)*X2
hist(Z,prob = TRUE,breaks = 100,xlim = c(-4,6))
x<-seq(-4,6,0.01)
y<-0.75*dnorm(x,0,1)+0.25*dnorm(x,3,1)
lines(x,y)
n <- 1e4
p <- c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9)

for (i in 1:9){
  X1 <- rnorm(n,0,1)
  X2 <- rnorm(n,3,1) 
  r <- sample(c(0,1),size=n,replace=TRUE,prob=c(p[i],1-p[i]))
  Z<- r*X1+(1-r)*X2
  hist(Z,breaks = 100,xlim = c(-4,6))
}
par(mfrow=c(1,1))

3.conjecture

当$p_{1}$在0.5附近的时候,总体呈现双峰分布

4.my thoughts

I'm not familiar with showing different pictures in a single screen and making pictures more beautiful.

question3

3.18 Write a function to generate a random sample from a $W_{d}(\Sigma, n)$ (Wishart) distribution for $n>d+1 \geq 1$, based on Bartlett’s decomposition.

Answer

1.basic ideas

首先生成一个下三角阵,满足$T_{i j} \stackrel{i i d}{\sim} N(0,1), i>j$,且$T_{i i} \sim \sqrt{\chi^{2}(n-i+1)}, i=1, \ldots, d$,再从Sigma矩阵中得到Choleski factorization,从而利用Bartlett’s decomposition的方法,我们可以生成一组随机样本服从于$W_{d}(\Sigma, n)$

2.code

n<-10; d<-5
X<-numeric(d)
for(i in 1:d){
  X[i]<-sqrt(rchisq(1,df=n-i+1))
  }
A <- diag(X)#生成符合题意的对角阵
B <- matrix(0,nrow=d,ncol=d)
for(i in 1:d)#生成下三角元服从正态分布的矩阵
  for(j in 1:i-1){
    B[i,j]<-rnorm(1)
    }
T <- A+B#符合题意的下三角阵
Sigma <- matrix(0,nrow=d,ncol=d)
for(i in 1:d)#输入多元正态分布的协方差矩阵
  for(j in 1:d){
    if (i==j)
   Sigma[i,j]<-4
    else
   Sigma[i,j]<-0.8}
L<-chol(Sigma)#求出Sigma矩阵的Choleski factorization
W<-t(L)%*%T%*%t(T)%*%L#利用Bartlett’s decomposition生成符合题意的样本
W

3.my thoughts

不太会写函数,以及如何简便地输入矩阵,此外,复杂的循环语句不太会书写

Homework of 20191011

Question1

5.1 Compute a Monte Carlo estimate of $\theta=\int_{0}^{\frac{\pi}{3}} \sin t d t$ and compare your estimate with the exact value of the integral.

Answer

1.basic ideas

We can apply the Mento Carlo method and use a frequency to approximate the integration, so that we can get an estimation of the integration. We can examine the result with the true value. $\theta=\int_{0}^{\frac{\pi}{3}} \sin t d t=E_{Y}\left(\frac{\pi}{3} \sin Y\right), Y \sim U\left(0, \frac{\pi}{3}\right)$

2.code

m <- 1e4
x <- runif(m, min=0, max=pi/3) #generate random numbers
theta.hat <- mean(sin(x)) * pi/3 #use sample mean to approximate the integration
print(c(theta.hat,cos(0) - cos(pi/3))) #compare the approximation and the true value

3.my thoughts

The Monte Carlo method provides a simple and accurate method to appoximate some complicated integrations in a general way.

Question2

5.10 Use Monte Carlo integration with antithetic variables to estimate $\int_{0}^{1} \frac{e^{-x}}{1+x^{2}} d x$, and find the approximate reduction in varianceas a percentage of the variance without variance reduction.

Answer

1.basic ideas

We use the standard Mento Carlo method and Monte Carlo method with antithetic variables to approximate the integration, and we compare the two results by their variances.

2.code

m <- 1e4
x1 <- runif(m)
x2 <- runif(m/2) #generate m/2 samples to ensure the total are the same with the 1st method
MC_standard<-numeric(m); MC_antithetic<-numeric(m/2)
MC_standard<-exp(-x1)/(1+x1^2)
MC_antithetic<-1/2*(exp(-x2)/(1+x2^2)+exp(-1+x2)/(1+(1-x2)^2))
theta.stan <- mean(MC_standard) #the standard Mento Carlo method 
se.stan<- sd(MC_standard)
theta.anti <- mean(MC_antithetic) #the Monte Carlo method with antithetic variables
se.anti <- sd(MC_antithetic)
reduc_sd <- 1-sd(MC_antithetic)/sd(MC_standard) #the reduction of sd
cbind(theta.stan,theta.anti,se.stan,se.anti,reduc_sd)

3.my thoughts

This printing result shows that the Mento Carlo method with antithetic variables greatly reduced the sd by more than 86% than the standard Mento Carlo method.

Question3

5.15 Obtain the stratified importance sampling estimate in Example 5.13 and compare it with the result of Example 5.10.

Answer

1.basic ideas

We first choose several importance functions to estimate $\int_{0}^{1} \frac{e^{-x}}{1+x^{2}} d x$, and we can easily find that the 4th choice is the best. Then we use stratified importance sampling to estimate the integration again and compare the result with the former best one.

Because$\int_{0}^{1} \frac{e^{-x}}{1+x^{2}} d x=\int_{\alpha_0}^{\alpha_{1}} \frac{1-e^{-1}}{5\left(1+x^{2}\right)} \frac{5 e^{-x}}{1-e^{-1}} d x+\cdots+\int_{\alpha_{4}}^{\alpha_{5}} \frac{1-e^{-1}}{5\left(1+x^{2}\right)} \frac{5 e^{-x}}{1-e^{-1}} d x$, we first generate random numbers from the distribution with the density function: $f(x)=\frac{5e^{-x}}{1-e^{-1}}$,$\alpha_{ j}$<x<$\alpha_{ j+1}$, where $\alpha_{ j}$ is the $\frac{j}{5} \times 100\%$ of the distribution with a density function: $f(x)=\frac{e^{-x}}{1-e^{-1}}, 0<x<1$. And then we add the means of each interval to get an estimation of the integration and other results.

2.code

(1)several possible choices of importance functions to estimate$\int_{0}^{1} \frac{e^{-x}}{1+x^{2}} d x$ by importance sampling method are compared as below.

m <- 10000
theta.hat <- se <- numeric(5)
g <- function(x) {
  exp(-x - log(1+x^2)) * (x > 0) * (x < 1) }

x <- runif(m) #using f0
fg <- g(x)
theta.hat[1] <- mean(fg)
se[1] <- sd(fg)

x <- rexp(m, 1) #using f1
fg <- g(x) / exp(-x)
theta.hat[2] <- mean(fg)
se[2] <- sd(fg)

x <- rcauchy(m) #using f2
i <- c(which(x > 1), which(x < 0))
x[i] <- 2 #to catch overflow errors in g(x)
fg <- g(x) / dcauchy(x)
theta.hat[3] <- mean(fg)
se[3] <- sd(fg)

u <- runif(m) #f3, inverse transform method
x <- - log(1 - u * (1 - exp(-1)))
fg <- g(x) / (exp(-x) / (1 - exp(-1)))
theta.hat[4] <- mean(fg)
se[4] <- sd(fg)

u <- runif(m) #f4, inverse transform method
x <- tan(pi * u / 4)
fg <- g(x) / (4 / ((1 + x^2) * pi))
theta.hat[5] <- mean(fg)
se[5] <- sd(fg)

rbind(theta.hat, se)

(2)Stratified Importance Sampling

m <- 10000 #number of replicates
k <- 5 #number of strata
r <- m/k #replicates per stratum
#compute the 0%, 20%, 40%, 60%, 80% and 100%percentiles
alpha <- numeric(6)
for(j in 1:6){
  alpha[j] <- -log(1-(1/5)*(j-1)*(1-exp(-1)))
}
x <- matrix(nrow=2000,ncol=5)
g <- function(x)(1-exp(-1))/(5*(1+x^2))*(x>0)*(x<1)
est <- numeric(5)
#the inverse transform method to generate samples 
for (j in 1:k){
  u <- runif(r)
  x[,j] <- - log(1 - (1/5)*(u+j-1) * (1 - exp(-1)))
  est[j] <- mean(g(x[,j]))
  }
theta.hat<-sum(est); se<-sd(g(x))
cbind(theta.hat,se)

3.my thoughts

We can find that the result of Stratified Importance Sampling is quite better than the former best one of the method of importance functions.

Homework of 20191018

Question

6.5 Suppose a 95% symmetric t-interval is applied to estimate a mean, but the sample data are non-normal. Then the probability that the confidence interval covers the mean is not necessarily equal to 0.95. Use a Monte Carlo experiment to estimate the coverage probability of the t-interval for random samples of $\chi^{2}(2)$ data with sample size n = 20. Compare your t-interval results with the simulation results in Example 6.4. (The t-interval should be more robust to departures from normality than the interval for variance.)

6.6 Estimate the 0.025, 0.05, 0.95, and 0.975 quantiles of the skewness $\sqrt{b_{1}}$ under normality by a Monte Carlo experiment. Compute the standard error of the estimates from (2.14) using the normal approximation for the density (with exact variance formula). Compare the estimated quantiles with the quantiles of the large sample approximation $\sqrt{b_{1}} \approx N(0,6 / n)$.

Answer of 6.5

1.basic ideas

Use a Monte Carlo experiment to estimate the coverage probability of three different confidence intervals and compare them.

2.code

set.seed(123)
covp <- function(m){
  n <- 20; alpha <- 0.05
  LCL_mean <- numeric(m)
  UCL_mean <- numeric(m)
  UCL_var_chisq <- UCL_var_normal <- numeric(m)
  for(i in 1:m){
    x <- rchisq(n,2) #samples from chisq with mean=2, var=4
    #the lower and upper confidence limit of mean
    LCL_mean[i] <- mean(x)-sd(x)*qt(1-alpha/2,n-1)/sqrt(n)
    UCL_mean[i] <- mean(x)+sd(x)*qt(1-alpha/2,n-1)/sqrt(n)
    #the one-side upper confidence limit of variance
    UCL_var_chisq[i] <- (n-1)*var(x)/qchisq(alpha, df=n-1)
    y <- rnorm(n) #samples from norm with mean=0, var=1
    #the one-side upper confidence limit of variance
    UCL_var_normal[i] <- (n-1)*var(y)/qchisq(alpha, df=n-1)
  }
  #coverage probability of mean from chisq distribution
  covp_mean_chisq <- mean(LCL_mean<2&2<UCL_mean)
  #coverage probability of variance from chisq distribution
  covp_var_chisq <- mean(4<UCL_var_chisq)
  #coverage probability of variance from normal distribution
  covp_var_normal <- mean(1<UCL_var_normal)
  cbind(covp_mean_chisq,covp_var_chisq,covp_var_normal)
}
m<-1e3; covp(m)
m<-1e4; covp(m)
m<-1e5; covp(m)

3.my thoughts

We can find that the 1st and 2nd coverage probabilties are always smaller than the true confidence levels because the samples are not from a normal distribution. However, we choose the t-statistic to construct the confidence interval, hence the method is wrong theorically.

At the same time, we can find the 3rd coverage probability is quite close to the true probability and it will get closer as the number of replicates becomes bigger because the confidence interval is correct theorically.

In addition, the confidence interval of mean performs better than that of variance, and I'm not quite sure of the reasons, maybe the former interval is more robust.

Answer of 6.6

1.basic ideas

We first generate samples from normal distribution and replicate the process and use the the mean of sample quantiles to estimate the true quantiles of the skewness. And then we compare the estimated quantiles with the quantiles of the large sample approximation.

Also, we use $\operatorname{Var}\left(\hat{x}{q}\right)=\frac{q(1-q)}{n f\left(x{q}\right)^{2}}$ to compute the standard error of the estimates.

2.code

set.seed(123)
n <- 50 #sample sizes
m <- 1000 #replicates to compute a sample quantile once
k <- 100 #replicates of Mento Carlo experiment
alpha <- c(0.025,0.05,0.95,0.975) #different probabilities of quantiles
skew <- numeric(m)
se_quantile.hat <- quantile.appro <- quantile.hat <- numeric(4)
quantile <- matrix(nrow=k,ncol=4)
#we compute k sample quantiles of every probability
for(i in 1:k){
  for(j in 1:m){
    x <- rnorm(n)
    skew[j] <- mean((x-mean(x))^3)/(mean((x-mean(x))^2))^(3/2)  #sample skewness
    }
  quantile[i,] <- quantile(skew, probs=alpha) #a sample quantiles of skewness
}
#use the mean of sample quantiles to estimate the sample quantiles
quantile.hat <- colMeans(quantile)
for(i in 1:4){
#quantiles of the large sample approximation
  quantile.appro[i] <- qnorm(alpha[i],0,sqrt(6/n))
#the standard error of the estimates
  se_quantile.hat[i] <- sqrt((alpha[i]*(1-alpha[i])/(n*(dnorm(quantile.appro[i],0,sqrt(6/n))^2))))
}
cbind(alpha,quantile.hat,quantile.appro,se_quantile.hat)   

3.my thoughts

We can find the results of Mento Carlo method is close to the results of large sample approximation and the stand error is not big so that the Mento Carlo Method provides a good method to do statistic inference.

Homework of 20191101

Question

6.7 Estimate the power of the skewness test of normality against symmetric $\operatorname{Beta}(\alpha, \alpha)$ distributions and comment on the results. Are the results different for heavy-tailed symmetric alternatives such as $t(\nu)$?

6.A Use Monte Carlo simulation to investigate whether the empirical Type I error rate of the t-test is approximately equal to the nominal significance level $\alpha$, when the sampled population is non-normal. The t-test is robust to mild departures from normality. Discuss the simulation results for the cases where the sampled population is (i) $\chi^{2}(1)$, (ii) Uniform(0,2), and (iii) Exponential(rate=1). In each case, test: $H_{0}: \mu=\mu_{0}$ vs $H_{1}: \mu \neq \mu_{0}$, where$\mu_{0}$ is the mean of $\chi^{2}(1)$, Uniform(0,2), and Exponential(1), respectively.

Discussion: If we obtain the powers for two methods under a particular simulation setting with 10,000 experiments: say, 0.651 for one method and 0.676 for another method. Can we say the powers are different at 0.05 level?

(1)What is the corresponding hypothesis test problem?

(2)What test should we use? Z-test, two-sample t-test, paired t-test or McNemar test?

(3)What information is needed to test your hypothesis?

Answer of 6.7

1.basic ideas

Use Monte Carlo methods to estimate the power of skewness test of normality against symmetric $\operatorname{Beta}(\alpha, \alpha)$ distributions and heavy-tailed symmetric alternatives such as $t(\nu)$.

2.code

#skewness test of normality against symmetric Beta distribution
alpha<-0.05; n<-100
a<-seq(1,50,1)
# critical value for the skewness test
cv<-qnorm(1-alpha/2,0,sqrt(6*(n-2)/((n+1)*(n+3))))
# the function to compute skewness of samples
sk <- function(x) mean((x-mean(x))^3)/(mean((x-mean(x))^2))^(3/2)
pwr <- numeric(length(a))
m <- 10000 # num. repl. each sim.
for (i in 1:length(a)) {
  sktests<-numeric(m)
  for (j in 1:m) {
    x <-  rbeta(n,a[i],a[i]) # test decision is 1 (reject) or 0
    sktests[j] <- as.integer(abs(sk(x)) >= cv) 
    }
  pwr[i] <- mean(sktests) # proportion rejected, that is the power of test
}
plot(a,pwr,xlab = "parameter of Beta distribution",ylab = "power of test")

3.comments

the power of test is poor generally although it will get higher from zero to the nominal level as the parameter of Beta distribution becomes bigger.

# skewness test of normality against heavy-tailed symmetric alternatives such as t distribution
alpha<-0.05; n<-100
a<-seq(1,30,1)
cv<-qnorm(1-alpha/2,0,sqrt(6*(n-2)/((n+1)*(n+3))))
sk <- function(x) mean((x-mean(x))^3)/(mean((x-mean(x))^2))^(3/2)
pwr <- numeric(length(a))
m <- 10000 #num. repl. each sim.
for (i in 1:length(a)) {
  sktests<-numeric(m)
  for (j in 1:m) {
    x <- rt(n,a[i]) #test decision is 1 (reject) or 0
    sktests[j] <- as.integer(abs(sk(x)) >= cv)
    }
  pwr[i] <- mean(sktests) #proportion rejected
}
plot(a,pwr,xlab = "df of t distribution",ylab = "power of test")

4.comparison and my thoughts

The powers of test against t distribution get lower from 100% to the nominal level 5% as the df of t distribution becomes bigger. We can find that the tendency of two tests is contrary but they all get closer to the nominal level as the parameters get bigger.

Answer of 6.A

1.basic ideas

Use Monte Carlo experiment to assess Type I error rate.

2.code

#H0:mu=1
m <- 1e4; n <- 1000
set.seed(123)
p.valx <- p.valy <- p.valz <- numeric(m)
for(i in 1:m){
  # generate samples from $\chi^{2}(1)$ and compute p-value
  x <- rchisq(n,1)
  p.valx[i] <- 2*(1-pt(abs(sqrt(n)*(mean(x)-1)/sd(x)),n-1))
  # generate samples from uniform(0,2) and compute p-value
  y<-runif(n,0,2)
  p.valy[i] <- 2*(1-pt(abs(sqrt(n)*(mean(y)-1)/sd(y)),n-1))
  # generate samples from Exp(1) and compute p-value
  z<-rexp(n,1)
  p.valz[i] <- 2*(1-pt(abs(sqrt(n)*(mean(z)-1)/sd(z)),n-1))
}
#print the t1es
print(c(mean(p.valx<=0.05),mean(p.valy<=0.05),mean(p.valz<=0.05)))

3.my thoughts

We can find that the t1es are quite close to the nominal level of the test, which means that even though the sampled population is non-normal, the t-test is quite robust to mild departures from normality.

Answer of discussion

(1)Let's take the statistic set "sleep" for an example, we may care about whether the new medicine is effective in improving the quality of sleeping, that is whether the means of two sampled population are the same or not in another word. So the null hypothesis is "$\mu_{1}=\mu_{2}$", and the alternative hypothsis is "$\mu_{1}\neq\mu_{2}$".

(2)We can use two-sample t-test or paired t-test to test the hypothesises.

(3)We need to know whether the distributions and of the sampled populations are normal or not, the variances of two sampled population are the same or not and the two sampled populations are independent or paired. All these above informations will determine which test we shall use.

Homework of 20191108

Question

7.6 Efron and Tibshirani discuss the scor (bootstrap) test score data on 88 students who took examinations in five subjects [84, Table7.1], [188, Table1.2.1]. The first two tests (mechanics, vectors) were closed book and the last three tests (algebra, analysis, statistics) were open book. Each row of the data frame is a set of scores $\left(x_{i 1}, \ldots, x_{i 5}\right)$ for the ith student. Use a panel display to display the scatter plots for each pair of test scores. Compare the plot with the sample correlation matrix. Obtain bootstrap estimates of the standard errors for each of the following estimates: $\hat{\rho}{12}=\hat{\rho}(\mathrm{mec}, \mathrm{vec})$, $\hat{\rho}{34}=\hat{\rho}(\mathrm{alg}, \mathrm{ana})$, $\hat{\rho}{35}=\hat{\rho}(\mathrm{alg}, \mathrm{sta})$, $\hat{\rho}{45}=\hat{\rho}(\mathrm{ana}, \mathrm{sta})$.

7.B Repeat Project 7.A for the sample skewness statistic. Compare the coverage rates for normal populations (skewness 0) and $\chi^{2}(5)$ distributions (positive skewness).

Answer of 7.6

1.basic ideas

bootstrap samples to estimate the standard errors of different correlations

2.code

set.seed(12345)
library(boot)
data(scor, package = "bootstrap")
scor<-as.matrix(scor)
# display the scatter plots for each pair of test scores
pairs(scor)
# compute the correlations
cor(scor)
# bootstrap estimates of the correlations and their standard errors
b.cor <- function(x,i) cor(x[i,1],x[i,2])
obj <- boot(data=scor,statistic=b.cor,R=10000)
round(c(rho12=obj$t0,se=sd(obj$t)),3)

b.cor <- function(x,i) cor(x[i,3],x[i,4])
obj <- boot(data=scor,statistic=b.cor,R=10000)
round(c(rho34=obj$t0,se=sd(obj$t)),3)

b.cor <- function(x,i) cor(x[i,3],x[i,5])
obj <- boot(data=scor,statistic=b.cor,R=10000)
round(c(rho35=obj$t0,se=sd(obj$t)),3)

b.cor <- function(x,i) cor(x[i,4],x[i,5])
obj <- boot(data=scor,statistic=b.cor,R=10000)
round(c(rho45=obj$t0,se=sd(obj$t)),3)

3.my thoughts

We can find that both these relationships are positive. Further, the relationship between the two tests of closed book is lower than those tests of open book.

Answer of 7B

1.basic ideas

We can use bootstrap and Mento Carlo methods to calculate the coverage rates and missing proportions for the sample skewness statistic from different populations.

2.code

set.seed(12345)
# the package to compute sample skewness and bootstrap
library(moments); library(boot)
n<-1e2; m<-1e3
boot.skew <- function(x,i) skewness(x[i])
ci_norm<-ci_chisq<-matrix(NA,m,2)

# mento carlo methods to compute coverage probability
for(i in 1:m){
  # generate original sample
  U<-rnorm(n)
  V<-rchisq(n,5)
  # bootstrap to compute skewness
  out1 <- boot(U,statistic=boot.skew, R = 999)
  out2 <- boot(V,statistic=boot.skew, R = 999)
  # compute the confidence interval
  ci_norm[i,] <- boot.ci(out1,type="norm")$norm[2:3]
  ci_chisq[i,] <- boot.ci(out2,type="norm")$norm[2:3]
}
# print the results
cbind(cp_norm =mean(ci_norm[,1]<=0&ci_norm[,2]>=0),prop_left=mean(0<ci_norm[,1]),prop_right=mean(0>ci_norm[,2]))

cbind(cp_chisq =mean(ci_chisq[,1]<=sqrt(8/5)&ci_chisq[,2]>=sqrt(8/5)),prop_left=mean(sqrt(8/5)<ci_chisq[,1]),prop_right=mean(sqrt(8/5)>ci_chisq[,2]))

3.thoughts

We can find that the coverage rate of normal distribution is bigger than that of chisq distribution, which means the CI of skewness for normal distribution is better than that of the chisq distribution.

What's more, the proportion of times missing on the left and right is almost the same for the normal distribution, which means that normal distribution is symmetric.

However, the chisq the proportion of times missing on the right is apparently more than the proportion on the left, which means that chisq distribution is asymmetric, in fact, it inclines towards right.

Homework of 20191115

Question

7.8 Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.

7.10 In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Repeat the analysis replacing the Log-Log model with a cubic polynomial model. Which of the four models is selected by the cross validation procedure? Which model is selected according to maximum adjusted $R^{2}$?

Answer of 7.8

1.basic ideas

Use the formats of jackknife to estimate bias and stanard error

2.code

data(scor, package = "bootstrap")
lambda.hat <- eigen(cov(scor))$values
theta.hat <- max(lambda.hat)/sum(lambda.hat)
n <- nrow(scor)
theta.j <- numeric(n)
#leave one out
for (i in 1:n) {
  x <- scor [-i,]
  lambda <- eigen(cov(x))$values
  theta.j[i] <- max(lambda)/sum(lambda)
}
bias.jack <- (n-1)*(mean(theta.j)-theta.hat)
se.jack <- sqrt((n-1)^2/n*var(theta.j))
#print bias and se
round(c(bias.jack=bias.jack, se.jack=se.jack),5)

3.my thoughts

We can find that the bias is quite close to zero and the standard error is very small.

Answer of 7.10

1.basic ideas

Cross validation and adjusted R square are applied to select the best regression model

2.code

library(DAAG)
attach(ironslag)
n <- length(magnetic)
e1 <- e2 <- e3 <- e4 <- numeric(n)
#for n-fold cross validation
for (i in 1:n) {
  y <- magnetic[-i]
  x <- chemical[-i]
#Fit the model(s) using only the n−1 observations in the training set
#Compute the predicted response for the test point
#Compute the prediction error

  M1 <- lm(y ~ x) #linear model
  yhat1 <- M1$coef[1]+M1$coef[2]*chemical[i]
  e1[i] <- magnetic[i] - yhat1

  M2 <- lm(y ~ x + I(x^2)) #quadratic polynomial model
  yhat2 <- M2$coef[1]+M2$coef[2]*chemical[i]+M2$coef[3]*chemical[i]^2
  e2[i] <- magnetic[i] - yhat2

  M3 <- lm(log(y) ~ x) #exponential model
  logyhat3 <- M3$coef[1]+M3$coef[2]*chemical[i]
  yhat3 <- exp(logyhat3)
  e3[i] <- magnetic[i] - yhat3

  M4 <- lm(y ~ x + I(x^2)+I(x^3)) #cubic polynomial model
  yhat4 <- M4$coef[1]+M4$coef[2]*chemical[i]+M4$coef[3]*chemical[i]^2+M4$coef[4]*chemical[i]^3
  e4[i] <- magnetic[i] - yhat4
}
#calculate the average squared prediction error
 c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2))
#print the adjusted R squares
 c(summary(M1)$adj.r.squared, summary(M2)$adj.r.squared, summary(M3)$adj.r.squared, summary(M4)$adj.r.squared)

3.my thoughts

According to the cross validation procedure, we should choose the quadratic polynomial model. However, according to maximum adjusted R square, we should choose the cubic polynomial model.

Homework of 20191122

Question

8.3 The Count 5 test for equal variances in Section 6.4 is based on the maximum number of extreme points. Example 6.15 shows that the Count 5 criterion is not applicable for unequal sample sizes. Implement a permutation test for equal variance based on the maximum number of extreme points that applies when sample sizes are not necessarily equal.

Ex: Power comparison (distance correlation test versus ball covariance test)

Model 1: $Y=X/4+e$; Model 2: $Y=X/4\times e$

where $X\sim N\left(0_{2},I_{2}\right)$, $e\sim N\left(0_{2},I_{2}\right)$, $X$ and $e$ are independent.

Answer of 8.3

1.basic ideas

In order to overcome the problem we met in eg 6.15, we can use permutation to generate samples with equal sizes and then compute the Count Five statistic.

2.code

set.seed(12345)
## count5 function
count5test <- function(x, y) {
  X <- x - mean(x)
  Y <- y - mean(y)
  outx <- sum(X > max(Y)) + sum(X < min(Y))
  outy <- sum(Y > max(X)) + sum(Y < min(X)) 
  # return 1 (reject) or 0 (do not reject H0)
  return(as.integer(max(c(outx, outy)) > 5))
}
## generate samples with unequal sizes and same variance
n1 <- 20; n2 <- 30; N <- n1 + n2
mu1 <- mu2 <- 0; sigma1 <- sigma2 <- 1 
x <- rnorm(n1,mu1,sigma1); y <- rnorm(n2,mu2,sigma2)

#use permutation to estimate t1e with sample size of 20 and 30
z <- c(x,y); R <- 9999; t <- numeric(R)
for (i in 1:R) {
  k <- sample(1:N, size = N/2, replace = FALSE)
  x1 <- z[k]; y1 <- z[-k] # a new group of X and Y with the same size
  t[i] <- count5test(x1,y1)
}
p1 <- mean(t)# t1e

#generate another group of samples with sample size of 20 and 50
n1 <- 20; n2 <- 50; N <- n1+n2
mu1 <- mu2 <- 0; sigma1 <- sigma2 <- 1
x <- rnorm(n1,mu1,sigma1); y <- rnorm(n2,mu2,sigma2)

#use permutation to estimate t1e 
z <- c(x,y); R <- 9999; t <- numeric(R)
for (i in 1:R) {
  k <- sample(1:N, size = N/2, replace = FALSE)
  x1 <- z[k]; y1 <- z[-k] # a new group of X and Y with the same size
  t[i] <- count5test(x1,y1)
}
p2 <- mean(t)# t1e

# use Mento carlo method to estimate the empirical t1e with sample size of 20 and 50
p <- numeric(100)
for(j in 1:100){
  x <- rnorm(n1,mu1,sigma1); y <- rnorm(n2,mu2,sigma2)

  z <- c(x,y); R <- 9999; t <- numeric(R)
  for (i in 1:R) {
    k <- sample(1:N, size = N/2, replace = FALSE)
    x1 <- z[k]; y1 <- z[-k]
    t[i] <- count5test(x1,y1)
  }
  p[j] <- mean(t)
}
p3 <- mean(p)# empirical t1e

# print the result
round(c(p1=p1,p2=p2,p3=p3),3)

3.my thoughts

We can find that the result is quite good. When sample sizes are 20 and 30, we improve the original value in eg 6.15 from 0.1064 to 0.051, and when sample sizes are 20 and 50, we improve it from 0.2934 to 0.065.

I use Mento Carlo method to repeat the simulation above with n1 = 20 and n2 = 50, and I get the empirical t1e (0.072), which is quite good compared with the original t1e in eg 6.15.

As a result, the method with permutation is applicable for unequal sample sizes.

Answer of Ex

1.basic ideas

Use distance correlation test and ball covariance test to do independence hypothesis test of two different models, and use Mento Carlo method to estimate the powers and compare them with each other.

2.code

set.seed(12345)
library(mvtnorm); library(Ball); library(boot)
seed <- 12345; m <- 1e2
mean <- c(0,0); sigma <- matrix(c(1,0,0,1),nrow =2)
## the statistic of distance correlation test
DCOR <- function(z,ix,dims) {
  p <- dims[1]
  q <- dims[2]
  d <- p + q
  x <- as.matrix(z[ , 1:p])
  y <- as.matrix(z[ix, -(1:p)])
  n <- nrow(x)
  m <- nrow(y)
  if (n != m || n < 2) stop("Sample sizes must agree")
  if (! (all(is.finite(c(x, y)))))
    stop("Data contains missing or infinite values")
  Akl <- function(x) {
    d <- as.matrix(dist(x))
    m <- rowMeans(d); M <- mean(d)
    a <- sweep(d, 1, m); b <- sweep(a, 2, m)
    b + M
  }
  A <- Akl(x)
  B <- Akl(y)
  dCov <- mean(A * B)
  dVarX <- sqrt(mean(A * A))
  dVarY <- sqrt(mean(B * B))
  dCor <- dCov / sqrt(dVarX * dVarY)
  return(dCor)
}

N <- seq(10, 150, by=10)# different sample sizes
power11 <- power12 <- power21 <- power22 <- numeric(length(N))
alpha <- 0.05

## model 1
## Mento Carlo Method
for(i in 1:length(N)){
  X1 <- rmvnorm(N[i],mean,sigma); e1 <- rmvnorm(N[i],mean,sigma)
  Y1 <- X1/4+e1 #generate samples
  z <- cbind(X1,Y1); z1 <- as.matrix(z)

  p.values <- matrix(NA,m,2)

  for(j in 1:m){
    boot.obj <- boot(data = z1, statistic = DCOR, R = 99, sim ="permutation", dims = c(2, 2))# distance correlation test with permutation
    tb <- c(boot.obj$t0, boot.obj$t)
    p.values[j,1] <- mean(tb >= tb[1])

    p.values[j,2] <- bcov.test(X1,Y1,R=99,seed=j*seed)$p.value# ball covariance test
    }
  power11[i] <- colMeans(p.values < alpha)[1]
  power12[i] <- colMeans(p.values < alpha)[2]
}
## draw the graph about powers of two testing methods with model 1
par <- par(no.readonly = TRUE)
plot(N,power11,type = "b",xlab="n",ylab="power",main="Model 1",lty=2,col=2)
lines(N,power12,type="b",lty=2,col=3)
legend("topleft", legend = c("dcor", "ball"), col = 2:3,lty = c(2,2))


## model 2 (the same process with model 1)
for(i in 1:length(N)){
  X2 <- rmvnorm(N[i],mean,sigma); e2 <- rmvnorm(N[i],mean,sigma)
  Y2 <- (X2/4)*e2
  z <- cbind(X2,Y2); z2 <- as.matrix(z)

  p.values <- matrix(NA,m,2)
  for(j in 1:m){
    boot.obj <- boot(data = z2, statistic = DCOR, R = 99, sim = "permutation", dims = c(2, 2))
    tb <- c(boot.obj$t0, boot.obj$t)
    p.values[j,1] <- mean(tb>=tb[1])

    p.values[j,2] <- bcov.test(X2,Y2,R=99,seed=j*seed)$p.value
    }
  power21[i] <- colMeans(p.values < alpha)[1]
  power22[i] <- colMeans(p.values < alpha)[2]
}

plot(N,power21,type = "b",xlab="n",ylab="power",main="Model 2",lty=2,col=2)
lines(N,power22,type="b",lty=2,col=3)
legend("topleft", legend = c("dcor", "ball"), col = 2:3,lty = c(2,2))

3.my thoughts

We can find that, in model 1, the power of distance correlation test is not less than the ball covariance test, which means distance correlation test is more suitable for model 1.

However, in model 2, the power of ball covariance test is not less than the distance correlation test, which means ball covariance test is more suitable for model 2.

All in a word, for different models, we should choose different testing method in order to get the best result.

Homework of 20191129

Question

9.4 Implement a random walk Metropolis sampler for generating the standard Laplace distribution (see Exercise 3.2). For the increment, simulate from a normal distribution. Compare the chains generated when different variances are used for the proposal distribution. Also, compute the acceptance rates of each chain.

Answer of 9.4

1.basic ideas

First we can calculate that, the CDF of Laplace distribution is $$F(x)=\left{\begin{array}{ll}{\frac{1}{2} e^{x},} & {x<0} \ {1-\frac{1} {2}e^{-x}} & {, x \geq 0}\end{array}\right.$$ We then implement a random walk Metropolis sampler for generating the standard Laplace distribution and compare them.

2.code

set.seed(12345)
## random walk Metropolis sampler
rw.Metropolis <- function(sigma, x0, N) {
  x <- numeric(N)
  x[1] <- x0
  u <- runif(N)
  k <- 0
  for (i in 2:N) {
    y <- rnorm(1, x[i-1], sigma)
    if (u[i] <= ((1/2)*exp(-abs(y))) / ((1/2)*exp(-abs(x[i-1]))))
      x[i] <- y else {
        x[i] <- x[i-1]
        k <- k + 1
      }
  }
  return(list(x=x, k=k))
}

## samples generated by random walk Metropolis sampler with proposal distributions of different variances
N <- 2000; sigma <- c(0.05, 0.5, 2, 16); x0 <- 25
rw1 <- rw.Metropolis(sigma[1], x0, N)
rw2 <- rw.Metropolis(sigma[2], x0, N)
rw3 <- rw.Metropolis(sigma[3], x0, N)
rw4 <- rw.Metropolis(sigma[4], x0, N)

## plots of four chains
index <- 1:2000
refline <- c(log(2*0.025), -log(2*(1-0.975)))
for(i in 1:4){
  y <- rw.Metropolis(sigma[i], x0, N)$x
  plot(index,y,type = "l",ylab="x")
  abline(h=refline,lwd=1,col=2)
  title(paste("sigma=",sigma[i]))
}

## compare the quantiles
alpha <- seq(0.05,0.95,by=0.05)
Q <- c(log(2*alpha[1:10]), -log(2*(1-alpha[11:19])))# the true quantiles of Lapalace disribution
rw <- cbind(rw1$x, rw2$x, rw3$x, rw4$x)
mc <- rw[401:N, ]
Qrw <- apply(mc,2,function(x) quantile(x, alpha))# the sample quantiles of the four chains
qq <- data.frame(round(cbind(Q, Qrw), 3))
names(qq) <- c('True','sigma=0.05','sigma=0.5','sigma=2','sigma=16')
print(qq)

## acceptance rates
print(c(acceptRate1=1-rw1$k/N, acceptRate2=1-rw2$k/N, acceptRate3=1-rw3$k/N, acceptRate4=1-rw4$k/N))

3.my thoughts

In the first plot, the increments are small and the chain is almost like a true random walk although its acceptance is very high.

The second and third chain converge to the target distribution after a short burn-in period of less than 400 and their quantiles are close to the true distribution. Besides, they have an acceptance rate higher than 0.5.

The fourth chain converges and its quantiles are close to the true distribution, but it is inefficient because the acceptance rate is just 0.1085.

Homework of 20191206

Question

11.1 The natural logarithm and exponential functions are inverses of each other, so that mathematically $log(e^x) = e^{log(x)}=x$. Show by example that this property does not hold exactly in computer arithmetic. Does the identity hold with near equality? (See all.equal.)

11.5 Write a function to solve the equation $$\frac{2 \Gamma\left(\frac{k}{2}\right)}{\sqrt{\pi(k-1)} \Gamma\left(\frac{k-1}{2}\right)}\int_{0}^{c_{k-1}}\left(1+\frac{u^{2}}{k-1}\right)^{-k / 2} d u $$ $$= \frac{2 \Gamma\left(\frac{k+1}{2}\right)}{\sqrt{\pi k} \Gamma\left(\frac{k}{2}\right)}\int_{0}^{c_{k}}\left(1+\frac{u^{2}}{k}\right)^{-(k+1) / 2} d u $$ for $a$, where $$c_{k}=\sqrt{\frac{a^{2} k}{k+1-a^{2}}}$$ Compare the solutions with the points $A(k)$ in Exercise 11.4.

Ex. A-B-O blood type problem

Let the three alleles be A, B, and O with allele frequencies p, q, and r. The 6 genotype frequencies under HWE and complete counts are as follows. $$\begin{array}{|c|c|c|c|c|c|}\hline \text { Genotype } & {A A} & {B B} & {O O} & {A O} & {B O} & {A B} & {S u m} \ \hline \text { Frequency } & {p^{2}} & {q^{2}} & {r^{2}} & {2 p r} & {2 q r} & {2 p q} & {1} \ \hline \text { Count } & {n {A A}} & {n {B B}} & {n {O O}} & {n {A O}} & {n {B O}} & {n{ A B}} & {n} \ \hline\end{array}$$ Observed data: nA· =nAA +nAO =28 (A-type), nB· =nBB +nBO =24 (B-type), nOO =41 (O-type), nAB =70 (AB-type).

Use EM algorithm to solve MLE of p and q (consider missing data nAA and nBB). Show that the log-maximum likelihood values in M-steps are increasing via line plot.

Answer of 11.1

1.basic ideas

Mathematical calculation is not computational calculation for base 2 representation is used in computer arithmetic. We use "isTRUE" to compare two numbers to see whether they are equal exactly and use "all.equal" to see whether they are equal nearly.

2.code

isTRUE(exp(log(0.1))==log(exp(0.1))); isTRUE(exp(log(0.1))==0.1); isTRUE(log(exp(0.1))==0.1)

isTRUE(all.equal(exp(log(0.1)),log(exp(0.1)))); isTRUE(all.equal(exp(log(0.1)),0.1)); isTRUE(all.equal(log(exp(0.1)),0.1))

3.conclusion

The identity $log(e^x) = e^{log(x)}=x$ does not hold exactly in computer arithmetic, however, it holds with near equality.

Answer of 11.5

1.basic ideas

We use function "uniroot" to calculate the roots of two equations in 11.4 and 11.5 with different values of parameter k and compare them.

2.code

## answer of 11.4
k <- c(4:25,100,500,1000)# different values of parameter k (df of t distribution)
solution1 <- numeric(length(k))
for(i in 1:length(k)){
# the one-dimensional nonlinear function 
  f <- function(a) {pt(sqrt(a^2*(k[i]-1)/(k[i]-a^2)),k[i]-1)-pt(sqrt(a^2*k[i]/(k[i]+1-a^2)),k[i])}
  solution1[i] <- uniroot(f,lower=0+1e-6,upper=sqrt(k[i])-1e-6)$root
}
round(solution1,4)

## answer of 11.5
k <- c(4:25,100,500,1000)
solution2 <- numeric(length(k))
for(i in 1:length(k)){
# the upper value of integrate function
  ck1 <- function(a) sqrt(a^2*(k[i]-1)/(k[i]-a^2))
# the left function of the equation 
  g1 <- function(a) 2*exp(lgamma(k[i]/2)-log(sqrt(pi*(k[i]-1)))-lgamma((k[i]-1)/2))*(integrate(function(u) {(1+u^2/(k[i]-1))^(-k[i]/2)},lower = 1e-6,upper = ck1(a)-1e-6)$value)
  ck2 <- function(a) sqrt(a^2*(k[i])/(k[i]+1-a^2))
  g2 <- function(a) 2*exp(lgamma((k[i]+1)/2)-log(sqrt(pi*k[i]))-lgamma(k[i]/2))*(integrate(function(u) {(1+u^2/(k[i]))^(-(k[i]+1)/2)},lower = 1e-6,upper = ck2(a)-1e-6)$value)

  # the one-dimensional nonlinear function
  f <- function(a) {g1(a)-g2(a)}
  solution2[i] <- uniroot(f,lower=0+1e-2,upper=sqrt(k[i])-1e-2)$root
}
round(solution2,4)

3.my thoughts

We can find that when k is not large, the roots of the two equantions in 11.4 and 11.5 are the same, however, when k is larger than 22, the roots of 11.4 are wrong, which is because computer can not calculate numbers too large.

As a result, we can use $exp(log(x))=x$ to calculate the ratio of two large numbers and improve the algorithm of 11.5 to get the true answers.

Answer of Ex

1.basic ideas

We use the EM (Expectation–Maximization) algorithm to find maximum likelihood estimates when data are incomplete.

2.code

library("stats4")
## the log likelihood function
ll <- function(p,nAA,nBB) -{nAA*log(p[1]^2)+nBB*log(p[2]^2)+41*log((1-p[1]-p[2])^2)+(28-nAA)*log(2*p[1]*(1-p[1]-p[2]))+(24-nBB)*log(2*p[2]*(1-p[1]-p[2]))+70*log(2*p[1]*p[2])}
tol <- 1e-10# convergence condition
N <- 100# max numbers of iterations
lmlv <- numeric(N)# log-maximum likelihood values
p <- c(0.1,0.1)# initial estimate of the target parameter
par <- matrix(nrow= N ,ncol = 2); par[1,] <- p
for(i in 1:N){
  # the E step
  nAA <- 28*p[1]^2/(p[1]^2+2*p[1]*(1-p[1]-p[2]))
  nBB <- 24*p[2]^2/(p[2]^2+2*p[2]*(1-p[1]-p[2]))
  lmlv[i]<-ll(p,nAA,nBB)
  # the M step
  par[i+1,] <- optim(par=p,fn=ll,nAA=nAA,nBB=nBB)$par
  p <- par[i+1,]
  if (sum(abs(par[i+1,]-par[i,])) < tol) break
}
print(round(c(p_mle=p[1],q_mle=p[2]),5))# MLE of p and q
index <- c(1:i)
plot(index,-lmlv[index],type="b",ylab="log-maximum likelihood values",xlab="iteration numbers")

3.my thoughts

The MLE of p is 0.32735 and the MLE of q is 0.31040 by EM algorithm and we can find that the log-maximum likelihood values in M-steps are increasing via the line plot.

Homework of 20191213

11.1.2 Exercises

Question 3

Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:

formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt)

Answer

attach(mtcars)
formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt)

## for loops
for(i in seq_along(formulas)){
  print(lm(formulas[[i]]))
}

## lapply
lapply(formulas, lm)

detach(mtcars)

Question 4

Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loops and lapply().

Can you do it without an anonymous function?

bootstraps <- lapply(1:10, function(i) {

rows <- sample(1:nrow(mtcars), rep = TRUE)

mtcars[rows, ]

})

Answer

bootstraps <- lapply(1:10, function(i) {
  rows <- sample(1:nrow(mtcars), rep = TRUE)
  mtcars[rows, ]
  })
## with an anonymous function
# usefor loops
for(i in seq_along(bootstraps)){
  print(lm(bootstraps[[i]]$mpg ~ bootstraps[[i]]$disp))
}
# use lapply
lapply(bootstraps, lm, formula = mpg ~ disp)

## use lapply without an anonymous function
fit <- function(x){
  lm(mpg ~ disp,data=x)
}
lapply(bootstraps, fit)

Question 5

For each model in the previous two exercises, extract $R^2$ using the function below.

rsq <- function(mod) summary(mod)$r.squared

Answer

attach(mtcars)
rsq <- function(mod) summary(mod)$r.squared
## R2 in ex3
formulas <- list(mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt)

# for loops
for(i in seq_along(formulas)){
  print(rsq(lm(formulas[[i]])))
}
# lapply
lapply(lapply(formulas,lm),rsq)

## R2 in ex4
bootstraps <- lapply(1:10, function(i) {
  rows <- sample(1:nrow(mtcars), rep = TRUE)
  mtcars[rows, ]
  })

# for loops
for(i in seq_along(bootstraps)){
  print(rsq(lm(bootstraps[[i]]$mpg ~ bootstraps[[i]]$disp)))
}
# lapply
lapply(lapply(bootstraps, lm, formula = mpg ~ disp),rsq)
detach(mtcars)

11.2.5 Exercises

Question 3

The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial. trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE ) Extra challenge: get rid of the anonymous function by using [[ directly.

Answer

trials <- replicate(
  100,
  t.test(rpois(10, 10), rpois(7, 10)),
  simplify = FALSE
)
## use sapply() and an anonymous function
pvalue <- function(test) test$p.value
sapply(trials, pvalue)

## use [[ directly
sapply(trials, '[[', 3)

Question 7

Implement mcsapply(), a multicore version of sapply(). Can you implement mcvapply(), a parallel version of vapply()? Why or why not?

Answer

library(parallel)
cl <- makeCluster(getOption("cl.cores", 2))

bootstraps <- lapply(1:100000, function(i) {
  rows <- sample(1:nrow(mtcars), rep = TRUE)
  mtcars[rows, c(1,3)]
})
## for sapply()
system.time(sapply(bootstraps, lm))
## for parSapply
parSapply(cl, 1:10, sqrt)
system.time(parSapply(cl,bootstraps, lm))

My thoughts

Since we can compute each element in any order, it’s easy to dispatch the tasks to different cores, and compute them in parallel, so we can use parSapply.

However, there is no direct way to use a multicore version of vapply(), I guess it's because vapply() need to set a pre-specified type of return value, which can not be computed in parallel.

Homework of 20191220

Question

You have already written an R function for Exercise 9.4 (page 277, Statistical Computing with R). Rewrite an Rcpp function for the same task.

Compare the generated random numbers by the two functions using qqplot.

Campare the computation time of the two functions with microbenchmark.

Comments your results.

Answer

library(Rcpp)
library(microbenchmark)
set.seed(135)

## random walk Metropolis sampler in R
rwMetropolisR <- function(sigma, x0, N) {
  x <- numeric(N)
  x[0] <- x0
  u <- runif(N)
  k <- 0
  for (i in 2:N) {
    y <- rnorm(1, x[i-1], sigma)
    if (u[i] <= ((1/2)*exp(-abs(y))) / ((1/2)*exp(-abs(x[i-1]))))
      x[i] <- y else {
        x[i] <- x[i-1]
        k <- k + 1
      }
  }
  return(list(x=x, k=k))
}

## random walk Metropolis sampler in C++
cppFunction('List rwMetropolisC(double sigma, double x0, int N) {
            NumericVector x(N);
            x[0] = x0;
            int k = 0;
            for(int i = 1; i < N; i++){
              double u = as<double>(runif(1));
              double y = as<double>(rnorm(1, x[i-1], sigma));
              if(u <= exp(-abs(y)+abs(x[i-1]))) 
              x[i] = y;
              else {
                x[i] = x[i-1];
                k += 1;
              }
            }
            return List::create(x,k);
          }')

N <- 2000; sigma <- c(0.05, 0.5, 2, 16); x0 <- 25

# plots of four chains in C++
index <- 1:2000; refline <- c(log(2*0.025), -log(2*(1-0.975)))
for(i in 1:4){
  y <- unlist(rwMetropolisC(sigma[i], x0, N)[1])
  plot(index, y, type = "l", ylab="x")
  abline(h = refline, lwd = 1, col = 2)
  title(paste("sigma=", sigma[i]))
}

# acceptance rates in C++
acceptRateC <- numeric(4)
for(i in 1:4){
  acceptRateC[i] <- 1-(as.numeric(rwMetropolisR(sigma[i], x0, N)[2]))/N
}
print(c(acceptRateC1 = acceptRateC[1], acceptRateC2 = acceptRateC[2], acceptRateC3 = acceptRateC[3], acceptRateC4 = acceptRateC[4]))

# qqplot
xR <- rwMetropolisR(2, 25, 10000)$x[1000:10000]
xC <- unlist(rwMetropolisC(2, 25, 10000)[1])[1000:10000]
qqplot(xR, xC, main="qqplot", xlab = "rwMetropolisR", ylab = "rwMetropolisC")
abline(a = 0, b = 1)

# computation time
ts <- microbenchmark(xR = rwMetropolisR(2, 25, 10000), xC = rwMetropolisC(2, 25, 10000))
summary(ts)[, c(1,3,5,6)]

Comments

For me, it's easier to solve this question in R because I'm not quite familiar with the usage of C and C++.

The results in qqplot shows that the generated random numbers are quite near with these two functions. However, the results of computation time show an evident computational speed gain of C++ against R.



SC19020/SC19020 documentation built on Jan. 3, 2020, 12:09 a.m.