knitr::opts_chunk$set(comment = "#", warning = TRUE, eval = TRUE, message = FALSE)
library(SC19090)

Section 1: Illustration of the usage of my functions

Jonckheere-Terpstra Test

Jonckheere-Terpstra test is a nonparametric test used to test the location parameters of multi-group data. We can easily implement the test using jonckheere.test function in clinfun package. However, it will report warnings if the input data has ties. See the example below:

library(clinfun,quietly = TRUE)
medical=c(125,136,116,101,105,109,
122,114,132,120,119,127,
128,142,128,134,135,132,140,129)
gr.medical=c(rep(1,6),rep(2,6),rep(3,8))
jonckheere.test(medical,gr.medical)

Therefore, I rewrote an extended version of jonckheere.test, which can automatically detect the existence of ties in the data and call the corresponding function.

data(m)
attach(m)
jh_extend(medical,gr)

Functions for solving confidence interval

I also wrote two functions to solve the confidence interval of sample mean by methods of empirical likelihood method and nonparametric delta method. Consider such an example:

Suppose the mean value for X is zero. For samples $X_1,...,X_n$, use empirical likelihood method and nonparametric delta method to get the 95% confidence interval for $Var(X)$. Sample from normal distribution and compare the results under n=500,5000,50000.

Then, we can solve the prolem by using functions CI_ELR and CI_NDM

set.seed(7524)
n=c(500,5000,50000)
int=matrix(0,ncol=2,nrow=3)
lik=matrix(0,ncol=2,nrow=3)
rownames(int)=rownames(lik)=paste("n=",n)
colnames(int)=colnames(lik)=c("left","right")
up=c(1.23,1.07,1.07)
for (i in 1:3){
 x=rnorm(n[i])
 y=x^2 #turn var into mean
 int[i,]=CI_NDM(y,0.05)
 lik[i,]=CI_ELR(y,0.05,up[i])
}
print("95% Confidence Interval for Nonparametric Delta Method");int
print("95% Confidence Interval for Empirical Likelihood Ratio");lik

Section 2: Answers for all the homework

Homework-1 9.20

Question: Use knitr to produce texts, figures and tables.

Answers:

library(lattice)
library(dplyr)
library(knitr)

1.Texts:

\ \ This is my first homework for this year Statistical Computing. I wrote this paragraph to show the use of texts in "knitr". We always end a line with two spaces.
\ \ We use two star to have italic form: Happy and we use two stars to have bold one: Happy. What's more, if you want the delete line style,you can use tilde: ~~This is the delete line form~~
\ \ We can also display math formula in this form $a^2+b^2=c^2$

2.Figures:

n <- seq(5, 45, 5)
x <- rnorm(sum(n))
y <- factor(rep(n, n), labels=paste("n =", n))
densityplot(~ x | y,
panel = function(x, ...) {
panel.densityplot(x, col="DarkOliveGreen", ...)
panel.mathdensity(dmath=dnorm,
args=list(mean=mean(x), sd=sd(x)),
col="darkblue")
})

3.Tables:

dt <- mtcars[1:5, 1:6]
kable(dt)

Homework-2 9.29

3.4

Develop an algorithm to generate random samples from a Rayleigh(σ) distribution with different values of sigma. Using histogram to check the mode of the generated samples is close to the theoretical mode $\sigma$.

Answers:

cbPalette = c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", 
              "#D55E00", "#CC79A7")
gen_Raylei= function(sigma){
 n = 1000;
 u = runif(n);
 x = sqrt(-2*sigma^2*log(1-u));
 hist(x, prob = TRUE,ylim=c(0,0.65),main=paste("Sigma = ",sigma))
 y = seq(0,ceiling(max(x)),.01)
 lines(y,y/sigma^2*exp(-y^2/2/sigma^2),col=cbPalette[sigma],lwd=2)
}
par(mfrow=c(1,1))
for (sigma in 1:4){
 gen_Raylei(sigma)
}

I use the inverse transform method to generate the random numbers. It can be clearly seen that the mode of simulation is close to the theorectical mode $\sigma$.

3.11

Generate a random sample of size 1000 from a normal location mixture. Graph the histogram of the sample with density superimposed, for $p_1 = 0.75$. Repeat with different values for $p_1$ and observe whether the empirical distribution of the mixture appears to be bimodal. Make a conjecture about the values of $p_1$ that produce bimodal mixtures.

Answers:

gen_Mixnorm = function(p1){
 n = 1e4
 X1 = rnorm(n,0,1)
 X2 = rnorm(n,3,1)
 r = sample(c(0,1),n,prob=c(1-p1,p1),replace=TRUE)
 Z = r*X1+(1-r)*X2
 hist(Z,prob=TRUE,main=paste(p1,"X1+",1-p1,"X2"),ylim=c(0,0.8))
 y = seq(-4,6,.01)
 lines(y,p1*dnorm(y,0,1)+(1-p1)*dnorm(y,3,1),col=cbPalette[sample(seq(1:8),1,replace=F)],lwd=2)
}
par(mfrow=(c(1,1)))
gen_Mixnorm(0.25)
gen_Mixnorm(0.5)
gen_Mixnorm(0.75)
gen_Mixnorm(1)

When $p_1=0.5$, the mixture appears to be bimodal.

3.18

Write a function to generate a random sample from a $W_d(\Sigma,n)$ (Wishart) distribution for $n>d+1\geq 1$, based on Bartlett’s decomposition.

Answers:

gen_Wisha = function (d,n,Sigma){
 T=matrix(0,nrow=d,ncol=d)
 for (i in 1:d){
  for (j in 1:d){
   if (i>j) T[i,j]=rnorm(1)
   else if (i==j) T[i,i]=sqrt(rchisq(1,n-i+1))
        else next
  }
 }
 L=t(chol(Sigma)) #Choleski decomposition
 return(L%*%T%*%t(T)%*%t(L))
}
Sigma=matrix(c(3,2,2,4),nrow=2)
gen_Wisha(2,5,Sigma)

I choose the sample as a $W_2(5,\Sigma)$,$\Sigma=\bigl( \begin{smallmatrix} 3 & 2\ 2& 4 \end{smallmatrix} \bigr)$, which satisfies $5>2+1\geq 1$. The function can be further tested in more complex samples.

Homework-3 10.11

5.1

Compute a Monte Carlo estimate of $$\int_{0}^{\frac{\pi}{3}}sintdt$$ and compare your estimate with the exact value of the integral.

set.seed(1234)
m = 1e4;
x = runif(m,min=0,max=pi/3)
theta.hat = mean(sin(x)*pi/3)
re=matrix(c(theta.hat,cos(0)-cos(pi/3)),nrow=1)
re=round(re,4)
colnames(re)=c("Estimate","Exact Value")
re

The difference between estimation and exact value is 0.0006, which is acceptable.

5.10

Use Monte Carlo integration with antithetic variables to estimate $$\int_{0}^{1}\frac{e^{-x}}{1+x^2}dx$$ and find the approximate reduction in variance as a percentage of the variance without variance reduction.

set.seed(2222)
MC.Phi <- function(R = 1e4, anti = TRUE) {
  u <- runif(R/2)
  if (!anti) v <- runif(R/2) else v <- (1-u)
  u <- c(u, v)
  MC = mean(exp(-u)/(1+u^2))
  return(MC)
  }
m=1000;
MC.de=MC=numeric(m)
for (i in 1:m){
 MC.de[i]=MC.Phi(R=1000)
 MC[i]=MC.Phi(R=1000,anti=FALSE)
}
print("Approximate reduction in variance: ")
(var(MC)-var(MC.de))/var(MC)

Using antithetic variables, the variance is reduced by 97%.

5.15

Obtain the stratified importance sampling estimate in Example 5.13 and compare it with the result of Example 5.10. The integral is the same as 5.10: $$\int_{0}^{1}\frac{e^{-x}}{1+x^2}dx$$

set.seed(5678)
M = 1e4; k = 5 
r = M/k             #replicates per stratum
N = 50              #number of times to repeat the estimation
T2 = numeric(k)     #Store the results of each stratum
est = matrix(0, N, 2)
g = function(x)exp(-x)/(1+x^2)*(x>0)*(x<1) 
f1 = function(x)exp(-x)/(1-exp(-1))
f2 = function(x)exp(-x)/(exp(-(j-1)/k)-exp(-j/k))

for (i in 1:N) {
  u = runif(M);x = -log(1-u*(1-exp(-1))) #u~U[0,1]; inverse trans; 
  est[i, 1] <- mean(g(x)/f1(x))            
  for(j in 1:k){
   u = runif(r)  
   x = -log(exp(-(j-1)/k)-(exp(-(j-1)/k)-exp(-j/k))*u) 
   T2[j]<-mean(5*g(x)/f2(x))
  }
  est[i,2] <- mean(T2)
}
re = cbind(apply(est,2,mean),apply(est,2,sd))
rownames(re)=c("Important Sampling","Im Sa + Str Sa")
colnames(re)=c("theta_hat","sd")
knitr::kable(re)

It can be seen that the standard error of stratified important sampling is 1/5 of the standard error of important sampling.

Homework-4 10.18

6.5

Use a Monte Carlo experiment to estimate the coverage probability of the t-interval for random samples of $\chi^2(2)$ data with sample size $n = 20$. Compare your t-interval results with the simulation results in Example 6.4. $E(\chi^2(2))=2\quad Var(\chi^2(x))=4$

Coverage Probability: $$\frac{1}{m}\sum_{i=1}^{m}I(\bar{X}-t_{0.975,n-1}\frac{sd(X)}{\sqrt{n}}<2<\bar{X}+t_{0.975,n-1}\frac{sd(X)}{\sqrt{n}})$$ $$\frac{1}{m}\sum_{i=1}^{m}I(\frac{(n-1)Var(X)}{\chi^2_{0.05}}>4)$$

n=20;m=1000
set.seed(5557)


CP = replicate(m,expr={
 x = rchisq(n,df=2)
 d = qt(.975,n-1)*sd(x)/sqrt(n)
 c(mean(x)-d,mean(x)+d)
 })
mean(2> CP[1,]& 2< CP[2,])

UCL = replicate(m, expr ={
  x = rchisq(n,df=2)
 (n-1)*var(x)/qchisq(.05,df=n-1)
})

mean(UCL>4)

The coverage probability for t-interval is 0.938 and the coverage probability for variance interval is 0.801. It's obvious that the CP for t-interval is much closer to 0.95. Therefore, the t-interval is more robust to departures from normality than the interval for variance.

6.6

Estimate the 0.025, 0.05, 0.95, and 0.975 quantiles of the skewness $\sqrt{b_1}$ under normality by a Monte Carlo experiment. $$\sqrt(b_1)=\frac{\frac{1}{n}\sum_{i=1}^{n}(X_i-\bar{X})^3}{(\frac{1}{n}\sum_{i=1}^{n}(X_i-\bar{X})^2)^{3/2}}$$ Compute the standard error of the estimates from (2.14) $$Var(\hat{x_q})=\frac{q(1-q)}{nf^2(x)}$$ using the normal approximation for the density (with exact variance formula) $$Var(\sqrt{b_1})=\frac{6(n-2)}{(n+1)(n+3)}$$ Compare the estimated quantiles with the quantiles of the large sample approximation. $$\sqrt{b_1}\approx N(0,\frac{6}{n})$$

set.seed(1234)
sk = function(x) {
 #computes the sample skewness coeff.
 xbar <- mean(x)
 m3 <- mean((x - xbar)^3)
 m2 <- mean((x - xbar)^2)
 return( m3 / m2^1.5 )
}
m=1000;n=100
stat = replicate(m, expr={
 x = rnorm(n)
 y = sk(x)
})
q = c(.025,.05,.95,.975)
x = quantile(stat,q)

sd_sk = function(x,q,n){
 sv = 6*(n-2)/(n+1)/(n+3)
 var_q=q*(1-q)/n/dnorm(x,mean=0,sd=sqrt(sv))
 return(sqrt(var_q))
}
print('The standard error of the estimates:')
sd_sk(x,q,n)
cv = qnorm(q,0,sqrt(6/n))
quat=rbind(x,cv,x-cv)
rownames(quat)=c('estimated quantiles','quat of large sample approx','bias (est-app)')
knitr::kable(quat)

It can be seen that the estimated quantiles are close to the approximate ones. And it should be noticed that the values of quantiles depend on n. Thus it will have different values for quantiles when choosing different n. The scoring criteria should not be based on the absolute-unchanging values of quantiles and it should be compared with the approximate ones to decide whether it's correct or not.

Homework-5 11.1

6.7 Estimate the power of the skewness test of normality against symmetric Beta$(\alpha, \alpha)$ distributions and comment on the results. Are the results different for heavy-tailed symmetric alternatives such as t$(\nu)$?

1.Select a particular value of the parameter $\theta_1\in \Theta$
2.For each replicate,indexed by j=1...m:
(a) Generate the $j^{th}$ n random sample under $\theta=\theta_1$
(b) Compute the test statistics $T_j$
(c) Set $I_j=1$ if $H_0$ is rejected at significance level $\alpha$, and otherwise set $I_j=0$
3.Compute the proportion of significant test $\hat{\pi}(\theta_1)=\frac{1}{m}\sum_{j=1}^{m}I_j$

set.seed(1234)
sk = function(x) {
  xbar = mean(x)
  m3 = mean((x - xbar)^3)
  m2 = mean((x - xbar)^2)
  return( m3 / m2^1.5 )
}

pwr_beta = function(a){
 alpha = .1;n = 30;m = 10000;N = length(a)
 pwr = numeric(N)
 cv = qnorm(1-alpha/2, 0, sqrt(6*(n-2) / ((n+1)*(n+3))))
 for (j in 1:N) { #for each a
  sktests = numeric(m)
  for (i in 1:m) { 
   x = rbeta(n, a[j], a[j])
   sktests[i] = as.integer(abs(sk(x))>= cv)
  }
  pwr[j] = mean(sktests)
 }
 se = sqrt(pwr * (1-pwr) / m) #add standard errors
 plot(a, pwr, type = "b",xlab = "a",ylab="pwr", ylim = c(0,0.12),pch=20,
 main=paste('a from',min(a),'to',max(a)))
 abline(h = .1, lty = 2)
 lines(a, pwr+se, lty = 3)
 lines(a, pwr-se, lty = 3)
 text(a[ceiling(0.3*N)],0.105,paste("alpha =",alpha))
}

a1 = c(seq(0,1,0.1))
a2 = c(seq(1,20,2),seq(20,100,5))
par(mfrow=c(1,1))
pwr_beta(a1);pwr_beta(a2)

The power of the skewness test of normality against symmetric Beta($\alpha,\alpha$) distribution is under 0.1. The behaviors for a>1 and a<1 are different. The empirical power is more and more far away from 0.1 With the increase of a when a<1. The empirical power for a>1 increases to 0.1 when a gets larger. Generally, the empirical powers for the two cases are smaller than 0.1 (confidence level).

pwr_t = function(v){
 alpha = .1;n = 30;m = 2500;N = length(v)
 pwr = numeric(N)
 cv = qnorm(1-alpha/2, 0, sqrt(6*(n-2) / ((n+1)*(n+3))))
 for (j in 1:N) { #for each v
  sktests = numeric(m)
  for (i in 1:m) { 
   x = rt(n,v[j])
   sktests[i] = as.integer(abs(sk(x))>= cv)
  }
  pwr[j] = mean(sktests)
 }
 se = sqrt(pwr*(1-pwr) / m) #add standard errors
 plot(v, pwr, type = "b",xlab = "v",ylab="pwr",pch=20,ylim=c(0,1),
 main="Power of skewness test against t(v)")
 abline(h = .1, lty = 2)
 lines(v, pwr+se, lty = 3)
 lines(v, pwr-se, lty = 3)
 text(v[ceiling(0.8*N)],0.8,paste("alpha =",alpha))
}
v=seq(1,20);pwr_t(v)

The results are different for heavy-tailed symmetric alternatives. When for t distribution, the empirical power is always bigger than 0.1 (the confidence level we set) and it decreases to 0.1 with the increase of v. When for Beta distribution, the empirical power increases monotonously to 0.1. This is because Beta($\alpha,\alpha$) is a symmetric distribution with zero skewness like normal distribution. Thus the power for Beta distribution is small and it is hard to detect the non-normality. But in heavy-tailed t distribution case, the power is large. When v grows bigger, the tail becomes lighter, and the power decrease. So, we can say that the power of skewness test is heavily influenced by heavy-tailed.

6.A Use Monte Carlo simulation to investigate whether the empirical Type I error rate of the t-test is approximately equal to the nominal significance level $\alpha$, when the sampled population is non-normal.

$$H_0: \mu=\mu_0\qquad vs\qquad H_a: \mu\ne\mu_0$$ $$E(\chi^2(1))=E(U(0,2))=E(Exp(1))=1=\mu_0$$ $$T=\frac{\sqrt{n}(\bar{X}-\mu_0)}{S}\sim t_{n-1}\qquad p\ value = 2P(T>|T^{(o)}|)$$ $$Type-I\ error: P_{H_0}(p\ value\le \alpha)$$

(1)$\chi^2(1)$

m = 1e4; n = 10; mu0 = 1 
mu.hat = mu.se = p.val1 = numeric(m)
 for(i in 1:m){
 x = rchisq(n,1)
 mu.hat[i] <- mean(x);mu.se[i] <- sd(x)
 p.val1[i] <- t.test(x,mu=1)$p.value
}
T=sqrt(n)*(mu.hat-mu0)/mu.se
p.val2=2*pt(abs(T),(n-1),lower.tail=FALSE)
print(c(mean(p.val1<=0.05),mean(p.val2<=0.05)))

When samples come from $\chi^2$ distribution, the empirical t1e is far away from nominal significance level 0.05. Thus choosing p value $\le 0.05$ under this circumstance can't ensure the type-I error under nominal significance level.

(2)$U(0,2)$

for(i in 1:m){
 x=runif(n,0,2)
 mu.hat[i] <- mean(x);mu.se[i] <- sd(x)
 p.val1[i] <- t.test(x,mu=1)$p.value
}
T=sqrt(n)*(mu.hat-mu0)/mu.se
p.val2=2*pt(abs(T),(n-1),lower.tail=FALSE)
print(c(mean(p.val1<=0.05),mean(p.val2<=0.05)))

When samples come from uniform distribution, the empirical t1e is close to nominal significance level 0.05. Therefore, under this case, choosing p value $\le 0.05$ can efficiently control the type-I error.

(3)Exp(1)

for(i in 1:m){
 x=rexp(n,1)
 mu.hat[i] <- mean(x);mu.se[i] <- sd(x)
 p.val1[i] <- t.test(x,mu=1)$p.value
}
T=sqrt(n)*(mu.hat-mu0)/mu.se
p.val2=2*pt(abs(T),(n-1),lower.tail=FALSE)
print(c(mean(p.val1<=0.05),mean(p.val2<=0.05)))

When samples come from exponential distribution, the empirical t1e is far from nominal significance level 0.05, too, but it is not that worse compared with $\chi^2$ distribution. Choosing p value $\le 0.05$ can't ensure type-I error under nominal significance level.

3. If we obtain the powers for two methods under a particular simulation setting with 10,000 experiments: say, 0.651 for one method and 0.676 for another method. Can we say the powers are different at 0.05 level?

We should use the McNemar test.

Homework-6 11.8

7.6 Use a panel display to display the scatter plots for each pair of test scores. Compare the plot with the sample correlation matrix. Obtain bootstrap estimates of the standard errors for each of the following estimates: $\rho_{12}=\rho(mec,vec), \rho_{34}=\rho(alg,ana), \rho_{35}=\rho(alg,sta), \rho_{45}=\rho(ana,sta)$ .

The standard error of $\rho$ can be estimated with the sample standard error of $\rho^*$, I carried out 2000 times of bootstrap, computed the correlation coefficient and estimated the standard error of the bootstrap correlation coefficients.

set.seed(8888)
data(scor, package = "bootstrap")
scor=data.matrix(scor)
pairs(scor,main='Scatter plots for each pair of test scores',pch=20)
cor_mat = function(x){
 n=dim(x)[1]
 one=rep(1,n);dim(one)=c(n,1);ONE=diag(n)
 Po=one%*%t(one)/n
 S=t(x)%*%(ONE-Po)%*%x/(n-1)    
 D=c();for (i in 1:dim(x)[2]) {D=c(D,S[i,i])};D=diag(D)
 R=sqrt(solve(D))%*%S%*%sqrt(solve(D))
 colnames(R)=rownames(R)=colnames(x)  
 return(R=round(R,4))
}
cor_mat(scor)

It can be seen from the scatter plot that there is a positive correlation between each two variables. The only difference is that some of the positive correlations are a little bit weak, e.g. mec v.s sta, mec v.s ana. We can clearly see that all the values in sample correlation matrix are above zero, thus the positive correlation is true. And the values for mec v.s sta is little smaller than other pairs. Therefore, the scatter plots are correspond with the sample correlation matrix.

rho_12 = function(x,i) cor(x[i,1],x[i,2])
rho_34 = function(x,i) cor(x[i,3],x[i,4])
rho_35 = function(x,i) cor(x[i,3],x[i,5])
rho_45 = function(x,i) cor(x[i,4],x[i,5])

r12_b = boot::boot(data=scor,statistic=rho_12,R=2000)
r34_b = boot::boot(data=scor,statistic=rho_34,R=2000)
r35_b = boot::boot(data=scor,statistic=rho_35,R=2000)
r45_b = boot::boot(data=scor,statistic=rho_45,R=2000)

OUT=matrix(c(sd(r12_b$t),sd(r34_b$t),sd(r35_b$t),sd(r45_b$t)),nrow=1,byrow=T)
OUT=round(OUT,4)
colnames(OUT)=c("p_12","p_34","p_35","p_45");rownames(OUT)="standard error"
knitr::kable(OUT)

The standard error estimation is showed above.

7.B Repeat Project 7.A for the sample skewness statistic. Compare the coverage rates for normal populations (skewness 0) and $\chi^2(5)$ distributions (skewness $\sqrt{\frac{8}{5}}$).

$$Empirical\ coverage\ rates:\ P(CI[1]<=sk<=CI[2])$$ $$Proportion\ miss\ on\ the\ left:\ P(skCI[2])$$

SK=0
boot_sk = function(x,i) {
  xbar = mean(x[i])
  m3 = mean((x[i] - xbar)^3)
  m2 = mean((x[i] - xbar)^2)
  return( m3 / m2^1.5 )
}
n=10;m=1000;
CI.n=CI.b=CI.p=matrix(0,m,2)
for(i in 1:m){
 x=rnorm(n)
 b = boot::boot(data=x,statistic=boot_sk, R = 1000)
 CI = boot::boot.ci(b,type=c("norm","basic","perc"))
 CI.n[i,]=CI$norm[2:3]
 CI.b[i,]=CI$basic[4:5]
 CI.p[i,]=CI$percent[4:5]
}
out=out2=out3=matrix(0,nrow=2,ncol=3)
out[1,]=c(mean(CI.n[,1]<=SK & CI.n[,2]>=SK),mean(CI.b[,1]<=SK & CI.b[,2]>=SK),
mean(CI.p[,1]<=SK & CI.p[,2]>=SK))

out2[1,]=c(mean(CI.n[,1]>SK),mean(CI.b[,1]>SK),mean(CI.p[,1]>SK)) 
#miss on the left

out3[1,]=c(mean(CI.n[,2]<SK),mean(CI.b[,2]<SK),mean(CI.p[,2]<SK)) 
#miss on the right
sk=sqrt(8/5)
Ci.n=Ci.b=Ci.p=matrix(0,m,2)
for(i in 1:m){
 x=rchisq(n,5)
 b = boot::boot(data=x,statistic=boot_sk, R = 1000)
 Ci = boot::boot.ci(b,type=c("norm","basic","perc"))
 Ci.n[i,]=Ci$norm[2:3]
 Ci.b[i,]=Ci$basic[4:5]
 Ci.p[i,]=Ci$percent[4:5]
}
out[2,]=c(mean(Ci.n[,1]<=sk & Ci.n[,2]>=sk),mean(Ci.b[,1]<=sk & Ci.b[,2]>=sk),
mean(Ci.p[,1]<=sk & Ci.p[,2]>=sk))

out2[2,]=c(mean(Ci.n[,1]>sk),mean(Ci.b[,1]>sk),mean(Ci.p[,1]>sk))
#miss on the left

out3[2,]=c(mean(Ci.n[,2]<sk),mean(Ci.b[,2]<sk),mean(Ci.p[,2]<sk))
#miss on the right

rownames(out)=rownames(out2)=rownames(out3)=c("N(0,1)","X^2(5)")
colnames(out)=colnames(out2)=colnames(out3)=c("norm","basic","perc")
print("Empirical Coverage Rates");out
print("The proportion miss on the left");out2
print("The proportion miss on the right");out3

Compared to $\chi^2$ distribution, normal distribution has coverage probability more close to nominal confidence level 0.95.The proportion miss on the left is close to the proportion miss on the right for normal distribution, while for $\chi$ distribution, the proportion miss on the right is larger than the proprtion miss on the left which also indicates the asymmetry.

Homework-7 11.15

7.8 Obtain the jackknife estimates of bias and standard error of $$\hat{\theta}=\frac{\hat{\lambda_1}}{\sum_{i=1}^{5}\hat{\lambda_i}}$$

The data has 88 samples, we leave one each time, compute the covariance matrix and estimate $\theta$. The estimate for bias is $(n-1)(\bar{\hat{\theta_{(.)}}-\hat{\theta}})$. The estimate for variance is $\frac{(n-1)^2}{n}var(\hat{\theta_{(i)}})$ and standard error is the square root of variance.

library(bootstrap,quietly=TRUE)
data(scor)
scor=data.matrix(scor)
Sigma=cov(scor)
lambda=eigen(Sigma)$values

theta_compute = function(lambda){
 return(lambda[1]/sum(lambda))
}

theta_hat = theta_compute(lambda)
n=dim(scor)[1]
theta_jack = numeric(n)
for(i in 1:n){
 Sigma_i=cov(scor[-i,])
 lambda_i=eigen(Sigma_i)$values
 theta_jack[i] = theta_compute(lambda_i)
}

bias_jack = (n-1)*(mean(theta_jack)-theta_hat)
var_jack = (n-1)^2*var(theta_jack)/n
se_jack = sqrt(var_jack)
round(c(original=theta_hat,bias_jack=bias_jack,
se_jack=se_jack),3)

Using jackknife, the estimated bias is 0.001 and the estimated standard eroor for $\hat{\theta}$ is 0.05.

7.10 In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Repeat the analysis replacing the Log-Log model with a cubic polynomial model. Which of the four models is selected by the cross validation procedure? Which model is selected according to maximum adjusted $R^2$?

The four models we consider is
1. Linear: $Y=\beta_0+\beta_1 X+\epsilon$
2. Quadratic: $Y=\beta_0+\beta_1 X +\beta_2 X^2+\epsilon$
3. Exponential: $\log(Y)=\log(\beta_0)+\beta_1 X+\epsilon$
4. Cubic: $Y=\beta_0+\beta_1 X +\beta_2 X^2+\beta_3 X^3+\epsilon$

We use the mean of squared prediction error as the criteria for cross validation approach. Denote $\hat{y_{k.}}$ as the prediction when omitting sample k, $e_k$ as the difference between $y_k$ and $\hat{y_{k.}}$. Then the mean of squared prediction error is $\frac{1}{n}\sum_{k=1}^{n}e_k^2$.
The formula for adjusted $R^2$ is $R^2_{adj}=1-[\frac{(1-R^2)(n-1)}{n-k-1}]$. n is the number of samples and k is the number of non-constant variables in the model. The adjusted $R^2$ increase only if the new term improves the model more than would be expected by chance. It decrease when a predictor improves the model by less than expected by chance.

data(ironslag,package="DAAG")
attach(ironslag)
n = length(magnetic) 
e1 = e2 = e3 = e4 = numeric(n)
for (k in 1:n) {
 y = magnetic[-k]
 x = chemical[-k]

 J1 = lm(y ~ x) 
 yhat1 = J1$coef[1] + J1$coef[2] * chemical[k]
 e1[k] = magnetic[k] - yhat1

 J2 = lm(y ~ x + I(x^2))
 yhat2 = J2$coef[1] + J2$coef[2] * chemical[k] +
 J2$coef[3] * chemical[k]^2
 e2[k] = magnetic[k] - yhat2


 J3 = lm(log(y) ~ x)
 logyhat3 = J3$coef[1] + J3$coef[2] * chemical[k]
 yhat3 = exp(logyhat3)
 e3[k] = magnetic[k] - yhat3


 J4 = lm(y ~ x + I(x^2)+ I(x^3))
 yhat4 = J4$coef[1] + J4$coef[2] * chemical[k] +
 J4$coef[3] * chemical[k]^2 + J4$coef[4] * chemical[k]^3
 e4[k] = magnetic[k] - yhat4


}

round(c(spe_1=mean(e1^2), spe_2=mean(e2^2), 
spe_3=mean(e3^2), spe_4=mean(e4^2)),4)

According to the prediction error criterion, Model 2, the quadratic model, with the smallest prediction error, would be the best fit for the data.

y = magnetic
x = chemical
L1 = lm(y ~ x) 
L2 = lm(y ~ x + I(x^2))
L3 = lm(log(y) ~ x)
L4 = lm(y ~ x + I(x^2)+ I(x^3))
round(c(adj_R2_1=summary(L1)$adj.r.squared,
        adj_R2_2=summary(L2)$adj.r.squared,
        adj_R2_3=summary(L3)$adj.r.squared,
        adj_R3_3=summary(L4)$adj.r.squared),4)

According to the adjusted $R^2$ , also the Model 2, with the biggest value, would be the best fit the data. Particularly, we can see that the $R^2_{adj}$ for model 4 is smaller than model 2, which means that adding a cubic term in the model does not improve the model.

The fitted regression equation for Model 2 is:

L2 = lm(magnetic ~ chemical + I(chemical^2))
coef=round(L2$coef,3)
paste("Magnetic =",coef[1],coef[2],"*Chemical +" ,coef[3],"*Chemical^2")

Homework-8 11.22

8.3 Implement a permutation test for equal variance based on the maximum number of extreme points that applies when sample sizes are not necessarily equal.

We use function count5num to compute the maximum number of extreme points, then we use function count5_test to do the permutation and compute p value. We reject null hypothesis if the permutation number larger than observed number of extreme points.

library(boot,quietly = TRUE)
count5num <- function(z,ix,n1) {
 z = z[ix]
 x = z[1:n1]
 y = z[-(1:n1)]
 X <- x - mean(x)
 Y <- y - mean(y)
 outx <- sum(X > max(Y)) + sum(X < min(Y))
 outy <- sum(Y > max(X)) + sum(Y < min(X))
 # return the number of extreme points
 return(max(c(outx, outy)))
}
count5_test =function(y,n1){ #Permutation version
 boot.obj = boot(data = y, statistic = count5num
 ,R = 2000,sim = "permutation",n1=n1)
 tb = c(boot.obj$t0, boot.obj$t)
 p.value = mean(tb>=tb[1]) #larger t favors alternative
 return(as.integer(p.value<0.05))
}
count5test = function(x, y) {
 X = x - mean(x)
 Y = y - mean(y)
 outx = sum(X > max(Y)) + sum(X < min(Y))
 outy = sum(Y > max(X)) + sum(Y < min(X))
 # return 1 (reject) or 0 (do not reject H0)
 return(as.integer(max(c(outx, outy)) > 5))
}
n1=20;n2=30;m=500
a1 = mean(replicate(m, expr={ #Count 5 criterion
 x = rnorm(n1);y <- rnorm(n2)
 x = x - mean(x) #centered by sample mean
 y = y - mean(y)
 count5test(x, y)
}))
a2 = mean(replicate(m, expr={ #Permutation test
 x1 = rnorm(n1);x2 = rnorm(n2)
 y=c(x1,x2)
 count5_test(y,n1)
}))
round(c(c5_t1e=a1,pt_t1e=a2),3)

When $n_1=20,n_2=30$, we can see that the empirical tyep-I error of permutation test is smaller than count-5 test when example sizes are different. Thus, using permutation test can better control type-I error than count 5 criterion.

n2=50;m=500
a1 = mean(replicate(m, expr={ #Count 5 criterion
 x = rnorm(n1);y <- rnorm(n2)
 x = x - mean(x) #centered by sample mean
 y = y - mean(y)
 count5test(x, y)
}))
a2 = mean(replicate(m, expr={ #Permutation test
 x1 = rnorm(n1);x2 = rnorm(n2)
 y=c(x1,x2)
 count5_test(y,n1)
}))
round(c(c5_t1e=a1,pt_t1e=a2),3)

This also make sense when $n_2=50$, thus we can come to the conclusion that using permutation test can better control type-I error than count 5 criterion.

2. Power comparison (distance correlation test versus ball covariance test) X ∼ $N(0_2, I_2)$, $e ∼ N(0_2, I_2)$, X and e are independent

Model 1: $Y = \frac{X}{4} + e$

library(snowfall,quietly=TRUE)
Akl = function(x) {
 d = as.matrix(dist(x))
 m = rowMeans(d); M = mean(d)
 a = sweep(d, 1, m); b = sweep(a, 2, m)
 return(b+M)
}
dCov = function(x, y) {
 x = as.matrix(x); y = as.matrix(y)
 n = nrow(x); m = nrow(y)
 if (n != m || n < 2) stop("Sample sizes must agree")
 if (! (all(is.finite(c(x, y)))))
 stop("Data contains missing or infinite values")
 A = Akl(x); B = Akl(y)
 sqrt(mean(A * B))
}
ndCov2 = function(z, ix, dims) {
 #dims contains dimensions of x and y
 p = dims[1]
 q = dims[2]
 d = p + q
 x = z[ , 1:p] #leave x as is
 y = z[ix, -(1:p)] #permute rows of y
 return(nrow(z) * dCov(x, y)^2)
}
dcov.test= function(z){
 boot.obj = boot(data = z, statistic = ndCov2, R = 999,
 sim = "permutation", dims = c(2, 2))
 tb = c(boot.obj$t0, boot.obj$t)
 return(p.cor = mean(tb>=tb[1]))
}
run_cop1 = function(n){
for (i in 1:10){
 x = rmvnorm(n,mu=c(0,0),sigma=diag(1,2))
 e = rmvnorm(n,mu=c(0,0),sigma=diag(1,2))
 y = x/4+e
 z = cbind(x,y)
 p1.values[i,1] = dcov.test(z)
 p1.values[i,2] = bcov.test(z[,1:2],z[,3:4],R=999,seed=i+1)$p.value
}
 return(colMeans(p1.values<0.1))
}
p1.values=matrix(0,ncol=2,nrow=10)
sfInit(parallel = TRUE, cpus = 10)
sfLibrary(mixtools)
sfLibrary(Ball)
sfLibrary(boot)
sfExport("Akl","dCov","ndCov2","dcov.test")
sfExport("p1.values")
n = seq(50,200,20)
result1 = sfLapply(n, run_cop1)  #Model 1
result1=matrix(unlist(result1),ncol=2,byrow=T)
plot(n,result1[,1],type="b",pch=20,ylab="Power",main="Model 1",ylim=c(0,1.1))
lines(n,result1[,2],type="b",lty=2)
legend(140,0.3,c("Distance","Ball"),lty=1:2)
t1=t(result1);colnames(t1)=as.character(n);rownames(t1)=c("Distance","Ball")
knitr::kable(t1)

Model 2: $Y = \frac{X}{4} \cdot e$

run_cop2 = function(n){
for (i in 1:10){
 x = rmvnorm(n,mu=c(0,0),sigma=diag(1,2))
 e = rmvnorm(n,mu=c(0,0),sigma=diag(1,2))
 y = x/4*e
 z = cbind(x,y)
 p2.values[i,1] = dcov.test(z)
 p2.values[i,2] = bcov.test(z[,1:2],z[,3:4],R=999,seed=i+1)$p.value
}
 return(colMeans(p2.values<0.1))
}
p2.values=matrix(0,ncol=2,nrow=10)
sfExport("p2.values")
n = seq(50,200,20)
result2 = sfLapply(n, run_cop2)  #Model 2
result2=matrix(unlist(result2),ncol=2,byrow=T)
plot(n,result2[,1],type="b",pch=20,ylab="Power",main="Model 2",ylim=c(0,1.1))
lines(n,result2[,2],type="b",lty=2)
legend(140,0.3,c("Distance","Ball"),lty=1:2)
t2=t(result2);colnames(t2)=as.character(n);rownames(t2)=c("Distance","Ball")
knitr::kable(t2)

From the curves and the tables, we can see that in Model 1, distance correlation test has larger power, but in Model 2, Ball covariance test has larger power.

Homework-9 11.29

9.4 Implement a random walk Metropolis sampler for generating the standard Laplace distribution (see Exercise 3.2). For the increment, simulate from a normal distribution. Compare the chains generated when different variances are used for the proposal distribution. Also, compute the acceptance rates of each chain.

The standard Laplace distribution has density $f(x)=\frac{1}{2}e^{-|x|},x\in\mathbb{R}$, it is proportional to $e^{-|x|}$
$$r(X_t,Y)=\frac{f(Y)}{f(X_t)}=e^{|X_t|-|Y|}$$ Random Walk Metropolis: $g(r|s)=g(|s-r|)$

set.seed(8888)
f = function(x) {
  return(exp(-abs(x)))
}
random.M = function(sigma, x0, N){
 x = numeric(N)
 x[1] = x0
 u = runif(N)
 rej = 0
 for (i in 2:N) {
  y = rnorm(1, x[i-1], sigma)
  if (u[i] <= (f(y) / f(x[i-1]))) x[i] = y 
  else {x[i] = x[i-1];rej=rej+1}
 }
 return(list(x=x, rej=rej))
}
N=2000
sigma = c(.05, .5, 2, 16)
x0=20
R1=random.M(sigma[1],x0,N)
R2=random.M(sigma[2],x0,N)
R3=random.M(sigma[3],x0,N)
R4=random.M(sigma[4],x0,N)
#number of candidate points rejected
Rej=cbind(R1$rej, R2$rej, R3$rej, R4$rej)
Rej= round((N-Rej)/N,3)
rownames(Rej)="Accept rates";colnames(Rej)=paste("sigma",sigma)
knitr::kable(Rej)
par(mfrow=c(1,1))
plot(1:N,R1$x,type='l',ylab='N',main=paste("sigma=",sigma[1]))
plot(1:N,R2$x,type='l',ylab='N',main=paste("sigma=",sigma[2]))
plot(1:N,R3$x,type='l',ylab='N',main=paste("sigma=",sigma[3]))
plot(1:N,R4$x,type='l',ylab='N',main=paste("sigma=",sigma[4]))
library(rmutil,quietly=TRUE)
p = c(.05, seq(.1, .9, .1), .95)
Q = qlaplace(p)
R_all = cbind(R1$x, R2$x, R3$x, R4$x)
MC = R_all[501:N, ]
QR = apply(MC, 2, function(x) quantile(x, p))
colnames(QR)=paste("sigma",sigma)
knitr::kable(round(cbind(real_q=Q, QR),3))
#xtable::xtable(round(cbind(real_q=Q, QR),3))
real=rlaplace(N)
par(mfrow=c(1,1))
qqplot(MC[,1],real,pch=20,col='red',main=paste("sigma=",sigma[1]));
qqline(MC[,1],distribution=qlaplace,lwd=1.9)
qqplot(MC[,2],real,pch=20,col='red',main=paste("sigma=",sigma[2]));
qqline(MC[,2],distribution=qlaplace,lwd=1.9)
qqplot(MC[,3],real,pch=20,col='red',main=paste("sigma=",sigma[3]));
qqline(MC[,3],distribution=qlaplace,lwd=1.9)
qqplot(MC[,4],real,pch=20,col='red',main=paste("sigma=",sigma[4]));
qqline(MC[,4],distribution=qlaplace,lwd=1.9)

We can see that when $\sigma = 0.05$, the acceptance rate is 0.979, really close to 1 and the behavior of the chain was like a random walk, without any convergent trends. When $\sigma=0.5$, the chain converged slowly. When $\sigma=2$, the chain converged quickly with a lower acceptance rate: 0.524. When $\sigma=16$, we can see the acceptance rate was small and thus the simulation was very inefficient.

As for the qqplot, the quantiles of simulation 1 were far from theoretical ones. For simulation 2 and 3, the simulated quantiles are extremely close to theoretical ones. When $\sigma=16$, the simulation was still close to theory, but acceptance rates were much lower. Together with acceptance rates and convergent speeds, I thought simulation 3 ($\sigma=2$) was the best one. Besides, the variance of proposed distribution greatly influenced the simulation effect.

P.s: If I choose $x_0=0$, then it won't have the sharp decrease at the beginning. Considering we only focus on the later part of the chain and we have the burn-in process, the value of $x_0$ doesn't matter much.

Homework-10 12.6

11.1 The natural logarithm and exponential functions are inverses of each other, so that mathematically log(exp x) = exp(log x) = x. Show by example that this property does not hold exactly in computer arithmetic. Does the identity hold with near equality? (See all.equal.)

a=3
log(exp(a))==exp(log(a));isTRUE(all.equal(log(exp(a)),exp(log(a))))

a=5
log(exp(a))==exp(log(a));isTRUE(all.equal(log(exp(a)),exp(log(a))))

We can see that if choosing a=3/5, the property does not hold, but it holds with near equality.

11.5 Write a function to solve the equation for $a$

$$\frac{2\Gamma(\frac{k}{2})}{\sqrt{\pi (k-1)} \Gamma(\frac{k-1}{2})}\int_0^{c_{k-1}}(1+\frac{u^2}{k-1})^{-k/2}du$$ $$=\frac{2\Gamma(\frac{k+1}{2})}{\sqrt{\pi k} \Gamma(\frac{k}{2})}\int_0^{c_{k}}(1+\frac{u^2}{k})^{-(k+1)/2}du$$ where $$c_k=\sqrt{\frac{a^2k}{k+1-a^2}}$$ Compare the solutions with the points $A(k)$ in Exercise 11.4

I define function f to compute one side of the equation and solve the equation in function solve_eq. For Exercise 11.4, I define function Sk to compute one end of the curve and find the intersection points using solve_ist. I found the roots lying between 1 and 2 by attempting, thus I set the lower=1 and upper=2 in uniroot. There are two ways to compute Exercise 11.5, one uses numerical integration and the other uses the probability function of t distribution directly. I listed the both here and commented on one.

f = function(k,a){
 ck=sqrt(a^2*k/(k+1-a^2))
 int=integrate(function(u){(1+u^2/k)^(-(k+1)/2)},0,ck)$value
 mul=2/sqrt(pi*k)*exp(lgamma((k+1)/2)-lgamma(k/2))
 mul*int
}
# f = function(k,a){
#   ck=sqrt(a^2*k/(k+1-a^2))
#   pt(ck,df=k)
# }
solve_eq = function(k){
  out=uniroot(function(a){f(k,a)-f(k-1,a)},lower=1,upper=2)
  out$root
}
Root=matrix(0,2,5)
k=c(4,25,100,500,1000)
for (i in 1:length(k)){
  Root[2,i]=round(solve_eq(k[i]),4)
}
Root[1,]=k;rownames(Root)=c('k','root')
Root
Sk = function(k,a){
 ck=sqrt(a^2*k/(k+1-a^2))
 pt(ck,df=k,lower.tail=FALSE)
}

solve_ist = function(k){
  o=uniroot(function(a){Sk(k,a)-Sk(k-1,a)},lower=1,upper=2)
  o$root
}
Ist=matrix(0,2,5)
k=c(4,25,100,500,1000)
for (i in 1:length(k)){
  Ist[2,i]=round(solve_ist(k[i]),4)
}
Ist[1,]=k;rownames(Ist)=c('k','ist')
Ist

The results showed that the solutions to equation in 11.5 were the same as the points in Exercise 11.4.

ABO Blood Type Problem: The frequencies of A, B, O is p, q, r. Observed data: $n_A=28,n_B=24,n_{OO}=41,n_{AB}=70$.

$$Observed\ data\ likelihood:\ L(p,q|...)=(p^2+2pr)^{n_{A.}}(q^2+2qr)^{n_{B.}}(r^2)^{n_{OO}}(2pq)^{n_{AB}}$$ $$Complete\ data\ likelihood:\ L(p,q|...)=(p^2)^{n_{AA}}(q^2)^{n_{BB}}(r^2)^{n_{OO}}(2pr)^{n_{AO}}(2qr)^{n_{BO}}(2pq)^{n_{AB}}$$ Then take log and use $n_{A.}-n_{AA}$,$n_{B.}-n_{BB}$ to replace $n_{AO}$ and $n_{BO}$.

(1) Use EM algorithm to solve MLE of p and q (consider missing data $n_{AA}$ and $n_{BB}$)

$$\hat{p_1}=\frac{n_{AB}+n_{A.}+n_{A.}\frac{\hat{p_0}^2}{\hat{p_0}^2+2\hat{p_0}\hat{r_0}}}{2n}$$ $$\hat{q_1}=\frac{n_{AB}+n_{B.}+n_{B.}\frac{\hat{q_0}^2}{\hat{q_0}^2+2\hat{q_0}\hat{r_0}}}{2n}$$

set.seed(5678)
log_like = function(p,q,r){
 p_t=p^2/(p^2+2*p*r);q_t=q^2/(q^2+2*q*r)
 l=n_A*log(2*p*r)+n_B*log(2*q*r)+2*n_OO*log(r)+n_AB*log(2*p*q)+
   p_t*n_A*log(p/2/r)+q_t*n_B*log(q/2/r)
 l
}
N = 10000 
B = c(.5,.4,.1) 
tol = .Machine$double.eps^0.5
n_A=28;n_B=24;n_OO=41;n_AB=70
n=n_A+n_B+n_OO+n_AB
B.new=B;Log_l=log_like(B[1],B[2],B[3])
for (j in 1:N) {
 p_tmp=B[1]^2/(B[1]^2+2*B[1]*B[3])
 B[1]=(n_AB+n_A+n_A*p_tmp)/n/2
 q_tmp=B[2]^2/(B[2]^2+2*B[2]*B[3])
 B[2]=(n_AB+n_B+n_B*q_tmp)/n/2
 B[3]=1-B[1]-B[2]
 if (sum(abs(B-B.new)/B.new)<tol) break
 B.new = B;Log_l=c(Log_l,log_like(B[1],B[2],B[3]))
}
print(list(pqr = B, iter = j, tol = tol))

(2) Show that the log-maximum likelihood values in M-steps are increasing via line plot.
$$Log-max\ likelihood:\ l(p,q|...)=n_{A.}log(2pr)+n_{B.}log(2qr)+2n_{OO}log(r)+n_{AB}log(2pq)$$ $$+\frac{p^2}{p^2+2pr}n_{A.}log(\frac{p}{2r})+\frac{q^2}{q^2+2qr}n_{B.}log(\frac{q}{2r})$$ All p,q,r use the estimated values in M-steps.

plot(1:j,Log_l,type='l',xlab='iteration',ylab='log-likelihood')

We can see that the log-maximum likelihood values in M-steps are increasing.

Homework-11 12.13

(Advanced R p204) 3.Use both for loops and lapply() to fit linear models to mtcars using the formulas stored in this list: formulas <- list(mpg ~ disp, mpg ~ I(1 / disp),mpg ~ disp + wt,mpg ~ I(1 / disp) + wt).

attach(mtcars)
formulas=list(
 mpg ~ disp,
 mpg ~ I(1 / disp),
 mpg ~ disp + wt,
 mpg ~ I(1 / disp) + wt
)
#Loop
n=length(formulas)
out=vector("list",n)
for (i in 1:n){
 out[[i]]=lm(formulas[[i]],data=mtcars)
}
out
#lapply
lapply(formulas,lm)

Lapply greatly simplified the code.

4.Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function? bootstraps <- lapply(1:10, function(i) {rows <- sample(1:nrow(mtcars), rep = TRUE);mtcars[rows, ]})

bootstraps = lapply(1:10, function(i) {
 rows = sample(1:nrow(mtcars), rep = TRUE)
 mtcars[rows, ]
})
oUt=vector("list",10)
for (i in 1:10) oUt[[i]]=lm(mpg ~ disp,data=bootstraps[[i]])
oUt[1:3]#Only print results 1-3  
Out=lapply(seq_along(bootstraps), function(i){
 lm(mpg ~ disp,data=bootstraps[[i]])})
Out[1:3]#Only print resuls 1-3
OUT=lapply(bootstraps,lm,formula = mpg ~ disp)#No anonymous function
OUT[1:3]

5.For each model in the previous two exercises, extract $R^2$ using the function below.rsq <- function(mod) summary(mod)$r.squared

rsq = function(mod) summary(mod)$r.squared
unlist(lapply(out,rsq)) #Model in Ex3
unlist(lapply(Out,rsq)) #Model in Ex4

(Advanced R p214) 3.The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial. Extra challenge: get rid of the anonymous function by using [[ directly.

trials=replicate(100,t.test(rpois(10, 10),rpois(7, 10)),simplify = FALSE)
sapply(trials,function(mod){mod$p.value})
sapply(trials, "[[", "p.value") #No anonymous function

Using sapply greatly simplified the code.

7.Implement mcsapply(), a multicore version of sapply(). Can you implement mcvapply(), a parallel version of vapply()? Why or why not?

library(parallel)
cores = detectCores()
cluster = makePSOCKcluster(cores)
mcsapply = function(cluster,X,FUN,...){
 res=parLapply(cluster,X,FUN,...) #Use parLapply in Windows
 simplify2array(res)
}
#Example of Parallelisation
boot_i = function(i){
 r_adj = function(x,id) {
  x = x[id,]     #bootstrap sample
  res = lm(mpg~wt+disp,data=x)
  summary(res)$adj.r.squared
}
 as.numeric(boot::boot(mtcars,r_adj,1)$t)
}
system.time(sapply(1:500,boot_i))
system.time(mcsapply(cluster,1:500,boot_i))
#In fact, in Windows, we have function *parSapply* directly.

We can see that using parallelisation, the system time for running bootstrap decrease sharply. I can implement mcvapply, it is just the same as mcsapply, just with additional argument specifying the output type.

Homework-12 12.20

1.You have already written an R function for Exercise 9.4 (page 277, Statistical Computing with R). Rewrite an Rcpp function for the same task.

```{eval=FALSE}

include

include

using namespace Rcpp;

//[[Rcpp::export]] double f(double x) { return exp(-fabs(x)); }

//[[Rcpp::export]] List randomMHC (double sigma, double x0, int N) { NumericVector x(N); x[0] = x0; //index start from 0 NumericVector u = runif(N); int rej = 0; for (int i = 1; i < N;i++ ) { NumericVector y = rnorm(1, x[i-1], sigma); if (u[i] <= (f(y[0]) / f(x[i - 1]))) { x[i] = y[0]; } else { x[i] = x[i-1]; rej ++; } } return List::create(Named("x") = x, Named("rej") = rej); }

I rewrite the random Metropolis-Hastings Rcpp function named *randomMHC* as above. For Exercise 9.4, one needs to use different sigma, but in fact the mechanism is the same. I just take sigma=2 as an example here.  

```r
library(Rcpp)
set.seed(1278)
N=2000;sigma=2;x0=20
C=randomMHC(sigma,x0,N) #Samples generated by Cpp function
paste('Rej=',C$rej)
plot(1:N,C$x,type='l',xlab='N',ylab='Samples',main='RW Metropolis sampler for generating Laplace dist')

Using sourceCpp we can see that the Cpp function really works in R.

2.Compare the generated random numbers by the two functions using qqplot

set.seed(1278)
f = function(x) {
  return(exp(-abs(x)))
}
random.M = function(sigma, x0, N){
 x = numeric(N)
 x[1] = x0
 u = runif(N)
 rej = 0
 for (i in 2:N) {
  y = rnorm(1, x[i-1], sigma)
  if (u[i] <= (f(y) / f(x[i-1]))) x[i] = y 
  else {x[i] = x[i-1];rej=rej+1}
 }
 return(list(x=x, rej=rej))
}
R=random.M(sigma,x0,N) #Samples generated by R function
library(rmutil,quietly=TRUE)
p = c(.05, seq(.1, .9, .1), .95)
real=rlaplace(N) #Real examples of Laplace distribution 
Ex = cbind(C$x, R$x)
Mc = Ex[501:N, ] #burn in the first 500 samples
QR = apply(Mc, 2, function(x) quantile(x, p))
# par(mfrow=c(1,3))
# qqplot(Mc[,1],real,pch=20,col='red',main='Samples generated by Cpp');
# qqline(Mc[,1],distribution=qlaplace,lwd=1.9)
# qqplot(Mc[,2],real,pch=20,col='red',main='Samples generated by R');
# qqline(Mc[,2],distribution=qlaplace,lwd=1.9)
qqplot(Mc[,1],Mc[,2],xlab='Samples generated by Cpp',ylab='Samples generated by R',col='red',pch=20)
abline(a=0,b=1,col='black')

Th dots of qqplot are located on the diagonal lines. Therefore, with fixed seed, we can see that samples generated by two functions are the same. There is no any difference in generating samples between the two functions. (You can run the commented codes to see that both of the samples are close to the real distribution)

3.Campare the computation time of the two functions with microbenchmark.

library(microbenchmark)
time = microbenchmark(MHC=randomMHC(sigma,x0,N),MHR=random.M(sigma,x0,N))
summary(time)[,c(1,3,5,6)]

We can see that using Cpp function, the running time is much shorter than using R function, almost 1/15 of before. Therefore, using Rcpp method can greatly shorter the consuming time for calculations.



XixuHu/SC19090 documentation built on Jan. 2, 2020, 8:40 p.m.