Overview

StatComp18017 is a simple R package including my homework.

The R package 'microbenchmark' can be used to benchmark the above R and C++ functions.


Question 1

Most otfen, it is necessary to split the picture or know how the device splits the windows that we need.Apparently??the function split.screen()could fits our needs,but with the shortcomings of incompatibility and unability to be performed on several devices.Thus,how to know that the route to split??and how to code?

Answer 1

First,we can generate a matrix with specified construction,then utilize the funcyion of layout() to display the details of the split process.The examples are follows.

mat<- matrix(1:4,2,2)
mat
layout (mat)
layout.show(4)

It is also feasible that fuse the function layout() and matrix() to simplify the code and get to our targets. As the examples show below.

layout (matrix(1:6,3,2))
layout.show(6)
m<- matrix(c(1:3,3),2,2)
layout (m)
layout.show(3)

In addition, options in function of layout(), such as widths and heights, give the possibilities to alter the split width and weight.So, the screen could be separated into plentiful shapes even be some complicated or weird.

m<- matrix(c(1,1,2,1),2,2)
layout (m,widths = c(2,1),heights = c(1,2))
layout.show(2)

These are about the split of screen with the help of function of layout and matrix.


Question 2

In the analysis process,to make a intuitive impression, the images are indispensable. So knowing how to code to display the relevant images is essential.Thus??here gives some simple examples.

Answer 2

The simplist example of the plot is followed.

x<-rnorm(10)
y<-rnorm(10)
plot(x,y)

In fact, changing the image characters and making the iamges more artistic and analyzable, through adjusting some parameter values, are practicible.

plot(x,y,xlab = "Ten random values",ylab = "Ten other values", xlim = c(-2,2), ylim = c(-2,2), pch=22, col="red", bg="yellow", bty="7",tcl=0.4, main="How to customize a plot with R", las=1,cex=1.5)

As we all know, the function of plot is one of the simple praphing methods.Thus more advanced functions are needed to modify the images pet by function plot as function par.

opar<-par()
par(bg="lightyellow", col.axis="blue", mar=c(4,4,2.5,0.25))
plot(x,y,xlab = "Ten random values",ylab = "Ten other values", xlim = c(-2,2), ylim = c(-2,2), pch=22, col="red", bg="yellow",bty="7", tcl=-0.25, las=1, cex=1.5)
title("How to customize a plot with R(bias)",font.main=3, adj=1)
par(opar)

Question 3

In this informative age, processing big data is common, thus we generally need some function with the cycle structure to find the potential rules. And in the code of R, some of the cycle functions are different and unique. Those we have to study this.So how the procedure of the cycle does?

Answer 3

In the R software, the function with the form of multi function apply can avoid the ues of the direct cycle structure.as the example follows.

x<-rnorm(10,-5,0.1)
y<-rnorm(10,5,2)
X<-cbind(x,y)
apply(X, 2,mean)
apply(X, 2, sd)

More information, the function of lappy() which has the silimar syntax with the function apply can work on the list objects and eventually return a list object.

forms<-list(y~x,y~poly(x,2))
lapply(forms,lm)

In more flexible situations, to accept the vectors or the matrixes as the main parameters, function sapply can be a good alternate with the final return of the form of table. Here we do not talk these.


Question 1

Most otfen, it is necessary to split the picture or know how the device splits the windows that we need.Apparently??the function split.screen()could fits our needs,but with the shortcomings of incompatibility and unability to be performed on several devices.Thus,how to know that the route to split??and how to code?

Answer 1

First,we can generate a matrix with specified construction,then utilize the funcyion of layout() to display the details of the split process.The examples are follows.

mat<- matrix(1:4,2,2)
mat
layout (mat)
layout.show(4)

It is also feasible that fuse the function layout() and matrix() to simplify the code and get to our targets. As the examples show below.

layout (matrix(1:6,3,2))
layout.show(6)
m<- matrix(c(1:3,3),2,2)
layout (m)
layout.show(3)

In addition, options in function of layout(), such as widths and heights, give the possibilities to alter the split width and weight.So, the screen could be separated into plentiful shapes even be some complicated or weird.

m<- matrix(c(1,1,2,1),2,2)
layout (m,widths = c(2,1),heights = c(1,2))
layout.show(2)

These are about the split of screen with the help of function of layout and matrix.


Question 2

In the analysis process,to make a intuitive impression, the images are indispensable. So knowing how to code to display the relevant images is essential.Thus??here gives some simple examples.

Answer 2

The simplist example of the plot is followed.

x<-rnorm(10)
y<-rnorm(10)
plot(x,y)

In fact, changing the image characters and making the iamges more artistic and analyzable, through adjusting some parameter values, are practicible.

plot(x,y,xlab = "Ten random values",ylab = "Ten other values", xlim = c(-2,2), ylim = c(-2,2), pch=22, col="red", bg="yellow", bty="7",tcl=0.4, main="How to customize a plot with R", las=1,cex=1.5)

As we all know, the function of plot is one of the simple praphing methods.Thus more advanced functions are needed to modify the images pet by function plot as function par.

opar<-par()
par(bg="lightyellow", col.axis="blue", mar=c(4,4,2.5,0.25))
plot(x,y,xlab = "Ten random values",ylab = "Ten other values", xlim = c(-2,2), ylim = c(-2,2), pch=22, col="red", bg="yellow",bty="7", tcl=-0.25, las=1, cex=1.5)
title("How to customize a plot with R(bias)",font.main=3, adj=1)
par(opar)

Question 3

In this informative age, processing big data is common, thus we generally need some function with the cycle structure to find the potential rules. And in the code of R, some of the cycle functions are different and unique. Those we have to study this. So how the procedure of the cycle does?

Answer 3

In the R software, the function with the form of multi function apply can avoid the ues of the direct cycle structure.as the example follows.

x<-rnorm(10,-5,0.1)
y<-rnorm(10,5,2)
X<-cbind(x,y)
apply(X, 2,mean)
apply(X, 2, sd)

More information, the function of lappy() which has the silimar syntax with the function apply can work on the list objects and eventually return a list object.

forms<-list(y~x,y~poly(x,2))
lapply(forms,lm)

In more flexible situations, to accept the vectors or the matrixes as the main parameters, function sapply can be a good alternate with the final return of the form of table. Here we do not talk these.


Question 5.4

Write a function to compute a $ Monte Carlo $ estimate of the $ Beta(3, 3) $ cdf, and use the function to estimate # F(x) forx =0 .1,0.2,...,0.9. # Compare the estimates with the values returned by the $ pbeta $ function in R.

Answer

x <- seq(.1,0.9, length = 9) 
m <- 10000 
u <- runif(m) 
cdf <- numeric(length(x)) 
for (i in 1:length(x)) { 
  g <- x[i] ^3*u^2* (1-x[i]^2*u^2) 
  cdf[i] <- 30*mean(g) 
}

Now the estimates ?? ?? for ten values of x are stored in the vector cdf. Compare the estimates with the value ??(x) computed (numerically) by the pnorm function.

Phi <- pbeta(x,3,3) 
print(round(rbind(x, cdf, Phi), 3)) 

Results for several values x>0 are shown compared with the value of the beta distribution cdf function pbeta. The Monte Carlo estimates appear to not be very close to the pnorm values cause the estimates will be worse in the extreme upper tail of the distribution.)


Question 5.9

The $ Rayleigh $ density [156, (18.76)] is $ f(x)= x/sigma^2exp(???x^2/(2sigma^2)),x ??0,??>0.$ Implement a function to generate samples from a $ Rayleigh(??)$ distribution, using antithetic variables. What is the percent reduction in variance of $ (X+X')/ 2 $ compared with $ (X1+X2)/2 $ for independent $ X1 $, $X2 $?

Answer

MC.Phi <- function(x, R = 10000, antithetic = TRUE) {
  u <- runif(R/2) 
  sigma <- 4
  if (!antithetic) v <- runif(R/2) else 
    v <- 1 - u 
  u <- c(u, v) 
  cdf <- numeric(length(x)) 
  for (i in 1:length(x)) { 
    g <- x[i]^2*u/sigma^2 * exp(-(u* x[i])^2 / (2*sigma^2) )
    cdf[i] <- mean(g)  
  } 
  cdf 
}

A comparison of estimates obtained from a single Monte Carlo experiment is below.

x <- seq(1,10, length=10) 
sigma <- 4
Phi <- x/sigma^2*exp(-x^2/(2*sigma^2))
set.seed(123)
MC1 <- MC.Phi(x, anti = FALSE) 
set.seed(123)
MC2 <- MC.Phi(x) 
print(round(rbind(x, MC1, MC2, Phi), 5))
m <- 1000 
MC1 <- MC2 <- numeric(m) 
x <- 2
for (i in 1:m) { 
  MC1[i] <- MC.Phi(x, R = 1000, anti = FALSE) 
  MC2[i] <- MC.Phi(x, R = 1000)
  }
print(sd(MC1)) 
print(sd(MC2)) 
print((var(MC1) - var(MC2))/var(MC1)) 

The antithetic variable approach achieved approximately 99.5% reduction in variance at x =2.0


Question 5.13

Find two importance functions f1 and f2 that are supported on (1,??) and are ??close?? to $ g(x)= x^2/sqrt(2pi)exp(???x^2/2), x > 1.$ Which of your two importance functions should produce the smaller variance in estimating $ integrate( x^2/sqrt(2??)* exp(???x^2/2),1,inf) $ by importance sampling? Explain.

Answer

m <- 10000 
theta.hat <- se <- numeric(2) 
g <- function(x) { 
 x^2/sqrt(2*pi)*exp(-x^2/2) * (x >1) 
 }

x <- rnorm(m) #using f1
fg <- sqrt(2*pi)*g(x)/exp(-x^2/2)
theta.hat[1] <- mean(fg) 
se[1] <- sd(fg)

x <- rchisq(m,4) #using f2
fg <- 4*g(x)/(x^2*exp(-x/2))
theta.hat[1] <- mean(fg) 
se[2] <- sd(fg)

rbind(theta.hat, se) 

we choose f1 as standard norm distribution,and f2 as chi-squared distribution.The estimates (labeled theta.hat) of the integate g(x) within the interval (1,inf) and the corresponding standard errors se for the simulation using each of the importance functions are perfomed above. From the results below,we know that the f1 produce the smaller variance. the relative codes are followed.

x <- seq(0, 1, .01)
w <- 2 
f1 <- exp(-x^2/2)/sqrt(2*pi)
f2 <- x^2*exp(-x/2)/4
g <- x^2/sqrt(2*pi)*exp(-x^2/2)

#figure (a) 
plot(x, g, type = "l", main = "", ylab = "", ylim = c(0,2), lwd = w) 
lines(x, f1, lty = 2, lwd = w) 
lines(x, f2, lty = 3, lwd = w) 
legend("topright", legend = c("g", 0:2), lty = 1:3, lwd = w, inset = 0.02)

#figure (b) 
plot(x, g, type = "l", main = "", ylab = "", ylim = c(0,3.2), lwd = w, lty = 2) 
lines(x, g/f1, lty = 3, lwd = w) 
lines(x, g/f2, lty = 4, lwd = w)
legend("topright", legend = c(0:2), lty = 2:4, lwd = w, inset = 0.02)

Question 5.14

Obtain a Monte Carlo estimate of $ integrate( x^2/sqrt(2??)* exp(???x^2/2),1,inf) $ by importance sampling.

Answer

we use the standard norm distribution to estimate the integrate g(x) and we get the result :0.3868715.

m <- 10000 
theta.hat <- se <- numeric(1) 
g <- function(x) { 
  x^2/sqrt(2*pi)*exp(-x^2/2) * (x >1) 
 }
x <- rnorm(m) 
fg <- sqrt(2*pi)*g(x)/exp(-x^2/2)
theta.hat <- mean(fg) 
theta.hat

Question 6.9

Let $ X $ be a non-negative random variable with $ ?? = E[X] < ?? $ . For a random sample $ x_{1},...,x_{n} $ from the distribution of $ X $, the $ Gini $ ratio is de???ned by $$ G = \frac{1}{2n^2u}\sum_{j=1}^{n}\sum_{i=1}^{n}|x_{i}- x_{j}|.$$ The $ Gini $ ratio is applied in economics to measure inequality in income distribution (see e.g. [163]). Note that $ G $ can be written in terms of the order statistics $ x_{i} $ as $$ G = \frac{1}{n^2u}\sum_{i=1}^{n}(2i-n-1)x_{(i)}.$$ If the mean is unknown,let $ \hat{G} $ be the statistic $ G $ with $ ?? $ replaced by $ \overline {x} $. Estimate by simulation the mean, median and deciles of $ \hat{G} $ if $ X $ is standard lognormal. Repeat the procedure for the uniform distribution and $ Bernoulli(0.1) $. Also construct density histograms of the replicates in each case.

Answer 6.9

We use the ordered sample to esrimte the relevent target values of $ G $ function. The basic idea is : (1) generate the random samples with the number n: $ x_1,x_2,????,x_n $, and iid from the distribution of $ X $. (2) Sort the samples we get in increasing order, to obtain the ordered samples: $ x(1)<x(2)<????<x(n)$ (3) Compute the median function to get the results,mean ,median and deciles of the G.

X is standard lognormal

First, we simulate the mean,median and deciles of $ \hat{G} $ if $ X $ is standard lognormal.

n <- 20  ## numbers of samples
m <- 1000  ## experiment times
G <- numeric(m) 
for (i in 1:m) {
  x <- sort(rlnorm(n)) ## X is standard lognormal
  mu <- mean(x)  ## let the mu be the mean values of the samples
  for (j in 1:n) {
    t <- (2*j-n-1)*x[j]
  }
  G[i] <- sum(t)/(mu*n^2)
}
Ghat <- mean(G) ## compute the estimate values(mean) of G
MGhat <- median(G) ## compute the estimate values(median) of G
DGhat <- quantile(G,probs=0.1) ## compute the estimate values(deciles) of G
c(Ghat,MGhat,DGhat)
hist(G,prob=TRUE) ## density histogram of sample

X has the uniform distribution

Then, we simulate the mean,median and deciles of $ \hat{G} $ if $ X $ is uniform.

n <- 20  ## numbers of samples
m <- 1000  ## experiment times
G <- numeric(m) 
for (i in 1:m) {
  x <- sort(runif(n)) ## X has the uniform distribution
  mu <- mean(x)  ## let the mu be the mean values of the samples
  for (j in 1:n) {
    t <- (2*j-n-1)*x[j]
  }
  G[i] <- sum(t)/(mu*n^2)
}
Ghat <- mean(G) ## compute the estimate values(mean) of G
MGhat <- median(G) ## compute the estimate values(median) of G
DGhat <- quantile(G,probs=0.1) ## compute the estimate values(deciles) of G
c(Ghat,MGhat,DGhat)
hist(G,prob=TRUE) ## density histogram of sample

X has the Bernoulli(0.1) distribution

Last, we simulate the mean,median and deciles of $ \hat{G} $ if $ X $ is Bernoulli(0.1).

n <- 20  ## numbers of samples
m <- 1000  ## experiment times
G <- numeric(m) 
for (i in 1:m) {
  x <- sort(rbinom(n,size=1,prob=0.1)) ## X has the Bernoulli(0.1) distribution
  mu <- mean(x)  ## let the mu be the mean values of the samples
  for (j in 1:n) {
    t <- (2*j-n-1)*x[j]
  }
  G[i] <- sum(t)/(mu*n^2)
}
Ghat <- mean(G) ## compute the estimate values(mean) of G
MGhat <- median(G) ## compute the estimate values(median) of G
DGhat <- quantile(G,probs = seq(0,1,0.1),na.rm = TRUE) ## compute the estimate values(deciles) of G
c(Ghat,MGhat,DGhat)
hist(G,prob=TRUE) ## density histogram of sample

Question 6.10

Construct an approximate 95% con???dence intervalfor the $ Gini $ ratio $ \gamma = E[G] $ if $ X $ is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a $ Monte Carlo $ experiment.

Answer 6.10

From the codes below, we can get the estimate of G and a 95% CI. Accoding to the relevent theory, we know that the CI is $$ [RG-sd(G)/sqrt(n)t_{(1-\frac{\alpha}2)}(n-1),RG+sd(G)/sqrt(n)t_{(1-\frac{\alpha}2)}(n-1)]$$,and the RG is the real value of the expectation of $ G $ computed with the specific distribution of $ X $.

n <- 20  ## numbers of samples
m <- 1000  ## experiment times
G <- numeric(m) 
for (i in 1:m) {
  x <- sort(rlnorm(n)) ## X is standard lognormal
  mu <- mean(x)  ## let the mu be the mean values of the samples
  for (j in 1:n) {
    t[j] <- (2*j-n-1)*x[j]
    sum(t)
  }
  G[i] <- sum(t)/(mu*n^2)
  G[i]
}
mean(G) ## compute the estimate values(mean) of G
alpha <- 0.05 
sd <- 1
EG <- 2*pnorm(sd/sqrt(2))-1 ## compute the integration of G to get the real mean values
  UCL <- mean(G)+sd(G)*qt(alpha/2, df = n-1)/sqrt(n) ## compute the upper confidence limit
  LCL <- mean(G)-sd(G)*qt(alpha/2, df = n-1)/sqrt(n) ## compute the lower confidence limit
c(UCL,LCL)

Then, we estimate the coverage rate according to the hypothesis test.

n <- 20  ## numbers of samples
m <- 1000  ## experiment times
G <- numeric(m) 
UCL=numeric(1000)
LCL=numeric(1000)
for(ii in 1:1000) {
for (i in 1:m) {
  x <- sort(rlnorm(n)) ## X is standard lognormal
  mu <- mean(x)  ## let the mu be the mean values of the samples
  for (j in 1:n) {
    t[j] <- (2*j-n-1)*x[j]
    sum(t)
  }
  G[i] <- sum(t)/(mu*n^2)
  G[i]
}
mean(G) ## compute the estimate values(mean) of G
alpha <- 0.05 
sd <- 1
EG <- 2*pnorm(sd/sqrt(2))-1 ## compute the integration of G to get the real mean values
  UCL[ii] <- mean(G)+sd(G)*qt(alpha/2, df = n-1)/sqrt(n) ## compute the upper confidence limit
  LCL[ii] <- mean(G)-sd(G)*qt(1-alpha/2, df = n-1)/sqrt(n) ## compute the lower confidence limit
}
sum(EG<UCL & EG>LCL)  #count the number of intervals that satisfy the conditions
mean(UCL>EG & EG>LCL) ## compute the coverage rate

Question 6.B

Tests for association based on $ Pearson $ product moment correlation $ \rho $, $ Spearman??s $ rank correlation coe???cient $ \rho_{s} $, or $ Kendall??s $ coe???cient $ \tau $ ,are implemented in $ cor.test $. Show (empirically) that the nonparametric tests based on $ \rho_{s} $ or $ \tau $ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $ (X,Y) $ such that $ X $ and $ Y $ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.

Answer 6.B


Question 7.1

Compute a $jackknife$ estimate of the bias and the standard error of the correlation statistic in $Example 7.2$ (p202).

Answer 7.1

Ideas

If $\hat\theta$ is a smooth (plug-in) statistic, then $\hat\theta_{(i)} = t(F_{n-1}(x_{(i)}))$, and the jackknife estimate of bias is $$\hat{bias}{jack}=(n-1)(\overline{\hat\theta{(.)}}-\hat\theta),$$ when $\overline{\hat\theta_{(.)}}=\frac1n\sum_{i=1}^{n}\hat\theta_{(i)}$ is the mean of the estimates from the leave-one-out samples, and $\hat\theta= \hat\theta_{(x)}$ is the estimate computed from the original observed sample.

A $jackknife$ estimate of standard error is $$\hat{se}{jack}=\sqrt{\frac{n-1}{n}\sum{i=1}^{n}(\hat\theta_{(i)}-\overline{\hat\theta_{(.)}})^2},$$ for smooth statistics $\hat\theta.$

library(bootstrap) #for the law data
data(law, package = "bootstrap") 
n <- nrow(law) # for all data available in the samples popuation
theta.hat <- cor(law$LSAT,law$GPA) #compute the correlation between law$LSAT and law$GPA
print(sprintf('theta.hat=%f',theta.hat))
#compute the jackknife replicates, leave-one-out estimates 
theta.jack <- numeric(n) 
for (i in 1:n) 
  theta.jack[i] <- cor(law$LSAT[-i],law$GPA[-i])
bias <- (n - 1) * (mean(theta.jack) - theta.hat)
print(sprintf('bias=%f',bias)) #jackknife estimate of bias 
se <- sqrt((n-1) * mean((theta.jack - mean(theta.jack))^2)) 
print(sprintf('se=%f',se)) #jackknife estimate of standard error

Explain the results we get

The correlation statistic is the correlation betwwen $LSAT$ and $GPA$, we use the $\theta$ to delegate the target correlation and denote the estimstor $\theta$ as $\hat\theta$, we compute and get the value $\hat\theta=0.776374$.Then we use the method of jackknife to estimate its bias and standard error.The final results are $\hat{bias}{jack}=-0.006474$ and $\hat{se}{jack}=0.142519.$


Question 7.5

Refer to $Exercise 7.4$(p206). Compute 95% bootstrap confidence intervals for the mean time between failures $\frac1\lambda$ by the standard normal, basic, percentile, and $BCa$ methods. Compare the intervals and explain why they may differ.

Answer 7.5

Ideas

1,The standard normal bootstrap confidence interval : If $\theta$ is a sample mean and the sample size is large, then the Central Limit Theorem implies that $$Z=\frac{\hat\theta-E[\hat\theta]}{se(\hat\theta)}$$ is approximately standard normal. Hence, if $\hat\theta$ is unbiased for $\theta$, then an approximate $100(1-\alpha)$% CI for $\theta$ is the $Z$-interval $$[\hat\theta-z_\frac{\alpha}{2}se(\hat\theta),\hat\theta+z_\frac{\alpha}{2}se(\hat\theta)]$$ where $$z_\frac{\alpha}{2} =\Phi^{-1}(1-\frac{\alpha}{2}).$$

2,The basic bootstrap confidence interval : The basic bootstrap confidence interval transforms the distribution of the replicates by subtracting the observed statistic.The $100(1-\alpha)$% confidence limits for the basic bootstrap confidence interval are $$(2\hat\theta-\hat\theta_{1-\frac{\alpha}{2}},2\hat\theta-\hat\theta_\frac{\alpha}{2}).$$

3,The percentile bootstrap confidence interval : Suppose that $\hat\theta^{(1)},\cdots,\hat\theta^{(B)}$ are the bootstrap replicates of the statistic $\hat\theta$. From the ecdf of the replicates, compute the $\frac\alpha2$ quantile $\hat\theta_\frac\alpha2$,and the $1-\frac\alpha2$quantile$\hat\theta_{1-\frac\alpha2}.$

4,The better bootstrap confidence interval(BCa) : For a $100(1-\alpha)$% BCa bootstrap confidence interval is $$(\hat\theta_{\alpha1}^,\hat\theta_{\alpha2}^).$$

But we can obtain the normal, basic, percentile and the better bootstrap confidence intervals using the $boot$ and $boot.ci$ functions in the $boot$ package.The code generates 95% confidence intervals are follow.


library(boot) #for boot and boot.ci 
data(aircondit, package = "boot") #get the data set and package
theta.boot <- function(x,ind) { 
  #function to compute the target values 
  x <- x[ind]
  mean(x) #acording to the exponetial distribution,the $\frac1\lambda$ is exactly the values of the mean x
}
#Run the bootstrap and compute con???dence interval estimates for the bioequivalence ratio
x <- aircondit$hours #give x the values in the data set aircondit to denote the time intervals
dat <- x
boot.obj <- boot(dat, statistic = theta.boot, R = 2000)
#The output for the bootstrap and bootstrap con???dence intervals is below.
print(boot.obj) 
print(boot.ci(boot.obj, type = c("basic", "norm", "perc","bca")))

Explain the results we get

From the codes above, we get the standard normal,the basic,the percentile and the better bootstrap 95% CI are $( 33.8, 181.2 ),$ $( 25.6, 172.1 ),$ $( 44.1, 190.6 ),$ $( 55.2, 229.3 )$ respectively. And we compare them then know that the results of the standard normal,the basic,the percentile methods are different but with subtle difference.Meanwhile, the results of the BCa method are different from that of other methods notably.

(1)The standard normal bootstrap CI is the simplest approach, but not necessarily the best.In this method, we have made several assumptions:to apply the normal distribution, we assume that the distribution of $\hat\theta$ is normal or $\hat\theta$is a sample mean and the sample size is large; We have also implicitly assumed that $\hat\theta$ is unbiased for $\theta.$

(2)The basic bootstrap CI transforms the distribution of the replicates by subtracting the observed statistic. We reuse the ecdf and depend on it strongly than other methods and we do not utilize the information sufficently,but we make no assumptions,so the result is more fitted to some extent.

(3)A bootstrap percentile interval uses the empirical distribution of the bootstrap replicates as the reference distribution. The quantiles of the empirical distribution are estimators of the quantiles of the sampling distribution of $\hat\theta$, so that these (random) quantiles may match the true distribution better when the distribution of $\hat\theta$ is not normal.

(4)The bias-corrected and accelerated (BCa)is a modified version of percentile intervals that have better theoretical properties and better performance in practice.And also we can get the same conclusion from the results.


Question 7.8

Refer to $Exercise 7.7$(p210). Obtain the $jackknife$ estimates of bias and standard error of $\hat\theta$.

Answer 7.8

Ideas

If we want to compute the jackknife estimates of $\hat\theta$, we must to get the covariance matrix of the samples popuation and compute its all eigenvalues. Then it is easier for us to compute the $\hat\theta$. Finally, we just need to estimate with three methods of jackknife.The simple procedures are:

(1) Get the data set and generate a covariance matrix $\Sigma.$

(2) Compute all eigenvalues of the covariance matrix : $\lambda_1,\cdots,\lambda_5$ and compute the will-estimated parameter $\hat\theta=\frac{\hat\lambda_1}{\sum_{i=1}^{5}\hat\lambda_i}.$

(3) Estimate $\hat\theta$ with the jackknife : its $jackknife$ estimate of bias is $$\hat{bias}{jack}=(n-1)(\overline{\hat\theta{(.)}}-\hat\theta),$$ and its $jackknife$ estimate of standard error is $$\hat{se}{jack}=\sqrt{\frac{n-1}{n}\sum{i=1}^{n}(\hat\theta_{(i)}-\overline{\hat\theta_{(.)}})^2},$$

library(bootstrap) #for the sample data
data(scor) # duplicate the data scor to "data"
str(scor) # the structure of the sample data
n <- nrow(scor) # for all data available in the samples popuation
cov <- cov(scor) # compute the covariance matrix
cov
e <- eigen(cov) #compute the eigenvalues of the covariance matrix
e
x <- e$values[1] # The maximum eigenvaues of the covariance matrix
print(sprintf('x=%f',x))
y <- sum(e$values[1:5]) # The sum values of all eigenvaues of the covariance matrix
print(sprintf('y=%f',y))
theta.hat <- x/y #The contribution rate of the covariance matrix
print(sprintf('theta.hat=%f',x/y))
#compute the jackknife replicates, leave-one-out estimates 
theta.jack <- numeric(n)
for (i in 1:n) 
  cov.jack <- cov(scor[-i])  # compute the covariance matrix in the jackknife
  e <- eigen(cov.jack) #compute the eigenvalues of the covariance matrix in the jackknife
  x <- e$values[1] # The maximum eigenvaues of the jackknife covariance matrix
  y <- sum(e$values[1:5]) # The sum values of all eigenvaues of the jackknife covariance matrix
  theta.jack[i] <- x/y #The contribution rate of the jackknife covariance matrix
bias <- (n - 1) * (mean(theta.jack) - theta.hat)
print(sprintf('bias=%f',bias)) #jackknife estimate of bias 
se <- sqrt((n-1) * mean((theta.jack - mean(theta.jack))^2))
print(sprintf('se=%f',se)) #jackknife estimate of standard error

Explain the results we get

From the results, we can see that the eigrn values of the covariance matrix are $\lambda_1 = 686.98981, \lambda_2 = 202.11107,\lambda_3 = 103.74731, \lambda_4 = 84.63044,\lambda_5 = 32.15329$. So we compute the $\hat\theta=\frac{\hat\lambda_1}{\sum_{i=1}^{5}\hat\lambda_i}=0.619115.$ Then we estimate the jackknife estimate of standard error and bias of $\hat\theta$ are $\hat{bias_{jack}}=-53.250929, \hat{se_{jack}}=0.612080$ respectively.


Question 7.11

In $Example 7.18$(p227), leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.

Answer 7.11

Ideas

The proposed models for predicting magnetic measurement (Y) from chemical measurement (X) are:

1.$Linear: Y=\beta_0+\beta_1X+\epsilon.$

2.$Quadratic: Y=\beta_0+\beta_1X+\beta_2X^2+\epsilon.$

3.$Exponential: log(Y) = log(\beta_0)+\beta_1X + \epsilon.$

4.$Log-Log: log(Y)=\beta_0 + \beta_1log(X)+\epsilon.$

And the leave-two-out cross validation procedure is:

1.For $k =1,\cdots,n$,let observation $(x_k,y_k)$ and $(x_{n-k+1},y_{n-k+1})$ be the test point and use the remaining observations to fit the model.

(a)Fit the model(s) using only the n-2 observations in the training set $(x_i,y_i),i \neq k,i \neq n-k+1.$

(b) Compute the predicted response $\hat{y}_k=\hat\beta_0+\hat\beta_1x_k$for the test point.

(c) Compute the prediction error $e_{k1}=y_k-\hat{y}k.$ and $e{k2}=y_{n-k+1}-\hat{y}_{n-k+1}.$

(d) compute the mean values of $e_{k1}=y_k-\hat{y}k.$ and $e{k2}=y_{n-k+1}-\hat{y}_{n-k+1}$ and denote as $e_k.$

2.Estimate the mean of the squared prediction errors $\hat\sigma_\epsilon^2=\frac1n\sum_{k=1}^{n}e_k^2.$

The code to estimate the parameters of the four models follows.

library(DAAG)
data(ironslag, package = "DAAG")
attach(ironslag)
n <- length(magnetic) #in DAAG ironslag 
e1<- e11<- e12<- e2<- e21<- e22<- e3<- e31<- e32<- e4<- e41<- e42 <- numeric(n)
# for n-fold cross validation 
# fit models on leave-two-out samples 
for (k in 1:n) { 
  y <- magnetic[{-k}&{-(n-k+1)}] ## leave-two-out samples
  x <- chemical[{-k}&{-(n-k+1)}] ## leave-two-out samples

  J1 <- lm(y ~ x) #the Linear model
  yhat11 <- J1$coef[1] + J1$coef[2] * chemical[k] 
  e11[k] <- magnetic[k] - yhat11
  yhat12 <- J1$coef[1] + J1$coef[2] * chemical[n-k+1] 
  e12[k] <- magnetic[n-k+1] - yhat12
  e1[k] <- e11[k]+e12[k]

  J2 <- lm(y ~ x + I(x^2)) #the Quadratic model
  yhat21 <- J2$coef[1] + J2$coef[2] * chemical[k] + J2$coef[3] * chemical[k]^2 
  e21[k] <- magnetic[k] - yhat21
  yhat22 <- J2$coef[1] + J2$coef[2] * chemical[n-k+1] + J2$coef[3] * chemical[n-k+1]^2 
  e22[k] <- magnetic[n-k+1] - yhat22
  e2[k] <- e21[k]+e22[k]

  J3 <- lm(log(y) ~ x) #the Exponential model
  logyhat31 <- J3$coef[1] + J3$coef[2] * chemical[k] 
  yhat31 <- exp(logyhat31) 
  e31[k] <- magnetic[k] - yhat31
  logyhat32 <- J3$coef[1] + J3$coef[2] * chemical[n-k+1] 
  yhat32 <- exp(logyhat32) 
  e32[k] <- magnetic[n-k+1] - yhat32
  e3[k] <- e31[k]+e32[k]

  J4 <- lm(log(y) ~ log(x)) #the Log-Log model
  logyhat41 <- J4$coef[1] + J4$coef[2] * log(chemical[k]) 
  yhat41 <- exp(logyhat41) 
  e41[k] <- magnetic[k] - yhat41
  logyhat42 <- J4$coef[1] + J4$coef[2] * log(chemical[n-k+1]) 
  yhat42 <- exp(logyhat42) 
  e42[k] <- magnetic[n-k+1] - yhat42
  e4[k] <- e41[k]+e42[k]
} 
 c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2)) 

Now according the results, we display the specific properties and plots for the model 2:

Explain the results we get

According to the prediction error criterion, Model 2, the $Quadratic$ model, would be the best fit for the data. And now we show some relative parameter values and plots of this model. The best model is L2(the $Quadratic$ model). The fitted regression equation for Model 2 is $$\hat{Y}= 24.49262 -1.39334X + 0.05452X^2.$$


Question

Exercise 8.1

Implement the two-sample Cramer-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.

Answer Exercise 8.1

Ideas

The Cramer-von Mises statistic, which estimates the integrated squared distance between the distributions, is defined by $$W_2=\frac{mn}{(m+n)^2}[\sum^n_{i=1}(F_n(x_i)-G_m(x_i))^2+\sum^m_{j=1}(F_n(x_j)-G_m(x_j))^2].$$
where $F_n$ is the ecdf of the sample $x_1,...,x_n$ and $G_m$ is the ecdf of the sample $y_1,...,y_m$.

attach(chickwts)
x <- sort(as.vector(weight[feed == "soybean"]))
y <- sort(as.vector(weight[feed == "linseed"]))
detach(chickwts)

R <- 999#number of replicates
z <- c(x,y)#pooled sample
K <- 1:length(z)
D <- numeric(R) #storage for replicates

CM <- function(x,y){#a function to compute the statistic of Cramer-von Mises
  ecdfx <- ecdf(x)
  ecdfy <- ecdf(y)
  n1 <- length(x)
  n2 <- length(y)

  sum1 <- sum((ecdfx(x)-ecdfy(x))^2)
  sum2 <- sum((ecdfx(y)-ecdfy(y))^2)
  w <- n1*n2/(n1+n2)^2*(sum1+sum2)
  return(w)
}

D0 <- CM(x,y)

for(i in 1:R) {
  k <- sample(K, size = length(x), replace = F)
  x1 <- z[k]
  y1 <- z[-k] #complements of X1
  D[i] <- CM(x1,y1)
}

p <- mean(c(D0, D) >= D0)
print(p)

hist(D, main = "", freq = FALSE, xlab = "D (p = 0.412)", breaks = "scott")
points(D0, 0, cex = 1, pch = 16)

Explains

The approximate achieved significance level 0.412 does not support the alternative hypothesis that distributions differ. So we may say that soybean and linseed groups may be similar.


Question 8-2

Design experiments for evaluating the performance of the NN, energy, and ball methods in various situations

  • Unequal variances and equal expectations
  • Unequal variances and unequal expectations
  • Non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)
  • Unbalanced Unbalanced samples (say, 1 case versus 10 controls)
  • Note: The parameters should be choosen such that the powers are distinguishable (say, range from 0.3 to 0.9).

Answer 8.2

Unequal variances and equal expectations

In the following code, the first case, where unbalanced samples are with unequal variances and equal expectations is discussed.

library(RANN)## for nn function
library(boot)
library(energy)
library(Ball)
library(ggplot2)

m <- 30#times to loop
k<-3
p<-2#ncol
# mu <- 0.5
n1 <- n2 <- 50#nrow
R<-999#the number of replications in the bd.test function
n <- n1+n2
N = c(n1,n2)

Tn <- function(z, ix, sizes,k) {
  n1 <- sizes[1]
  n2 <- sizes[2]
  n <- n1 + n2
  if(is.vector(z)) z <- data.frame(z,0)
  z <- z[ix, ]
  NN <- nn2(data=z, k=k+1)
  block1 <- NN$nn.idx[1:n1,-1] 
  block2 <- NN$nn.idx[(n1+1):n,-1] 
  i1 <- sum(block1 < n1 + .5); i2 <- sum(block2 > n1+.5) 
  (i1 + i2) / (k * n)
}

eqdist.nn <- function(z,sizes,k){#NN
  boot.obj <- boot(data=z,statistic=Tn,R=R,
  sim = "permutation", sizes = sizes,k=k)
  ts <- c(boot.obj$t0,boot.obj$t)
  p.value <- mean(ts>=ts[1])
  list(statistic=ts[1],p.value=p.value)
}

p.values <- matrix(NA,m,3)#to store p values

for(i in 1:m) {
  x <- matrix(rnorm(n1 * p, sd = 1), ncol = p)#unequal variances
  y <- matrix(rnorm(n2 * p, sd = 1.4), ncol = p)
  z <- rbind(x, y)
  p.values[i, 1] <- eqdist.nn(z, N, k)$p.value#NN
  p.values[i, 2] <- eqdist.etest(z, sizes = N, R = R)$p.value#in the energy package
  p.values[i, 3] <- bd.test(x = x, y = y, R = 999, seed = i)$p.value#"Ball Divergence" in the ball package
}

alpha <- 0.1#confidence level
pow <- apply(p.values<alpha,2,mean)#compute the number of p.values which is less than 0.1 in each column
print(pow)
power <- data.frame(methods = c('NN','energy','Ball'),pow)
ggplot(power,aes(methods,pow))+#plot
  geom_col(fill = 'orange')+
  coord_flip()

Unequal variances and unequal expectations:

for(i in 1:m) {
  x <- matrix(rnorm(n1 * p, mean = 0.4, sd = 1), ncol = p)#unequal variances and unequal expectations
  y <- matrix(rnorm(n2 * p, mean = 0, sd = 1.4), ncol = p)
  z <- rbind(x, y)
  p.values[i, 1] <- eqdist.nn(z, N, k)$p.value
  p.values[i, 2] <- eqdist.etest(z, sizes = N, R = R)$p.value
  p.values[i, 3] <- bd.test(x = x,  y = y,  R = 999,  seed = i)$p.value
}
alpha <- 0.1
pow <- apply(p.values<alpha,2,mean)
print(pow)
power <- data.frame(methods = c('NN','energy','Ball'),pow)
ggplot(power,aes(methods,pow))+
  geom_col(fill = 'lightslategrey')+
  coord_flip()

Non-normal distributions: t distribution with 1 df and bimodel distribution

for(i in 1:m) {
  x <- matrix(rt(n1 * p,df = 1), ncol = p)# t distribution
  y <- matrix(rnorm(n2 * p,sd = sample(c(1,1.3),size = n2*p, prob = c(0.5,0.5),replace = T)), ncol = p)#bimodel distribution
  z <- rbind(x, y)
  p.values[i, 1] <- eqdist.nn(z, N, k)$p.value
  p.values[i, 2] <- eqdist.etest(z, sizes = N, R = R)$p.value
  p.values[i, 3] <- bd.test(x = x, y = y, R = 999, seed = i)$p.value
}
alpha <- 0.01
pow <- apply(p.values<alpha,2,mean)
print(pow)
power <- data.frame(methods = c('NN','energy','Ball'),pow)
ggplot(power,aes(methods,pow))+
  geom_col(fill = 'mediumpurple')+
  coord_flip()

Unbalanced samples

n1 <- 50
n2 <- 5
n <- n1+n2
N = c(n1,n2)
for(i in 1:m) {
  x <- matrix(rnorm(n1*p,mean = 1), ncol = p)#100 samples
  y <- matrix(rnorm(n2*p,mean = 2), ncol = 2)#10 samples
  z <- rbind(x, y)
  p.values[i, 1] <- eqdist.nn(z, N, k)$p.value
  p.values[i, 2] <- eqdist.etest(z, sizes = N, R = R)$p.value
  p.values[i, 3] <- bd.test(x = x, y = y, R = 999, seed = i)$p.value
}
alpha <- 0.1
pow <- apply(p.values<alpha,2,mean)
print(pow)
power <- data.frame(methods = c('NN','energy','Ball'),pow)
ggplot(power,aes(methods,pow))+
  geom_col(fill = 'sienna2')+
  coord_flip()

Explains

The comments are similar in every chunk, so I only put them in the first chunk.
A function is defined to implement NN method. In every question, a loop is run to compute p values of each method. In my code, confidence level $\alpha$ is defined as 0.1, and the number of p values which are less than 0.1 are obtained by the funcion apply. At last, I use ggplot2 to generate a figure to show the difference of power in three methods.
I have tried a lot of parameters to make the gap of power as large as possible, making the difference of power clear enough.


Exercise 9.3

Use the $Metropolis-Hastings$ sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard $Cauchy$ distribution(see $qcauchy$ or $qt$ with df=1). Recall that a $Cauchy(\theta,\eta)$ distribution has density function $$f(x)=\frac1{\theta\pi(1+[(x-\eta)/\theta]^2)},-\infty0.$$ The standard $Cauchy$ has the $Cauchy(\theta =1,\eta= 0)$ density. (Note that the standard $Cauchy$ density is equal to the Student $t$ density with one degree of freedom.)


Answer Exercise 9.3

Ideas

Use the Metropolis-Hastings sampler to generate random variables from a $Cauchy$ distribution. Implementation of a Metropolis-Hastings sampler for this example is as follows. Note that the base of the array in $R$ is 1, so we initialize the chain at $X_0$ in $x[1]$.

  • 1.Set $g(??|X)$ to the density of $qt(X)$
  • 2.Generate $X_0$ from distribution $qt(1)$ and store in $x[1]$
    1. Repeat for $i = 2,\cdots,N:$
  • (a) Generate $Y$ from $qt(df = X_t) =qt(df= x[i-1])$

  • (b) Generate $U$ from $Uniform(0, 1)$
  • (c) With $X_t = x[i-1]$, compute$$r(X_t,Y)=\frac{f(Y)g(X_t|Y)}{f(X_t)g(Y|X_t)},$$ where $f$ is the $Cauchy$ density with parameter $\theta=1.\eta=0$, $g(Y|X_t)$ is the $qt(df = X_t)$ density evaluated at $Y$ , and $g(X_t|Y)$ is the $qt(df = Y)$ density evaluated at $X_t.$ If $$U \leq r(X_t, Y)$$ accept $Y$ and set $X_{t+1} = Y$ ; otherwise set $X_t+1 =X_t$. Store $X_{t+1}$ in $x[i]$
  • (d)Increment t.
N <- 5000
my_vector <- numeric(N)
my_vector[1] <- rnorm(1)
counter <- 0
#generates random deviates
u <- runif(N)

for (i in 2:N) 
{
    xt <- my_vector[i-1]
    y <- rnorm(1, mean = xt)
    #Density Cauchy distribution
    numerator <- dcauchy(y)*dnorm(xt, mean = y)
    denominator <- dcauchy(xt)*dnorm(y, mean = xt)

    if(u[i] <= numerator/denominator) 
    {
        my_vector[i] <- y
    }
    else 
    {
        my_vector[i] <- xt
        counter <- counter+1
    }
}

plot(my_vector,type = "l")

After1kDiscard <- my_vector[1000:N]
Compare the deciles of Sampler(after discard the first 1000 of the chain) and standard Cauchy:
#Sequence Generation
a <- ppoints(100)
sequence <- seq(0,1,0.01)
standardCauchy <- qcauchy(sequence)
standardCauchy <- standardCauchy[(standardCauchy> -Inf) & (standardCauchy< Inf)]
Q <- quantile(my_vector, a)
qqplot(standardCauchy, Q, main="",xlab="standardCauchy Quantiles", ylab="Sample Quantiles")
hist(After1kDiscard, freq = FALSE,main = "after discard the first 1000 of the chain")
lines(standardCauchy, dcauchy(standardCauchy), lty = 2)

Explain

To see the generated sample as a realization of a stochastic process, we can plot the sample vs the time index. The codes above have displayed a partial plot starting at time index 0. And the final plot shows that at times the candidate point is rejected and the chain does not move at these time points; this corresponds to the short horizontal paths in the graph. The QQ plot is an informal approach to assessing the goodness-of-fit of the generated sample with the target distribution. From the plot, it appears that the sample quantiles are in approximate agreement with the theoretical quantiles.


Exercise 9.6

$Rao$ [220, Sec. 5g] presented an example on genetic linkage of 197 animals in four categories (also discussed in [67, 106, 171, 266]). The group sizes are (125,18,20,34). Assume that the probabilities of the corresponding multinomial distribution are $(\frac12+\frac\theta4,\frac{1-\theta}4,\frac{1-\theta}4,\frac\theta4).$ Estimate the posterior distribution of $\theta$ given the observed sample, using one of the methods in this chapter.

Answer Exercise 9.6

we need to estimate the parameters $p_1,i=1,2,3,4.$ with the missing datas. And the following codes and the corresponding perform results will show some experiments.

p1=125/197
p2=18/197
p3=20/197
p4=34/197

my_prob <- c(p1,p2,p3,p4)
number_of_experiments <- 10
number_of_samples <- 197

experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments

We hope to find the likelihood function to use the specify method to get the goal.Then get the full data to estimate the $\theta$ to produce the maximum function.And we interal continually to optimize the estimate value.

# defube expectation function 

estep <- function(theta,z2){
z2 = 125*(0.25*theta/(0.5+0.25*theta))
# z1 = 125*(0.5/(0.5+0.25*theta))
return(z2)
}

mstep <- function(theta,z2){
theta <- (z2+34)/(z2+34+18+20)
return(theta)
}

# set initial value for iteration

cur_theta = 0.5
maxit = 100
maxit=1000
tol=1e-6

# start iteration

for(i in 1:maxit){
  new_z2 <- estep(cur_theta,cur_z2)
  new_theta <- mstep(cur_theta,new_z2)

  # Stop iteration if the difference between the current and new estimates is less than a tolerance level
  if( all(abs(cur_theta - new_theta) < tol) ){ flag <- 1; break}


  # Otherwise continue iteration
  cur_theta <- new_theta
  cur_z2 <- new_z2
}
if(!flag) warning("Didn??t converge\n")

final_theta = cur_theta
paste0("Final Theta = ", format(round(cur_theta, 4), nsmall = 4))
paste0("Final Z2 = ", format(round(cur_z2, 4), nsmall = 4))

From the whole process ,we has get the disdribution of the parameter $\theta.$ Then we estimate the density function of the observed data.

p1=125/197
p2=18/197
p3=20/197
p4=34/197

my_prob <- c(p1,p2,p3,p4)


number_of_experiments <- 1
number_of_samples <- 1000

experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments

But the estimate of the density distribution based on the completed data is like that we will do.

p1=0.5
p2=0.25*final_theta
p3=0.25-p2
p4=p3
p5=p2

my_prob <- c(p1,p2,p3,p4,p5)
number_of_experiments <- 10
number_of_samples <- 1000

experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments



number_of_experiments <- 1
number_of_samples <- 1000

experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments

Question 9-3

For $Exercise 9.6$, use the $Gelman-Rubin$ method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R}<1.2.$

Answer Question 9-3

Ideas

  • The $Gelman-Rubin$ method of monitoring convergence of a $M-H$ chain is based on comparing the behavior of several generated chains with respect to the variance of one or more scalar summary statistics. The estimates of the variance of the statistic are analogous to estimates based on between-sample and within-sample mean squared errors in a one-way analysis of variance ($ANOVA$).

  • The between-sequence variance is $$B=\frac1{k-1}\sum_{i=1}^{k}\sum_{j=1}^{n}(\hat{\psi_{i.}}-\hat{\psi_{..}})^2=\frac{n}{k-1}\sum_{i=1}^{k}(\hat{\psi_{i.}}-\hat{\psi_{..}})^2$$

where $\hat{\psi_{i.}}=\frac1n\sum_{j=1}^{n}\psi_{ij} , \hat{\psi_{..}}=\frac1{nk}\sum_{i=1}^{k}\sum_{j=1}^{n}\psi_{ij}$

  • Within the $i^{th}$ sequence, the sample variance is $$s^2_i=\frac1n\sum_{j=1}^{n}(\psi_{ij}-\hat{\psi{i.}})^2$$ and the pooled estimate of within sample variance is $$W=\frac1{nk-k}\sum_{i=1}^{k}(n-1)s^2_i=\frac1k\sum_{i=1}^{k}s^2_i$$

  • The between-sequence and within-sequence estimates of variance are combined to estimate an upper bound for $V ar(\psi)$ is $$\hat{Var(\psi)}=\frac{n-1}{n}W+\frac1nB$$

  • The $Gelman-Rubin$ statistic is the estimated potential scale reduction $$\sqrt{\hat{R}}=\sqrt{\frac{\hat{Var(\psi)}}{W}}$$

which can be interpreted as measuring the factor by which the standard deviation of $\psi$ could be reduced by extending the chain. The factor $\hat{R}$ decreasesto 1 as the length of the chain tends to infinity, so $\hat{R}$ should be close to 1.ifthe chains have approximately converged to the target distribution. $Gelman$ suggests that $\hat{R}$ should be less than 1.1 or 1.2.

set.seed(1)
Gelman.Rubin <- function(psi) {
  # psi[i,j] is the statistic psi(X[i,1:j])
  # for chain in i-th row of X
  psi <- as.matrix(psi)
  n <- ncol(psi)
  k <- nrow(psi)

  psi.means <- rowMeans(psi) #row means
  B <- n * var(psi.means) #between variance est.
  psi.w <- apply(psi, 1, "var") #within variances
  W <- mean(psi.w) #within est.
  v.hat <- W*(n-1)/n + (B/n) #upper variance est.
  r.hat <- v.hat / W #G-R statistic
  return(r.hat)
}

theta <- .2 #parameter of proposal distribution
k <- 4 #number of chains to generate
N <- 5000 #length of chains
b <- 500 #burn-in length
M <- 5000
groupsize <- c(125,18,20,34) #group size

prob <- function(theta,group){
  if(theta<0||theta>=0.9)
    return(0)
  return((1/2+theta/4)^groupsize[1]*((1-theta)/4)^groupsize[2]*((1-theta)/4)^groupsize[3]*(theta/4)^groupsize[4])
}


mn.chain <- function(groupsize,N,X1) {
  #generates a Metropolis chain 
  #with Multinomial proposal distribution
  #and starting value X1
  x <- numeric(N)
  x[1] <- X1#initialize x[1]
  w <- 0.25
  u <- runif(N)
  v <- runif(M, -w, w)

  for (i in 2:N) {
    theta <- x[i - 1] + v[i]
    if (u[i] <= prob(theta, groupsize) / prob(x[i - 1], groupsize))
    x[i] <- theta
    else
    x[i] <- x[i - 1]
  }
  return(x)
}

#choose overdispersed initial values
x0 <- c(0.2, 0.4, 0.6, 0.8)

#generate the chains
X <- matrix(0, nrow=k, ncol=N)
for (i in 1:k)
  X[i, ] <- mn.chain(groupsize,N,x0[i])

#compute diagnostic statistics
psi <- t(apply(X, 1, cumsum))
for (i in 1:nrow(psi))
  psi[i,] <- psi[i,] / (1:ncol(psi))
print(Gelman.Rubin(psi))

Then we can learn from the results:

  • The plots above are the four sequences of the summary statistic (the mean) $\psi$ showing from time 0 to 5000.

  • The dashed line on the plot is at $\hat{R}=1.2$,and we know that the according interations appropriately within 1000.


Exercise 11.4

Find the intersection points $A(k)$ in $(0,\sqrt{k})$ of the curves $$S_{k-1}(a)=P(t(k-1)>\sqrt{\frac{a^2(k-1)}{k-a^2}})$$ and $$S_{k}(a)=P(t(k)>\sqrt{\frac{a^2k}{k+1-a^2}}),$$ for k = 4 : 25, 100, 500, 1000, where $t(k)$ is a Student $t$ random variable with $k$ degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by $Sz??ekely [260].$)

Answer exercise 11.4

The Idea is :

  • generate the inset function that the inside one is the function of a and the ourside one is the function of k.

  • first, we code the function to compute the $S_{k-1}(a)$ and $S_k(a)$ respectively.

  • then construct the function of $S_k(a)-S_{k-1}(a)$

  • and the next step is to find the intersection points through the function $uniroot$

set.seed(1)
findIntersection = function (k) {
  s.k.minus.one = function (a) {
    1-pt(sqrt(a^2 * (k - 1) / (k - a^2)), df = k-1)
  }

  s.k = function (a) {
    1-pt(sqrt(a^2 * k / (k + 1 - a^2)), df = k)
  }

  f = function (a) {
    s.k(a) - s.k.minus.one(a)
  }

  eps = 1e-2
  return(uniroot(f, interval = c(eps, sqrt(k)-eps))$root)
}

rs = sapply(c(4:25, 100, 500, 1000), function (k) {
  findIntersection(k)
  })
cbind(k=c(4:25, 100, 500, 1000),a=rs)

From the results ,we can get the different intersection points with the different k.And we find the intersection point changes slightly while the k has a significant increase.


Question 11.6

Write a function to compute the $cdf$ of the $Cauchy$ distribution, which has density $$\frac1{\theta\pi(1+((x-\eta)/\theta)^2)},-\infty0.$ Compare your results to the results from the $R$ function $pcauchy.$ (Also see the source code in $pcauchy.c.$)

Answer Question 11.6

The Idea is :

  • If we knoow the density function of the specified distribution,then integtate this density function with the suitable interval then we will get the target $CDF$

  • To get the truth $CDF$, we utilize the function $sapply$ and $pcauchy$

my.dcauchy = function (x, eta, theta) {
  stopifnot(theta > 0)
  return(1/(theta*pi*(1 + ((x - eta)/theta)^2)))
}

my.pcauchy = function (x, eta, theta) {
  stopifnot(theta > 0)

  integral = function (x) {
    my.dcauchy(x, eta, theta)
  }

  return(integrate(integral, lower = -Inf, upper = x)$value)
}

eta = 0
theta = 2
xs = seq(-10, 10)
estimate = sapply(xs, function(x) my.pcauchy(x, eta, theta))
truth = sapply(xs, function(x) pcauchy(x, eta, theta))
round(cbind(estimate, truth),4)

Interpretation:

  • From the result,we could agree intuitively that the value we estimate is equal to its real value to some extent.


Exercise 11-ABO

  • $A-B-O$ blood type problem

| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| |Genotype|AA|BB|OO|AO|BO|AB|AA| |Frequency|p2|q2|r2|2pr|2qr|2pq|1| |Count|nAA|nBB|nOO|nAO|nBO|nAB|n|

  • Observed data: $n_{A??} = n_{AA} + n_{AO} = 28$ (A-type), $n_{B??} = n_{BB} + n_{BO} = 24$ (B-type), $n_{OO} = 41$ (O-type), $n_{AB} = 70$ (AB-type).

  • Use EM algorithm to solve $MLE$ of $p$ and $q$ (consider missing data $n_{AA}$ and $n_{BB}$).

  • Record the maximum likelihood values in $M$-steps, are they increasing?

Answer exercise 11-ABO

The Idea is :

  • The $EM$ (Expectation Maximization) algorithm is a general optimization method that is often applied to find maximum likelihood estimates when data are incomplete.

  • First, we compute the likelihood function,then evaluate the log-likelihood of the data for any arbitrary set of allele frequencies.

  • Second, we can estimate the number of individuals carrying each possible genotype based on the observed counts for each phenotype and estimates of the allele frequencies at each iteration.And you can run the E-M function with the debug flag set to TRUE.

  • Then we set the initial value to be p=0.3,q=0.2,r=0.5, and then run the E-M function.

#set the likelihood function 
lnL <- function(p, q, nA =28, nB = 24, nAB = 70, nO = 41) {
  r = 1.0 - p - q
  nA * log(p^2 + 2*p*r) + nB * log(p^2 + 2 * q * r) + nAB * log(2 * p * q) + 2 * nO * log(r)
}

EM <- function (p,q,nA =28, nB = 24, nAB = 70, nO = 41, debug = FALSE) {

  # Evaluate the likelihood using initial estimates
  llk <- lnL(p, q, nA, nB, nAB, nO)

  # Count the number of iterations so far
  iter <- 1

  # Loop until convergence ...
  while (TRUE)
  {
    # Estimate the frequency for allele O
    r= 1.0 - p - q

    # First we carry out the E-step

    # The counts for genotypes O/O and A/B are effectively observed
    # Estimate the counts for the other genotypes
    nAA <- nA * p / (p + 2*r)
    nAO <- nA - nAA
    nBB <- nB * q / (q + 2*r)
    nBO <- nB - nBB

    # Print debugging information
    if (debug)
    {
      cat("Round #", iter, "lnLikelihood = ", llk, "\n\n")
      cat("          Allele frequencies:      p = ",   p, ",   q = ",   q, ",   r = ",   r, "\n\n")
      cat("          Genotype  counts:      nAA = ", nAA, ", nAO = ", nAO, ", nBB = ", nBB, ", nBO = ", nBO, "\n\n")
    }

    # Then the M-step
    p <- (2 * nAA + nAO + nAB) / (2 * (nA + nB + nO + nAB))
    q <- (2 * nBB + nBO + nAB) / (2 * (nA + nB + nO + nAB))


    # Then check for convergence ...
    llk1 <- lnL(p, q, nA, nB, nAB, nO)
    if (abs(llk1 - llk) < (abs(llk) + abs(llk1)) * 1e-6) break       

    # Otherwise keep going
    llk <- llk1
    iter <- iter + 1
  }

 cbind(p = p, q = q)
}
EM(0.3,0.2,nA =28, nB = 24, nAB = 70, nO = 41, debug = TRUE)

Interpretation:

  • From the results above, we get the eatimated value is : $p=0.32734$, $q= 0.3104234$,

  • The log likelihood estimate is : $lnLikelihood = -261.3305 , -251.1735, -251.1293 ,-251.1236 ,-251.122$ which means the maximum likelihood estimates is monotonous increasing.


Exercise 11.1.2(3)

Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:

  • formulas <- list(
    • mpg ~ disp,
    • mpg ~ I(1 / disp),
    • mpg ~ disp + wt,
    • mpg ~ I(1 / disp) + wt
  • )

Answer Exercise 11.1.2(3)

The Idea is :

  • lapply() is a wrapper for a common for loop pattern: create a container for output, apply f() to each component of a list, and fill the container with the results. All other for loop functionals are variations on this theme: they simply use different types of input or output.

  • unlist() can be used to convert the output from a list to a vector to make it more compact. But data frames are also lists, lapply() is also useful when you want to do something to each column of a data frame.

formulas <- list(
  mpg ~ disp,
  mpg ~ I(1 / disp),
  mpg ~ disp + wt,
  mpg ~ I(1 / disp) + wt
)

 # with lapply()

print("***************************************** with lapply() ***********************************************************")

lapply(formulas, lm, data = mtcars)

 # with a for loop 

print("***************************************** with a for loop *********************************************************")

out_formulas <- vector('list', length(formulas))
   for(i in seq_along(formulas)) {
       out_formulas[[i]] <- lm(formulas[[i]], data = mtcars)
   }
   out_formulas

Interpretation:

  • From the results, we can easily see that the loop with a for is the same as with the lapply

  • We get the model form are :

    • [1] $mpg = 29.59985 - 0.04122disp$

    • [2] $mpg = 10.75 - 1557.67I(1 / disp)$

    • [3] $mpg = 34.96055 - 0.01772disp - 3.35083wt$

    • [4] $mpg = 19.024 - 1142.560I(1 / disp) - 1.798wt$


Exercise 11.1.2(4)

Fit the model $mpg$ ~ $disp$ to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?

  • bootstraps <- lapply(1:10, function(i) {
    • rows <- sample(1:nrow(mtcars), rep = TRUE)
    • mtcars[rows, ]
  • })

Answer Exercise 11.1.2(4)

The Idea is :

  • The basic method is similar to the exercise 11.1.2(3)
bootstraps <- lapply(1:10, function(i) {
  rows <- sample(1:nrow(mtcars), rep = TRUE)
  mtcars[rows, ]
})

 # with lapply()

print("***************************************** with lapply() ***********************************************************")

lapply(bootstraps, lm, formula = mpg~disp)

 # with a for loop 

print("***************************************** with a for loop *********************************************************")

 out_bootstraps <- vector('list', length(bootstraps))
   for(i in seq_along(bootstraps)) {
       out_bootstraps[[i]] <- lm(mpg~disp, data = bootstraps[[i]])
   }
   out_bootstraps

Interpretation:

  • From the results, we can easily see that the loop with a for is the same as with the lapply.

  • We get the modle form are :

    • [1] $mpg = 28.96282 - 0.04149 disp$

    • [2] $mpg = 30.13856 - 0.04693 disp$

    • [3] $mpg = 27.09594 - 0.03264 disp$

    • [4] $mpg = 28.64848 - 0.03945 disp$

    • [5] $mpg = 29.94778 - 0.04477 disp$

    • [6] $mpg = 31.36958 - 0.04391 disp$

    • [7] $mpg = 28.92391 - 0.04218 disp$

    • [8] $mpg = 28.74858 - 0.03776 disp$

    • [9] $mpg = 28.70887 - 0.03956 disp$

    • [10] $mpg = 30.8249 - 0.0469 disp$


Exercise 11.1.2(5)

For each model in the previous two exercises, extract $R^2$ using the function below.

  • rsq <- function(mod) summary(mod)$r.squared

Answer Exercise 11.1.2(5)

The Idea is :

  • Use lapply function to complete the loop with different models

  • Then utilize the unlist function to extract the target values of $R^2$

rsq <- function(mod) summary(mod)$r.squared

bootstraps <- lapply(1:10, function(i) {
  rows <- sample(1:nrow(mtcars), rep = TRUE)
  mtcars[rows, ]
})

models <- lapply(bootstraps, function(x) lm(mpg ~ disp, data = x))
unlist(lapply(models, rsq))

Interpretation:

  • The extracted values in different models are :

  • $R_1^2 = 0.6606581, R_2^2 = 0.8160798, R_3^2 = 0.5806448, R_4^2 = 0.7407087, R_5^2 = 0.7589143,$

  • $R_6^2 = 0.6561107, R_7^2 = 0.6021043, R_8^2 = 0.6888394, R_9^2 = 0.5989015, R_{10}^2 = 0.7239530.$


Exercise 11.2.5(3)

The following code simulates the performance of a $t-test$ for non-normal data. Use sapply() and an anonymous function to extract the $p$-value from every trial.

  • trials <- replicate(
    • 100,
    • t.test(rpois(10, 10), rpois(7, 10)),
    • simplify = FALSE
  • )

Answer exercise 11.2.5(3)

trials <- replicate(
     100, 
     t.test(rpois(10, 10), rpois(7, 10)),
     simplify = FALSE
     )

print("***************************************** with sapply() *******************************************************")

sapply(trials, function(mod) mod$p.value)

#Extra challenge: get rid of the anonymous function by using `[[` directly.
print("*************************************** get rid of the anonymous function ********************************************")

sapply(trials, `[[`, 'p.value')

Interpretation:

  • From the results shown above, we can learn that the sapply and an anonymous function can extract the wanted p-values.

  • When getting rid of the anonymous function, we can also get the same results.


Exercise 11.2.5(6)

Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?

Answer Exercise 11.2.5(6)

The Idea is :

  • lapply() takes a function, applies it to each element in a list, and returns the results in the form of a list.

  • vapply() is very similar to lapply() except it simplify output to produce an atomic vector. vapply() is more verbose, but gives more informative error messages and never fails silently. It is better suited for use inside other functions. vapply() is an implementation of lapply() that assigns results to a vector (or matrix) of appropriate type instead of as a list.

  • With lapply(), only one argument to the function varies; the others are fixed. This makes it poorly suited for some problems.

  • There??s a natural equivalence between Map() and lapply() because you can always convert a Map() to an lapply() that iterates over indices.

library(parallel)
mcvMap <- function(f, FUN.VALUE , ...) {
  out <- mcMap(f, ...)
  vapply(out, identity, FUN.VALUE)
   }

Exercise 17.5.1(4)

Make a faster version of $chisq.test()$ that only computes the $chi-square$ test statistic when the input is two numeric vectors with no missing values. You can try simplifying $chisq.test()$ or by coding from the mathematical definition ($http://en. wikipedia.org/wiki/Pearson%27s_chi-squared_test$).

Answer Exercise 17.5.1(4)

  • 1, We fist use getAnywhere(chisq.test) to trace the source codes of chisq.test()
 getAnywhere(chisq.test)
  • 2, Since chisq.test() has a relatively long source code, we try a new implementation from scratch:
chisq.test2 <- function(x, y){

  # Input
  if (!is.numeric(x)) {
    stop("x must be numeric")}
  if (!is.numeric(y)) {
    stop("y must be numeric")}
  if (length(x) != length(y)) {
    stop("x and y must have the same length")}
  if (length(x) <= 1) {
    stop("length of x must be greater one")}
  if (any(c(x, y) < 0)) {
    stop("all entries of x and y must be greater or equal zero")}
  if (sum(complete.cases(x, y)) != length(x)) {
    stop("there must be no missing values in x and y")}
  if (any(is.null(c(x, y)))) {
    stop("entries of x and y must not be NULL")}

  # Help variables
  m <- rbind(x, y)
  margin1 <- rowSums(m)
  margin2 <- colSums(m)
  n <- sum(m)
  me <- tcrossprod(margin1, margin2) / n

  # Output
  x_stat = sum((m - me)^2 / me)
  dof <- (length(margin1) - 1) * (length(margin2) - 1)
  p <- pchisq(x_stat, df = dof, lower.tail = FALSE)

  return(list(x_stat = x_stat, df = dof, `p-value` = p))
}
  • 3, We check if our new implementation returns the same results
a <- 21:25
b <- c(21, 23, 25, 27, 29)
m <- cbind(a, b)
m

chisq.test(m)
chisq.test2(a, b)
  • 4, Finally we benchmark this implementation against a compiled version of itself and the original stats::chisq.test():
chisq.test2c <- compiler::cmpfun(chisq.test2)

microbenchmark::microbenchmark(
  chisq.test(m),
  chisq.test2(a, b),
  chisq.test2c(a, b)
)

Exercise 17.5.1(5)

Can you make a faster version of $table()$ for the case of an input of two integer vectors with no missing values? Can you use it to speed up your $chi-square$ test?

Answer Exercise 17.5.1(5)

  • We fist use getAnywhere(chisq.test) to trace the source codes of table()
getAnywhere(table)

Then we code to speed up the $chi-square$ test.

table2 <- function(x, y) {

  x_val <- unique(x)

  y_val <- unique(y)

  mat <- matrix(0L, length(x_val), length(y_val))

  for (i in seq_along(x)) {

    mat[which(x_val == x[[i]]), which(y_val == y[[i]])] <-

      mat[which(x_val == x[[i]]),  which(y_val == y[[i]])] + 1L

  }

  dimnames <- list(x_val, y_val)

  names(dimnames) <- as.character(as.list(match.call())[-1])  # R has names for dimnames... :/

  tab <- array(mat, dim = dim(mat), dimnames = dimnames)

  class(tab) <- "table"

  tab

}

check:

a <- c(1, 2, 3)

identical(table(a, a), table2(a, a))

we benchmark this implementation against a compiled version of itself and the original stats::table():

microbenchmark::microbenchmark(table(a, a), table2(a, a))

check

b <- c(2, 3, 4)
identical(table(a, b), table2(a, b))

c <- c(1, 2, 3, 1, 2, 3)
d <- c(2, 3, 4, 2, 3, 4)
identical(table(c, d), table2(c, d))

e <- c(1, 2, 2)
identical(table(a, e), table2(a, e))

identical(table(b, e), table2(b, e))

identical(table(e, e), table2(e, e))

f <- c(1, 1, 1)
identical(table(f, f), table2(f, f))


identical(table(e, f), table2(e, f))

g <- c(1, 4, 9)
identical(table(g, g), table2(g, g))
identical(table(g, f), table2(g, f))


h <- c(10, 20, 30)
identical(table(h, h), table2(h, h))

identical(table(h, f), table2(h, f))

identical(table(h, g), table2(h, g))

i <- c(0, 0, 0)

identical(table(i, i), table2(i, i))

identical(table(h, i), table2(h, i))

then we have:

j <- c(0, 10, 100, 1000, 10000)

identical(table(j, j), table2(j, j))

microbenchmark::microbenchmark(table(j, j), table2(j, j))

Thanks for your reading!






cssUSTC/StatComp18017 documentation built on May 5, 2019, 11:06 p.m.