Overview

StatComp18063 is a simple R package that is the homework for 'Statistical Computing' course in USTC Department of Statistics and Finance. The package include three functions and all homework for 18063.

Homework 1

Question

Write a .Rmd file to implement at least three examples of different types in the books "R for Beginners"(texts, numerical results, tables, and figures).

Answer

In this example, we will use the package 'lattice' to plot some graphs. Firstly, the package must be loaded in the memory using the command $\color{red}{\rm{'library(lattice) '}}$ so that the function of the package can be accessed.

library(lattice)

Let's start with the density function graph. We can use the function $\color{red}{\rm{'densityplot(\sim x)'}}$ to get this graph. In each figure, in addition to the empirical density curve, a normal density fitting curve is superimposed. This requires the argument 'panel' to define what to draw on each diagram.

The code is as follows:

n=seq(5,45,5)
x=rnorm(sum(n))
y=factor(rep(n,n),labels = paste('n=',n))
densityplot(~x|y,
            panel = function(x, ...) {
              panel.densityplot(x, col="DarkOliveGreen", ...)
              panel.mathdensity(dmath=dnorm,
                                args=list(mean=mean(x), sd=sd(x)),
                                col="darkblue")
              })

Then we use the dataset $\color{red}{\rm{'quakes'}}$ to show the geographical location of the earthquakes at different depths.

The code is as follows:

data(quakes)
mini = min(quakes$depth)
maxi = max(quakes$depth)
int = ceiling((maxi - mini)/9)
inf = seq(mini, maxi, int)
quakes$depth.cat = factor(floor(((quakes$depth - mini) / int)),
labels=paste(inf, inf + int, sep="-"))
xyplot(lat ~ long | depth.cat, data = quakes)

A simple example of variance analysis. In the package $\color{red}{\rm{'stats '}}$, we use the function $\color{red}{\rm{'aov '}}$ for variance analysis. To demonstrate this function, we use the dataset: $\color{red}{\rm{'InsectSprays '}}$.

We use the fuction $\color{red}{\rm{' data '}}$ input the dataset, the code as follows:

data(InsectSprays)
aov.spray = aov(sqrt(count) ~ spray, data = InsectSprays)
aov.spray
termplot(aov.spray, se=TRUE, partial.resid=TRUE, rug=TRUE)

In this example, we will show how to use the packages in R. First, we can use the function $\color{red}{\rm{' search '}}$ to show loaded packages.

search()

Other packages need to be loaded before they can be used:

library(grid)

Functions that can be used in a package can be displayed in the following way:

library(help = grid)

Homework 2

Question

A discrete random variable $X$ has probability mass function

\begin{equation} \left| \begin{array}{ll1111} x & 0 & 1 &2 &3 &4\ p(x)& 0.1&0.2&0.2&0.2&0.3\ \end{array} \right| \end{equation}

Use the inverse transform method to generate a random sample of size 1000 from the distribution of $X$. Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function.

Answer

Use the inverse transform method to generate a random sample of size 1000 from the distribution of $X$.

The code is as follows:

set.seed(977)   ##  set a random seed
x=c(0:4)        ## generate the x
p=c(0.1,0.2,0.2,0.2,0.3)  ## generate the probability
cp=cumsum(p)   ## give the cumulate probability
n=1000      ## set the sample of size 1000 
r=numeric(n)
r=x[findInterval(runif(n),cp)+1] ## the inverse transform method
table(r) ## construct a relative frequency table 
ct=as.vector(table(r))
ct/sum(ct)  ## compare the empirical probability
ct/sum(ct)/p  ## the ratio of emparical probability and theorical probability

According to the results above, we can see that the performence of the inverse transform method is very well.

Now, we will use the R sample function to compare the empirical with the theoretical probabilities. At the same time, we also plot some graphs to show the familiar between empirical with the theoretical probabilities.

Y=p*n
counts = rbind(table(r),Y) ## combine r and Y into a data.frame
rownames(counts)=c('r','Y')  ## gives the row name of the data.frame
counts
barplot(counts/1000, main="multinomial distribution generation",
        xlab="x",ylab='prob', col=c("darkblue","red"),
        beside=TRUE,legend.text = c('empirical','theoretical'),
        args.legend = list(x = "topleft")) 

Question

Write a function to generate a random sample of size n from the Beta(a, b) distribution by the acceptance-rejection method. Generate a random sample of size 1000 from the Beta(3,2) distribution. Graph the histogram of the sample with the theoretical Beta(3,2) density superimposed.

Answer

The pdf of Beta(3,2) is: $$f(x)=12x^2(1-x), \quad 0\mu$$ The code is as follows:

set.seed(53) ## set a random seed
n=1000
k=0  ## counter for accepted
j=0  ## iterations
y=numeric(n)

 while (k<n) {
  u=runif(1)
  j=j+1
  x=runif(1)  ## #random variate from g
  if(x^2*(1-x)>u) {
    k=k+1
    y[k]=x
  }
}
j
hist(y,breaks=15,freq=F, main='empirical vs theoretical') ## plot the histogram
x1=seq(0,1,0.02)
y1=12*x1^2*(1-x1)
lines(x1,y1)   ## add the theoretical curve

In this simulation, 12471 iterations were required to generate the 1000 beta variates. According to the histogram above, we can see that the performance of Acceptance-rejection method is very well.

Question

Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter $\Lambda$ has Gamma(r,$\beta$) distribution and $Y$ has Exp($\Lambda$) distribution. That is, $$(Y|\Lambda=\lambda)\sim f_Y(y|\lambda)=\lambda e^{-\lambda y}.$$ Generate 1000 random observationsvfrom this mixture with $r = 4$ and $\beta = 2$.

Answer

The pdf of Gamma(4,2) is: $$p(\lambda)=\frac{2^4}{\Gamma(4)}\lambda^3e^{-2\lambda}, \lambda>0.$$ Given the $\Lambda=\lambda$, Y has $Exp(\lambda)$ distribution. That is, $$p(y|\Lambda=\lambda)=\lambda e^{-\lambda y}.$$

Now, we can simulate a continuous Exponential-Gamma mixture. The code is as follows:

#generate a Exponential-Gamma mixture
set.seed(473)
n=1000
r=4
beta=2
lambda = rgamma(n, r, beta) ## generate the $\lambda$ from the Gamma(4,2) distribution

#now supply the sample of lambda??s as the parametric of Exponential distribution

y_lambda = rexp(n, lambda) ##  generate the mixture

hist(y_lambda,col='gray',ylim=c(0,800),,xlim=c(0,12),main='Exponential-Gamma mixture')

Homework 3

Question

Write a function to compute a Monte Carlo estimate of the Beta(3, 3) cdf, and use the function to estimate F(x) for x = 0.1, 0.2, . . ., 0.9. Compare the estimates with the values returned by the pbeta function in R.

The pdf of Beta(3,3) is : $$p(x)=30x^2(1-x)^2,\quad 0<x<1.$$

so, the cdf of beta(3,3) is $$F(x)=\int_0^x 30t^2(1-t)^2 dt.$$ Let $y=t/x,$then$dt=xdy$,and $$F(x)=\int_0^130x^3y^2(1-xy)^2dy=30\theta,$$ where $\theta=\int_0^1x^3y^2(1-xy)^2dy=E_Y[x^3Y^2(1-xY)^2], Y$ has the Uniform(0,1) distribution. Generate iid Uniform(0,1) random numbers $u_1,...,u_m$, and compute $$\widehat{\theta}=\frac{1}{m}\sum_{i=1}^mx^3u_i^2(1-xu_i)^2.$$ The R code as follows:

set.seed(432)
m=10000
u=runif(m)
##  the function to generate cdf of Beta(3,3)
CDF=function(x){
  theta_hat=mean(x^3*u^2*(1-x*u)^2)
  CDF_hat=30*theta_hat
  #print(CDF_hat)
}

q=seq(0.1,0.9,0.1) 
Phi=pbeta(q,3,3)   ## q-quantile of Beta(3,3) by function 'pbeta'


##  q-quantile of Beta(3,3) by function CDF
cdf=rep(0,length(q))  
for (i in 1 : length(q)) { 
  cdf[i] = CDF(q[i]) 
  }
## Compare the estimator 
print(round(rbind(q,cdf,Phi),3))

The Rayleigh density [156, (18.76)] is $$f(x)=\frac{x}{\sigma^2}e^{-x^2/(2\sigma^2)},\quad x\geq 0,\sigma>0.$$ Implement a function to generate samples from a Rayleigh($\sigma$) distribution, using antithetic variables. What is the percent reduction in variance of$\frac{X+X^\prime}{2}$compared with $\frac{X_1+X_2}{2}$ for independent $X_1,X_2$?

The cdf of Rayleigh distribution is: $$F(x)=1-\exp\big( \frac{-x^2}{2\sigma^2} \big).$$ so the inverse translation is $$x=\sqrt{-2\sigma^2\log(1-y)}.$$ Because of the mean is $\sigma\sqrt{\pi/2},$ in this answer, we will use the sample mean as the estermator, and set $\sigma=1,2,3,4,5$ to compare the estermator by diference sampling methods. What's more, we let $\sigma=1$ to compare the percent reduction in variance for antithetic variables method and inverse method.

## the function of  generate samples from a Rayleigh distribution

  mc=function(x,R=10000,antithetic = TRUE){
    u=runif(R/2) 
    if (!antithetic)  v=runif(R/2) else
      v=1-u
    u=c(u,v)
    me=numeric(length(x))
    for (i in 1:length(x)){
      g=x[i]*(sqrt(2*log(1/(1-u))))
      me[i]=mean(g)
    }
    me
  }
sigma=c(1,2,3,4,5)  ## given the $\sigma$
mc1=mc(sigma)  ## antithetic variables method
mc2=mc(sigma,anti=F) ## inverse method.
print(rbind(sigma,mc1,mc2))  

### compare  the percent reduction in variance
  m=1000  ## repeat 1000 times 
  mc1=mc2=numeric(m)
  sigma=1  ## let $\sigma=1$
  for(i in 1:m){
    mc1[i]=mc(sigma,R=1000)
    mc2[i]=mc(sigma,R=1000,anti=F)
  }
print(sd(mc1))  ## the variance of antithetic variables method
print(sd(mc2))  ## the variance of inverse method
print((var(mc2)-var(mc1))/var(mc2)) ## the percent reduction in variance

Find two importance functions $f_1$ and $f_2$ that are supported on $(1,\infty)$ and are ??close?? to $$g(x)=\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2},\quad x>1.$$ Which of your two importance functions should produce the smaller variance in estimating

$$\int_1^\infty \frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx$$ by importance sampling? Explain.

let $$f1(x)=xe^{((1-x^2)/2)},$$

then $f_1(x)$ is pdf defined $(1,\infty).$

and

$$ f_2(x)=\frac{1}{x^2},x\in(1,\infty).$$ $f_1,f_2$ is the importance function for $g(x)$, and $f_1$ should produce the smaller variance by importance sampling. Because of the $f_1$ is simaliar to $g(x)$.

Obtain a Monte Carlo estimate of $$\int_1^\infty \frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx$$ by importance sampling.

$$\theta=\int_1^\infty \frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx=\int_0^\infty\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx-\int_0^1\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx=\frac{1}{2}-\int_0^1\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx.$$

let importance function is $$f_1(x)=\frac{1}{1-e^{-\frac{1}{2}}}xe^{-\frac{x^2}{2}},x\in(0,1).$$

then $$\theta=\frac{1}{2}-\int_0^1\frac{1-e^{-\frac{1}{2}}}{\sqrt{2\pi}}xf_1(x)dx.$$

so $$\hat{\theta}=\frac{1}{2}-\frac{1-e^{-\frac{1}{2}}}{\sqrt{2\pi}}\frac{1}{m}\sum_{i=1}^nX_i,$$

where $X_i\sim f_1(x).$

The code as follows:

set.seed(345)
m=10000
u=runif(m)
x=sqrt(-2*log(1-(1-exp(-0.5))*u)) ## inverse method generte sample for inportance function
theta_hat=1/2-(1-exp(-1/2))/sqrt(2*pi)*mean(x)
theta_hat

Homework 4

Question

Estimateby simulation the mean, median and deciles of $\hat{G}$if $X$ is standard lognormal.Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.

answer

the pdf of standard lognormal distribution is: $$p(x)=\frac{1}{x\sqrt{2\pi}}e^{-\frac{(\log x)^2}{2}} .$$

set.seed(436)
### geerate the random numbers, divide to 1000 groups, each group contains 5000 random numbers.

data=matrix(rlnorm(1000*2000,mean=0,sd=1),nrow=2000,ncol=1000)

logn_mean=rep(0,ncol(data)) ## definate the estimator of mean for 1000 times 
logn_median=rep(0,ncol(data)) ## defnate the estimator of median for 1000 times 
logn_G=rep(0,ncol(data)) ## definate the estimator of G for 1000 times

## repaet 1000 times

for (i in 1:ncol(data)){
  logn_mean[i]=mean(data[,i]) ## compute the sample mean
  data[,i]=sort(data[,i],decreasing=F) ## sort the random number from minimum to maximum.
  logn_median[i]=(data[,i][ncol(data)/2]+data[,i][ncol(data)/2+1])/2 ## compute the sample median

  ##compute the Gini ratio via order statistics
  n=nrow(data)
  d=rep(0,n)
  for(j in 1:n) {
    d[j]=(2*j-n-1)*data[,i][j]
  }
logn_G[i]=mean(d)/(n^2*logn_mean[i])
}
hist(logn_G)  
hist(logn_mean)
hist(logn_median)

now we repeat above code for uniform[0,1]

set.seed(1436)
### geerate the random numbers, divide to 1000 groups, each group contains 5000 random numbers.

data=matrix(runif(1000*2000),nrow=2000,ncol=1000)

unif_mean=rep(0,ncol(data)) ## definate the estimator of mean for 1000 times 
unif_median=rep(0,ncol(data)) ## defnate the estimator of median for 1000 times 
unif_G=rep(0,ncol(data)) ## definate the estimator of G for 1000 times

## repaet 1000 times

for (i in 1:ncol(data)){
  unif_mean[i]=mean(data[,i]) ## compute the sample mean
  data[,i]=sort(data[,i],decreasing=F) ## sort the random number from minimum to maximum.
  unif_median[i]=(data[,i][ncol(data)/2]+data[,i][ncol(data)/2+1])/2 ## compute the sample median

  ##compute the Gini ratio via order statistics
  n=nrow(data)
  d=rep(0,n)
  for(j in 1:n) {
    d[j]=(2*j-n-1)*data[,i][j]
  }
unif_G[i]=mean(d)/(n^2*unif_mean[i])
}
hist(unif_G)  
hist(unif_mean)
hist(unif_median)

for Bernoulli(0.1),the code is also as above.

Question

Construct an approximate 95% confidence interval for the Gini ratio $\gamma= E[G]$if $X $is lognormal with unknown parameters. Assess the coverage rate of theestimation procedure with a Monte Carlo experiment.

Answer

we assume that the parameter $\mu,\sigma$ is unknown so the pdf is $$p(x)=\frac{1}{x\sigma\sqrt{2\pi}}e^{-\frac{(\log x-\mu)^2}{2\sigma^2}},$$

we use plug-in method to estimate the G. The code is as follow:

## write a function  n denote sample size, m denote repaet times, mu and sigma denote the parameter  in lognorm-distribution

mc_fun=function(n,m,mu,sigma){
  set.seed(528)
  data=matrix(rlnorm(m*n,mu,sigma),nrow=n,ncol=m)
  logn_G=rep(0,m)
  logn_mean=rep(0,m)
  for(i in 1:m){
    logn_mean[i]=mean(data[,i])
     data[,i]=sort(data[,i],decreasing=F) ##sort the random number from minimum to maximum.(obtain the order statistics)

    ##compute the Gini ratio via order statistics]

       d=rep(0,n)
       for(j in 1:n)  d[j]=(2*j-n-1)*data[,i][j]

       logn_G[i]=mean(d)/(n^2*logn_mean[i])  
  }
  ##print the sample mean.sample sd,  and 95% CI??
print(c(mean(logn_G),sd(logn_G),mean(logn_G)-2*sd(logn_G),
        mean(logn_G)+2*sd(logn_G)))
}
Gini=mc_fun(1000,1000,1,1)

According to the result, for $\mu=\sigma=1$,we can conclude that the sample mean is 5.186052e-04, sample sd is 1.272425e-05, 95% CI is [4.931567e-04,5.440537e-04].

Question

Tests for association based on Pearson product moment correlation $\rho$, Spearman??srank correlation coefficient $\rho_s$, or Kendall??s coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariatedistribution (X, Y ) such that X and Y are dependent) such that at least oneof the nonparametric tests have better empirical power than the correlation test against this alternative.

Answer

we let the null hypotheses is $H_0: \rho=0$, alternative hypotheses is $H_1: \rho\neq 0.$

we use the bivariate normal dinstrbution to simulation, we assume that $\mu_1=\mu_2=0,\sigma_1^2=\sigma_2^2=1,\rho=0.5$

##Simulate from a Multivariate Normal Distribution 
library(MASS) 
set.seed(4356)
Sigma = matrix(c(1,0.5,0.5,1),2,2) 
Sigma 
x=mvrnorm(n=1000, rep(0, 2), Sigma)
cor.test(x[,1],x[,2],method='pearson')  ##  person correlation coefficient
cor.test(x[,1],x[,2],method='spearman') ## spearman rank correlation coefficient
cor.test(x[,1],x[,2],method='kendall')  ## kendall correlation coefficient

Homework 5

Compute a jackknife estimate of the bias and the standard error of the correlation statistic in Example 7.2.

Answer

set.seed(1)
library(bootstrap)
data=law  # data set
head(data)
n=nrow(data)  # sample size
theta.hat=cor(data[,1],data[,2]) # the sample correlation.
#### jackknife
theta.jack=numeric(n) 
for (i in 1:n) {   ##leave one out
  data.jack=data[-i,]
  theta.jack[i]=cor(data.jack[,1],data.jack[,2])
}
jack.bias=(n-1)*mean(theta.jack-theta.hat) # jackknife bias
jack.sd=sqrt((n-1)*mean((theta.jack - mean(theta.jack))^2)) # jackknife sd
round(c(theta.hat=theta.hat,theta.jack=mean(theta.jack),
        jack.bias=jack.bias,jack.sd=jack.sd),5)

Refer to Exercise 7.4. Compute 95% bootstrap confidence intervals for the mean time between failures $1/\lambda$ by the standard normal, basic, percentile, and BCa methods. Compare the intervals and explain why they may differ.

Answer

the pdf of exponential distribution is

$$f(x)=\lambda e^{-\lambda x}, x>0,$$ and the mean is $1/\lambda.$

this is a paramatric bootstrap, firstly, we use the boot function to compute the bootstrap bias and se, then use the boot.ci function to compute the CI.

library(boot)

set.seed(1)
#  define a function to compute the boostrap estimator for sample mean.
air.fun <- function(data,i) {
  m <- mean(data$hours[i]) # sample mean
  n <- length(i)
  v <- (n-1)*var(data$hours[i])/n^2 # sample variance
  c(m, v)
}

# use  boot function, repeat 1000 times
air.boot <- boot(aircondit, air.fun, R = 1000) 
air.boot #t1 denote sample mean bootstrap, t2 denote sample variance bootstrap

# use boot.ci function to obtain the CI
ci <- boot.ci(air.boot,type=c("norm","basic","perc","bca")) 
ci

Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.

Answer

library(bootstrap) #loading packages
data=as.matrix(scor) # let data.frame become matrix
head(data)
n=nrow(data) # sample size
lambda.hat=eigen(t(data)%*%data/n)$values ## sample eigenvalues
theta.hat=lambda.hat[1]/sum(lambda.hat)  ## sample estimator 
#### jackknife
theta.jack=numeric(n)
for(i in 1:n){          ##leave one out
  data.jack=data[-i,]
  lambda.jack=eigen(t(data.jack)%*%data.jack/(n-1))$values
  theta.jack[i]=lambda.jack[1]/sum(lambda.jack)
}
jack.bias=(n-1)*mean(theta.jack-theta.hat) #jackknife  bias
jack.sd=sqrt((n-1)*mean((theta.jack - mean(theta.jack))^2)) #jackknife se
round(c(theta.jack=mean(theta.jack),theta.hat=theta.hat,
        jack.bias=jack.bias,jack.sd=jack.sd),7)

In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.

Answer

The sample size of the dataset is 53, we need to randomly sampling 2 obeservations, there have $53\times52/2$ outcomes. Then, we use the n-2 observations as the training data, establish the model, and then use the 2 observations to testing the error.

library(DAAG) # loading packages
data=ironslag #data set
n=nrow(data) # sample size
# creat a matrix to storage the index of test for leave-two-out CV.
index=matrix(nrow=n*(n-1)/2,ncol=2) 

#we obtain all the index via random sampling 2 observations, 
# there is 53*52/2=1378 outcomes.

a=c(1:n) # index of sample size, is 1:n
l=1
for(i in 1: (n-1)){
  for(j in (i+1):n){
index[l,]=(a[c(i,j)]) 
l=l+1
  }    
}
##
e1= e2 = e3 = e4 = matrix(nrow=nrow(index),ncol=2)

# fit models on leave-two-out samples
for(l in 1:nrow(index)){
  k=index[l,] #the index of testing data 
  y=data[-k,2] # the y of training data
  x=data[-k,1] # the x of training data

## linear model
  J1 = lm(y ~ x) # linear model
  yhat1 = J1$coef[1] + J1$coef[2] * data[k,1] # yhat on testing data
  e1[l,] = data[k,2] - yhat1

##quadratic model
  J2 <- lm(y ~ x + I(x^2)) # quadratic model 
yhat2 <- J2$coef[1] + J2$coef[2] * data[k,1] + # yhat on testing data
J2$coef[3] * data[k,1]^2
e2[l,] = data[k,2] - yhat2

##Exponential model
J3 = lm(log(y) ~ x) #exp model
logyhat3 =J3$coef[1]+J3$coef[2] * data[k,1] # yhat on testing data
yhat3 <- exp(logyhat3)
e3[l,] = data[k,2] - yhat3

## Log-Log model
  J4 <- lm(log(y) ~ log(x)) # log-log model
  logyhat4 <- J4$coef[1] + J4$coef[2] * log(data[k,1]) # yhat on testing data
  yhat4 <- exp(logyhat4)
  e4[l,] = data[k,2] - yhat4

}
print(c(mse1=mean(e1^2), mse2=mean(e2^2), mse3=mean(e3^2), mse4=mean(e4^2)))

Homework 6

Question 8.1

Implement the two-sample Cramer-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.

Answer

We implement the two-sample Cramer-von Mises test for equal distributions as following:

cvm.d<-function(x,y){ #compute the Cramer-von Mises statistic
  n<-length(x);m<-length(y)
  Fn<-ecdf(x);Gm<-ecdf(y)
  W2<-((m*n)/(m+n)^2)*
      (sum((Fn(x)-Gm(x))^2)+sum((Fn(y)-Gm(y))^2))
  W2
}

Next, we apply the test to the data in Examples 8.1 and 8.2.

library(boot)
attach(chickwts)
x <- sort(as.vector(weight[feed == "soybean"]))
y <- sort(as.vector(weight[feed == "linseed"]))
detach(chickwts)

set.seed(611)
R <- 999 #number of replicates
z <- c(x, y) #pooled sample
n<-length(x)
m<-length(y)
N <- 1:(m+n)
reps <- numeric(R) #storage for replicates
W2 <- cvm.d(x, y)
W2
for (i in 1:R) {
#generate indices k for the first sample
k <- sample(N, size = n, replace = FALSE)
x1 <- z[k]
y1 <- z[-k] #complement of x1
reps[i] <- cvm.d(x1, y1)
}
p <- mean(c(W2, reps) >= W2)
p

The observed Cramer-von Mises statistic is $W_2 = 0.1515236$. The aproximate ASL 0.394 implies that there is no enough evidence to reject the null hypothesis.

Question 9.3

Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the ???rst 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard Cauchydistribution(seeqcauchyorqtwithdf=1). RecallthataCauchy(??,??) distribution has density function

f(x)=$\frac{1}{\thetapi(1+[(x-\eta)/\theta]^2)}\qquad

The standard Cauchy has the Cauchy(?? =1,??= 0) density. (Note that the standard Cauchy density is equal to the Student t density with one degree of freedom.)

Answer

I choose the normal distribution as the proposal distribution of this exercise.

set.seed(1)
f <- function(x, theta,eta) {
  stopifnot(theta > 0)
  return(1/(theta*pi*(1+((x-eta)/theta)^2))) 
} 
xt <- x[i-1] 
y <-  xt+runif(1,-1,1)
m <- 10000
theta <- 1
eta <- 0
x <- numeric(m) 
x[1] <- runif(1,-1,1) 
k <- 0
u <- runif(m)

for (i in 2:m) {
  xt <- x[i-1]
  y <- xt+runif(1,min=-1,max=1)
  num <- f(y, theta,eta)*dnorm(xt,mean=y,sd=1) 
  den <- f(xt, theta,eta)*dnorm(y,mean=xt,sd=1)
  if (u[i] <= num/den) x[i] <- y else {
    x[i] <- xt
    k <- k+1         #y is rejected 
  } 
}
print(k) 

In this exercise, approximately 17% of the candidate points are rejected, so the chain is somewhat e???cient.

To see the generated sample as a realization of a stochastic process, we can plot the sample vs the time index. The following code will display a partial plot starting at time index 1500.

index <- 1500:2000
y1 <- x[index] 
plot(index, y1, type="l", main="", ylab="x") 

The following code compares the quantiles of the target cauchy (1,0) distribution with the quantiles of the generated chain in a quantile-quantile plot (QQ plot).

gene<- quantile(x[-(1:1000)],seq(0.1,0.9,0.1))
empi<-qt(seq(0.1,0.9,0.1),1)
rbind(gene,empi)
b <- 1001     #discard the burnin sample 
y <- x[b:m] 
probs <- seq(0.1,0.9,0.01)
QR <- qt(probs,df=1)
Q <- quantile(x, probs)

qqplot(QR, Q, main="", 
       xlab="Cauchy Quantiles", ylab="Sample Quantiles")

hist(y, breaks="scott", main="", xlab="", freq=FALSE) 
lines(QR, f(QR, 1,0))

The QQ plot is aninformal approachto assessing the goodness-of-???t of the generated sample with the target distribution. From the plot, it appears that the sample quantiles are in approximate agreement with the theoretical quantiles.

Question 9.6

Rao [220, Sec. 5g] presented an example on genetic linkage of 197 animals in four categories (also discussed in [67, 106, 171, 266]). The group sizes are (125,18,20,34). Assume that the probabilities of the corresponding multinomial distribution are ($\frac{2+\theta}{4}\qquad,$\frac{1-\theta}{4}\qquad,$\frac{1-\theta}{4}\qquad,$\frac{\theta}{4}\qquad) Estimate the posterior distribution of ?? given the observed sample, using one of the methods in this chapter.

Answer

set.seed(1)

w <- .25   #width of the uniform support set 
m <- 5000  #length of the chain 
burn <- 1000  #burn-in time 
total <- 197 
x <- numeric(m)  #the chain
group <- c(125,20,18,34)

prob <- function(y, group) { 
  # computes (without the constant) the target density 
  if (y < 0 || y >= 1) 
    return (0) 
  return(((2+y)/4)^group[1] * 
           ((1-y)/4)^group[2] * ((1-y)/4)^group[3] *
           ((y/4)^group[4]))
} 

u <- runif(m)          #for accept/reject step 
v <- runif(m, -w, w)   #proposal distribution
x[1] <- 0.5
for (i in 2:m) {
  y <- x[i-1] + v[i]
  if (u[i] <= prob(y, group) / prob(x[i-1], group)) 
    x[i] <- y else 
      x[i] <- x[i-1]
}
print(group) 
print(round(group/total, 3))
set.seed(1)
 xb <- x[(burn+1):m] 
 print(mean(xb))    

The sample mean of the generated chain is 0.6196245

Homework 7

1. For exercise 9.6, use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R}<1.2$.

answer:

The solution for exercise 9.6:

lden=function(theta)125*log(2+theta)+38*log(1-theta)+34*log(theta)
MCMC=function(x,burn_in=0,N=burn_in+5000,print_acc=F){
  y=1:N#for MCMC generation
  ld=lden#use log-likelihood, things could be a little bit different.
  acc=0
  for(i in 1:N){
    p=runif(1)
    y[i]=x=if(runif(1)<exp(ld(p)-ld(x))){acc=acc+1;p}else x
  }
  if(print_acc)print(acc/N)
  y[(burn_in+1):N]
}
plot(MCMC(0.5,print_acc=T))#accept rate~0.15

Gelman-Rubin method of monitoring convergence

Gelman.Rubin <- function(psi) {
 # psi[i,j] is the statistic psi(X[i,1:j])
 # for chain in i-th row of X
  psi <- as.matrix(psi)
  n <- ncol(psi)
  k <- nrow(psi)

   psi.means <- rowMeans(psi)     #row means
   B <- n * var(psi.means)        #between variance est.
   psi.w <- apply(psi, 1, "var")  #within variances
  W <- mean(psi.w)               #within est.
  v.hat <- W*(n-1)/n + B/n+(B/(n*k))     #upper variance est.
   r.hat <- v.hat / W             #G-R statistic
return(r.hat)
  }

so we could monitor 2 chains.

M1=cumsum((MC1=MCMC(0.2)))/1:5000#mean_cumsum
M2=cumsum((MC2=MCMC(0.4)))/1:5000#mean_cumsum
M3=cumsum((MC3=MCMC(0.6)))/1:5000#mean_cumsum
M4=cumsum((MC4=MCMC(0.8)))/1:5000#mean_cumsum
psi=rbind(M1,M2,M3,M4)
plot((R=sapply(1:100,function(i)Gelman.Rubin(psi[,1:i]))),main="R value of Gelman-Rubin method",type='l',ylab = "")

res=(1:100)[R[2:100]>1.2][sum(R[2:100]>1.2)]+1
c(mean(c(MC1[res:5000],MC2[res:5000],MC3[res:5000],MC4[res:5000])),
  var(c(MC1[res:5000],MC2[res:5000],MC3[res:5000],MC4[res:5000])))

2. Find the intersection points $A(k)$ in $(0,\sqrt{k})$ of the curves

$$S_{k-1}(a)=\left(t(k-1)>\sqrt{\frac{a^2(k-1)}{k-a^2}} \right) $$ and $$S_{k}(a)=\left(t(k)>\sqrt{\frac{a^2k}{k+1-a^2}} \right) $$ for $k=4:25,100, 500, 1000$, where $t(k)$ is a Student $ t$ random variable with $k$ degrees of freedom.

answer:

k=4:25

n=25
A=rep(0,length(c(4:n)))
for(k in 4:n){
f=function(x){
pt(sqrt(x^2*k/(k+1-x^2)),k)-pt(sqrt(x^2*(k-1)/(k-x^2)),k-1)} 
A[k-3]=uniroot(f,c(0.00001,sqrt(k)-0.00001))$root #solve the intersection points
}
plot(A,type='l')

k=4:100

n=100
A=rep(0,length(c(4:n)))
for(k in 4:n){
f=function(x){
pt(sqrt(x^2*k/(k+1-x^2)),k)-pt(sqrt(x^2*(k-1)/(k-x^2)),k-1)}
A[k-3]=uniroot(f,c(0.00001,sqrt(k)-0.00001))$root #solve the intersection points
}
plot(A,type='l')

k=4:500

n=500
A=rep(0,length(c(4:n)))
for(k in 4:n){
f=function(x){
pt(sqrt(x^2*k/(k+1-x^2)),k)-pt(sqrt(x^2*(k-1)/(k-x^2)),k-1)}
A[k-3]=uniroot(f,c(0.00001,sqrt(k)-0.00001))$root #solve the intersection points
}
plot(A,type='l')

k=4:1000

n=1000
A=rep(0,length(c(4:n)))
for(k in 4:n){
f=function(x){
pt(sqrt(x^2*k/(k+1-x^2)),k)-pt(sqrt(x^2*(k-1)/(k-x^2)),k-1)}
A[k-3]=uniroot(f,c(0.00001,sqrt(k)-0.00001))$root #solve the intersection points
}
plot(A,type='l') 

Homework 8

The cdf is $$F(x)=\int_{-\infty}^x \frac{1}{\theta\pi(1+[(t-\eta)/\theta]^2)}dt.$$ we use the Numerical Integration method to get the pdf.

we let ($\theta,\eta$)=(1,0) and ($\theta,\eta$)=(1,1), $x\in(0,10)$,and compute the cdf for cauchy distribution.

## define the pdf of cauchy distribution
pdf_cauchy=function(t,theta,eta){
  1/(theta*pi*(1+((t-eta)/theta)^2))
}
x=seq(0,10,0.1)
num_int=rep(0,length(x))
p=rep(0,length(x))
# let theta=1,eta=0.
for(i in 1:length(x)){
num_int[i]=integrate(pdf_cauchy, -Inf, x[i],theta=1,eta=0)$value
p[i]=pcauchy(x[i],location = 0, scale = 1) # by pcauchy
}
plot(x,num_int) # plot the num_int
lines(x,p,col='green') # plot the pcauchy

# let theta=1,eta=1.
for(i in 1:length(x)){
num_int[i]=integrate(pdf_cauchy, -Inf, x[i],theta=1,eta=1)$value
p[i]=pcauchy(x[i],location = 1, scale = 1) # by pcauchy
}
plot(x,num_int) # plot the num_int
lines(x,p,col='green') # plot the pcauchy

According the graph, we can see that both pcauchy and Numerical Integration method very similar.

Homework 9

Question 1.

Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:

formulas <- list(

mpg ~ disp

mpg ~ I(1 / disp)

mpg~disp+wt

mpg~I(1/disp)+wt

)

answer

## use loops to fit linear model.
formulas = list(
mpg ~ disp,
mpg ~ I(1 / disp), mpg~disp+wt, mpg~I(1/disp)+wt
)
out1=list()
for(i in 1:4){
  out1[[i]]=lm(formulas[[i]],data=mtcars)
}
out1

### use lapply() function to fit linear model
out2=lapply(formulas,lm,data=mtcars)
out2

Question 2.

Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?

answer

bootstraps <- lapply(1:10, function(i) {
         rows <- sample(1:nrow(mtcars), rep = TRUE)
         mtcars[rows, ]
})
### loops
out3=list()
for(i in 1:10){
  out3[[i]]=lm(mpg ~ disp,data=bootstraps[[i]])
}
out3

### lapply
newdata=list()
for(i in 1:10){

  newdata[[i]]=bootstraps[[i]][,c(1,3)]
}
out4=lapply(newdata,lm)
out4

Question 3

For each model in the previous two exercises, extract R2 using the function below.

answer

rsq <- function(mod) summary(mod)$r.squared
## compute R2 for out1 
R2_1=lapply(out1,rsq)
R2_1
## compute R2 for out2
R2_2=lapply(out2,rsq)
R2_2
## compute R2 for out3
R2_3=lapply(out3,rsq)
R2_3
## compute R2 for out4
R2_4=lapply(out4,rsq)
R2_4

Question 4

The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.

answer

trials <- replicate(
         100,
         t.test(rpois(10, 10), rpois(7, 10)),
         simplify = FALSE
       )
## define a function to extract p.value
p_fun=function(f)(f$p.value)
sapply(trials,p_fun)

Homework 10

Question 1

Make a faster version of chisq.test( ) that only computes the chi-square test statistic when the input is two numeric vectors with no missing values.

answer

In R, the function of chisq.test() as follow. let chisq.test1 denote chisq.test( ), we will try simplifying chisq.test( ).

chisq.test1=function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), 
    rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) 
{
    DNAME <- deparse(substitute(x))
    if (is.data.frame(x)) 
        x <- as.matrix(x)
    if (is.matrix(x)) {
        if (min(dim(x)) == 1L) 
            x <- as.vector(x)
    }
    if (!is.matrix(x) && !is.null(y)) {
        if (length(x) != length(y)) 
            stop("'x' and 'y' must have the same length")
        DNAME2 <- deparse(substitute(y))
        xname <- if (length(DNAME) > 1L || nchar(DNAME, "w") > 
            30) 
            ""
        else DNAME
        yname <- if (length(DNAME2) > 1L || nchar(DNAME2, "w") > 
            30) 
            ""
        else DNAME2
        OK <- complete.cases(x, y)
        x <- factor(x[OK])
        y <- factor(y[OK])
        if ((nlevels(x) < 2L) || (nlevels(y) < 2L)) 
            stop("'x' and 'y' must have at least 2 levels")
        x <- table(x, y)
        names(dimnames(x)) <- c(xname, yname)
        DNAME <- paste(paste(DNAME, collapse = "\n"), "and", 
            paste(DNAME2, collapse = "\n"))
    }
    if (any(x < 0) || anyNA(x)) 
        stop("all entries of 'x' must be nonnegative and finite")
    if ((n <- sum(x)) == 0) 
        stop("at least one entry of 'x' must be positive")
    if (simulate.p.value) {
        setMETH <- function() METHOD <<- paste(METHOD, "with simulated p-value\n\t (based on", 
            B, "replicates)")
        almost.1 <- 1 - 64 * .Machine$double.eps
    }
    if (is.matrix(x)) {
        METHOD <- "Pearson's Chi-squared test"
        nr <- as.integer(nrow(x))
        nc <- as.integer(ncol(x))
        if (is.na(nr) || is.na(nc) || is.na(nr * nc)) 
            stop("invalid nrow(x) or ncol(x)", domain = NA)
        sr <- rowSums(x)
        sc <- colSums(x)
        E <- outer(sr, sc, "*")/n
        v <- function(r, c, n) c * r * (n - r) * (n - c)/n^3
        V <- outer(sr, sc, v, n)
        dimnames(E) <- dimnames(x)
        if (simulate.p.value && all(sr > 0) && all(sc > 0)) {
            setMETH()
            tmp <- .Call(C_chisq_sim, sr, sc, B, E)
            STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))
            PARAMETER <- NA
            PVAL <- (1 + sum(tmp >= almost.1 * STATISTIC))/(B + 
                1)
        }
        else {
            if (simulate.p.value) 
                warning("cannot compute simulated p-value with zero marginals")
            if (correct && nrow(x) == 2L && ncol(x) == 2L) {
                YATES <- min(0.5, abs(x - E))
                if (YATES > 0) 
                  METHOD <- paste(METHOD, "with Yates' continuity correction")
            }
            else YATES <- 0
            STATISTIC <- sum((abs(x - E) - YATES)^2/E)
            PARAMETER <- (nr - 1L) * (nc - 1L)
            PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
        }
    }
    else {
        if (length(dim(x)) > 2L) 
            stop("invalid 'x'")
        if (length(x) == 1L) 
            stop("'x' must at least have 2 elements")
        if (length(x) != length(p)) 
            stop("'x' and 'p' must have the same number of elements")
        if (any(p < 0)) 
            stop("probabilities must be non-negative.")
        if (abs(sum(p) - 1) > sqrt(.Machine$double.eps)) {
            if (rescale.p) 
                p <- p/sum(p)
            else stop("probabilities must sum to 1.")
        }
        METHOD <- "Chi-squared test for given probabilities"
        E <- n * p
        V <- n * p * (1 - p)
        STATISTIC <- sum((x - E)^2/E)
        names(E) <- names(x)
        if (simulate.p.value) {
            setMETH()
            nx <- length(x)
            sm <- matrix(sample.int(nx, B * n, TRUE, prob = p), 
                nrow = n)
            ss <- apply(sm, 2L, function(x, E, k) {
                sum((table(factor(x, levels = 1L:k)) - E)^2/E)
            }, E = E, k = nx)
            PARAMETER <- NA
            PVAL <- (1 + sum(ss >= almost.1 * STATISTIC))/(B + 
                1)
        }
        else {
            PARAMETER <- length(x) - 1
            PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
        }
    }
    names(STATISTIC) <- "X-squared"
    names(PARAMETER) <- "df"
    if (any(E < 5) && is.finite(PARAMETER)) 
        warning("Chi-squared approximation may be incorrect")
    structure(list(statistic = STATISTIC, parameter = PARAMETER, 
        p.value = PVAL, method = METHOD, data.name = DNAME, observed = x, 
        expected = E, residuals = (x - E)/sqrt(E), stdres = (x - 
            E)/sqrt(V)), class = "htest")
}

Because the input is two numeric vectors with no missing values.

Firstly, this allows me to remove the is.data.frame( ) and is.matrix( ) test.

chisq.test2=function (x, y)
{
  if (length(x) != length(y)) 
  stop("'x' and 'y' must have the same length")
  OK <- complete.cases(x, y)
        x <- factor(x[OK])
        y <- factor(y[OK])
        if ((nlevels(x) < 2L) || (nlevels(y) < 2L)) 
            stop("'x' and 'y' must have at least 2 levels")
        x <- table(x, y)
        n <- sum(x)
        p = rep(1/length(x), length(x))
        p <- p/sum(p)


        sr <- rowSums(x)
        sc <- colSums(x)
        E <- outer(sr, sc, "*")/n
        v <- function(r, c, n) c * r * (n - r) * (n - c)/n^3
        V <- outer(sr, sc, v, n)
        dimnames(E) <- dimnames(x)
        STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))
         STATISTIC 

}

options(warn=-1)
x <- c(89,37,30,28,2,14,25)
y <- c(29,47,10,8,14,32,52)

chisq.test1(x,y)$statistic
chisq.test2(x,y)


library(microbenchmark)
microbenchmark(
  chisq.test1(x,y)$statistic,
  chisq.test2(x,y)
)

It can be seen that chisq.test2( ) is much faster than chisq.test1( ) as twice.



StatComp18063/StatComp documentation built on May 13, 2019, 10:04 a.m.