Homework 1

Question

Implement at least three examples of different types in the book "R for Beginners".

Answer

Example 1 (P56,5.1)

In this example, we choose the data set "InsectSprays" to do some simple analysis of variance(aov) so that we can know whether six Insecticides have different effects.

At first, some features of the data can be established as the following:

count.1<-matrix(InsectSprays$count,nrow=12)
mean.var<-rbind(colMeans(count.1),apply(count.1,2,var))
rownames(mean.var)<-c("mean","var")
colnames(mean.var)<-c("A","B","C","D","E","F")
mean.var

In order to get more stable variance and make data satisfy Normality hypothesis, we should do square root conversion of data. The results of aov is as following:

aov.spray<-aov(sqrt(InsectSprays$count)~InsectSprays$spray)
summary(aov.spray)
aov.spray$coefficients

As the P-value is much smaller than 0.05, it can be considered that the result is much significant.

Example 2 (P52)

This example in the book "R for beginners" mainly used some drawing function to reflect some features of the data iris. Moreover, I will show you some numerical results of the data.

Three species can be clearly distinguished by the figure as following:

library(lattice)
data("iris")
xyplot(Petal.Length~Petal.Width,data=iris,groups=Species,type=c("p","smooth"),span=.75,auto.key=list(x=0.15,y=0.85))

Next, we just pick two spiecies to do Multiple regression analysis. It is neccessary to define Dumb variable to replace the variable species.

iris$isSetosa<-ifelse(iris$Species=="setosa",1,0)
iris$isVersicolor<-ifelse(iris$Species=="versicolor",1,0)
iris_demo=iris[,-5]
m_iris<-lm(Sepal.Width~.,data=iris_demo)
summary(m_iris)

The results above reflect that only intercept is not significant. The coeficients about isSetosa and isVersicolor is much significant which implies there are several differences between different species.

Example 3 (P45,4.5)

This example tells us how to enrich a scatter plot. We will establish several scatter plots about ten pairs of random values. The latter is richer than former. Before showing the figures, ten pairs of random values are established as following:

x<-rnorm(10)
y<-rnorm(10)
rbind(x,y)

Figure 1:

plot(x,y)

Figure 2:

plot(x,y,xlab="Ten random values",ylab="Ten other values",xlim=c(-2,2),ylim=c(-2,2),pch=22,col="red",bg="yellow",bty="l",tcl=0.4,main="How to customize a plot with R",las=1,cex=1.5)

Figure 3:

opar <- par()
par(bg="lightyellow", col.axis="blue", mar=c(4, 4, 2.5, 0.25))
plot(x, y, xlab="Ten random values", ylab="Ten other values",
xlim=c(-2, 2), ylim=c(-2, 2), pch=22, col="red", bg="yellow",
bty="l", tcl=-.25, las=1, cex=1.5)
title("How to customize a plot with R (bis)", font.main=3, adj=1)

Figure 4:

opar <- par()
par(bg="lightgray", mar=c(2.5, 1.5, 2.5, 0.25))
plot(x, y, type="n", xlab="", ylab="", xlim=c(-2, 2),
ylim=c(-2, 2), xaxt="n", yaxt="n")
rect(-3, -3, 3, 3, col="cornsilk")
points(x, y, pch=10, col="red", cex=2)
axis(side=1, c(-2, 0, 2), tcl=-0.2, labels=FALSE)
axis(side=2, -1:1, tcl=-0.2, labels=FALSE)
title("How to customize a plot with R (ter)",
font.main=4, adj=1, cex.main=1)
mtext("Ten random values", side=1, line=1, at=1, cex=0.9, font=3)
mtext("Ten other values", line=0.5, at=-1.8, cex=0.9, font=3)
mtext(c(-2, 0, 2), side=1, las=1, at=c(-2, 0, 2), line=0.3,
col="blue", cex=0.9)
mtext(-1:1, side=2, las=1, at=-1:1, line=0.2, col="blue", cex=0.9)

Homework 2

Question

A discrete random variable $X$ has probability mass function

|$x$|0|1|2|3|4| |-|-|-|-|-|-| |$p(x)$|0.1|0.2|0.2|0.2|0.3|

Use the inverse transform method to generate a random sample of size 1000 from the distribution of $X$. Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function.

Answer

x<-c(0,1,2,3,4);p<-c(0.1,rep(0.2,3),0.3)
cp<-cumsum(p);n<-1000;r<-numeric(1000)
set.seed(211)
r1<-x[findInterval(runif(n),cp)+1]
ct1<-as.vector(table(r1))
fre1<-rbind(x,ct1/n);rownames(fre1)<-c("x","frequency")
fre1
ct1/sum(ct1)/p
## Inverse transform method

The first one is the relative frequency table. The sencond is the ratio between the frequency and the theoretical probabilities.

Next, we establish the results by using the sample function.

set.seed(212)
r2<-sample(c(0,1,2,3,4),n,replace=T,c(0.1,rep(0.2,3),0.3))
ct2<-as.vector(table(r2))
fre2<-rbind(x,ct2/n);rownames(fre2)<-c("x","frequency")
fre2
ct2/sum(ct2)/p
#R sample function

Question

Write a function to generate a random sample of size n from the $Beta(a, b)$ distribution by the acceptance-rejection method. Generate a random sample of size 1000 from the $Beta(3,2)$ distribution. Graph the histogram of the sample with the theoretical $Beta(3,2)$ density superimposed.

Answer

$Beta(3,2)$ with pdf $f(x)=12x^2(1-x),~0<x<1$. Let $c=2$ and $g(x)=x,~0<x<1$.

set.seed(221)
n <- 1000;j<-k<-0;y <- numeric(n)
while (k < n) {
u <- runif(1)
j <- j + 1
x <- runif(1) #random variate from g
if (6*x^2*(1-x) > u) {
#we accept x
k <- k + 1
y[k] <- x
}
}
1000/j    #efficiency
# #(experiments) for n random numbers
hist(y,freq=F,main="Histogram of Beta(3,2)",xlab="Beta(3,2)")
curve(12*x^2*(1-x),from=0,to=1,add=TRUE,col="darkblue", lwd=2)

As above figure shows, the generated random sample by the acceptance-rejection method quite matches with the distribution $Beta(3,2)$.

Question

Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter $??$ has $Gamma(r, ??)$ distribution and $Y$ has $Exp(??)$ distribution. That is, $(Y |?? = ??) ?? fY (y|??) = ??e^{-??y}$. Generate 1000 random observations from this mixture with $r = 4$ and $?? = 2$.

Answer

set.seed(231)
n<-1000;r<-4;beta<-2
lambda<-rgamma(n,r,beta)
y<-rexp(n,lambda)        ##1000 random observations
y.cut<-y[y<4]
hist(y.cut,freq=F,breaks = c(0.2*0:10,3,4),main="Histogram of mixture")
curve(2*(1+(1/2)*x)^(-5),0,4,add=T)

1000 random observations are stored in the vector $y$. As there are extremely few points bigger than 4, so that we can graph the histogram without these points. The histogram of the generated sample is shown above.

Homework 3

Question

Write a function to compute a Monte Carlo estimate of the $Beta(3, 3)$ cdf, and use the function to estimate $F(x)$ for $x = 0.1, 0.2,\dots, 0.9$. Compare the estimates with the values returned by the pbeta function in R.

Answer

$Beta(3,3)$ has pdf $f(x)=30x^2(1-x)^2,~~0<x<1$. Then the cdf of $Beta(3,3)$ can be represented as $F(x)=\int_0^x 30y^2(1-y)^2dy$. By the principle of Monte Carlo estimate, $F(x)=\mathbb{E}[30xY^2(1-Y)^2]$ where $Y\sim U(0,x)$.

The function computed the $Beta(3,3)$ cdf is shown as follows:

Fx<-function(n,x) {y<-runif(n,0,x);mean(30*x*y^2*(1-y)^2)}
pbeta33<-function(x)  ##Beta(3,3) cdf
{n<-1000              ##size 1000
m<-10                 ##10 groups 
Fx.vec<-sapply(1:m,function(o) Fx(n,x))
mean(Fx.vec)
}

Next, we show you the estimated values of $F(x)$ for $x=0.1,0.2,\dots,0.9$.

beta33.hat<-numeric(9)
set.seed(311)
for(i in 1:9)  beta33.hat[i]<-pbeta33(i/10)
cbind(1:9/10, round(beta33.hat,5))

Then we establish the theoretical values by pbeta function.

cbind(1:9/10,pbeta(1:9/10,3,3))

At last, we combine the estimated values and the theoretical values to compare them.

comp<-data.frame(cbind(1:9/10, round(beta33.hat,5),pbeta(1:9/10,3,3)))
colnames(comp)<-c("x","beta33.hat","beta33")
comp

From the above sheet, the estimated value is very close to the theoretical value under different values of $x$. Hence, we can fully consider that the Monte Carlo estimate of $Beta(3,3)$ cdf is much effective.

Question

The Rayleigh density [156, (18.76)] is $f(x)=\frac{x}{\sigma^2}e^{-x^2/(2\sigma^2)},~~x\ge0, \sigma>0$. Implement a function to generate samples from a $Rayleigh(\sigma)$ distribution, using antithetic variables. What is the percent reduction in variance of $\frac{X+X^`}{2}$ compared with $\frac{X_1+X_2}{2}$ for independent $X_1, X_2$?

Answer

In order to get Rayleigh cdf, we should compute $F(x)=\int_0^x \frac{y}{\sigma^2}e^{-y^2/(2\sigma^2)}dy$ by Monte Carlo, using antithetic vaeiables. First, the integral is equal to $\mathbb{E}[x\frac{Y}{\sigma^2}e^{-\frac{Y^2}{2\sigma^2}}]$ where $Y\sim U(0,x)$. Moreover, it can also be reprensented as $\mathbb{E}[x^2\frac{U}{\sigma^2}e^{-\frac{x^2U^2}{2\sigma^2}}]$ where $U\sim U(0,1)$. Define $g(U)=x^2\frac{U}{\sigma^2}e^{-\frac{x^2U^2}{2\sigma^2}}$ Then we see that $g(U)$ and $g(1-U)$ are a pair of antithetic varibles.

The function to compute $F(x)$ is of follows:(we assume $\sigma=1$ originally)

MC.Ray <- function(x,sigma=1, R = 10000, antithetic = TRUE) {
cdf <- numeric(length(x))
u <- runif(R/2)
if (!antithetic) v <- runif(R/2) else
v <- 1 - u
u <- c(u, v) 
for (i in 1:length(x)) {
g <- x[i]^2*u*exp(-x[i]^2*u^2/(2*sigma^2))/sigma^2
cdf[i] <- mean(g)
}
cdf
}

When $x=1$ which implies to compute $F(1)$, we compare the variances between single MC experiment and antithetic variable approach.

m <- 1000
MC1 <- MC2 <- numeric(m)
x <- 1
set.seed(321)
for (i in 1:m) {
MC1[i] <- MC.Ray(x, R = 1000, anti = FALSE)
MC2[i] <- MC.Ray(x, R = 1000)
}
print(sd(MC1))
print(sd(MC2))
print((var(MC1)-var(MC2))/var(MC1))

The antithetic variable approach achieved approximately 90% reduction in variance.

Next, we generate samples from $Rayleigh(\sigma=1)$ by inverse transform method. Actually, the cdf of $Rayleigh(\sigma)$ has the analytical form $F(x)=1-e^{-x^2/(2\sigma^2)}$. So that the inverse transform has form $F^{-1}(u)=(2\sigma^2\log(\frac{1}{1-u})^{1/2}$. Then we can construct antithetic varibles $X=F^{-1}(U)$ and $X^`=F^{-1}(1-U)$ from which $U\sim U(0,1)$.

The function to generate samples is of follows:

IT.Ray<-function(n,sigma=1,antithetic=TRUE){
u <- runif(n/2)
if (!antithetic) v <- runif(n/2) else
v <- 1 - u
u <- c(u, v) 
x<-sqrt(-2*sigma^2*log(1-u))
x
}
set.seed(321)
hist(IT.Ray(10000),ylim=c(0,0.6),breaks =seq(0,5,0.5),freq=F,main="Histogram of Rayleigh(1)",xlab="x")
curve(x*exp(-x^2/2),0,4,add=T)

We plot the histogram with 10000 generated random numbers and add the density curve of $Rayleigh(1)$ in the same figure. As you see, the histogram is very close to the theoritical density.

At last, suppose taht $X, X_1, X_2 \sim F^{-1}(U)$ and $X^\sim F^{-1}(1-U)$ such that $X_1,X_2$ are independent. Clearly, $X$ and $X^$ are a pair of antithetic variables. Then we compute the percent reduction in variance of $\frac{X+X^`}2$ compared with $\frac{X_1+X_2}2$.

m <- 1000
anti.F <- anti.T <- numeric(m)
set.seed(321)
for (i in 1:m) {
anti.F[i] <- mean(IT.Ray(10000, antithetic = FALSE))
anti.T[i] <- mean(IT.Ray(10000))
}
print(sd(anti.F))
print(sd(anti.T))
print((var(anti.F)-var(anti.T))/var(anti.F))

The antithetic variable approach achieved approximately 95% reduction in variance.

Question

Find two importance functions $f_1$ and $f_2$ that are supported on $(1,\infty)$ and are ??close?? to $$ g(x)=\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2},~~x>1. $$ Which of your two importance functions should produce the smaller variance in estimating $$ \int_1^{\infty}\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx $$ by importance sampling? Explain.

Answer

We show two importance functions following: $$ f_1(x)=e^{1-x},~~x>1;~~f_2(x)=\frac12xe^{\frac{1-x^2}{4}},~~x>1. $$ We can show you the figure of $g(x)/f_1$ and $g(x)/f_2$

#figure $g(x)/f(x)$
x <- c(seq(1, 5, .1))
w <- 2
g<-x^2*exp(-x^2/2)/sqrt(2*pi)
f1<-exp(1-x)
f2<-x*exp((1-x^2)/4)/2
plot(x, g/f1, type = "l", main = "", ylab = "",
ylim = c(0,0.8), lwd = w, lty = 2)
lines(x, g/f2, lty = 3, lwd = w)
legend("topright", legend = c(1:2),
lty = 2:6, lwd = w, inset = 0.02)

Obviously, cdf of $f_1,f_2$ can be calculated as following: $$ F_1(x)=1-e^{1-x},~~x>1;~~F_2(x)=1-e^{\frac{1-x^2}{4}},~~x>1. $$ Then we can generate random samples by inverse transform method. We can see that $$ F_1^{-1}(u)=1-\log(1-u);~~F_2^{-1}(u)=(1-4\log(1-u))^{1/2}. $$ Next, we compute the integral with two importance functions respectively.

m <- 10000     ##size 10000
theta.hat <- se <- numeric(2)
g <- function(x) {
x^2*exp(-x^2/2)/sqrt(2*pi)*(x>1)
}
u <- runif(m) #f1, inverse transform method  
x<-1-log(1-u)
fg <- g(x)/exp(1-x)
theta.hat[1] <- mean(fg)
se[1] <- sd(fg)
u <- runif(m) #f2, inverse transform method  
x<-(1-4*log(1-u))^0.5
fg <- 2*g(x) /(x*exp((1-x^2)/4))
theta.hat[2] <- mean(fg)
se[2] <- sd(fg)
rbind(theta.hat,se)

As established above, we find that $f_2$ produces the smaller variance. Actually, the variances is much closer, remind the figure of $g(x)/f_1$ and $g(x)/f_2$, we see that these two lines are similar, the line of $g(x)/f_2$ is a little bit steadier. Therefore, the varience of $f_2$ is a little smaller than the other one.

Question

Obtain a Monte Carlo estimate of $$ \int_{1}^{\infty}\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx $$ by importance sampling.

Answer

From the answer in exercise 5.13, we use the importance function $f_2$ to calculate the value of this integral.

The estimated value and variance is of followng:

m <- 10000     ##size 10000
g <- function(x) {
x^2*exp(-x^2/2)/sqrt(2*pi)*(x>1)
}
u <- runif(m) #f2, inverse transform method  
x<-(1-4*log(1-u))^0.5
fg <- 2*g(x) /(x*exp((1-x^2)/4))
theta.hat<- mean(fg)
se<- sd(fg)
c(theta.hat,se)

Homework 4

Question

Let $X$ be a non-negative random variable with $\mu= E[X]<\infty$. For a random sample $x_1,\dots, x_n$ from the distribution of $X$, the Gini ratio is defined by $$ G=\frac{1}{2n^2\mu}\sum_{j=1}^{n}\sum_{i=1}^n|x_i-x_j|. $$ The Gini ratio is applied in economics to measure inequality in income distribution (see e.g. [163]). Note that $G$ can be written in terms of the order statistics $x_{(i)}$ as $$ G=\frac{1}{n^2\mu}\sum_{i=1}^{n}(2i-n-1)x_{(i)}. $$ If the mean is unknown, let $\hat{G}$ be the statistic $G$ with $\mu$ replaced by $\bar{x}$. Estimate by simulation the mean, median and deciles of $\hat{G}$ if $X$ is standard lognormal. Repeat the procedure for the uniform distribution and $Bernoulli(0.1)$. Also construct density histograms of the replicates in each case.

Answer

1.Standard lognormal If $X$ is standard lognormal, then $\log(x)\sim N(0,1)$. Thus the pdf of X is $f(x)=\frac{1}{\sqrt{2\pi}x}e^{-\frac{(\log{x})^2}{2}}$. Actually, we can compute the theoritical expectation of $X$, its value is $e^{1/2}$.

n<-1000              #generate 1000 random numbers
m<-100               #repeat 100 times 
Glnorm<-numeric(m)
v<-2*(1:n)-n-1
set.seed(411)
for(i in 1:m){
  x<-sort(rlnorm(n))
  Glnorm[i]<-(1/(n^2*exp(0.5)))*as.numeric(v%*%x)
}
lnormmean<-mean(Glnorm)
lnormmed<-quantile(Glnorm,0.5)
lnormdec<-quantile(Glnorm,probs=seq(0,1,0.1))
print(c(lnormmean,lnormmed,lnormdec))

The mean, median and deciles are established above. Mean value is 0.523 which is much approaching to median.

hist(Glnorm,freq=F,xlim=c(0.45,0.65),main="Gini ratio of standard lognormal")

2.Uniform distribution If $X\sim \mathbb{U}(0,1)$, then $E(X)=0.5$.

Gunif<-numeric(m)
set.seed(411)
for(i in 1:m){
  x<-sort(runif(n))
  Gunif[i]<-(1/(n^2*0.5))*as.numeric(v%*%x)
}
unifmean<-mean(Gunif)
unifmed<-quantile(Gunif,0.5)
unifdec<-quantile(Gunif,probs=seq(0,1,0.1))
print(c(unifmean,unifmed,unifdec))

The mean, median and deciles are established above. Mean value is 0.333 which is much approaching to median.

hist(Gunif,freq=F,xlim=c(0.320,0.345),main="Gini ratio of uniform distribution")
  1. Bernoulli(0.1) First, we assume that $X\sim Bernoulli(100,0.1)$, then $E(X)=10$.
Gbinom<-numeric(m)
set.seed(411)
for(i in 1:m){
  x<-sort(rbinom(n,100,0.1))
  Gbinom[i]<-(1/(n^2*0.5))*as.numeric(v%*%x)
}
binommean<-mean(Gbinom)
binommed<-quantile(Gbinom,0.5)
binomdec<-quantile(Gbinom,probs=seq(0,1,0.1))
print(c(binommean,binommed,binomdec))

The mean, median and deciles are established above. Mean value is 3.370 which is much approaching to median.

hist(Gbinom,freq=F,xlim=c(3.1,3.6),main="Gini ratio of Bernoulli(100,0.1)")

Qustiion

Construct an approximate 95% confidence interval for the Gini ratio $\gamma = E[G]$ if $X$ is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.

Answer

If $X\sim lognormal(\mu,\sigma^2)$, then $\log(X)\sim N(\mu,\sigma^2)$. Assume that we have a random sample $X_1,X_2,\dots,X_n$ from the population $X$. Denote $Y_i$ by $\log(X_i)$ for $1\le i\le n$, then we can construct an approximate 95% confidence interval for $\sigma$ following: $$ (\sqrt{\frac{(n-1)S^2}{\chi_{n-1}^2(0.025)}},\sqrt{\frac{(n-1)S^2}{\chi_{n-1}^2(0.975)}}), $$ where $S^2=\frac{1}{n-1}\sum_{i=1}^{n}(Y_i-\bar{Y})^2$.

Note that $E(G)=2\phi(\frac{\sigma}{\sqrt{2}})-1$ where $\phi$ is the distribution function of standard normal (You can find this conclusion in wikipedia). Therefore we can naturally construct an approximate 95% confidence interval for $\gamma=E(G)$ following: $$ (2\phi(\frac{\sqrt{\frac{(n-1)S^2}{\chi_{n-1}^2(0.025)}}}{\sqrt{2}})-1,2\phi(\frac{\sqrt{\frac{(n-1)S^2}{\chi_{n-1}^2(0.975)}}}{\sqrt{2}})-1). $$

We assume that $X$ obtains standard lognormal.

n<-1000              #generate 1000 random numbers
m<-100               #generate 100 gini ratio every times
l<-100               #repeat 100 times
mu<-0
sigma<-1
ci.sigma<-function(x,n){#function to construct confident interval for sigma
  s2<-(1/(n-1))*sum((x-mean(x))^2)
  c((n-1)*s2/qchisq(0.975,n-1),(n-1)*s2/qchisq(0.025,n-1))
}   
ciG<-cbind(numeric(m),numeric((m)))
judge<-numeric(m)          #judge whether true value is in ci or not
power.G<-numeric(l)
G.ture<-2*pnorm(sigma/sqrt(2))-1 #the true value of G
set.seed(412)
for(j in 1:l){
for(i in 1:m){
  x<-rlnorm(n)
  cisigma<-ci.sigma(log(x),n)
  ciG[i,]<-2*pnorm(cisigma/sqrt(2))-1
  judge[i]<-I(G.ture>ciG[i,1] & G.ture<ciG[i,2])
}
power.G[j]<-mean(judge)
}
mean(power.G)

The coverage rate of estimation is 0.9524 which is a little bigger than 0.95.

Quenstion

Tests for association based on Pearson product moment correlation ??, Spearman??s rank correlation coefficient $\rho_s$, or Kendall??s coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $(X, Y)$ such that $X$ and $Y$ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.

Answer

We assume $(X,Y)\sim N(\mu,\Sigma)$, where $\mu=(0,0)^T$ and $\Sigma$ is of following: \begin{matrix} 1 & \rho\ \rho & 1 \end{matrix}

Spearman's rank correlation:

library(MASS)
n <- 20
m <- 1000
rho.s <- c(seq(-0.8,-0.1,0.1),seq(0.1,0.8,0.1)) #alternatives
M <- length(rho.s)
power.s <- numeric(M)
set.seed(431)
for (i in 1:M) {
rho <-rho.s[i]
pvalues.s <- replicate(m, expr = {
#simulate under alternative mu1
x <- mvrnorm(n, c(0,0), matrix(c(1,rho,rho,1),2,2))
cortest.s <- cor.test(x[,1],x[,2],
alternative = "two.sided",method="spearman")
cortest.s$p.value } )
power.s[i] <- mean(pvalues.s<= .05)
}
power.s

Kendall's coefficient:

n <- 20
m <- 1000
rho.k <- c(seq(-0.8,-0.1,0.1),seq(0.1,0.8,0.1)) #alternatives
M <- length(rho.k)
power.k <- numeric(M)
set.seed(431)
for (i in 1:M) {
rho <-rho.k[i]
pvalues.k <- replicate(m, expr = {
#simulate under alternative mu1
x <- mvrnorm(n, c(0,0), matrix(c(1,rho,rho,1),2,2))
cortest.k <- cor.test(x[,1],x[,2],
alternative = "two.sided",method="kendall")
cortest.k$p.value } )
power.k[i] <- mean(pvalues.k<= .05)
}
power.k

Pearson's correlation:

n <- 20
m <- 1000
rho.p <- c(seq(-0.8,-0.1,0.1),seq(0.1,0.8,0.1)) #alternatives
M <- length(rho.p)
power.p <- numeric(M)
set.seed(431)
for (i in 1:M) {
rho <-rho.p[i]
pvalues.p <- replicate(m, expr = {
#simulate under alternative mu1
x <- mvrnorm(n, c(0,0), matrix(c(1,rho,rho,1),2,2))
cortest.p <- cor.test(x[,1],x[,2],
alternative = "two.sided",method="pearson")
cortest.p$p.value } )
power.p[i] <- mean(pvalues.p<= .05)
}
power.p
power.rho<-rbind(power.s,power.k,power.p)
colnames(power.rho)<-c(seq(-0.8,-0.1,0.1),seq(0.1,0.8,0.1))
power.rho

As you see, the nonparametric tests based on $??_s$ or $??$ are less powerful than the correlation test when the sampled distribution is bivariate normal.

Pearson's correlation measures linear trend between two datas, it also bases on normality assumption. When the bivariate distribution $(X,Y)$ is not bivariate normal, Pearson's correlation is less significant.

Now, we assume that $X\sim t_{2}$, $Y\sim U(-1,0)$ when $X<0$ and $Y\sim U(0,1)$ when $X\ge0$. Clearly, $X,Y$ are dependent.

n <- 20
m <- 1000
set.seed(431)
pvalues.s1 <- replicate(m, expr = {
#simulate under alternative mu1
x <- rt(n,2)
y<-numeric(n)
for(i in 1:n){
  if(x[i]<0) y[i]<-runif(1,-1,0)
  else       y[i]<-runif(1,0,1)
}
cortest.s1 <- cor.test(x,y,
alternative = "two.sided",method="spearman")
cortest.s1$p.value } )
power.s1 <- mean(pvalues.s1<= .05)
power.s1
n <- 20
m <- 1000
set.seed(431)
pvalues.p1 <- replicate(m, expr = {
#simulate under alternative mu1
x <- rt(n,2)
y<-numeric(n)
for(i in 1:n){
  if(x[i]<0) y[i]<-runif(1,-1,0)
  else       y[i]<-runif(1,0,1)
}
cortest.p1 <- cor.test(x,y,
alternative = "two.sided",method="pearson")
cortest.p1$p.value } )
power.p1 <- mean(pvalues.p1<= .05)
power.p1

In this case, the empirical power of the nonparametric test Spearman's rank correlation coefficient $\rho_s$ is 0.987, which is bigger than the empirical power of Pearson correlation $\rho=0.849$.

Homework 5

Question

Compute a jackknife estimate of the bias and the standard error of the correlation statistic in Example 7.2.

Answer

#jackknife
library(bootstrap) #for the law data
print(cor(law$LSAT,law$GPA)) 
print(cor(law82$LSAT,law82$GPA))

The sample correlation is R = 0.7763745. The correlation for the universe of 82 law schools is R = 0.7599979. Use jackknife to estimate the bias and standard error of the correlation statistic computed from the sample of scores in law.

#estimate of bias by jackknife
n<-nrow(law)
R<-numeric(nrow(law))
R.hat<-cor(law$LSAT,law$GPA)
for(i in 1:n)  R[i]<-cor(law$LSAT[-i],law$GPA[-i])

bias.j<-(n-1)*(mean(R)-R.hat)
print(bias.j)

The estimate of bias is -0.006473623.

#estimate of standard error by jackknife
se.j<-sqrt((n-1)*mean((R-mean(R))^2))
se.j

The estimate of standard error is 0.1425186.

Question

Refer to Exercise 7.4. Compute 95% bootstrap confidence intervals for the mean time between failures $1/\lambda$ by the standard normal, basic, percentile, and BCa methods. Compare the intervals and explain why they may differ.

Answer

library(boot)
set.seed(521)
boot.mean <- function(x,i) mean(x[i])  #use MLE estimate 
boot.obj <- boot(aircondit[,1], statistic=boot.mean, R = 2000)
ci<- boot.ci(boot.obj,type=c("norm","basic","perc","bca"))
ci
mean(aircondit[,1])

These four confidence intervals of the mean time between failures are established above, respectively. Although the four CIs all cover the value 108.0833 which is the mean time of potential data, they are quite different. As you see, bca CI is the biggest, while basic CI is the smallest. According to the fomulation of bca and percentile CI and compute some relevant quantiles, it is trival to see that the upper and lower bounds of bca CI is bigger than percentile CI. The normal CI is a litter different with percentile CI. To see it, we plot the histogram of mean time computed with generated samples (figure1). We find the distribution of mean time computed with generated samples is a little different with normal distribution. So that they are different. When the data is much close to normal, norm and percentile CI is almost same.

set.seed(521)
mu<-mean(boot.obj$t)
sigma<-sd(boot.obj$t)
figure1<-boot.obj$t
f<-function(x) (1/(sqrt(2*pi)*sigma))*exp(-(x-mu)^2/(2*sigma^2))
hist(figure1,freq=F,main="Histogram of mean time",ylim=c(0,0.012))
#histogram of mean time computed with generated samples
curve(f,add=T)

Question

Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.

Answer

library(bootstrap)
Rmtr.hat<-cov(scor)
egvals.hat<-eigen(Rmtr.hat)$values  #eigen values of                                             potential data
theta.hat<-egvals.hat[1]/sum(egvals.hat)

J<-nrow(scor)
theta.star<-numeric(J)

for(i in 1:J){
  Rmtr<-cov(scor[-i,])
  egvals<-eigen(Rmtr)$values
  theta.star[i]<-egvals[1]/sum(egvals)
}
bias.j<-(J-1)*(mean(theta.star)-theta.hat)
se.j<-sqrt((J-1)*mean((theta.star-mean(theta.star))^2))
print(c(bias.j,se.j))

The bias is 0.001069139, and the standard error is 0.049552307.

Question

In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.

Answer

library(DAAG)
magnetic<-ironslag$magnetic
chemical<-ironslag$chemical
l<-nrow(ironslag)
n<-floor(l/2)
e1<-e2<-e3<-e4<-numeric(2*n)
set.seed(541)
a<-sample(1:l,replace = F)      #disorder the potential data
magnetic<-magnetic[a]
chemical<-chemical[a]

# for n-fold cross validation
# fit models on leave-two-out samples
for (i in 1:n) {
y <- magnetic[-c(2*i-1,2*i)]
x <- chemical[-c(2*i-1,2*i)]
J1 <- lm(y ~ x)
yhat1.1 <- J1$coef[1] + J1$coef[2] * chemical[2*i-1]
yhat1.2 <- J1$coef[1] + J1$coef[2] * chemical[2*i]
e1[2*i-1] <- magnetic[2*i-1] - yhat1.1
e1[2*i] <- magnetic[2*i] - yhat1.2

J2 <- lm(y ~ x + I(x^2))
yhat2.1 <- J2$coef[1] + J2$coef[2] * chemical[2*i-1] +
J2$coef[3] * chemical[2*i-1]^2
yhat2.2 <- J2$coef[1] + J2$coef[2] * chemical[2*i] +
J2$coef[3] * chemical[2*i]^2
e2[2*i-1] <- magnetic[2*i-1] - yhat2.1
e2[2*i] <- magnetic[2*i] - yhat2.2

J3 <- lm(log(y) ~ x)
logyhat3.1 <- J3$coef[1] + J3$coef[2] * chemical[2*i-1]
logyhat3.2 <- J3$coef[1] + J3$coef[2] * chemical[2*i]
yhat3.1 <- exp(logyhat3.1)
yhat3.2 <- exp(logyhat3.2)
e3[2*i-1] <- magnetic[2*i-1] - yhat3.1
e3[2*i] <- magnetic[2*i] - yhat3.2

J4 <- lm(log(y) ~ log(x))
logyhat4.1 <- J4$coef[1] + J4$coef[2] * log(chemical[2*i-1])
logyhat4.2 <- J4$coef[1] + J4$coef[2] * log(chemical[2*i])
yhat4.1 <- exp(logyhat4.1)
yhat4.2 <- exp(logyhat4.2)
e4[2*i-1] <- magnetic[2*i-1] - yhat4.1
e4[2*i] <- magnetic[2*i] - yhat4.2
}

c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2))

According to the prediction error criterion, Model 2, the quadratic model, would be the best fit for the data.

J2

The fitted regression equation for Model 2 is $\hat{Y} = 23.93750??? 1.34026X + 0.05345X^2.$

Some of the residual plots for Model 2 are shown below.

plot(J2)

Homework 6

Question

Implement the two-sample Cramer-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.

Answer

We implement the two-sample Cramer-von Mises test for equal distributions as following:

cvm.d<-function(x,y){ #compute the Cramer-von Mises statistic
  n<-length(x);m<-length(y)
  Fn<-ecdf(x);Gm<-ecdf(y)
  W2<-((m*n)/(m+n)^2)*
      (sum((Fn(x)-Gm(x))^2)+sum((Fn(y)-Gm(y))^2))
  W2
}

Next, we apply the test to the data in Examples 8.1 and 8.2.

library(boot)
attach(chickwts)
x <- sort(as.vector(weight[feed == "soybean"]))
y <- sort(as.vector(weight[feed == "linseed"]))
detach(chickwts)

set.seed(611)
R <- 999 #number of replicates
z <- c(x, y) #pooled sample
n<-length(x)
m<-length(y)
N <- 1:(m+n)
reps <- numeric(R) #storage for replicates
W2 <- cvm.d(x, y)
W2
for (i in 1:R) {
#generate indices k for the first sample
k <- sample(N, size = n, replace = FALSE)
x1 <- z[k]
y1 <- z[-k] #complement of x1
reps[i] <- cvm.d(x1, y1)
}
p <- mean(c(W2, reps) >= W2)
p

The observed Cramer-von Mises statistic is $W_2 = 0.1515236$. The aproximate ASL 0.394 implies that there is no enough evidence to reject the null hypothesis.

Question

Design experiments for evaluating the performance of the NN, energy, and ball methods in various situations.

(1) Unequal variances and equal expectations

(2) Unequal variances and unequal expectations

(3) Non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)

(4) Unbalanced samples (say, 1 case versus 10 controls)

Note: The parameters should be choosen such that the powers are distinguishable (say, range from 0.3 to 0.9).

Anwer

library(energy)
library(RANN)
library(Ball)
m <- 100
k<-3
p<-2              #p-dimension Multivarziate
n1 <- n2 <- 50    #size of samples 
R<-999            ##number of replicates
n <- n1+n2
N = c(n1,n2)

Tn <- function(z, ix, sizes,k) {
n1 <- sizes[1]; n2 <- sizes[2]; n <- n1 + n2
if(is.vector(z)) z <- data.frame(z,0);
z <- z[ix, ];
NN <- nn2(data=z, k=k+1) 
block1 <- NN$nn.idx[1:n1,-1]
block2 <- NN$nn.idx[(n1+1):n,-1]
i1 <- sum(block1 < n1 + .5); i2 <- sum(block2 > n1+.5)
(i1 + i2) / (k * n)
}

eqdist.nn <- function(z,sizes,k){         #NN test
boot.obj <- boot(data=z,statistic=Tn,R=R,
sim = "permutation", sizes = sizes,k=k)
ts <- c(boot.obj$t0,boot.obj$t)
p.value <- mean(ts>=ts[1])
list(statistic=ts[1],p.value=p.value)
}
p.values <- matrix(NA,m,3)                

We will establish the powers with three methods in the following different situations.

(1)Unequal variances and equal expectations

Suppose that $X\sim N((0,0),I)$ and $Y\sim N((0,0),(1,0,0,0.25))$.

set.seed(621)
for(i in 1:m){
x <- matrix(rnorm(n1*p),ncol=p);
y <- cbind(rnorm(n2),rnorm(n2,sd=0.5));
z <- rbind(x,y)
p.values[i,1] <- eqdist.nn(z,N,k)$p.value
p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value
p.values[i,3] <- bd.test(x=x,y=y,R,seed=i*12345)$p.value
}
alpha <- 0.1                #confidence level 
pow <- colMeans(p.values<alpha)
pow

NN, energy and Ball methods have powers 0.62, 0.45 and 0.7 respectively.

(2)Equal variances and unequal expectations

Suppose that $X\sim N((0,0),I)$ and $Y\sim N((0,0.6),I)$.

set.seed(623)
for(i in 1:m){
x <- matrix(rnorm(n1*p),ncol=p);
y <- cbind(rnorm(n2),rnorm(n2,mean=0.6));
z <- rbind(x,y)
p.values[i,1] <- eqdist.nn(z,N,k)$p.value
p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value
p.values[i,3] <- bd.test(x=x,y=y,R,seed=i*12345)$p.value
}
alpha <- 0.1                #confidence level 
pow <- colMeans(p.values<alpha)
pow

NN, energy and Ball methods have powers 0.38, 0.85 and 0.76 respectively.

(3)Non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)

Suppose that $X=(X_1,X_2), Y=(Y_1,Y_2)$ where $X_1\sim t_1, X_2\sim 0.5N(0,1)+0.5N(1,1)$. $Y_1\sim t_1, Y_2\sim 0.5N(1,1)+0.5N(2,1)$.

set.seed(623)
for(i in 1:m){
x <- cbind(rt(n1,1),rnorm(n2, mean = sample(c(0, 1), size = n2, prob = c(0.5, 0.5), replace = T), sd =1 ))

y <- cbind(rt(n2,1),rnorm(n2, mean = sample(c(1, 2), size = n2, prob = c(0.5, 0.5), replace = T), sd =1 ))

z <- rbind(x,y)
p.values[i,1] <- eqdist.nn(z,N,k)$p.value
p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value
p.values[i,3] <- bd.test(x=x,y=y,R,seed=i*12345)$p.value
}
alpha <- 0.1                #confidence level 
pow <- colMeans(p.values<alpha)
pow

NN, energy and Ball methods have powers 0.68, 0.64 and 0.71 respectively.

Question

Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard Cauchy distribution (see qcauchy or qt with df=1). Recall that a $Cauchy(\theta, \eta)$ distribution has density function $$ f(x) =\frac{1} {\theta\pi(1 + [(x- \eta)/\theta]^2)}, - \infty < x < \infty, \theta>0. $$ The standard Cauchy has the $Cauchy(\theta= 1, \eta = 0)$ density. (Note that the standard Cauchy density is equal to the Student t density with one degree of freedom.)

Answer

We choose normal distribution as the proposal distribution. Denote by $g(??|X)$ the the density function of $\mathbb{U}(X-1,X+1)$. The initial value $X_0$ is generated from $\mathbb{U}(-1,1)$.

f<-function(x) 1/(pi*(1+x^2))  #density function of standard  Cauchy distribution 
set.seed(631)
m <- 20000                     #length of the chain
x <- numeric(m)
x[1] <- runif(1,min=-1,max=1)
k <- 0
u <- runif(m)
for (i in 2:m) {
xt <- x[i-1]
y <- xt+runif(1,min=-1,max=1)
num <- f(y) #* dnorm(xt, mean = y,sd=0.5)
den <- f(xt) #* dnorm(y, mean = xt,sd=0.5)
if (u[i] <= num/den) x[i] <- y else {
x[i] <- xt
k <- k+1 #y is rejected
}
}
print(k)           #the number of rejected candidate points

In this method with the proposal distribution we selected, approximately 15% of the candidate points are rejected which implies that the chain is not effective enough but is okay.

Next, we give the comparision between the deciles of the generated observations and the deciles of the standard Cauchy distribution. Moreover, we will establish the qqplot and histogram of the generated observations.

gob<- quantile(x[-(1:1000)],seq(0.1,0.9,0.1))
stc<-qt(seq(0.1,0.9,0.1),1)
rbind(gob,stc)

b <- 1001 #discard the burnin sample
y <- x[b:m]
a <- seq(0.1,0.9,0.01)
QR <- qcauchy(a) #quantiles of Rayleigh
Q <- quantile(y, a)
qqplot(QR, Q, main="",
xlab="Cauchy Quantiles", ylab="Sample Quantiles")
hist(y, breaks="scott", main="", xlab="", freq=FALSE)
lines(QR, f(QR))

The deciles between the generated observations and the deciles of the standard Cauchy distribution are much closer. Besides, as the qqplot and histogram shown above, the generated sample is much approximate with standard Cauchy distribution.

Question

Rao $[220, Sec. 5g]$ presented an example on genetic linkage of 197 animals in four categories (also discussed in $[67, 106, 171, 266])$. The group sizes are $(125, 18, 20, 34)$. Assume that the probabilities of the corresponding multinomial distribution are $$ (\frac12+\frac{\theta}{4},\frac{1-\theta}4,\frac{1-\theta}4,\frac{\theta}4). $$ Estimate the posterior distribution of $\theta$ given the observed sample, using one of the methods in this chapter.

Answer

We use the random walk Metropolis sampler with a uniform proposal distribution to generate the posterior distribution of $\theta$.

w <- 0.25     #width of the uniform support set
m <- 10000    #length of the chain
burn <- 1000  #burn-in time
N <- 197
x <- numeric(m) #the chain

set.seed(641)
grpsize<-c(125,20,18,34)   #group sizes 

prob <- function(y, grpsize) {
# computes (without the constant) the target density
if (y < 0 || y >= 1)
return (0)
return((1/2+y/4)^grpsize[1] *
((1-y)/4)^grpsize[2] * ((1-y)/4)^grpsize[3] *
(y/4)^grpsize[4])
}

u <- runif(m)        #for accept/reject step
v <- runif(m, -w, w) #proposal distribution
x[1] <-0.5
for (i in 2:m) {
y <- x[i-1] + v[i]
if (u[i] <= prob(y, grpsize) / prob(x[i-1], grpsize))
x[i] <- y else
x[i] <- x[i-1]
}

xtheta<-x[-(1:burn)]
theta.hat<-mean(xtheta)
round(theta.hat,3)

print(round(c(0.5+theta.hat/4,(1-theta.hat)/4,(1-theta.hat)/4,theta.hat/4),3))                 

print(round(c(125,20,18,34)/N,3))   

hist(xtheta,freq=F,main="Histogram of theta",xlab="theta")

The value of estimated $\theta$ is 0.623.

Homework 7

Question

For exercise 9.6, use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R}$< 1.2.

Answer

We construct five chains with initial value 0.4,0.45,0.5,0.55,0.6, then we compute the Gelman-Rubin statistic.

Gelman.Rubin <- function(psi) {
# psi[i,j] is the statistic psi(X[i,1:j])
# for chain in i-th row of X
psi <- as.matrix(psi)
n <- ncol(psi)
k <- nrow(psi)
psi.means <- rowMeans(psi) #row means
B <- n * var(psi.means) #between variance est.
psi.w <- apply(psi, 1, "var") #within variances
W <- mean(psi.w) #within est.
v.hat <- W*(n-1)/n + (B/n) #upper variance est.
r.hat <- v.hat / W #G-R statistic
return(r.hat)
}
prob <- function(y, grpsize) {
# computes (without the constant) the target density
if (y < 0 || y >= 1)
return (0)
return((1/2+y/4)^grpsize[1] *
((1-y)/4)^grpsize[2] * ((1-y)/4)^grpsize[3] *
(y/4)^grpsize[4])
}

theta.chain<-function(w,m,burn){ 
#generate the chain (w is the initial value; m is the length of the chain)
x<-numeric(m)
u <- runif(m)        #for accept/reject step
v <- runif(m, -w, w) #proposal distribution
x[1] <-0.5
for (i in 2:m) {
y <- x[i-1] + v[i]
if (u[i] <= prob(y, grpsize) / prob(x[i-1], grpsize))
x[i] <- y else
x[i] <- x[i-1]
}
return(x[(burn+1):m])
}
set.seed(711)
m <- 5000 #length of chains
burn <- 1000 #burn-in length
#choose overdispersed initial values
grpsize<-c(125,20,18,34)
x0 <- c(0.4, 0.45, 0.5, 0.55,0.6)
k<-length(x0)  #number of chains to generate
#generate the chains
X <- matrix(0, nrow=k, ncol=m-burn)
for (i in 1:k)
X[i, ] <- theta.chain(x0[i],m,burn)
Gelman.Rubin(X)

The Gelman-Rubin statistic $\hat{R}$ is 1.001811<1.2.

Question

Find the intersection points $A(k)$ in $(0,\sqrt{k})$ of the curves $$ S_{k-1}(a)=P(t(k-1)>\sqrt{\frac{a^2(k-1)}{k-a^2}}) $$ and $$ S_{k}(a)=P(t(k)>\sqrt{\frac{a^2k}{k+1-a^2}}) $$

for $k = 4 : 25, 100, 500, 1000$, where $t(k)$ is a Student t random variable with $k$ degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by Sz??ekely [260].)

Answer

K<-c(4:25,100,500,1000)
res<-numeric(length(K))
for(i in 1:length(K)){
  k<-K[i]
  f<-function(a)
  {pt(sqrt(a^2*(k-1)/(k-a^2)),k-1)-pt(sqrt(a^2*k/(k+1-a^2)),k)}

  res[i]<-uniroot(f,c(1,2))$root
}
cbind(K,res)

The intersection points are established above.

Homework 8

Question

Write a function to compute the cdf of the Cauchy distribution, which has density $$ \frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)},~~-\infty0$. Compare your results to the results from the R function pcauchy. (Also see the source code in pcauchy.c.)

Answer

We use function "integrate" in R to compute the cdf of Cauchy distribution.

f<-function(x,eta,theta){      #density function
  1/(theta*pi*(1+((x-eta)/theta)^2))
}
df<-function(x,eta=0,theta=1){ #cdf of cauchy(eta,theta)
    n<-length(x)
    res<-numeric(n)
    for(i in 1:n){
     res[i]<-integrate(f, lower=-Inf, upper=x[i],
     rel.tol=.Machine$double.eps^0.25,
     theta=theta,eta=eta)$value
    }
     return(res)
}

Next, we use "df" function builded myself and "pcauchy" function in R to construct the cdf figures with different parameters $\theta, \eta$ respectively.

#(1)       #standard Cauchy ditribution
eta<-0
theta<-1
x<-seq(-10,10,0.1)
plot(x,df(x,eta,theta), lwd = 0.5,ylab="cdf",main="cdf of Cauchy(eta=0, theta=1)")
curve(pcauchy(x,location = 0, scale = 1),add=TRUE,lwd = 2,col="red")

#(2)
eta<-0
theta<-5
x<-seq(-10,10,0.1)
plot(x,df(x,eta,theta), lwd = 0.5,ylab="cdf",main="cdf of Cauchy(eta=0, theta=5)")
curve(pcauchy(x,location = 0, scale = 5),add=TRUE,lwd = 2,col="red")

#(3)
eta<-5
theta<-1
x<-seq(-5,15,0.1)
plot(x,df(x,eta,theta), lwd = 0.5,ylab="cdf",main="cdf of Cauchy(eta=5, theta=1)")
curve(pcauchy(x,location = 5, scale = 1),add=TRUE,lwd = 2,col="red")

#(4)
eta<-5
theta<-5
x<-seq(-5,15,0.1)
plot(x,df(x,eta,theta), lwd = 0.5,ylab="cdf",main="cdf of Cauchy(eta=5, theta=5)")
curve(pcauchy(x,location = 5, scale = 5),add=TRUE,lwd = 2,col="red")

As the figures showing, the cdf computed by myself is much close to the results from "pcauchy".

Finally, we establish the the source code in pcauchy.c.

pcauchy

Question

A-B-O blood type problem

$\blacktriangleright$ Let the three alleles be A, B, and O.

Genotype AA $~~$ BB $~~$ OO $~~$ AO $~~$ BO $~~$ AB

Frequency p2 $~~$ q2 $~~$ r2 $~~$ 2pr $~~$ 2qr $~~$ 2pq $~~$ 1

Count $~~~~~n_{AA}$ $n_{BB}$ $n_{OO}$ $n_{AO}$ $n_{BO}$ $n_{AB}~~$ n

$\blacktriangleright$ Observed data: $n_{A??} = n_{AA} + n_{AO} = 28 ~~~(A-type),$

$n_{B??} = n_{BB} + n_{BO} = 24 ~~(B-type), n_{OO} = 41~~ (O-type), n_{AB} = 70 ~~(AB-type).$

$\blacktriangleright$ Use EM algorithm to solve MLE of p and q (consider missing data $n_{AA} and n_{BB}$).

$\blacktriangleright$ Record the maximum likelihood values in M-steps, are they increasing?

Answer

First, we show you the complete data likelihood and with the initial parameters $\hat{p}_0,~\hat{q}_0$ we give you the fomulation of one step MLES.

Complete data likelihood: $$ L(p,q|n_{AA},n_{BB},n_{OO},n_{AO},n_{BO},n_{AB})=(p^2)^{n_{AA}}(q^2)^{n_{BB}}(r^2)^{n_{OO}}(2pr)^{n_{AO}}(2qr)^{n_{BO}}(2pq)^{n_{AB}}, $$ $$ l(p,q|n_{AA},n_{BB},n_{OO},n_{AO},n_{BO},n_{AB})=n_{A\cdot}log(pr)+n_{AA}log(p/r)+n_{B\cdot}log(qr)+n_{BB}log(q/r)+2n_{OO}log(r)+n_{AB}log(pq), $$ $$ E_{(\hat{p}0,\hat{q}_0)}[l(p,q|n{AA},n_{BB},n_{OO},n_{AO},n_{BO},n_{AB})]=n_{A\cdot}log(pr)+n_{A\cdot}\frac{\hat{p}0}{2-\hat{p}_0-2\hat{q}_0}log(p/r)+n{B\cdot}log(qr)+n_{B\cdot}\frac{\hat{q}0}{2-2\hat{p}_0-\hat{q}_0}log(q/r)+2n{OO}log(r)+n_{AB}log(pq), $$ where $\hat{p}_0$ and $\hat{q}_0$ are two initial parameters.

Define $a_0=\frac{\hat{p}0}{2-\hat{p}_0-2\hat{q}_0}$ and $b_0=\frac{\hat{q}_0}{2-2\hat{p}_0-\hat{q}_0}$, we can simplify the expectation equation: $$ E{(\hat{p}0,\hat{q}_0)}[l(p,q|n{AA},n_{BB},n_{OO},n_{AO},n_{BO},n_{AB})]=[n_{A\cdot}(1+a_0)+n_{AB}]log(p)+[n_{A\cdot}(1-a_0)+n_{B\cdot}(1-b_0)+2n_{OO}]log(r)+[n_{B\cdot}(1+b_0)+n_{AB}]log(q). $$ Then maximize the expectation equation, we obtain MLEs $\hat{p}1,~\hat{q}_1$: $$ \hat{p}_1=\frac{H_p}{H_p+H_q+H{pq}},~~\hat{q}1=\frac{H_q}{H_p+H_q+H{pq}}, $$ where $H_p=n_{A\cdot}(1+a_0)+n_{AB}$, $H_1=n_{B\cdot}(1+b_0)+n_{AB}$ and $H_{pq}=n_{A\cdot}(1-a_0)+n_{B\cdot}(1-b_0)+2n_{OO}$.

Next, we use EM algorithm to solve MLE of $p$ and $q$.

p0<-0.2; q0<-0.2   #initial estimates
nA<-28; nB<-24; nOO<-41; nAB<-70  #observed data
tol <- .Machine$double.eps^0.5
N <- 10000 #max. number of iterations
p<-p0; q<-q0
El<-numeric(N) #Record the maximum likelihood values in M-steps
k<-1  #Record the circle times
for (i in 1:N) {
a<-p/(2-p-2*q)
b<-q/(2-2*p-q)
Hp=nA*(1+a)+nAB
Hq=nB*(1+b)+nAB
Hpq=nA*(1-a)+nB*(1-b)+2*nOO

El[i]<-Hp*log(p)+Hpq*log(1-p-q)+Hq*log(q) #Record the    maximum likelihood values in M-steps

x<-p; y<-q                #storage previous estimates
p<-Hp/(Hp+Hq+Hpq); q<-Hq/(Hp+Hq+Hpq)
k<-k+1
if (abs(p-x)/x<tol && abs(q-y)/y<tol) break
p.hat<-p; q.hat<-q
El<-El[1:k]
}
p.hat
q.hat
El        
diff(El)  #First order difference of maximum likelihood values in M-steps

By EM algorithm, we obtain MLE of $p$ and $q$ are 0.3273442 and 0.3104267. From the First order difference of maximum log-likelihood values, they are not increasing.

Homework 9

Question

Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:

formulas <- list(

mpg ~ disp,

mpg ~ I(1 / disp),

mpg ~ disp + wt,

mpg ~ I(1 / disp) + wt

)

Answer

#Use for loops
attach(mtcars)
formulas <- list(
mpg ~ disp,
mpg ~ I(1 / disp),
mpg ~ disp + wt,
mpg ~ I(1 / disp) + wt
)
f<-list()
for (i in seq_along(formulas)){
  f[[i]]<-lm(formulas[[i]],mtcars)
}
f
#Use lapply()
lapply(formulas,function(x) lm(x,mtcars))

Question

Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?

bootstraps <- lapply(1:10, function(i) {

rows <- sample(1:nrow(mtcars), rep = TRUE)

mtcars[rows, ]

})

Answer

#Use for loops
set.seed(921)
attach(mtcars)
bootstraps <- lapply(1:10, function(i) {
rows <- sample(1:nrow(mtcars), rep = TRUE)
mtcars[rows, ]
})
f<-list()
for (i in seq_along(bootstraps)){
  f[[i]]<-lm(mpg~disp,bootstraps[[i]])
}
f
#Use lapply()
lapply(bootstraps,function(x) lm(mpg~disp,x))
#Without an anonymous function
lapply(bootstraps,lm,formula=mpg~disp)

Question

For each model in the previous two exercises, extract $R^2$ using the function below.

rsq <- function(mod) summary(mod)$r.squared

Answer

In this exercise, we only use lapply() to extract $R^2$.

set.seed(921)
mod1<-lapply(formulas,function(x) lm(x,mtcars))
mod2<-lapply(bootstraps,function(x) lm(mpg~disp,x))

rsq <- function(mod) summary(mod)$r.squared
c(unlist(lapply(mod1,rsq)),unlist(lapply(mod2,rsq)))

Question

The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.

trials <- replicate(

100,

t.test(rpois(10, 10), rpois(7, 10)),

simplify = FALSE

)

Extra challenge: get rid of the anonymous function by using [[ directly.

Answer

#Use sapply() and an anonymous function
set.seed(941)
trials <- replicate(
100,
t.test(rpois(10, 10), rpois(7, 10)),
simplify = FALSE
)
sapply(trials,function(x) x$p.value)
#Without an anonymous function
p.value<-function(x) x$p.value  #define p.value function
sapply(trials,p.value)

Question

Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?

Answer

We combine Map() and vapply() to create an lapply() variant.

lapply1<-function (f,n,type, ...) {  #n is the length of output of function f.
  f <- match.fun(f)
  tt=Map(f, ...)
  if(type=="numeric")  y=vapply(tt,cbind,numeric(n))
  else if (type=="character") y=vapply(tt,cbind,character(n))
  else if (type=="complex") y=vapply(tt,cbind,complex(n))
  else if (type=="logical") y=vapply(tt,cbind,logical(n))
  return(y)
}
#example
set.seed(951)
kk<-function(x,w){
  y=weighted.mean(x, w, na.rm = TRUE)
  return(c(1,y))
}
xs <- replicate(5, runif(10), simplify = FALSE)
ws <- replicate(5, rpois(10, 5) + 1, simplify = FALSE)
lapply1(kk,2,"numeric",xs, ws)
is.matrix(lapply1(kk,2,"numeric",xs, ws))

Then, we use function "foreach" to create an lapply() variant.

library(foreach)
lapply2<-function(...,combine){
foreach(...,.combine=combine)
}

We use the new function to complete some simple Parallel computations.

#output is vector
lapply2(a=c(1,2,3),b=rep(1,3),combine="c")%do%{x1<-a+b;x2<-a*b;c(x1,x2)}

#output is matrix
lapply2(a=c(1,2,3),b=rep(1,3),combine="cbind")%do%{x1<-a+b;x2<-a*b;c(x1,x2)}

#Obtain same results in the former question with the new function
lapply2(i=1:length(trials),combine=c)%do%trials[[i]]$p.value

bias<-lapply2(i=1:length(trials),combine=c)%do%trials[[i]]$p.value-sapply(trials,function(x) x$p.value)
bias

Homework 10

Question

Make a faster version of chisq.test() that only computes the chi-square test statistic when the input is two numeric vectors with no missing values. You can try simplifying chisq.test() or by coding from the mathematical definition (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test).

Answer

set.seed(1011)
chisq.test2<-function(x,y){
  input<-as.table(rbind(x,y))
  out<-chisq.test(input)$statistic
  out
}

#Example
x<-rpois(10000,lambda=20)

y<-rbinom(10000,30,0.6)

chisq.test2(x,y)

system.time(chisq.test2(x,y))
system.time(chisq.test((as.table(rbind(x,y)))))

Question

Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?

Answer

set.seed(1011)
x<-rpois(10000,lambda=20)
y<-rbinom(10000,30,0.6)
as.table2<-function(x,y){
  out<-rbind(x,y)
  class(out)<-"table"
  out
}
class(as.table2(x,y))

system.time(as.table2(x,y))
system.time(as.table(rbind(x,y)))

system.time(chisq.test(as.table2(x,y)))
system.time(chisq.test(as.table(rbind(x,y))))


QinYuWu/StatComp18036 documentation built on May 12, 2019, 5:15 p.m.