Overview

StatComp18096 is a simple R package developed to present the homework answers for the 'Statistical Computing' course.

1st 2018/09/14

Question

Implement at least three examples of different types from the book "R for Beginners", "Statistical Computing with R".

Show the answers with texts, numerical results, tables, and figures.

Answers

Example 1 - How to plot (R for Beginners P45)

The example shows us the way to plot a nice figure by using plot parameters and Par function.

Generating ten pairs of random values from normal distribtion.

Before showing the figures, ten pairs of random values are established as following:

x<-rnorm(10)
y<-rnorm(10)
rbind(x,y)

FIGURE 1.1

The first plot is displayed in black and white.

plot(x,y)

FIGURE 1.2 The second plot is much better using option value to replace the default value.

# replace some plotting parameters
plot(x,y,xlab="Ten random values",ylab="Ten other values",xlim=c(-2,2),ylim=c(-2,2),pch=22,col="red",bg="yellow",bty="l",tcl=0.4,main="How to customize a plot with R",las=1,cex=1.5)

FIGURE 1.3

Par can be used to set or query graphical parameters. The default drawing parameters are copied to the list opar and we continue modify the former plot.

opar <- par()
par(bg="lightyellow", col.axis="blue", mar=c(4, 4, 2.5, 0.25))
plot(x, y, xlab="Ten random values", ylab="Ten other values",
xlim=c(-2, 2), ylim=c(-2, 2), pch=22, col="red", bg="yellow",
bty="l", tcl=-.25, las=1, cex=1.5)
title("How to customize a plot with R (bis)", font.main=3, adj=1)

FIGURE 1.4

We used here plot function to draw a "blank" figure, and then use low-level functions to add points, coordinate axes, labels, etc. By doing so, we can totally control the graphic drawing.

opar <- par()
# change the backgroud and margin
par(bg="lightgray", mar=c(2.5, 1.5, 2.5, 0.25))
# plot a blank figure
plot(x, y, type="n", xlab="", ylab="", xlim=c(-2, 2), ylim=c(-2, 2), xaxt="n", yaxt="n")
# change the color of the drawing area
rect(-3, -3, 3, 3, col="cornsilk")
points(x, y, pch=10, col="red", cex=2)
axis(side=1, c(-2, 0, 2), tcl=-0.2, labels=FALSE)
axis(side=2, -1:1, tcl=-0.2, labels=FALSE)
title("How to customize a plot with R (ter)", font.main=4, adj=1, cex.main=1)
mtext("Ten random values", side=1, line=1, at=1, cex=0.9, font=3)
mtext("Ten other values", line=0.5, at=-1.8, cex=0.9, font=3)
mtext(c(-2, 0, 2), side=1, las=1, at=c(-2, 0, 2), line=0.3,
col="blue", cex=0.9)
mtext(-1:1, side=2, las=1, at=-1:1, line=0.2, col="blue", cex=0.9)

Example 2 - densityplot (R for Beginners P50)

Generating a Random independent normal samples, which is divided into nine subsamples.

Using densityplot function to plot for each subsample.

panel.densityplot is used to plot empirical density function, panel.mathdensity is used to plot the fitted normal distribution density function.

library("lattice")
n <- seq(5, 45, 5)
x <- rnorm(sum(n))
y <- factor(rep(n, n), labels=paste("n =", n))
densityplot(~ x | y, 
            panel = function(x, ...) {
              panel.densityplot(x, col="DarkOliveGreen", ...)
              panel.mathdensity(dmath=dnorm,
                                args=list(mean=mean(x), sd=sd(x)),
                                col="darkblue")
              })

Example 3 - xyplot & splom (R for Beginners P51)

Earthquake data This example shows the geographic location of earthquakes at different depths.

data(quakes)
mini <- min(quakes$depth)
maxi <- max(quakes$depth)
int <- ceiling((maxi - mini)/9)
inf <- seq(mini, maxi, int)
quakes$depth.cat <- factor(floor(((quakes$depth - mini) / int)),
labels=paste(inf, inf + int, sep="-"))  # add the labels with paste
xyplot(lat ~ long | depth.cat, data = quakes) # split according to depth.cat

Iris data Compared with the former example, this xyplot uses panel.superpose drawing the overlapped groups in the same diagram

data(iris)
xyplot(
Petal.Length ~ Petal.Width, data = iris, groups=Species,
panel = panel.superpose,
type = c("p", "smooth"), span=.75,
auto.key = list(x = 0.15, y = 0.85) # legend,relevant location (0,1)
)

splom(~ x) is used to plot two-dimensional graph matrix.

splom(
~iris[1:4], groups = Species, data = iris, xlab = "",
panel = panel.superpose,
auto.key = list(columns = 3)
)

Example 4 - stratified sampling ("Statistical Computing with R" P146) Stratified sampling is implemented in a more general way, for the Monte Carlo estimate of $\int_0^1e^{-x}(1+x^2)^{-1}dx$. The standard Monte Carlo estimate is also obtained for comparison.

    M <- 10000  #number of replicates
    k <- 10     #number of strata
    r <- M / k  #replicates per stratum
    N <- 50     #number of times to repeat the estimation
    T2 <- numeric(k)
    estimates <- matrix(0, N, 2)

    g <- function(x) {
        exp(-x - log(1+x^2)) * (x > 0) * (x < 1)
        }

    for (i in 1:N) {
        estimates[i, 1] <- mean(g(runif(M)))
        for (j in 1:k)
            T2[j] <- mean(g(runif(M/k, (j-1)/k, j/k)))
        estimates[i, 2] <- mean(T2)
    }
#The result of this simulation produces the following estimates.
    apply(estimates, 2, mean)
    apply(estimates, 2, var)

This represents a more than 98% reduction in variance.

### Plot importance functions in Figures 5.1(a) and 5.1.(b)

    #par(ask = TRUE) #uncomment to pause between graphs

    x <- seq(0, 1, .01)
    w <- 2
    f1 <- exp(-x)
    f2 <- (1 / pi) / (1 + x^2)
    f3 <- exp(-x) / (1 - exp(-1))
    f4 <- 4 / ((1 + x^2) * pi)
    g <- exp(-x) / (1 + x^2)

    #for color change lty to col

    #figure (a)
    plot(x, g, type = "l", main = "", ylab = "",
         ylim = c(0,2), lwd = w)
    lines(x, g/g, lty = 2, lwd = w)
    lines(x, f1, lty = 3, lwd = w)
    lines(x, f2, lty = 4, lwd = w)
    lines(x, f3, lty = 5, lwd = w)
    lines(x, f4, lty = 6, lwd = w)
    legend("topright", legend = c("g", 0:4),
           lty = 1:6, lwd = w, inset = 0.02)

    #figure (b)
    plot(x, g, type = "l", main = "", ylab = "",
        ylim = c(0,3.2), lwd = w, lty = 2)
    lines(x, g/f1, lty = 3, lwd = w)
    lines(x, g/f2, lty = 4, lwd = w)
    lines(x, g/f3, lty = 5, lwd = w)
    lines(x, g/f4, lty = 6, lwd = w)
    legend("topright", legend = c(0:4),
           lty = 2:6, lwd = w, inset = 0.02)

Example 5 -Estimating a confidence level ("Statistical Computing with R" P159)

Example 6.4 (Confidence interval for variance)

If $X_1,...,X_n$ is a random sample from a $Normal(??, ??^2)$ distribution, n ?? 2, and $S^2$ is the sample variance, then $V = \frac{(n - 1)S^2}{??2} ?? \chi^2(n - 1)$.

A one side $100(1-\alpha)\%$ confidence interval is given by $(0,\frac{(n-1)S2}{\chi^2\alpha})$, where $\chi^2_\alpha$ is the $\alpha$-quantile of the $\chi^2(n-1)$ distribution. If the sampled population is normal with variance $\sigma^2$, then the probability that the confidence interval contains $\sigma^2$ is $1-\alpha$.

The calculation of the 95% upper confidence limit (UCL) for a random sample size n = 20 from a $Normal(0, \sigma^2 = 4)$ distribution is shown below.

 n <- 20
    alpha <- .05
    x <- rnorm(n, mean=0, sd=2)
    UCL <- (n-1) * var(x) / qchisq(alpha, df=n-1)
    print(UCL)

The confidence level is exactly $P(\frac{19S^2}{\chi_{0.05}^2(19)}>4)=P(\frac{(n-1)S^2}{\alpha^2}>\chi_{0.05}(n-1))=0.95$

2nd 2018/09/20

Exercises 3.5.(pages 94)

A discrete random variable X has probability mass function

x<- 0:4
prob <- c(0.1,0.2,0.2,0.2,0.3)
x.prob <- rbind(x,prob)
x.prob

Use the inverse transform method to generate a random sample of size 1000 from the distribution of X.

Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function.

x<- 0:4
prob <- c(0.1,0.2,0.2,0.2,0.3)
    cp <- cumsum(prob); m <- 1000; r <- numeric(m)
    r <- x[findInterval(runif(m),cp)+1]
    ct <- as.vector(table(r)); ctfre<-ct/sum(ct);  ctdif<-ct/sum(ct)/prob
# generate a random sample of size 1000
n <- 1000
a <- sample(x,size = n,prob = prob ,replace = TRUE)
# compute the frequency
a <- as.vector(table(a))
a.fre <- a/n/prob
diffe1 <- a.fre
diffe2 <- ctdif
data <- data.frame(x,ct,a,ctfre,a.fre,prob,diffe1,diffe2)
colnames(data) <- c("x","ct-num","sample-num","ct-freq","samp-freq","prob","dif1","dif2")
data

States

The dataframe shows that the sample is great because both of the empirical probabilities generated by sample function and inverse transformation method are near to the theoretical prob.

For save space, I did not show you the sample data.



Exercises 3.7.(pages 94)

Write a function to generate a random sample of size n from the Beta(a, b) distribution by the acceptance-rejection method.

Generate a random sample of size 1000 from the Beta(3,2) distribution.

Graph the histogram of the sample with the theoretical Beta(3,2) density superimposed.

# write the function
rBeta <- function(n,a,b){
  k <- 0; j <- 0
  y <- numeric(n)
  while (k < n) {
    u <- runif(1)
    j <- j + 1
    x <- runif(1) #random variate from g
    c <- prod(2:(a+b-1))/prod(2:(a-1))/prod(2:(b-1))
    r <- x^(a-1)*(1-x)^(b-1)
    if (r > u) {
      #we accept x
      k <- k + 1
      y[k] <- x
    }
    }
  y
  }

# Generate a random sample of size 1000 from the Beta(3,2) distribution. 
a <- 3; b <- 2
c <- 12
t <- rBeta(1000,3,2)       #sample data
t
hist(t, prob = TRUE, main = bquote(f(x)==12*x^2*(1-x))) #density histogram of sample)

y <- seq(0, 1, .01)
lines(y,c*y^(a-1)*(1-y)^(b-1)) # density curve

States

As is required, I first establish the funtion rBeta(n,a,b) and n means sample size, a and b are the parameters of beta(a,b) distribution.

The pdf of beta(a,b) is $f(x)=\frac{(a+b-1)!}{(a-1)!(b - 1)!}x^{(a - 1)}(1-x)^{(b - 1)}$.

There exists a constant $c = \frac{(a+b-1)!}{(a-1)!(b-1)!}$ such that $f(x) <= cg(x)$ when $g(x)=1I^x_{(0,1)}$

P(acceptance) = $\frac{1}{c}$

After set the function, I generate a random sample of sample size 1000 from Beta(3,2) and plot the histogram.



Exercises 3.12. (pages 94)

Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter$\Lambda$ has $Gamma(r, \beta)$ distribution and Y has $Exp(\Lambda)$ distribution.

That is, $(Y |\Lambda = \lambda) \sim f_{Y} (y|\lambda) = \lambda{e^{-\lambda{y}}}$. Generate 1000 random observations from this mixture with r = 4 and $\beta$ = 2

    #generate a Exponential-Gamma mixture
    n <- 1000
    r <- 4
    beta <- 2
    lambda <- rgamma(n, r, beta) #lambda is random

    #now supply the sample of lambda's as the exponential parameter
    x <- rexp(n, lambda)        #the mixture
    x

States

First, we generate 1000 random obeserations called lambda vector from $Gamma(r, \beta)$ distribution, and then we supply the sample of lambda's as the exponential parameter.

That is the way to generate a random sample of size 1000 from this mixture with r = 4 and $\beta$ = 2

3rd 2018/09/28

Question

Write a function to compute a Monte Carlo estimate of the Beta(3, 3) cdf, and use the function to estimate F(x) for x = 0.1, 0.2,..., 0.9. Compare the estimates with the values returned by the pbeta function in R.

Answer

step1Write a function called 'MCbeta' of computing a Monte Carlo estimate of the Beta(a, b) cdf, where Beta(a,b) cdf: $$ F = \int_0^x\frac{\Gamma(a+b-1)}{\Gamma(a-1)\Gamma(b-1)}x^{(a-1)}(1-x)^{(b-1)} $$

In this function there are four parameters: a, b, n and x.

Consider the problem of estimating $\theta = \int_0^x30x^2(1-x)^2$. Generate 9 random uniform sample(0,x) whose size are all 1000, in which x equals to 0.1, 0.2,..., 0.9.

step2 Using Monte Carlo to estimate the cdf, and the estimated value = $\frac{1}{1000}x^2(1-x)^2$

step3

Compare the estimates with the value F(x) computed (numerically) by the pbeta function with ratio.

The code and the result is shown as below:

MCbeta <- function(a,b,n,x){
  cdf <- numeric(length(x))
    for (i in 1 :length(x) ){
      u <- runif(n,0,x[i])
      g <- x[i] * factorial(a+b-1)/factorial(a-1)/factorial(b-1)* u ^(a-1) *(1-u) ^(b-1)
      cdf[i] <- mean(g)   #estimated value
    }
    Phi <- pbeta(x,a,b)   #beta(a,b)
    ratio <- cdf/Phi
    print(round(rbind(x, cdf, Phi, ratio), 3))
}


x <- seq(.1,0.9, length=9)
MCbeta(3,3,1000,x)

As is shown before, the ratio is very near to 1,i.e.,the cdf is closed to Phi.

Question

The Rayleigh density [156, (18.76)] is $f(x) = \frac{x}{??^2} e^{-\frac{x^2}{2??^2}}, x ?? 0,??> 0$. Implement a function to generate samples from a Rayleigh($\sigma$) distribution, using antithetic variables. What is the percent reduction in variance of $\frac{X+X'}{2}$ compared with $\frac{X_1+X_2}{2}$ for independent $X_1$, $X_2$?

Answer

step1

If X is a continuous random variable with cdf $F_X (x)$, here $$F_X (x)=1-e^{\frac{x^2}{2\sigma^2}}$$ Then $U = F_X(X) \sim Uniform(0, 1)$

step2 The variable U follows a uniform distribution along [0, 1], the first sample will be {u_1 , ?? , u_n}, where, for any given i, u_i is obtained from $U(0, 1)$. The second sample is built from $u_1 ?? , ?? ,u_n ??$, where, for any given i:$u_i'=1-u_i$

Then $x = F ^{-1}_X (u)$, $X = F^{-1}(U), X' = F^{-1}(U')$both have distribution F but are antithentic to each other $F^{-1}$ is a monotone function.

and $X=\sqrt{-2\sigma^2}ln(1-U)$

Furthermore, covariance is negative, allowing for initial variance reduction.

step3

Get the mean value and repeat m times, then we get the variance and variance reduction.

The code and the result is shown as below:

Antith.sampling <- function(sigma, n, antithetic = TRUE){
  u <- runif(n/2) 
  if (antithetic){ 
    v <- 1 - u
    a <- sqrt(-2*sigma^2*log(1-u))
    b <- sqrt(-2*sigma^2*log(1-v))
    variance <- (var(a)+var(b)+2*cov(a,b))/4 # c(a,b) is the whole sample, n is the sample size)
  }
  else{
    v <- runif(n/2)
    u <- c(u, v)
    a <- sqrt(-2*sigma^2*log(1-u)) # a is the whole sample, n is the sample size
    variance <-var(a)
  }
variance
}

var1 <- Antith.sampling(2,1000,antithetic = FALSE) # variance of independent random variables
var2 <- Antith.sampling(2,1000,antithetic = TRUE) # var of antithentic variables
ratio <- 100 * (var1-var2)/var1   #the percent of reduction
print(c(var1,var2,ratio))

We can conclude that the Antithetic Variables method improves the effeciency of variance reduction a lot.

Question

Find two importance functions $f1$ and $f2$ that are supported on $(1, \infty)$ and are close to

$$g(x) = \frac{x^2}{\sqrt{2??}} e^{-x^2/2}, x> 1$$ Which of your two importance functions should produce the smaller variance in estimating$$\int_1^\infty\frac{x^2}{\sqrt{2\pi}} e^{-x2/2}dx$$by importance sampling? Explain.

Answer

step Consider the density $g(x)$ and plot it, we choose several ??$f(\dot)$ which is close to $g(x)$

So, I choose $$f_1(x) = e^{-x}, x \in (1,\infty)$$ and $$f_2(x) = \frac{1}{x^2}, x \in (1,\infty)$$

The code and the result is shown as below:

x <- seq(1,10,0.02)
y <- x^2/sqrt(2*pi)* exp((-x^2/2))
y1 <- exp(-x)
y2 <- 1 / (x^2)

gs <- c(expression(g(x)==e^{-x^2/2}*x^2/sqrt(2*pi)),expression(f[1](x)==1/(x^2)),expression(f[2](x)==x*e^{(1-x^2)/4}/sqrt(2*pi)))
par(mfrow=c(1,2))
    #figure (a)
    plot(x, y, type = "l", ylab = "", ylim = c(0,0.5),main='density function')
    lines(x, y1, lty = 2,col="red")
    lines(x, y2, lty = 3,col="blue")
    legend("topright", legend = 0:2,
           lty = 1:3,inset = 0.02,col=c("black","red","blue"))

    #figure (b)
   plot(x, y/y1,  type = "l",ylim = c(0,2),ylab = "",lty = 2, col="red",main = 'ratios')
    lines(x, y/y2, lty = 3,col="blue")
    legend("topright", legend = 1:2,
           lty = 2:3, inset = 0.02,col=c("red","blue"))

The function that corresponds to the most nearly constant ratio g(x)/f(x) appears to be $f_2$, which is clear in figure.

g <- function(x) {x^2*exp(-x^2/2)/sqrt(2*pi)*(x>1)}
m <- 10000   
theta.hat <- se <- numeric(2)

x <- rexp(m, 1)   #using f1
fg <- g(x) / exp(-x)
theta.hat[1] <- mean(fg)
se[1] <- sd(fg)

u <- runif(m) 
x <- 1/(1-u) #using f2
fg <- g(x)*x^2
theta.hat[2] <- mean(fg)
se[2] <- sd(fg)           
rbind(theta.hat,se)

Same to the conclusion before, $f_2$ is better for its smaller variance.

Question

Obtain a Monte Carlo estimate of$\int_1^\infty\frac{x^2}{\sqrt{2??}} e^{-x2/2}dx$by importance sampling

Answer

step1 (5.13 Cont.), we choose $f(x)=\frac{1}{x^2},x \in (1,+\infty)$as the importance function of $g(x)=\frac{x^2}{\sqrt{2\pi}} e^{-x2/2},x\in(1,+\infty)$

step2 $$\int_1^{\infty}\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx=E(g(x)/f(x))$$

The code and the result is shown as below:

g <- function(x) {x^2*exp(-x^2/2)/sqrt(2*pi)*(x>1)}
m <- 1e4                 
u <- runif(m) 
x <- 1/(1-u) #using f2
fg <- g(x)*x^2
theta.hat <- mean(fg)
print(theta.hat)
theta <- integrate(g,1,Inf)
theta

Using integrate function to compute the theorial value,he estimate value is quite near to it.

4th 2018/10/13

Question

Let X be a non-negative random variable with $?? = E[X] < \infty$. For a random sample $x_1,...,x_n$ from the distribution of X, the Gini ratio is defined by$$G = \frac{1}{2n\mu^2}\Sigma_i\Sigma_j|x_i-x_j|$$ The Gini ratio is applied in economics to measure inequality in income distribution. Note that G can be written in terms of the order statistics $x(i)$ as $$G = \frac{1}{n\mu^2}\Sigma_1^n(2i - n - 1)x(i).$$ If the mean is unknown, let $\hat{G}$ be the statistic $G$ with $\mu$ replaced by $\overline{x}$. Estimate by simulation the mean, median and deciles of $\hat{G}$ if $X$ is standard lognormal. Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.

Answer

step1

(a) Generate $x_1^{(j)} ,...,x_n^{(j)} , iid$ from the distribution of X.

(b) Sort $x_1^{(j)} ,...,x_n^{(j)}$in increasing order, to obtain $x_1^{(j)} \leq...\leq x_n^{(j)}$

(c) Compute $$\hat{G^{(j)}} = \frac{1}{n\overline{x^{(j)}}^2}\Sigma_1^n(2i - n - 1)x_{(i)}$$

step2 Compute $Media (G)=G^{m/2}$ Compute deciles $Deciles(G)=G^{im/10},i=1,...10.$

step3 Repeat the code and replacing the x from uniform distribution and Bernoulli(0.1)

step4 Plot the histgram.

The code and the result is shown as below:

m <- 1000
n <- 1000
Gini <- function(m,n){
  G_hat1<-G_hat2<-G_hat3 <- numeric(m)

  for (j in 1:m) {
    x1 <- rlnorm(n)
    x2 <- runif(n)
    x3 <- rbinom(n,size=1,prob=0.1)
    u1 <- mean(x1);u2 <- mean(x2);u3 <- mean(x3)
    x1sort <- sort(x1);x2sort <- sort(x2);x3sort <- sort(x3)

    g1<-g2<-g3 <- numeric(m)
    for (i in 1:m){
      g1[i] <- (2*i-m-1)*x1sort[i]
      g2[i] <- (2*i-m-1)*x2sort[i]
      g3[i] <- (2*i-m-1)*x3sort[i]
    }
    G_hat1[j] <- sum(g1)/m^2/u1
    G_hat2[j] <- sum(g2)/m^2/u2
    G_hat3[j] <- sum(g3)/m^2/u3
  }

  print(c(mean(G_hat1),median(G_hat1)))
  print(quantile(G_hat1,seq(0,1,0.1)))
  hist(G_hat1,prob=TRUE,main = "X from lognorm distribution")

  print(c(mean(G_hat2),median(G_hat2)))
  print(quantile(G_hat2,seq(0,1,0.1)))
  hist(G_hat2,prob=TRUE,main = "X from uniform distribution")

  print(c(mean(G_hat3),median(G_hat3)))
  print(quantile(G_hat3,seq(0,1,0.1)))
  hist(G_hat3,prob=TRUE,main = "X from Bionomial(0.1) distribution")
}

Gini(m,n)

As is shown before, the answer is totally different because X comes from diffferent populations.The Gini ratio measures the inequality among values of a frequency distribution (for example, levels of income). The larger the gini ratio, the higher the inequality. So when X comes from uniform, and G seems to be the lowest and when X comes from Bonulli distribution, G seems to be the highest.

Question

Construct an approximate 95% confidence interval for the Gini ratio $Y = E[G]$ if X is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.

Answer

step1 Generate the replicates $G(j), j = 1 ...,m$ by repeating:

(a) Generate $x_1^{(j)} ,...,x_n^{(j)} , iid$ from the distribution of X.

(b) Sort $x_1^{(j)} ,...,x_n^{(j)}$in increasing order, to obtain $x_1^{(j)} \leq...\leq x_n^{(j)}$

(c) Compute $$\hat{G^{(j)}} = \frac{1}{n\overline{x^{(j)}}^2}\Sigma_1^n(2i - n - 1)x_{(i)}$$

step2

According to the big sample properties, I compute the approximate 95% CI with the function Gini(m,n,a,b,alpha).m is the sample size of X,X comes from lognorm distribution, n is the sample number and a,b is the unknown parameters of lognorm distribution.The output is $(1-\alpha ) 100\%$ confidence interval.

step3 Assess the coverage rate with MC method.

The code and the result is shown as below:

GCI <- function(m,n,a,b,alpha){
  G <- numeric(m)
  for (i in 1:m) {
    # function to calculate Gini with Normal distribution.
    Gi <- function(n,a,b){
      y <- numeric(n)
      y <- sort(rnorm(n)) # x=exp(y) ~ln(0,1)
      mu <- exp(a+b^2/2) # mu of X
      Gini <- 1/n^2/mu * sum((2*c(1:n)-n-1)*exp(y))
      return(Gini)
      }
    G[i]<-Gi(n,a,b)
    }
  LCL <- quantile(G,alpha/2)
  UCL <- quantile(G,1-alpha/2)
  CI <- c(LCL,UCL)
  count <-sum(G <= UCL)-sum(G < LCL)
  coverage <- count/n #coverage rate
  return(c(CI,coverage))
  }
GCI(1000,1000,0,1,0.1)

The output shows the CI and coverage rate.

Question

Tests for association based on Pearson product moment correlation $\rho$, Spearman??s rank correlation coefficient $\rho_s$, or Kendall??s coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $(X, Y)$ such that $X$ and $Y$ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.

Answer

step1 first, we conduct a convariance matrix for independent variables. then, we compute the power and compare them.

The code and the result is shown as below:

library(MASS)
powers <- function(N,alpha,mu,sigma){
  p1<-p2<-p3<-numeric(N)
  for(i in 1:N){
  bvn<- mvrnorm(N, mu = mu, Sigma = sigma ) # mvrnorm function, independent
  x<-bvn[,1];y<-bvn[,2]
  p1[i]<-cor.test(x,y,method="spearman")$p.value
  p2[i]<-cor.test(x,y,method="kendall")$p.value
  p3[i]<-cor.test(x,y,method="pearson")$p.value
  }
power<-c(mean(p1<=alpha),mean(p2<=alpha),mean(p3<=alpha))
return(power)
}
set.seed(123)
N <- 500 # Number of random samples
alpha<-0.05
# Target parameters for univariate normal distributions
rho <- 0
mu1 <- 0; s1 <- 1
mu2 <- 1; s2 <- 4
# Parameters for bivariate normal distribution
mu <- c(mu1,mu2) # Mean 
sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2,2) # Covariance matrix
powers(N,alpha,mu,sigma)
library(MASS)
powers <- function(N,alpha,mu,sigma){
  p1<-p2<-p3<-numeric(N)
  for(i in 1:N){
  bvn<- mvrnorm(N, mu = mu, Sigma = sigma ) # mvrnorm function, independent
  x<-bvn[,1];y<-bvn[,2]
  p1[i]<-cor.test(x,y,method="spearman")$p.value
  p2[i]<-cor.test(x,y,method="kendall")$p.value
  p3[i]<-cor.test(x,y,method="pearson")$p.value
  }
power<-c(mean(p1<=alpha),mean(p2<=alpha),mean(p3<=alpha))
print(round(power),3)
}
N <- 500 # Number of random samples
alpha<-0.05
# Target parameters for univariate normal distributions
rho <- .6
mu1 <- 0; s1 <- 1
mu2 <- 1; s2 <- 4
# Parameters for bivariate normal distribution
mu <- c(mu1,mu2) # Mean 
sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2,2) # Covariance matrix
powers(N,alpha,mu,sigma)

We can find that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. And when changing the $\rho$ into 0.6, the power is different. The question is :

  1. Let $X$ be a non-negative random variable with $\mu = E[X] < ??$. For a random sample $x_1$ ,...,$x_n$ from the distribution of X, the Gini ratio is defined by $${G}=\frac{1}{2n^2\mu}\sum_{j=1}^n \sum_{i=1}^n |x_i-x_j|$$ The Gini ratio is applied in economics to measure inequality in income distribution (see e.g. [163]). Note that G can be written in terms of the order statistics $x_(i)$ as $${G}=\frac{1}{n^2\mu}\sum_{i=1}^n (2i-n-1)x_{(i)}$$ If the mean is unknown, let $\hat{G}$ be the statistic $G$ with $\mu$ replaced by $\bar{x}$. Estimate by simulation the mean, median and deciles of $\hat{G}$ if $X$ is standard lognormal.Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.

  2. Construct an approximate 95% confidence interval for the Gini ratio y = E[G] if X is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.

  3. Tests for association based on Pearson product moment correlation $\rho$, Spear-mans rank correlation coefficient $\rho_s$ , or Kendalls coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $(X,Y)$ such that $X$ and $Y$ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.

The answer code is as follows:

#The answer of first question:
m <- 1000
n <- 20
g <- numeric(m)
medians  <- means <- numeric(3)
y <- gini1 <- gini2 <- gini3 <- numeric(m)

for (i in 1:m) {
  x <- sort(rlnorm(n))
  xmean <- mean(x)
  for (j in 1:n) {
  y[j] <- (2*j-n-1)*x[j]
}
  gini1[i] <- 1/n^2/xmean*sum(y[1:n])
}

for (i in 1:m) {
  x <- sort(runif(n))
  xmean <- mean(x)
  for (j in 1:n) {
  y[j] <- (2*j-n-1)*x[j]
}
  gini2[i] <- 1/n^2/xmean*sum(y[1:n])
}

for (i in 1:m) {
  x <- sort(rbinom(n,c(0,1),c(0.1,0.9)))
  xmean <- mean(x)
  for (j in 1:n) {
  y[j] <- (2*j-n-1)*x[j]
}
  gini3[i] <- 1/n^2/xmean*sum(y[1:n])
}

par(mfrow=c(3,1))
par(pin=c(2,1))
hist(gini1,prob = TRUE)
lines(density(gini1),col = "red",lwd = 2)
hist(gini2,prob = TRUE)
lines(density(gini2),col = "blue",lwd = 2)
hist(gini3,prob = TRUE)
lines(density(gini3),col = "green",lwd = 2)

medians[1] <- median(gini1)
medians[2] <- median(gini2)
medians[3] <- median(gini3)
medians

quantiles1 <- as.vector(quantile(gini1,seq(0.1,0.9,0.1)))
quantiles1

quantiles2 <- as.vector(quantile(gini2,seq(0.1,0.9,0.1)))
quantiles2

quantiles3 <- as.vector(quantile(gini3,seq(0.1,0.9,0.1)))
quantiles3

means[1] <- mean(gini1)
means[2] <- mean(gini2)
means[3] <- mean(gini3)
means


#The answer of second question:
m <- 1000
n <- 20
gini <- numeric(m)

#Get A series of gini ratios genarating from a lognormal distribution
for (i in 1:m) {
  x <- sort(rlnorm(n))
  xmean <- mean(x)
  for (j in 1:n) {
  y[j] <- (2*j-n-1)*x[j]
}
  gini[i] <- 1/n^2/xmean*sum(y[1:n])
}

#get the lower confidence interval 
LCI<- mean(gini)-sd(gini)*qt(0.975,m-1)
#get the upper confidence interval 
UCI <- mean(gini)+sd(gini)*qt(0.975,m-1)
#get the confidence interval
CI <- c(LCI,UCI)
print(CI)
#calculate the converage rte 
covrate<-sum(I(gini>CI[1]&gini<CI[2]))/m
print(covrate)


#The answer of third question:
#We need load the MASS package
library(MASS)
mean <- c(0, 0)                                           
sigma <- matrix( c(25,5,                            
                    5, 25),nrow=2, ncol=2)
m <- 1000

#Calculate the power using pearson correlation test by setting the parameter method as pearson
pearvalues <- replicate(m, expr = {
    mydata1 <- mvrnorm(50, mean, sigma)
    x <- mydata1[,1]
    y <- mydata1[,2]
    peartest <- cor.test(x,y,alternative = "two.sided", method = "pearson")
    peartest$p.value
} )
power1 <- mean(pearvalues <= 0.05)
power1

#Calculate the power using spearman correlation test by setting the parameter method as spearman
spearvalues <- replicate(m, expr = {
    mydata2 <- mvrnorm(50, mean, sigma)
    x <- mydata2[,1]
    y <- mydata2[,2]
    speartest <- cor.test(x,y,alternative = "two.sided", method = "spearman")
    speartest$p.value
} )
power2 <- mean(spearvalues <= 0.05)
power2

#Calculate the power using kendall correlation test by setting the parameter method as kendall
kenvalues <- replicate(m, expr = {
    mydata3 <- mvrnorm(50, mean, sigma)
    x <- mydata3[,1]
    y <- mydata3[,2]
    kentest <- cor.test(x,y,alternative = "two.sided", method = "kendall")
    kentest$p.value
} )
power3 <- mean(kenvalues <= 0.05)
power3

R <- 1000
m <- 1000

#Calculate the power using pearson correlation test by setting the parameter method as pearson
pearvalues <- replicate(m, expr = {
    u <- runif(R/2,0,10)
    v <- sin(u)
    peartest <- cor.test(u,v,alternative = "two.sided", method = "pearson")
    peartest$p.value
} )
power1 <- mean(pearvalues <= 0.05)
power1

#Calculate the power using spearman correlation test by setting the parameter method as spearman
spearvalues <- replicate(m, expr = {
    u <- runif(R/2,0,10)
    v <- sin(u)
    speartest <- cor.test(u,v,alternative = "two.sided", method = "spearman")
    speartest$p.value
} )
power2 <- mean(spearvalues <= 0.05)
power2

#Calculate the power using kendall correlation test by setting the parameter method as kendall
kenvalues <- replicate(m, expr = {
    u <- runif(R/2,0,10)
    v <- sin(u)
    kentest <- cor.test(u,v,alternative = "two.sided", method = "kendall")
    kentest$p.value
} )
power3 <- mean(kenvalues <= 0.05)
power3

5th 2018/11/02

Question

The law school data set law contains LSAT and GPA for 15 law schools.This data set is a random sample from the universe of 82 law schools in law82.

set.seed(1)
library(bootstrap)    #for the law data
a<-matrix(c(round(law$LSAT,digits = 0),law$GPA),nrow=2,byrow = TRUE )
dimnames(a)<-list(c("LSAT","GPA"),1:15)
knitr::kable(a)

Compute a jackknife estimate of the bias and the standard error of the correlation statistic.

Answer

step1

Define the ith jackknife sample$x_{(i)}$ to be the subset of $x$ that leaves out the ith observation $x_i$.

That is, $x_{(i)} = (x_1,\ldots,x_{i-1}, x_{i+1},\ldots,x_n)$.

step2

$\hat{\theta} = T_n(x)$,define the ith jackknife replicate $\hat{\theta}{(i)} = T{n-1}(x_{(i)}), i = 1,\ldots,n$.

step3

$bias=E(\hat{\theta}-\theta)$, $\hat{bias}_{jack}=(n-1)E(\overline{\hat{\theta}^{\star}}-\hat{\theta})$

$\hat{se}{jack}=\sqrt{(n-1)/n \sum{i=1}^n(\bar{\hat{\theta}^{\star}}-\hat{\theta}_{i})^2}$

The Jackknife estimate of $\theta$ is $\hat{\theta}{jack}=\hat{\theta}-\hat{bias}{jack}$

The code and the result is shown as below:

library(bootstrap)
x <- law$LSAT; y<-law$GPA
cor <- cor(x,y)
n <- length(x)
cor_jack <- numeric(n)  #storage of the resamples

for (i in 1:n)
  cor_jack[i] <- cor(x[-i],y[-i]) 

bias.jack <- (n-1)*(mean(cor_jack)-cor)

se.jack <- sqrt((n-1)/n*sum((cor_jack-mean(cor_jack)))^2)
print(list(cor= cor ,est=cor-bias.jack, bias = bias.jack,se = se.jack, cv = bias.jack/se.jack))

We can find that in the 15 jackknife repeats, $\hat{bias}_{jack}$ is very small and the estimated coefficient of variation is far lower than 0.25.

Question

Refer to the air-conditioning data set aircondit provided in the boot package. The 12 observations are the times in hours between failures of air conditioning equipment: 3, 5, 7, 18, 43, 85, 91, 98, 100, 130, 230, 487. Assume that the times between failures follow an exponential model Exp($\lambda$).

Compute 95% bootstrap confidence intervals for themean time between failures $\frac{1}{\lambda}$ by the standard normal, basic, percentile,and BCa methods. Compare the intervals and explain why they may differ.

Answer

step1 $\frac{1}{\lambda}$is the mean value, so we use the data to resample and compute the mean.

step2 Use bootstrap to estimate the bias and standard error of the estimate $\frac{1}{\lambda}$.

step3 Use R function boot.ci{boot}: basic, normal, percentile and BCa.

The code and the result is shown as below:

#Bootstrap
library(boot)
data(aircondit,package = "boot")
air <- aircondit$hours
theta.hat <- mean(air)
#set up the bootstrap
B <- 2000            #number of replicates
n <- length(air)      #sample size
theta.b <- numeric(B)     #storage for replicates

#bootstrap estimate of standard error of R
for (b in 1:B) {
  #randomly select the indices
  i <- sample(1:n, size = n, replace = TRUE)
  dat <- air[i]       #i is a vector of indices
  theta.b[b] <- mean(dat)
}

bias.theta <- mean(theta.b - theta.hat)
se <- sd(theta.b)

print(list(bias.b = bias.theta,se.b = se))

theta.boot <- function(dat,ind) {
  #function to compute the statistic
  mean(dat[ind])
}
boot.obj <- boot(air, statistic = theta.boot, R = 2000)
print(boot.obj)

Here I use both of function I write and the boot function from package "boot" and get the value and bias and se of the bias.

print(boot.ci(boot.obj, type=c("basic","norm","perc","bca")))

As is shown the intervals differ a lot because of transformation respecting and different order accuracy.

The BCa confidence intervals are transformation respecting and BCa intervals have second order accuracy. Transformation respecting means that if $(\hat{\theta}^_{\alpha1}, \hat{\theta}^{\alpha2})$ is a confidence interval for $\theta$, and $t(\theta)$ is a transformation of the parameter $\theta$, here $\theta = \frac{1}{\lambda}$, then the corresponding interval for $t(\theta)$ is $(t(\hat{\theta}^_{\alpha1}),t(\hat{\theta}^{\alpha2}))$. A confidence interval is first order accurate if the error tends to zero at rate $1/\sqrt{n}$ for sample size $n$, and second order accurate if the error tends to zero at rate $1/n$. The bootstrap percentile interval is transformation respecting but only first order accurate. The standard normal confidence interval is neither transformation respecting nor second order accurate.

Question

Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.

Answer

We need to use eigen function to get the eigen values and get the jackknife estimates. Then use the method to obtain the jackknife estimates of bias and standard error of $\hat{\theta}$: An unbiased estimate of the bias $E(\hat\theta)-\theta_0$ is $$(n-1)(\bar{\hat\theta}{(\cdot)}-\hat\theta)$$. An unbiased estimate of $se(\hat\theta)$ is $$ \sqrt{\frac{n-1}n\sum{i=1}^n(\hat\theta_{(i)}-\bar{\hat\theta}_{(\cdot)})^2}$$

The code and the result is shown as below:

#Jackknife
#compute the jackknife replicates, leave-one-out estimates
library(bootstrap)
data(scor,package = "bootstrap")
n <- length(scor)
theta.jack <- numeric(n)
dat <- cbind(scor$mec, scor$vec, scor$alg, scor$ana, scor$sta)
sigma.hat <- cov(dat)
theta.hat <- eigen(sigma.hat)$values[1]/sum(eigen(sigma.hat)$values)
for (i in 1:n){
  sigma.jack <- cov(dat[-i,])
  theta.jack[i] <- eigen(sigma.jack)$values[1]/sum(eigen(sigma.jack)$values)
}

#jackknife estimate of bias
bias.jack <- (n - 1) * (mean(theta.jack) - theta.hat) 

#Jackknife estimate of standard error
se.j <- sqrt((n-1) * mean((theta.jack - mean(theta.jack))^2))

print(list(bias.jack = bias.jack,se.jack = se.j))

Numerical results show that the bias is small and the method is very accurate because the se is quite small.

Question

In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.

Answer

step1

We need to delete two data from datasets on which interpolation is to be done and to search the optimal shape parameter.

For $k,j = 1,\ldots,n$, let observation $(x_k, y_k),(x_j,y_j),k\neq j$ be the test points and use the remaining observations to fit the model.

For each $x_i, i= 1,\ldots,n$, it will be used as test points for n times.

Compute the predicted response $\hat{y_{i,k}} = \hat{\beta}0 +\hat{ \beta}_1x{i,k}$ for the test point.

Compute the prediction error $e_{i,k} = y_{k} ??? \hat{y}_{i,k}$.

step2

Estimate the mean of the squared prediction errors $\hat{\sigma_{\epsilon}}^2=\frac{1}{n(n-1)}\sum{e^2_{i,k}}$

step3

Compare the calculation accuracy of the two methods.

The code and the result is shown as below:

    ptm <- proc.time()

    library(DAAG); attach(ironslag)
    n <- length(magnetic)   #in DAAG ironslag
    e1 <- e2 <- e3 <- e4 <- matrix(0,n,n)
#    yhat1 <- yhat2 <- yhat3 <- yhat4 <- matrix(0,n,n)
 #   logyhat3 <- logyhat4 <- matrix(0,n,n)


    # for n-fold cross validation
    # fit models on leave-two-out samples
    for (i in 1:n) {
      for (j in 1:n){
        if (j != i){
          y <- magnetic[c(-i,-j)]
          x <- chemical[c(-i,-j)]

          J1 <- lm(y ~ x)
          yhat1 <- J1$coef[1] + J1$coef[2] * chemical[j]
          e1[i,j] <- magnetic[j] - yhat1

          J2 <- lm(y ~ x + I(x^2))
          yhat2 <- J2$coef[1] + J2$coef[2] * chemical[j] +
            J2$coef[3] * chemical[j]^2
          e2[i,j] <- magnetic[j] - yhat2

          J3 <- lm(log(y) ~ x)
          logyhat3 <- J3$coef[1] + J3$coef[2] * chemical[j]
          yhat3 <- exp(logyhat3)
          e3[i,j] <- magnetic[j] - yhat3

          J4 <- lm(log(y) ~ log(x))
          logyhat4 <- J4$coef[1] + J4$coef[2] * log(chemical[j])
          yhat4 <- exp(logyhat4)
          e4[i,j] <- magnetic[j]- yhat4
        }
      }
    }


ltocv <- c(sum(e1^2), sum(e2^2), sum(e3^2),sum(e4^2))/(n*(n-1))
ltocv.ptm <- proc.time() - ptm # same function with system.time(exp)
print(list("timeconsuming_of_ltocv"=ltocv.ptm[1:3]))
ltocv
### Example 7.18 (Model selection: Cross validation)

    # Example 7.17, cont.
    ptm <- proc.time()
    n <- length(magnetic)   #in DAAG ironslag
    e1 <- e2 <- e3 <- e4 <- numeric(n)

    # for n-fold cross validation
    # fit models on leave-one-out samples
    for (k in 1:n) {
        y <- magnetic[-k]
        x <- chemical[-k]

        J1 <- lm(y ~ x)
        yhat1 <- J1$coef[1] + J1$coef[2] * chemical[k]
        e1[k] <- magnetic[k] - yhat1

        J2 <- lm(y ~ x + I(x^2))
        yhat2 <- J2$coef[1] + J2$coef[2] * chemical[k] +
                J2$coef[3] * chemical[k]^2
        e2[k] <- magnetic[k] - yhat2

        J3 <- lm(log(y) ~ x)
        logyhat3 <- J3$coef[1] + J3$coef[2] * chemical[k]
        yhat3 <- exp(logyhat3)
        e3[k] <- magnetic[k] - yhat3

        J4 <- lm(log(y) ~ log(x))
        logyhat4 <- J4$coef[1] + J4$coef[2] * log(chemical[k])
        yhat4 <- exp(logyhat4)
        e4[k] <- magnetic[k] - yhat4
    }


    loocv <- c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2))
    loocv.ptm <- proc.time() - ptm
    print(list("timeconsuming_of_loocv"=loocv.ptm[1:3]))
a <- data.frame(rbind(loocv,ltocv))
row.names(a) <- c("loocv","ltocv")
names(a) <- c("L1","L2","L3","L4")
knitr::kable(a)

This table shows us the estimation of the mean of the squared prediction errors $\hat{\sigma}_\epsilon^2$ of the two method, and in the r code of leave-two-out method, yhat was caculated for two times.

According to the numberical result, the computation time of LTOCV increases from 0.26s to more than 10 seconds due to pair wises elections of data, but the the mean of the squared prediction errors did not decreases significantly.

The prediction error criterion of both methods shows that the quadratic model is best fit for the data. The fitted model is

$$\hat{Y} = 24.49262 - 1.39334X + 0.05452X^2$$

6th 2018/11/16

Question

Implement the two-sample Cram??er-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2

Answer

step1 Consider the null hypothesis:$$H_0: F=G$$ Cram??er-von Mises statistic is $$W_2=\frac{mn}{(m+n)^2}[\sum_1^n(F_n(x_i)-G_m(x_i))^2+\sum_1^m(F_n(y_j)-G_m(y_j))^2]$$ where $F_n$ is the ecdf of the sample $x_1,\ldots,x_n$ and $G_m$ is the ecdf of the sample $y_1,\ldots,y_m$. Large values of $W_2$ support alternative $F \neq G$.

step2 Let $r_1,r_2,\ldots,r_n$ be the ranks of the x's in the combined sample, and let $s_1,s_2,\ldots,s_m$ be the ranks of the y's in the combined sample. $$T = \frac{nm}{n+m} \omega^2 = \frac{U}{nm(n+m)}-\frac{4mn - 1}{6N}$$ where U is defined as $$U = n \sum_{i=1}^n(r_i-i)^2 + m\sum_{j=1}^m (s_j-j)^2$$ Note: the method can be found in wiki.

step3 According to the statistic or p-value to decide whether the null hypothesis is rejected or not.

The code and the result is shown as below:

set.seed(1)
attach(chickwts)
x <- sort(as.vector(weight[feed == "soybean"]))
y <- sort(as.vector(weight[feed == "linseed"]))
detach(chickwts)

n <- length(x) #sample1 size
m <- length(y) #sample2 size
z <- c(x, y)          #pooled sample
N <- n + m
h <- numeric(N)
T <- m*n/(m+n)^2
R <- 999              #number of replicates
K <- 1:N
reps <- numeric(R)   #storage for replicates
v.n <- v1.n <- numeric(n); v.m <- v1.m <- numeric(m)
for (i in 1:n) v.n[i] <- ( x[i] - i )**2
for (j in 1:m) v.m[j] <- ( y[j] - j )**2

reps0 <- ( (n * sum(v.n) + m * sum(v.m)) / (m * n * N) ) -  (4 * m * n - 1) / (6 * N) 
#replicates
for (i in 1:R) {
  #generate indices k for the first sample
  k <- sample(K, size = n, replace = FALSE)
  x1 <- z[k]
  y1 <- z[-k]      #complement of x1
  z1 <- c(x1,y1)
  for (i in 1:n) { v1.n[i] <- ( x1[i] - i )**2 }
  for (j in 1:m) { v1.m[j] <- ( y1[j] - j )**2 }
  reps[k] <- ( (n * sum(v1.n) + m * sum(v1.m)) / (m * n * N) ) - (4 * m * n - 1) / (6 * N)
}
p <- mean( c(reps0, reps) >= reps0 )
p

According to the p-value, we reject $H_0$ significantly.

Question

  • Design experiments for evaluating the performance of the NN, energy, and ball methods in various situations.

    • Unequal variances and equal expectations
    • Unequal variances and unequal expectations
    • Non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)
    • Unbalanced samples (say, 1 case versus 10 controls)
    • Note: The parameters should be choosen such that the powers are distinguishable (say, range from 0.3 to 0.9).

Answer

step1 In this simulation experiment, I will compare the empirical power of the third nearest neighbor test, the energy test and the ball test.

step2

First, unequal variances and equal expectations $$F_1=N_2(\mu = (0, 0)_2, \Sigma = I_2), F_2 = N_2(?? = (0, 0)_2 , ?? = 2I_2) $$ Second,unequal variances and unequal expectations $$F_1=N_2(\mu = (0, 0)_2, \Sigma = I_2), F_2 = N_2(?? = (0, 1)_2 , ?? = 2I_2) $$ Third, non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)

$$F_1=t_1, F_2 = N_1(0,1)*N_1(1,2) $$ And finally, unbalanced samples (say, 1 case versus 10 controls) $$F_1=t_1,F_2 = N_2((0,0),I_2) $$

The code and the result is shown as below:

library(RANN)
library(energy)
library(Ball)
library(boot)
m <- 50; k<-3; p<-2
n1 <- n2 <- 15;N = c(n1,n2); R<-999

Tn <- function(z, ix, sizes,k) {
  n1 <- sizes[1]
  n2 <- sizes[2]
  n <- n1 + n2
  z <- z[ix, ]
  o <- rep(0, NROW(z))
  z <- as.data.frame(cbind(z, o))
  NN <- nn2(z, k=k+1)  #uses package RANN
  block1 <- NN$nn.idx[1:n1,-1]
  block2 <- NN$nn.idx[-(1:n1),-1]
  i1 <- sum(block1 < n1 + 0.5)
  i2 <- sum(block2 > n1 + 0.5)
  return((i1 + i2) / (k * n))
}

eqdist.nn <- function(z,sizes,k){
  boot.obj <- boot(data=z,statistic=Tn,R=R,
                   sim = "permutation", sizes = sizes,k=k)
  ts <- c(boot.obj$t0,boot.obj$t)
  p.value <- mean(ts>=ts[1])
  list(statistic=ts[1],p.value=p.value)
}
p.values <- matrix(NA,m,3)

#Unequal variances and equal expectations
for(i in 1:m){
  x <- matrix(rnorm(n1*2),ncol=2);
  y <- matrix(rnorm(n1*2,0,2),ncol=2);
  z <- rbind(x,y)
  p.values[i,1] <- eqdist.nn(z,N,k)$p.value  #NN
  p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value  #energy method
  p.values[i,3] <- bd.test(x=x,y=y,R=R,seed=i*12345)$p.value  #ball method
}
alpha <- 0.05                        #confidence level 
pow <- colMeans(p.values<alpha)
pow

These alternatives differ in variance only, and the empirical evidence summarized before suggests that ball method is most powerful.

#Unequal variances and unequal expectations
for(i in 1:m){
  x <- matrix(rnorm(n1*2),ncol=2);
  y <- cbind(rnorm(n2,0,2),rnorm(n2,1,2));
  z <- rbind(x,y)
  p.values[i,1] <- eqdist.nn(z,N,k)$p.value  #NN
  p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value  #energy method
  p.values[i,3] <- bd.test(x=x,y=y,R=R,seed=i*12345)$p.value  #ball method
}
alpha <- 0.05                        #confidence level 
pow <- colMeans(p.values<alpha)
pow

The emperical power differs from 0.38 to 0.96 and the ball is also the best.

#Non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)
for(i in 1:m){
  x <- matrix(rt(n1*2,1), ncol = 2)
  y <- cbind(rnorm(n2),rnorm(n2,1,2))
  z <- rbind(x,y)
  p.values[i,1] <- eqdist.nn(z,N,k)$p.value  #NN
  p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value  #energy method
  p.values[i,3] <- bd.test(x=x,y=y,R=R,seed=i*12345)$p.value  #ball method
}
alpha <- 0.05                        #confidence level 
pow <- colMeans(p.values<alpha)
pow

Differently the energy method displays best and the ball is also not bad.

#Unbalanced samples (say, 1 case versus 10 controls)
n1 <- 10; n2 <- 100; N = c(n1,n2)
for(i in 1:m){
  x <- matrix(rt(n1*2,1),ncol=2);
  y <- matrix(rnorm(n2*2),ncol=2);
  z <- rbind(x,y)
  p.values[i,1] <- eqdist.nn(z,N,3)$p.value  #NN
  p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value  #energy method
  p.values[i,3] <- bd.test(x=x,y=y,R=R,seed=i*12345)$p.value  #ball method
}
alpha <- 0.05                        #confidence level 
pow <- colMeans(p.values<alpha)
pow

Unbalanced data shows that the energy is best and the NNtest follows.

Question

Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard Cauchy distribution (see qcauchy or qt with df=1). Recall that a $Cauchy(\theta,\eta)$ distribution has density function$$f(x)=\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)},-\infty 0$$.

The standard Cauchy has the $Cauchy(\theta = 1, \eta = 0)$ density. (Note that the standard Cauchy density is equal to the Student t density with one degree of freedom.)

Answer

step1

Choose a proposal distribution $g(??|X_t)$. Here I use Normal Distribution.

step2 Generate $X_0$ from a distribution normal distribution, and repeat.

step3

Plot and compare with the theorial value.

The code and the result is shown as below:

# Cauchy distribution density function
f <- function(x, theta,eta) {
  stopifnot(theta > 0)
  return(1/(theta*pi*(1+((x-eta)/theta)^2)))
  }

m <- 10000
theta <- 1; eta <- 0
x <- numeric(m)
x[1] <- rnorm(1) #proposal densty
k <- 0
u <- runif(m)

for (i in 2:m) {
  xt <- x[i-1]
  y <- rnorm(1, mean =  xt)
  num <- f(y, theta, eta) * dnorm(xt, mean = y)
  den <- f(xt, theta, eta) * dnorm(y, mean = xt)
  if (u[i] <= num/den) x[i] <- y else {
    x[i] <- xt
    k <- k+1     #y is rejected
  }
print(k)
}

In this exercise, approximately 20% of the candidate points are rejected, so the chain is somewhat not bad, but can be better.

plot(1:m, x, type="l", main="", ylab="x")

#discard the burnin sample
b <- 1001     
y <- x[b:m]
a <- ppoints(100)
Qc <- qcauchy(a)  #quantiles of Standard Cauchy
Q <- quantile(x, a)

qqplot(Qc, Q, main="", xlim=c(-2,2),ylim=c(-2,2), xlab="Standard Cauchy Quantiles", ylab="Sample Quantiles")

hist(y, breaks="scott", main="", xlab="", freq=FALSE)
lines(Qc, f(Qc, 1, 0))

From the plot, it appears that the sample quantiles are in approximate agreement with the theoretical quantiles.

Question

Rao [220, Sec. 5g] presented an example on genetic linkage of 197 animals in four categories (also discussed in [67, 106, 171, 266]). The group sizes are 278 (125, 18, 20, 34). Assume that the probabilities of the corresponding multinomial distribution are$$(\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4})$$. Estimate the posterior distribution of ?? given the observed sample, using one of the methods in this chapter.

Answer

$$f(GS|\theta) = (\frac{2 + \theta}{4})^{gs_1}(\frac{1-\theta}{4})^{gs_2}(\frac{1-\theta}{4})^{gs_3}(\frac{\theta}{4})^{gs_4}=(\frac{2 + \theta}{4})^{gs_1}(\frac{1-\theta}{4})^{gs_2 + gs_3}(\frac{\theta}{4})^{gs_4}$$ gs is group size

The code and the result is shown as below:

gs <- c(125,18,20,34)  #group size
m <- 5000
w <- .25 
b <- 1001
#the target density
prob <- function(y, gs) {
  if (y < 0 | y >1)
    return (0)
  else
    return((1/2+y/4)^gs[1] *((1-y)/4)^gs[2]*((1-y)/4)^gs[3]*(y/4)^gs[4])
}

u <- runif(m)  #for accept/reject step
v <- runif(m, -w, w)  #proposal distribution
x[1] <- .25
for (i in 2:m) {
  y <- x[i-1] + v[i]
  if (u[i] <= prob(y, gs) / prob(x[i-1], gs))
    x[i] <- y 
  else
    x[i] <- x[i-1]
}  
theta.hat <- mean(x[b:m])
theta.hat
# compare
gs.hat <- sum(gs) * c((2+theta.hat)/4, (1-theta.hat)/4, (1-theta.hat)/4, theta.hat/4)
round(gs.hat)

So, theta.hat is near to the real value.

7th 2018/11/23

Question

Exercise 9.6 presented an example on genetic linkage of 197 animals in four categories. The group sizes are 278(125, 18, 20, 34). Assume that the probabilities of the corresponding multinomial distribution are$$(\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4}).$$ Use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R}$ < 1.2.

Answer

step1 The Gelman-Rubin method of monitoring convergence of a M-H chain is based on comparing the behavior of several generated chains with respect to the variance of one or more scalar summary statistics.

step2 The Gelman-Rubin statistic is the estimated potential scale reduction$\sqrt{\hat{R}}=\sqrt{\frac{\hat{Var(\psi}}{W}}.$

step3 The factor $\hat{R}$ decreases to 1 as the length of the chain tends to infinity, so $\hat{R}$ should be close to 1 if the chains have approximately converged to the target distribution. Here we choose 1.1 and 1.2.

### Gelman-Rubin method of monitoring convergence

Gelman.Rubin <- function(psi) {
  # psi[i,j] is the statistic psi(X[i,1:j])
  # for chain in i-th row of X
  psi <- as.matrix(psi)
  n <- ncol(psi)
  k <- nrow(psi)

  psi.means <- rowMeans(psi)     #row means
  B <- n * var(psi.means)        #between variance est.
  psi.w <- apply(psi, 1, "var")  #within variances
  W <- mean(psi.w)               #within est.
  v.hat <- W*(n-1)/n + (B/n)     #upper variance est.
  r.hat <- v.hat / W             #G-R statistic
  return(r.hat)
}

gs <- c(125,18,20,34)  #group size
#m <- 5000
#w <- .25 
b <- 1001
#the target density
prob <- function(y, gs) {
  if (y < 0 | y >1)
    return (0)
  else
    return((1/2+y/4)^gs[1] *((1-y)/4)^gs[2]*((1-y)/4)^gs[3]*(y/4)^gs[4])
}

chain <- function(w, N) {
  #generates a Metropolis chain for Normal(0,1)
  #with Normal(X[t], sigma) proposal distribution
  #and starting value X1
  x <- rep(0, N)
  u <- runif(N)  #for accept/reject step
  v <- runif(N, -w, w)  #proposal distribution
  x[1] <- w
  for (i in 2:N) {
    y <- x[i-1] + v[i]
    if (u[i] <= prob(y, gs) / prob(x[i-1], gs))
      x[i] <- y 
    else
      x[i] <- x[i-1]
  }
  return(x)
}

k <- 4          #number of chains to generate
n <- 15000      #length of chains
w <- c(0.05,0.25,0.5,0.7,0.9)

#generate the chains
X <- matrix(0, nrow=k, ncol=n)
for (i in 1:k)
  X[i, ] <- chain(w[i], n)

#compute diagnostic statistics
psi <- t(apply(X, 1, cumsum))
for (i in 1:nrow(psi))
  psi[i,] <- psi[i,] / (1:ncol(psi))
print(Gelman.Rubin(psi))

#plot psi for the four chains
par(mfrow=c(2,2))
for (i in 1:k)
  plot(psi[i, (b+1):n], type="l",
       xlab=i, ylab=bquote(psi))
par(mfrow=c(1,1)) #restore default

#plot the sequence of R-hat statistics
rhat <- rep(0, n)
for (j in (b+1):n)
  rhat[j] <- Gelman.Rubin(psi[,1:j])
plot(rhat[(b+1):n], type="l", xlab="",ylim = range(1,1.4), ylab="R")
abline(h=1.2, lty=2)
abline(h=1.1,lty = 2)

The plot of $R^2$ is shown in the figure for time 1001 to 15000. From this plot it is evident that the chain is converging faster than when the proposal distribution had a very small variance. The value of $R^2$ is below 1.2 within 500 iterations and below 1.1 within 2000 iterations.

Question

Find the intersection points A(k) in $(0,\sqrt{k})$ of the curves$$S_{k - 1}(a) = P(t(k - 1) > \sqrt{\frac{a^2(k - 1)}{k - a^2}})$$ and $$S_{k}(a) = P(t(k) > \sqrt{\frac{a^2k}{k + 1 - a^2}})$$ for k = 4 : 25, 100, 500, 1000, where t(k) is a Student t random variable with k degrees of freedom.

Answer

k <- c(4 : 25, 100, 500, 1000) #Declare the variables 
n <- length(k)
a <- rep(0,n) #store the intersection points
eps <- .Machine$double.eps^0.25 
for (i in 1:n) {
  out <- uniroot(function(a){
    pt(sqrt(a^2*(k[i] -1)/(k[i] - a ^ 2)),df = ( k[i] - 1)) - pt(sqrt(a ^ 2 * k[i] /(k[i] + 1 - a ^ 2)),df = k[i])},
    lower = eps, upper = sqrt(k[i] - eps) #the endpoints
  )
  a[i] <- out$root
}
intsecp <- rbind(k,a)
#knitr::kable(intsecp)
print(round(intsecp,3))
plot(k[1:19] ,a[1:19], xlab = "k value(df)", ylab = "A(k)")

From the tendency, we can find that when k increases, the intersection values increases too.

8th 2018/11/30

Question

Write a function to compute the cdf of the Cauchy distribution, which has density $$\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)},-\infty < x < \infty,$$ where $\theta > 0$. Compare your results to the results from the R function $pcauchy$.

Answer

step

  • First, we can generate the density function of cauchy distribution named pdf.
  • Then, I use the integrate function to compute the cdf.
  • At last, consider the special case $\eta = 2 \quad and \quad \theta =3 ,\quad -15+\eta<x<15+\eta.$ plot the cdf curve and compare the difference with pcauchy function.
set.seed(1)
# generate the distribution function of cauchy distribution
cdf <- function(x,eta,theta){
  n <- length(x)
  cdf <- numeric(n)
  # compute the density function of cauchy distribution
  pdf <- function(y) 1/(theta* pi *(1+((y-eta)/theta)^2))
  for (i in 1 : n){
    cdf[i] <- integrate(pdf, -Inf, x[i])$value
  }
  return(cdf)
}
# consider the special case
eta <- 2 
theta <- 3
x <- seq(-15 + eta, 15 + eta, .01)
plot(x, cdf(x,eta, theta),type = "l",lwd = 3, xlab = "x            eta = 2, theta = 3", ylab = "cdf(x)")
lines(x,pcauchy(x,eta,theta),col = 2)
legend("topleft", c("cdf","pcauchy"), lwd = c(3,1),col = c(1,2))
#compute the difference
mean(cdf(x,eta, theta) - pcauchy(x))
  • From the figure, we find our method to compute cdf is feasible and the mean difference is very near to 0.

Question

  • A-B-O blood type problem
  • Let the three alleles be A, B, and O.
  dat <- rbind(Genotype=c('AA','BB','OO','AO','BO','AB','Sum'),
  Frequency=c('p2','q2','r2','2pr','2qr','2pq',1),
  Count=c('nAA','nBB','nOO','nAO','nBO','nAB','n'))
  knitr::kable(dat,format="html",table.attr = "class=\"table table-bordered\"", align="c")

9th 2018/12/07

Question

Use both for loops and $lapply()$ to fit linear models to the $mtcars$ using the formulas stored in this list:

formulas <- list(
mpg ~ disp,
mpg ~ I(1 / disp),
mpg ~ disp + wt,
mpg ~ I(1 / disp) + wt
)

Answer

Here, fomulas is a list, and it varys. The data is mtcars which does not vary. So the argument of the function is the list formulas.

set.seed(1)
## loop
out1 <- vector("list", length(formulas))
for (i in seq_along(formulas)) {
  out1[[i]] <- lm(formulas[[i]],data = mtcars)
}

## lapply
(out1<-lapply(formulas,function(x) lm(x,mtcars)))

Question

Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?

bootstraps <- lapply(1:10, function(i) {
rows <- sample(1:nrow(mtcars), rep = TRUE)
mtcars[rows, ]
})

Answer

Differently, the linear function does not vary but the input varies i.e. the bootrap is a list. So the argument of the function is the list bootstraps.

## loop
out2 <- vector("list", length(bootstraps))
for (i in seq_along(bootstraps)) {
  out2[[i]] <- lm(mpg ~ disp,bootstraps[[i]])
}
## lapply
(out2 <- lapply(bootstraps, function(x) lm(mpg ~ disp,x)))

Question

For each model in the previous two exercises, extract R2 using the function below.

rsq <- function(mod) summary(mod)$r.squared

Answer

set.seed(2)
# out is the linear model generated in ex3 and ex4.
out <- c(out1,out2) 
(R_sqr <- sapply(out, rsq))
plot(1:14,R_sqr,ylim = c(0,1))

Question

The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.

trials <- replicate(
100,
t.test(rpois(10, 10), rpois(7, 10)),
simplify = FALSE
)

Extra challenge: get rid of the anonymous function by using [[ directly.

Answer

##sapply
sapply(trials, function(x) x$p.value)

##get rid of the anonymous function
sapply(trials, `[[`, 'p.value')

Conclusion: We cannot reject H0 significantly for most p-values is lower than 0.05.

And actually the data come from the same distribution $Poisson(10)$

Question

Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?

Answer

library(parallel)
mcvMap <- function(f, FUN.VALUE , ...) {
    out <- mcMap(f, ...)
    vapply(out, identity, FUN.VALUE)
}

We can use parallel or foreach packages to do iterates in parallel.

10th 2018/12/14

Question

Make a faster version of chisq.test() that only computes the chi-square test statistic when the input is two numeric vectors with no missing values. You can try simplifying chisq.test() or by coding from the mathematical definition (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test).

Answer

Analysis

  • chisq.test() performs chi-squared contingency table tests and goodness-of-fit tests.

  • Calculate the chi-squared test statistic, $\chi ^{2}$, which resembles a normalized sum of squared deviations between observed and theoretical frequencies.

  • The value of the test-statistic is
    $$\chi ^{2}=\sum {i=1}^{r}\sum {j=1}^{c}\frac{(O_{i,j}-E_{i,j})^{2}}{E_{i,j}}$$ $$E_{i,j} = N \cdot p_{i,j} = N \cdot p_{i\cdot} \cdot p_{\cdot j} = \frac{E_{i\cdot} \cdot E_{\cdot j}}{N}$$ here, c = 2 because the input is two numeric vectors with no missing values.

  • Determine the degrees of freedom, df, of that statistic.$$df = (r - 1)??(c - 1) = r - 1, \qquad c=2.$$

  • Select a desired level of confidence for the result of the test.

  • Compare $\chi ^{2}$ to the critical value from the chi-squared distribution with df degrees of freedom and the selected confidence level , which in many cases gives a good approximation of the distribution of $\chi ^{2}$.

  • Use $microbenchmark::microbenchmark$ to compare with the version of $chisq.test()$

Code

# Testing for statistical independence
## 
chisq_test <- function(x){
  m <- nrow(x);  n <- ncol(x); N <- sum(x)
  E <- matrix(0,m,n)
  rowsums <- unlist(lapply(1:m, function(i) sum(x[i,]))) # which is used to computed pi.
  colsums <- unlist(lapply(1:n, function(j) sum(x[,j])))
  for (i in 1:m){
    for (j in 1:n) {
      E[i,j] <- rowsums[i]*colsums[j]/N
    }
  }
  df <- (m-1) * (n-1)
  chi_sqr <- sum((x-E)^2/E) #
  p_value <- dchisq(chi_sqr, df = df)
  (test <- list(chi_sqr = chi_sqr, df = df, p_value = p_value))
}

  # chisq.test() 
  ## From Agresti(2007) p.39
M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
                    party = c("Democrat","Independent", "Republican"))
(M)
(Xsq <- chisq.test(M))  # Prints test summary

chisq_test(M)

print(microbenchmark::microbenchmark(
  chisq_test(M),
  chisq.test(M)
))

Conclusion

We can say that calling chisq_test() quite a bit faster than calling chisq.test() for two numeric vectors with no missing values.

Question

Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?

Answer

Analysis

  • table() uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
  • I use unique(), class() to get table2() function and then use indentical() to check.
  • I use table on the pair of factors (e.g. Group and race) and then use that as input to chisq_test().

Code

table2 <- function(x, y) {
  x_val <- sort(unique(x)); y_val <- sort(unique(y)) # remove the duplicate elements
  m <- length(x_val); n <- length(y_val) # dimensions of the table 
  mat <- matrix(0L, m, n)
  for (i in seq_along(x)) {
    mat[which(x_val == x[[i]]), which(y_val == y[[i]])] <-
      mat[which(x_val == x[[i]]),  which(y_val == y[[i]])] + 1L
  }
  dimnames <- list(x_val, y_val)
  names(dimnames) <- as.character(as.list(match.call())[-1])  # R has names for dimnames... :/
  table <- array(mat, dim = dim(mat), dimnames = dimnames)
  class(table) <- "table"
  table
}
# Example:generate random samples.
data <- data.frame(group=as.factor(sample(1:4,100,TRUE)),
                     race=as.factor(sample(1:3,100,TRUE)))
# look at the table
tab1 <- with(data,table(group,race))
tab2 <- with(data,table2(group,race))
# Check whether the table2 is same with table function
identical(tab1,tab2)
# Use it to speed up your chi-square test for testing 
chisq_test(tab1)
microbenchmark::microbenchmark(chisq_test(tab1), chisq_test(tab2))

Conclusion

We can use it to speed up your chi-square test, but not very efficeient.



lvli19/StatComp18096 documentation built on May 5, 2019, 11:07 p.m.