Overview

StatComp18038 is a simple R package developed to use functions to compute the Cauchy distribution, generate samples from a Rayleigh distribution and compute Monte Carlo estimate of the Beta(3,3) cdf.

The function of Cau

The source R code for Cau is as follows:

Cau <- function(theta, eta, y)
{
  .dcauchy <- function(theta, eta, x)
  {
    #compute the density function of cauchy distribution
    if(theta <= 0) return(NA)

    r <- theta * pi * (1 + ((x-eta)/theta)^2)
    return(1 / r)
  }

  n <- length(y)
  res <- numeric(n)
  for(i in 1:n)
  {
    res[i] <- integrate(.dcauchy, lower = -Inf, upper = y[i],
                        rel.tol = .Machine$double.eps^0.25,
                        theta = theta, eta = eta)$value
  }
  return(res)
}

In order to show the results are true, we compare the results to the results from the R function pcauchy.

library(StatComp18038)
y <- seq(-10, 10, 1)
theta <- 2
eta <- 3
res1 <- Cau(theta, eta, y)
res2 <- pcauchy(y, eta, theta)
round(rbind(res1, res2), 5)

The above results show the function 'Cau' can compute the cauchy distribution.

The function of MCe

The source R code for MCe is as follows:

MCe <- function(x, n)
{
  y <- runif(n)
  cdf <- numeric(length(x))

  for (i in 1:length(x))
  {
      g <- (x[i])^3*(y*(1-y*x[i]))^2
      cdf[i] <- mean(g)/beta(3,3)
  }
   return(cdf)
}

Compare the estimates with the values returned by the pbeta function in R.

library(StatComp18038)
x <- seq(0.1, 0.9, 0.1)
n <- 10000
res1 <- MCe(x, n)
res2 <- pbeta(x,3,3)
round(rbind(res1, res2), 3)

The results show that the Monte Carlo estimates appear to be very close to the pbeta values

The function of MC.phi

The source R code for MC.Phi is as follows:

MC.Phi <- function(x, sigma, r, antithetic = TRUE)
{
  y <- runif(r)
  if(!antithetic) v <- runif(r)
  else
    v <- 1-y
  y <- c(y, v)
  cdf <- numeric(length(x))
  for(i in 1:length(x))
  {
    g <- (y*x[i]^2)*exp(-(y*x[i]^2)/(2*(sigma^2)))
    cdf[i] <-  1/(sigma^2)*mean(g)
  }
  cdf
}

A comparison of estimates obtained from a single Monte Carlo experiment

library(StatComp18038)
x <- seq(0.1, 0.9, 0.1)
sigma <- 2
r <- 10000
res1 <- MC.Phi(x, sigma, r, antithetic = TRUE)
res2 <- 1-exp(-((x)^2)/(2*(sigma^2))) 
round(rbind(res1, res2), 3)

The values generated from the simple Monte Carlo integration approach and the antithetic variable approach are very close to the Rayleigh values.

2018-09-14

Three examples from R for Beginners

ID <- c(1, 2, 3, 4)
age <- c(24, 25, 50, 30)
diabetes <- c("type1", "type2", "type1", "type1")
status <- c("poor", "improved", "excellent", "poor")
diabetes <- factor(diabetes)
status <- factor(status, order = TRUE)
pdata <- data.frame(ID, age, diabetes, status)
pdata
table(pdata$diabetes, pdata$status)
summary(pdata)
x <- c(1:10)
y <- x
z <- 10/x
opar <- par(no.readonly = TRUE)
par(mar = c(5, 4, 4, 8)+0.1)   
plot(x, y, type = "b", pch = 22, col = "red", 
     yaxt = "n", lty = 3, ann = FALSE)
lines(x, z, type = "b", pch = 25, col = "blue", lty = 2) 
axis(side = 2, at = x, labels = x, col.axis = "red", las = 2) 
axis(side = 4, at = z, labels = round(z, digits = 2),
     col.axis = "blue", las = 2, cex.axis = 0.7, tck = -0.1)
mtext("y=1/x", side = 4, line = 3, cex.lab = 1, las = 2, col= "blue") 
title(main = "An example of creative axes", col.main = "lightgreen", xlab = "X values", ylab = "Y=X")

2018-09-21

  1. A discrete random variable X has probability mass function $$ x|0|1|2|3|4 $$ $$p(x)|0.1|0.2|0.2|0.2|0.3|$$ Use the inverse transform method to generate a random sample of size 1000 from the distribution of X. Construct a relative frequency table and compare the empirical with the theoretical probabilities.

Answer

x <- c(0, 1, 2, 3, 4)    # the number of xi for i=1,2,3,4,5
p <- c(0.1, 0.2, 0.2, 0.2, 0.3)   # the probability of each xi 
cp <- cumsum(p)  # caculate the cumulative distribution function 
m <- 1000     
r <- numeric(m)
set.seed(1234)  
r <- x[findInterval(runif(m), cp)+1]
table(r)
ct <- as.vector(table(r))
q <- ct/sum(ct)/p  #the specific values
cbind(x = x, p = p, freq = q)

Construct a relative frequency table and compare the empirical with the theoretical probabilities,if the specific values close to 1 show the model is right.

set.seed(1234)
r1 <- sample(c(0,1,2,3,4), m, replace = T, c(0.1,rep(0.2,3),0.3))
ct1 <- as.vector(table(r1))
q1 <- ct1/sum(ct1)/p
cbind(x = x, p = p, freq = q1)
  1. Write a function to generate a random sample of size n from the $Beta(a,b)$ distribution by the acceptance-rejection method. Generate a random sample of size 1000 from the $Beta(3,2)$ distribution. Graph the histogram of the sample with the theoretical $Beta(3,2)$ density superimposed.

Answer

The $Beta(3,2)$ density is $f(x)=12x^2(1+x), 0<x<1$, Let$g(x)$ be the Uniform(0,1) density. Then $f(x)/g(x)<=3$ for all $0<x<1$, so $c=3$.

n <- 1000
j <- 0  # counter for accepted
k <- 0  #  iterations
y <- numeric(n)
while(k < n)
{
u <- runif(1)
j <- j+1
x <- runif(1)   # random variable form g(x)
if(4*x^2*(1-x) > u)
{  # we accept x
  k <- k+1
  y[k] <- x
}
}
j 
hist(y, prob = TRUE, main = expression(f(x)==12*x^2*(1-x)))
p <- seq(0, 1, 0.01)
lines(p, 12*p^2*(1-p))

Graph the histogram of the sample with the theoretical $Beta(3,2)$ density superimposed.

  1. Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter $Λ$ has $Gamma(r,β)$ distribution and $Y$ has $Exp(Λ)$ distribution. That is, $(Y|Λ=λ)~fY(y|λ)=λe^{−λy}$. Generate 1000 random observations from this mixture with $r=4$ and $β=2$.

Answer

Generate 1000 random observations and stored in the vector $y$. The follwing figure shows above.

n <- 1000
r <- 4
beta <- 2
lamba <- rgamma(n, r, beta)         # lamba is generate randomly
y <- rexp(n, lamba)                 # generate 1000 random observations
plot(y, ylim = c(0,6), main = "The random observations")

2018-09-30

  1. The function is used to compute a Monte Carlo estimate of the $Beta(3,3)$ cdf, and use the function to estimate $F(x)$ for $x = 0.1,0.2,...,0.9$. Compare the estimates with the values returned by the pbeta function in R.
x <- seq(0.1, 0.9, length = 9)
n <- 10000
y <- runif(n)
cdf <- numeric(length(x))
for (i in 1:length(x)) {
  g <- (x[i])^3*(y*(1-y*x[i]))^2
  cdf[i] <- mean(g)/beta(3,3)
}
phi <- pbeta(x,3,3) # gengerate the values by the pbeta function
print(round(rbind(x, cdf, phi), 3)) # Compare the estimates with the values returned by the pbeta function

Results are shown compared with the values of the beta cdf function pbeta.The Monte Carlo estimates appear to be very close to the pbeta values.

  1. The Rayleigh density [156, (18.76)] is [f(x)=\frac{x}{\sigma^{2}} \ e^{-\frac{x^{2}}{2 {\sigma^{2}}}} x \ge 0, \sigma > 0]
    Implement a function to generate samples from a Rayleigh(σ) distribution, using antithetic variables. What is the percent reduction in variance of$\frac{X + X^{'}}{2}$.compared with $\frac{X_1 + X_2}{2}$ for independent $X_1, X_2$ ?

Answer

$F(x)=\int_0^{x} \frac{t}{\sigma^{2}} \ e^{-\frac{t^{2}}{2 {\sigma^{2}}}} \,dt$ is the cdf of Rayleigh distribution.Making the substitution $y=t/x$, we have $dt=xdy$, and $F(x)=\int_0^{1} \frac{y x^{2}}{\sigma^{2}} \ e^{-\frac{(yx)^{2}}{2 {\sigma^{2}}}} \,dy$ ,where the random variable Y has the Uniform(0,1)

MC.Phi <- function(x, r = 10000, antithetic = TRUE){
  y <- runif(r/2)
  if(!antithetic) v <- runif(r/2)
  else
    v <- 1-y
  y <- c(y, v)
  cdf <- numeric(length(x))
  for(i in 1:length(x)){
    g <- (y*x[i]^2)*exp(-(y*x[i]^2)/(2*(2^2)))
    cdf[i] <-  1/(2^2)*mean(g) 
  }
  cdf
} # use antithetic varibules to implement a function to generate samples from a Rayleigh(2) distribution

x <- seq(0.1, 2.5, length = 10)
set.seed(12345)
Phi <- 1-exp(-((x)^2)/(2*(2^2)))  # Results generated from Rayleigh(2) 
MC1 <- MC.Phi(x, antithetic = FALSE)
MC2 <- MC.Phi(x)
print(round(rbind(x,Phi, MC1, MC2), 5)) # A comparison of estimates obtained from a single Monte Carlo experiment

Results are shown compared with the values of the Rayleigh(2) cdf.The values generated from the simple Monte Carlo integration approach and the antithetic variable approach are very close to the Rayleigh(2) values.

m <- 1000
MC1 <- MC2 <- numeric(m)
x <- 1.65
for (i in 1:m) {
  MC1[i] <- MC.Phi(x, r = 1000, antithetic = FALSE)
  MC2[i] <- MC.Phi(x, r = 1000)
}
print(sd(MC1))
print(sd(MC2))
print((var(MC1)-var(MC2))/var(MC1))

The antithetic variable approach achieved approximately 98.1% reduction in variance at x=1.65

  1. Find two importance functions f1 and f2 that are supported on (1,∞) and are ‘close’ to $ g(x)= \frac{x^{2}}{\sqrt{2\pi}} e^-\frac{x^{2}}{2} , x>1$ Which of your two importance functions should produce the smaller variance in estimating $\int_1^{\infty} \frac{x^{2}}{\sqrt{2\pi}} e^-\frac{x^{2}}{2}\,dx$ by importance sampling? Explain.

Answer

The candidates for the improtance functions are $$ f1= \frac{1}{\sqrt{2 \pi}{\sqrt{1.1}} } e^{-\frac{(x-2)^2}{2.2}} $$ $$ f2=4x^2e^{-2x}$$

x <- seq(1, 10, 0.1)
g <- (1/(sqrt(2*pi)))*x^2 * exp(-x^2/2)
f1 <- dnorm(x, 2, 1.1) 
f2 <- dgamma(x, 3, 2)
# importance functions: f1,f2 with g(x) and ratios g(x)/f(x) 
# figure 1
plot(x, g, type = "l", main = "", ylab = "", ylim = c(0, 0.4), lwd = 2)
lines(x, f1, lty = 2, lwd = 2, col = "red")
lines(x, f2, lty = 3, lwd = 2, col = "purple")
legend("topright", legend = c("g", "f1", "f2"), lty = 1:3, col = c("black", "red", "purple"), lwd = 2, inset = 0.02)
# figure 2
plot(x, g, type = "l", main = "", ylab = "", ylim = c(0, 3), lwd = 2)
lines(x, g/f1, lty = 2, lwd = 2, col = "red")
lines(x, g/f2, lty = 3, lwd = 2, col = "purple")
legend("topright", legend = c("g", "f1", "f2"), lty = 1:3, col = c("black", "red", "purple"), lwd = 2, inset = 0.02)

m <-10000 
se <- numeric(2)
g <- function(x){
  (1/(sqrt(2*pi)))*x^2 * exp(-x^2/2) * (x > 1)
}
set.seed(123)
x <- rnorm(m, 2, 1.1)
fg <- g(x) / dnorm(x, 2, 1.1)
se[1] <- sd(fg)

x <- rgamma(x, 3,2)
fg <- g(x) /  dgamma(x, 3, 2)
se[2] <- sd(fg)
print(round(se, 5)) #the corresponding standard errors

Analysi: AS u can see in the figure 2, the function that corresponds to the most nearly constant ratio g(x)/f(x) appears to be f2. From the graphs, we might prefer f2 for the smaller variance and from the results of the corresponding standard errors, f2 real produce smaller variance.

  1. Obtain a Monte Carlo estimate of$\int_1^{\infty} \frac{x^{2}}{\sqrt{2\pi}} e^-\frac{x^{2}}{2}\,dx$ by importance sampling.
m <-10000 
theta.hat  <- numeric(2)
g <- function(x){
  (1/(sqrt(2*pi)))*x^2 * exp(-x^2/2) * (x > 1)
}
set.seed(123)
x <- rnorm(m, 2, 1.1)
fg <- g(x) / dnorm(x, 2, 1.1)
theta.hat[1] <- mean(fg)

x <- rgamma(x, 3,2)
fg <- g(x) /  dgamma(x, 3, 2)
theta.hat[2] <- mean(fg)
phi <- (1/sqrt(2*pi))*exp(-1/2)+1-pnorm(1) # the real value of the function 
print(round(c(theta.hat, phi), 5))

Analysis: i use the functions f1, f2 to estimate the result of the integration,As u can see the Monte Carlo estimate are very closely to the real value (0.40063).

2018-10-12

  1. Let $X$ be a non-negative random variable with $\mu = E[X] < ∞$. For a random sample $x_1,...,x_n$ from the distribution of $X$, the Gini ratio is defined by $$G=\frac{1}{2n^{2}\mu}\sum_{j=1}^{n}\sum_{i=1}^{n}\left|x_{i}-x_{j}\right|$$ The Gini ratio is applied in economics to measure inequality in income distribution (see e.g. [163]). Note that $G$ can be written in terms of the order statistics $x_{(i)}$ as$$G=\frac{1}{n^{2}\mu}\sum_{i=1}^n(2i-n-1)x_{(i)}$$ If the mean is unknown, let $G$ be the statistic $\hat{G}$ with $\mu$ replaced by $\overline{x}$. Estimate by simulation the mean, median and deciles of $G$ if X is standard lognormal. Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.

Answer

Generate random samples $X_1,X_2\dots X_n$ from standard lognormal, uniform distribution and Bernoulli(0.1), respectively. Replicate 1000 times and compute the mean, median and deciles in each case.

f <- function( n, x, mu){
  y <- numeric(n)
  for(i in 1:n){
    y[i] <- (2*i-n-1) * x[i]
  }
  return(1/(n^2*mu)*sum(y))
}
set.seed(123)
#the mean,median and deciles of G.hat with X is standard lognormal
ucl1 <- replicate(1000, expr = f( n <- 20, x <- sort(rlnorm(n)), mu <- mean(rlnorm(n))))
print(mean(ucl1))
print(median(ucl1))
quantile(ucl1, seq(0,1,0.1))
hist(ucl1,probability = TRUE, main = "hist of standard lognormal") # histogram of the replicates with x is standard lognormal

Analysis: Results are shown the mean,median of $\hat{G}$ with X is standard lognormal are 0.54, 0.47, respectively. And u can see from the histogram the results are very close to real values.

#the mean,median and deciles of G.hat with X is uniform distribution
ucl2 <- replicate(1000, expr = f( n <- 20, x <- sort(runif(n)),mu <- mean(runif(n))))
print(mean(ucl2))
print(median(ucl2))
quantile(ucl2, seq(0,1,0.1))
hist(ucl2, probability = TRUE,main = "hist of uniform distribution") # histogram of the replicates with x is uniform distribution

Analysis: u can see the mean,median and deciles of $\hat{G}$ with X is standard lognormal are 0.32, 0.32, respectively. And the histogram the results are very close to real values.

#the mean,median and deciles of G.hat with X is Bernoulli(0.1)
ucl3 <- replicate(1000, expr = f( n <- 20, x <- sort(rbinom(n,20,prob = 0.1)),mu <- mean(rbinom(n,20,prob = 0.1))))
print(mean(ucl3))
print(median(ucl3))
quantile(ucl3, seq(0,1,0.1))
hist(ucl3, probability = TRUE,main = "hist of Bernoulli") # histogram of the replicates with x is Bernoulli(0.1)

Analysis: u can see the mean,median and deciles of $\hat{G}$ with $X$ is Bernoulli(0.1) are 0.36, 0.35, respectively. And the histogram the results are very close to real values.

  1. Construct an approximate 95% confidence interval for the Gini ratio $\gamma=E[G]$ if $X$ is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.

Answer

If $X\sim lognormal(\mu,\sigma^2)$, then $log(x)\sim N(\mu, \sigma^2)$.Assume we have a random sample $X_1,X_2\dots X_n$ from the population $X$ and $X\sim lognormal(\mu,\sigma^2)$, make $Y_i=log(X_i), 1\leq i \leq n$, then $Y\sim N(\mu, \sigma^2)$.We can construct an approximate 95% confidence interval for $\sigma$: $$(\sqrt{\frac{(n-1) S^2}{\chi^2_{n-1}(0.975)}} , \sqrt{\frac{(n-1) S^2}{\chi^2_{n-1}(0.025)}})$$ where $S^2=\frac{1}{n-1}\sum_{i=1}^{n}(Y_i-\bar{Y})$ and $\bar{Y}=\frac{1}{n}\sum_{i=1}^{n}Y_i$. Note that $\gamma=E(G)=2 \Phi(\frac{\sigma}{\sqrt 2})-1$ where $\Phi$ is the distribution of standard normal. Therefor we can construct an approminate 95% confidence interval for $\gamma=E(G)$ : $$(2 \Phi(\sqrt\frac{(n-1) S^2}{2 \chi^2_{n-1}(0.975)})-1,2 \Phi(\sqrt\frac{(n-1) S^2}{2 \chi^2_{n-1}(0.025) })-1)$$ Given $\sigma=1$, the code of assessing the coverage rate of $\gamma=E(G)$ with a Monte Carlo experiment as following:

sigma <- 1
f <- function(n, y){# function to construct confident interval for sigma 
  s2 <- (1/(n-1))*sum((y-mean(y))^2)
  c((n-1)*s2/qchisq(0.975, n-1), (n-1)*s2/qchisq(0.025, n-1))
}
set.seed(123)
result <- replicate(1000, f(n <- 20, y <- log(rlnorm(n))))
r.true <- 2*pnorm(sigma/sqrt(2))-1 # the true value of r
r.hat <- 2*pnorm(sqrt(result/2))-1
c.r <- cbind(r.hat[1,], r.hat[2,])
judge <- numeric(1000)
for(i in 1:1000)
{
  judge[i] <- I(c.r[i,1] <= r.true & c.r[i,2] >= r.true)
}
sum(judge)
mean(judge)

Analysis:The result is that 948 intervals satisfied, so the empirical confidence level is 94.8% in this experiment. The result very colse to the theoretical value 95%

  1. Tests for association based on Pearson product moment correlation $\rho$, Spearman’s rank correlation coefficient $\rho_s$ , or Kendall’s coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $(X,Y)$ such that $X$ and $Y$ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.

Answer

Tests for association based on Pearson product moment correlation $\rho$

library(MASS)
n <- 20
m <- 1000
c <- c(seq(-0.9, -0.1, 0.1), seq(0.1, 0.9, 0.1))
l <- length(c)
power.p <- numeric(l)
library(MASS)
set.seed(123)
for(i in 1:l)
{
  p.p <- replicate(m, expr = {
    x <- mvrnorm(n, c(0,1), matrix(c(1, c[i], c[i], 1), 2, 2))
     cortest <- cor.test(x[,1], x[,2], alternative = "two.sided", method = "pearson")
         cortest$p.value } )
  power.p[i] <- mean(  p.p <= 0.05 )
}
power.p 

Tests for association based on Spearman’s rank correlation coefficient $\rho_s$

n <- 20
m <- 1000
c <- c(seq(-0.9, -0.1, 0.1), seq(0.1, 0.9, 0.1))
l <- length(c)
power.s <- numeric(l)
library(MASS)
set.seed(123)
for(i in 1:l)
{
  p.s <- replicate(m, expr = {
    x <- mvrnorm(n, c(0,1), matrix(c(1, c[i], c[i], 1), 2, 2))
    cortest <- cor.test(x[,1], x[,2], alternative = "two.sided", method = "spearman")
    cortest$p.value } )
  power.s[i] <- mean(  p.s <= 0.05 )
}
power.s

Tests for association based on Kendall’s coefficient $\tau$

n <- 20
m <- 1000
c <- c(seq(-0.9, -0.1, 0.1), seq(0.1, 0.9, 0.1))
l <- length(c)
power.k <- numeric(l)
library(MASS)
set.seed(123)
for(i in 1:l)
{
  p.k <- replicate(m, expr = {
    x <- mvrnorm(n, c(0,1), matrix(c(1, c[i], c[i], 1), 2, 2))
    cortest <- cor.test(x[,1], x[,2], alternative = "two.sided", method = "kendall")
    cortest$p.value } )
  power.k[i] <- mean(  p.k <= 0.05 )
}
power.k
power <- cbind(power.p, power.s, power.k)
rownames(power) <- c(seq(-0.9, -0.1, 0.1), seq(0.1, 0.9, 0.1))
power

Analysis: as the results shown the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal.

If $X\sim exp(4)$ and $Y\sim U(0,1)$ , clearly $X,Y$ are dependent.

n <- 20
m <- 1000
set.seed(123)
p.p1 <- replicate(m, expr = {
  x <- rexp(n, 4)
  y <- runif(n, 0, 1)
  cortest <- cor.test(x, y, alternative = "two.sided", method = "pearson")
  cortest$p.value } )
power.p1 <- mean(  p.p1 <= 0.05 )
power.p1
n <- 20
m <- 1000
set.seed(123)
p.s1 <- replicate(m, expr = {
  x <- rexp(n, 4)
  y <- runif(n, 0, 1)
  cortest <- cor.test(x, y, alternative = "two.sided", method = "spearman")
  cortest$p.value } )
power.s1 <- mean(  p.s1 <= 0.05 )
power.s1

Analysis: In this case, the empirical power of the nonparametric test Spearman's rank correlation coefficient $\rho_s= 0.056$, which is bigger than the empirical power of Pearson correlation $\rho=0.048$.

2018-11-02

  1. Compute a jackknife estimate of the bias and the standard error of the correlation statistic in Example 7.2.

Answer

library(bootstrap)
n <- nrow(law)
LSAT <- law$LSAT
GPA <- law$GPA
theta.hat <- cor(LSAT, GPA)

# compute the jackknife replictes,leave-one-out estimates
theta.jack <- numeric(n)
for(i in 1:n)
  theta.jack[i] <- cor(LSAT[-i], GPA[-i]) 
bias <- (n-1) * (mean(theta.jack) - theta.hat)   #jackknife estimate of bias
se <- sqrt((n-1) * mean((theta.jack- mean(theta.jack))^2))   #jackknife estimate of standard eror
  print(round(c(bias, se), 5)) 

Analysis: Jackknife estimate of the bias and the standard error are -0.00647 and 0.14252 respectively.

  1. Refer to Exercise 7.4. Compute 95% bootstrap confidence intervals for the mean time between failures $\frac{1}{\lambda}$ by the standard normal, basic, percentile, and BCa methods. Compare the intervals and explain why they may differ.

Answer

the code for standard normal, basic, percentile and BCa confidence interval

library(boot)
  data("aircondit", package = "boot")

    # make the set of aircondit a vector
  aircondit <- unlist(aircondit)
  aircondit <- as.vector(aircondit)
  lambda <- mean(aircondit)

  lambda.boot <- function(x,i)
  {
    # function to compute the statistic
    mean(x[i])
  }
  x <- aircondit
boot.obj <- boot(x, statistic = lambda.boot, R = 2000 )
print(boot.ci(boot.obj, type = c("norm", "basic", "perc","bca"))) 

Analysis: From the results, we can see the BCa interval is more credible, because it is transformation respecting and second order accurate , the percentile interval is transformation respecting but only first order accurate. The standard normal confidence interval is neither transformation respecting nor second order accurate.

  1. Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of$\hat{\theta}$

Answer

If $X$ is the matrix of scor(bootstrap), then the estimate of $X$ of the covariance matrix $\Sigma$ is $$\hat\Sigma = \frac{(X-\bar{X})^T (X-\bar{X})}{n} $$ where $n$ is the number of rows in the matrix $X$

data(scor, package = "bootstrap")

# turn scor into a matrix x
names(scor) <- NULL
x <- as.matrix(scor)

n <- nrow(x)
m <- ncol(x)
x.bar <- numeric(m)

#compute the theta
theta <- function(cov)
{
  eigen.cov <- eigen(cov)   
  eigen.v <- sort(eigen.cov$values, decreasing = TRUE)
  eigen.v[1]/sum(eigen.v)
}

# compute the MLE of covariance matrix
cov.hat <- function(j)
{
  x.bar <- colMeans(j)
  I <- matrix(rep(1, nrow(j)), nrow = nrow(j), ncol = 1)
  x.bar <- I %*% x.bar
  crossprod(j - x.bar, j - x.bar)/nrow(j)
}
theta.hat <- theta(cov.hat(x))

#  the jackknife estimates of bias and standard error 
cov.jack <- matrix(0, nrow = m, ncol = m)
theta.jack <- numeric(n)
for(i in 1:n)
{
  cov.jack <- cov.hat(x[-i, ])
  theta.jack[i] <- theta(cov.jack)
}

bias <- (n-1) * (mean(theta.jack) - theta.hat)   #jackknife estimate of bias
se <- sqrt((n-1) * mean((theta.jack- mean(theta.jack))^2))   #jackknife estimate of standard eror
print(round(c(bias, se), 5)) 

Analysis : the bias and standard error are very small, we can belief the estimate is right.

  1. In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.

Answer

library(lattice)
library(DAAG)
attach(ironslag)
n <- length(magnetic) 

# the function of compute the change of R-square
shrinkage <- function(fit, k= n/2)
{
  require(bootstrap)

  theta.fit <- function(x,y)
  {
    lsfit(x,y)
  }

  theta.predict <- function(fit,x)
  {
    cbind(1,x) %*% fit$coef
  }

  x <- fit$model[,2:ncol(fit$model)]
  y <- fit$model[,1]

  results <- crossval(x, y, theta.fit, theta.predict, ngroup = k)

  r2 <- cor(y, fit$fitted.values)^2
  r2cv <- cor(y, results$cv.fit)^2
  return(r2-r2cv)   # the change of R-square
}

fit1 <- lm(magnetic ~ chemical)

fit2 <- lm(magnetic ~ chemical + I(chemical ^ 2))

fit3 <- lm(log(magnetic) ~ chemical)

fit4 <- lm(log(magnetic) ~ log(chemical))

print(c(shrinkage(fit1), shrinkage(fit2), shrinkage(fit3), shrinkage(fit4)))

Analysis : from the results, we can see the Model 2, the quadratic model, would be the best fit for the data, because the change of R-square is biggest.

2018-11-16

  1. Exercise 8.1: Implement the two-sample Cramer-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.

Answer

Firstly, i compulate the two-sample Cramer-von Mises statistic , then apply the test to the data in exemple 8.1

w2 <- function(x, y)
{
  n <- length(x)
  m <- length(y)

  # compute the ecdf of x and y 
  Fn <- ecdf(x)
  Gm <- ecdf(y)

  d <- (m * n) / (m + n)^2
  m1 <- sum((Fn(x) - Gm(x))^2)
  m2 <- sum((Fn(y) - Gm(y))^2)
  w2 <- d * (m1 + m2)
  return(w2)
}

attach(chickwts)
x <- sort(as.vector(weight[feed == "soybean"]))
y <- sort(as.vector(weight[feed == "linseed"]))
detach(chickwts)

R = 999
n <- length(x)
m <- length(y)
K <- 1:(m + n)
z <- c(x, y)
w0 <- w2(x, y)
w <- numeric(R)
set.seed(12345)

for(i in 1:R)
{
  k <- sample(K, size = n, replace = FALSE)
  x1 <- z[k]
  y1 <- z[-k]
  w[i] <- w2(x1, y1)
}
p <- mean(c(w0, w) >= w0)
p

hist(w, main = "", breaks = "scott", freq = FALSE,
     xlab = "W2(p = 0.407)")
points(w0, 0, cex = 1, pch = 16)

Analysis: 0.407 is very close to the result of k-s method, 0.46, the approximate p-value does not support the alternative hypothesis that distributions differ.

  1. Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard Cauchy distribution (see qcauchy or qt with df=1). Recall that a Cauchy(θ,η) distribution has density function $$\ f(x)=\frac{1}{\theta\pi (1 + [(x - \eta)/\theta]^2)} $$ $-\infty 0$ The standard Cauchy has the Cauchy($\theta = 1,\eta = 0$) density. (Note that the standard Cauchy density is equal to the Student t density with one degree of freedom.)

Answer

From the exercise, we konw the density function of a standard Cauchy distribution is $$\ f(x)= \frac{1}{\pi(1+x^{2})}$$ . For the proposal distribution , try the normal diStribution with expection $X_t$ and variance $\sigma^{2}$ , thus $g(.|x)\sim N(x,\sigma^{2})$ and generate $X_0$ from distribution $N(0, \sigma^{2})$

f <- function(x)
{
  1 / (pi * (1 + x^2))
}

m <- 1e4
sigma <- 1 # Set standard deviation of normal proposal density
x <- numeric(m)
x[1] <- rnorm(1, mean = 0, sd = sigma)
k <- 0
u <- runif(m)
for(i in 2:m)
{
  xt <- x[i-1]
  y <- rnorm(1, mean = xt, sd = sigma)
  num <- f(y) * dnorm(xt, mean = y, sd = sigma)
  den <- f(xt) * dnorm(y, mean = xt, sd = sigma)

  if(u[i] <= num / den)
    x[i] <- y
  else
  {
    x[i] <- xt
    k <- k + 1
  }
}
print(k)
index <- 5001:5500
y1 <- x[index]
plot(index, y1, type = "l", main = "", ylab = "X")

Analysis:From the result, we can see that approximately 20% of the candidate points are rejected, so the chain is efficient.

b <- 1001 # discard the burnin sample
y <- x[b:m]
#a <- ppoints(100)
a <- seq(0.1, 0.9, 0.1)
QR <- qcauchy(a)
Q <- quantile(y, a)
print(round(cbind(QR, Q), 3))

Analysis:Discard the first 1000 of the chain, the deciles of the generated observations and the deciles of the standard Cauchy distribution are very close.

2018-11-23

  1. Find the intersection points $A(k)$ in $(0,\sqrt {k}$ of the curves $$ S_{k-1}(a) = P(t(k-1)> \sqrt {\frac {a^2 (k-1)}{k-a^2}})$$ $$S_{k}(a) = P(t(k)> \sqrt {\frac {a^2 k}{k+1-a^2}})$$ for $k = 4 : 25,100,500,1000$, where $t(k)$ is a Student t random variable with $k$ degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by Szekely [260].)

Answer

suppose $ X\sim t(k-1)$, $ Y\sim t(k)$. then $S_{k-1}(a) = 1-P(X \le \sqrt {\frac {a^2 (k-1)}{k-a^2}})= 1-T_{k-1}(\sqrt {\frac {a^2 (k-1)}{k-a^2}})$ and $S_{k}(a) = 1-P(Y \le \sqrt {\frac {a^2 k}{k+1-a^2}})=1-T_{k}(\sqrt {\frac {a^2 k}{k+1-a^2}})$ , so the intersection points of $S_{k-1}(a)$ and $S_{k}(a)$ are that of $T_{k-1}(\sqrt {\frac {a^2 (k-1)}{k-a^2}})$ and $T_{k}(\sqrt {\frac {a^2 k}{k+1-a^2}})$

k <- 100
a <- seq(0, sqrt(k), 0.1)
y <- numeric(length(a))
f <- function(a)
{
  pt(sqrt(k * a^2 / (k+1-a^2)),k ) -
      pt(sqrt((k-1) * a^2 / (k-a^2)), k-1)
}

for(i in 1:length(a))
{
  y[i] <- f(a[i])
}
print(y)
plot(a, y, type = "l", xlab = "a", ylab = "y", main = "f(a)")
abline(h = 0, lty = 2, col = "red")

Analysis:From the results, we can see the signs changed in the interval (1, 2). u can change the $k$ value ,the interval remain unchange, so i choose the interval of $a$ is (1,2) in the follwing code.

k <- c(4:25, 100, 500, 1000)
n <-length(k)

out <- numeric(n)

for(i in 1:n)
{

  f <- function(a)
  {

    pt(sqrt(k[i] * a^2 / (k[i]+1-a^2)), k[i]) -
      pt(sqrt((k[i]-1) * a^2 / (k[i]-a^2)), k[i]-1)
  }

  out[i] <- uniroot(f, c(1, 2))$root
}
print(out)

Analysis:When $k$ is very large, the value of $T_{k-1}(\sqrt {\frac {a^2 (k-1)}{k-a^2}})$ and $T_{k}(\sqrt {\frac {a^2 k}{k+1-a^2}})$ all equal to 1, so all values from a node are equal.i just compute the intersection points in the interval 1 and 2.

2018-11-30

  1. Write a function to compute the cdf of the Cauchy distribution, which has density$$\frac{1}{\theta\pi (1+[\frac{x-\eta}{\theta}]^2)}, -\infty < x < \infty$$ where $ \theta >0 $ .Compare your results to the results from the R function pcauchy. (Also see the source code in pcauchy.c.)

Answer

.dcauchy <- function(theta, eta, x)
  {
    #compute the density function of cauchy distribution
    if(theta <= 0) return(NA)

    r <- theta * pi * (1 + ((x-eta)/theta)^2)
    return(1 / r)
  }

.pcauchy <- function(theta, eta, y)
{
   integrate(.dcauchy, lower = -Inf, upper = y,
                   rel.tol = .Machine$double.eps^0.25,
                   theta = theta, eta = eta)$value
}

y <- matrix(seq(-10, 10, 1))

res11 <- apply(y, 1, .pcauchy, eta = 0, theta = 0.5)
res12 <-  pcauchy(y,  0, 0.5)



plot(y, res11, type = "l", col = "red", lwd=2, ylab = "cdf", ylim = c(0, 1))
lines(y, res12, type = "p")

In those plots, the red lines represent the estimates, and the points represnt the true values. From those graphas, u can see the dots all in the line, so the estimates all very close to true values.

  1. A-B-O blood type problem Let the three alleles be A, B, and O

Genotype| AA | BB | OO | AO | BO | AB | Sum
----------|:--------:|----------|----------|----------|----------|----------|---- Frequency | $p^2$ | $q^2$ | $r^2$ | $2pr$ | $2qr$ | $2pq$ | 1 Count | $n_{AA}$ | $n_{BB}$ | $n_{OO}$ | $n_{AO}$ | $n_{BO}$ | $n_{AB}$ | n

Observed data: $n_{A·} = n_{AA} + n_{AO} = 28$ (A-type), $n_{B·} = n_{BB} + n_{BO} = 24$(B-type), $n_{OO} = 41$ (O-type), $n_{AB} = 70$ (AB-type).

Use EM algorithm to solve MLE of p and q (consider missing data $n_{AA}$ and $n_{BB}$ ).

Record the log-maximum likelihood values in M-steps, are they increasing?

Answer

the maximum likelihood function given complete data is $$L(p, q, n) = (p^2)^{n_{AA}} (q^2)^{n_{BB}} (r^2)^{n_{OO}} (2pr)^{n_{AO}} (2qr)^{n_{BO}} (2pq)^{n_{AB}}$$ and the log-maximum likelihood function given complete data is $$l(p,q,n)=(n_{AA}+n_{A.}+n_{AB})log(p)+(n_{BB}+n_{B.}+n_{AB})log(q)+(2n_{OO}+n_{A.}+n_{B.}-n_{AA}-n_{BB})log(p)$$. Because of $n_{AA}|n_{A.} \sim B(n_{A.}, \frac{p}{p+2r})$ and $n_{BB}|n_{B.} \sim B(n_{B.}, \frac{q}{q+2r})$, so the exception of $l(p,q,n)$given initiation is $$E_{p_0, q_0} = (\frac{2(p_0 + r_0)}{p_0 + 2r_0} n_{A.}+n_{AB})log(p)+(\frac{2(q_0 + r_0)}{q_0 + 2r_0} n_{B.}+n_{AB})log(q)+(\frac{2r_0}{p_0 + 2r_0} n_{A.}+\frac{2 r_0}{q_0 + 2r_0} n_{B.}+2n_{OO})log(r)$$

nA. = 28
nB. = 24
nOO = 41
nAB = 70

m <- 10 #max.number of interations
theta0 <- c(0.3, 0.2)  #initial est. for p,q,r
theta <- matrix(0, m, 3)
theta[1,1] <- theta0[1]
theta[1,2] <- theta0[2]
L <- numeric(m)
#tol <- .Machine$double.eps^0.5
for(i in 1:(m-1))
{
  ll <- function(theta1)
  {
    p.hat <- theta[i,1]
    q.hat <- theta[i,2]
    p <- theta1[1]
    q <- theta1[2]
    r <- 1-p-q
    r.hat <- 1-p.hat-q.hat
    f1 <- (nA.* 2*(p.hat + r.hat)/(p.hat + 2*r.hat) + nAB)*log(p)
    f2 <- (nB.* 2*(q.hat + r.hat)/(q.hat + 2*r.hat) + nAB)*log(q)
    f3 <- 2*(nOO + nA.* r.hat/(p.hat + 2*r.hat)+ nB.* r.hat/(q.hat + 2*r.hat))*log(r)
    return(- (f1+f2+f3))
  }
  theta[i+1, 1] <- optim(c(0.4, 0.2), ll)$par[1]
  theta[i+1, 2] <- optim(c(0.4, 0.2), ll)$par[2]
  theta[i+1, 3] <- 1-theta[i+1, 2]-theta[i+1, 1]
  L[i] <- ll(c(0.4, 0.2))
}
print(colMeans(theta))
x <- seq(1, m-1 , 1)
print(L[-m])
plot(x, L[-m])

2018-12-07

  1. Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:

formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt )

Answer

formulas <- list(
  mpg ~ disp,
  mpg ~ I(1 / disp),
  mpg ~ disp + wt,
  mpg ~ I(1 / disp) + wt
)

# with loop
lm.mtcars <- vector("list", length(formulas))

for(i in  seq_along(formulas))
{
  lm.mtcars[[i]] <- lm(formulas[[i]], data = mtcars)$coefficients
}
lm.mtcars

# with lapply()
lapply(formulas, function(x) lm(x, data = mtcars)$coefficients)

Analysis: lapply() makes it easier to code, and the results are equal.

2.Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply() . Can you do it without an anonymous function?

bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] })

# with loop
set.seed(123)
bootstraps <- lapply(1:10, function(i) {
  rows <- sample(1:nrow(mtcars), rep = TRUE)
  mtcars[rows, ]
})

 # with loop
cor.mtcars <- vector("list", length(bootstraps))

for(i in seq_along(bootstraps))
{
  cor.mtcars[[i]] <- lm(mpg ~ disp, 
                        data = bootstraps[[i]])$coefficients  # coefficient
}

# with lapply()
y <- lapply(bootstraps, function(x) lm(mpg ~ disp, data = x)$coefficients)

cor.mtcars <- as.matrix(unlist(cor.mtcars))
y <- as.matrix(unlist(y))
z <- cbind(cor.mtcars, unname(y))  # turn to matrix
z

Analysis:The firth column of z is the results of making with loop, and second one maked with lapply(). The results are equal.

  1. For each model in the previous two exercises, extract R2 using the function below.

rsq <- function(mod) summary(mod)$r.squared

Answer

# R2 with question 3
rsq <- function(mod) summary(mod)$r.squared
r1 <- numeric(length(formulas))
for(i in  seq_along(formulas))
{
  lm.mtcars[[i]] <- lm(formulas[[i]], data = mtcars)
  r1[i] <- rsq(lm.mtcars[[i]])
}
r1

# R2 with question 4
r2 <- numeric(length(bootstraps))


# with loop
cor.mtcars <- vector("list", length(bootstraps))

for(i in seq_along(bootstraps))
{
  cor.mtcars[[i]] <- lm(mpg ~ disp, 
                        data = bootstraps[[i]])# coefficient
  r2[i] <- rsq(cor.mtcars[[i]])
}
r2
  1. The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.

trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE ) Extra challenge: get rid of the anonymous function by using [[ directly.

Answer

set.seed(123)
trials <- replicate(
100,
t.test(rpois(10, 10), rpois(7, 10)),
simplify = FALSE
)
p.value <- function(test) test$p.value

round(sapply(trials, p.value), 5)

# extra challenge
x <- vector("list", length(trials))
for(i in seq_along(trials))
{
  x[i] <- trials[[i]]$p.value
}
round(unlist(x), 5)

Analysis

I do not know the meanning of "Extra challenge" very well, so i wrote the loop code directly.

5.Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?

Answer

set.seed(123)
df <- data.frame(replicate(5, runif(5)))
summary <- function(x)
{
  funs <- c(mean, median, sd, mad, IQR)
  sapply(funs, function(f) f(x))
}
x <- vapply(df, summary, c(mean = 0, median = 0, sd = 0, mad = 0, IQR = 0))
x

Analysis:The function is to caculate the mean, median, ect of a matrix.

2018-12-14

  1. Make a faster version of chisq.test() that only computes the chi-square test statistic when the input is two numeric vectors with no missing values. You can try simplifying chisq.test() or by coding from the mathematical definition (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test).

Answer

chisq.test1 <- function (x)
{
  if (is.data.frame(x)) 
    x <- as.matrix(x)

    if (length(x[1,]) != length(x[2,])) 
      stop("'x'  must have the same length")
  if ((n <- sum(x)) == 0) 
    stop("at least one entry of 'x' must be positive")

  if (is.matrix(x)) {
    METHOD <- "Pearson's Chi-squared test"
    nr <- as.integer(nrow(x))
    nc <- as.integer(ncol(x))

    sr <- rowSums(x)
    sc <- colSums(x)
    E <- outer(sr, sc, "*")
    statistic <- (sum((x*x)/E)-1) * n
  }
  return(statistic)
}
M <- rbind(c(762, 327, 468, 1000, 956, 859), c(484, 239, 477, 560, 458, 589))

library(microbenchmark)
microbenchmark(t <- chisq.test1(M), Xsq <- chisq.test(M))
t
Xsq

Analysis:u can find the results of statistic are equal, and the improved function takes less time.

2 . Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?

Answer

i am just going to do it with one vector, i really don't know how to compulate the case of two integer vectors.

table1 <- function(x)
{

   z <- array(tabulate(x))
  z
}
x <- 1:10
microbenchmark(t <- table1(x), xsq <- table(x))
t
xsq


zhaoynan/StatComp18038 documentation built on May 29, 2019, 8:33 a.m.