Assignment 09-14

Question 1

This example uses the inverse transform method to simulate a random sample from the distribution with density $f_X (x) = 3x^2 , 0 < x < 1$.

Answer 1

n <- 1000
u <- runif(n)
x <- u^(1/3)
hist(x, prob = TRUE) #density histogram of sample
y <- seq(0, 1, .01)
lines(y, 3*y^2) #density curve f(x)

Question 2

Generate random samples from a Logarithmic(0.5) distribution.

Answer 2

rlogarithmic <- function(n, theta) {
#returns a random logarithmic(theta) sample size n
u <- runif(n)
#set the initial length of cdf vector
N <- ceiling(-16 / log10(theta))
k <- 1:N
a <- -1/log(1-theta)
fk <- exp(log(a) + k * log(theta) - log(k))
Fk <- cumsum(fk)
x <- integer(n)
for (i in 1:n) {
x[i] <- as.integer(sum(u[i] > Fk)) #F^{-1}(u)-1
while (x[i] == N) {
#if x==N we need to extend the cdf
#very unlikely because N is large
logf <- log(a) + (N+1)*log(theta) - log(N+1)
fk <- c(fk, exp(logf))
Fk <- c(Fk, Fk[N] + fk[N+1])
N <- N + 1
x[i] <- as.integer(sum(u[i] > Fk))
}
}
x + 1
}

Question 3

Let $X 1 ∼ Gamma(2, 2) $and $X 2 ∼ Gamma(2, 4) $be independent. Compare the histograms of the samples generated by the convolution $S = X_1 +X_2 $ and the mixture $F_X = 0.5F_{X_1} + 0.5F_{X_2}$ .

Answer 3

n <- 1000
x1 <- rgamma(n, 2, 2)
x2 <- rgamma(n, 2, 4)
s <- x1 + x2 #the convolution
u <- runif(n)
k <- as.integer(u > 0.5) #vector of 0’s and 1’s
x <- k * x1 + (1-k) * x2 #the mixture
par(mfcol=c(1,2)) #two graphs per page
hist(s, prob=TRUE)
hist(x, prob=TRUE)
par(mfcol=c(1,1)) #restore display

Assignment 09-21

Question 1

exercise 3.5

A discrete random variable X has probability mass function

x <- c(0,1,2,3,4); p <- c(0.1,0.2,0.2,0.2,0.3)
names(p)=x
p

Use the inverse transform method to generate a random sample of size 1000 from the distribution of X. Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function.

Answer 1

  1. Use the inverse transform method to generate a random number from the distribution of X and compute the emperical frequency.
set.seed(1)
x <- c(0,1,2,3,4); p <- c(0.1,0.2,0.2,0.2,0.3)
cp <- cumsum(p); m <- 1e3; r <- numeric(m)
r <- x[findInterval(runif(m),cp)+1]
ct <- as.vector(table(r))
emperical_frequency=ct/sum(ct) #compute the emprical frequency
  1. use the R sample function to generate a random number from the distribution of X and compute the theoretical probablities.
auto_generate = sample(c(0,1,2,3,4), size = 1000, replace = TRUE, prob = c(0.1,0.2,0.2,0.2,0.3))
ct <- as.vector(table(auto_generate))
emperical_frequency_sample=ct/sum(ct) #compute the theoretical probablities
  1. compare the empirical with the theoretical probabilities.
theoretical_probablities = p
names(emperical_frequency)=c(0,1,2,3,4)
names(theoretical_probablities)=c(0,1,2,3,4)

rbind(theoretical_probablities,emperical_frequency,emperical_frequency_sample)

We can find easily that emperical_frequency has little difference with theoretical_probablities, thus generating random number is sucessful. The answer to the question 1 is completed.

Question 2

exercise 3.7

Write a function to generate a random sample of size n from the Beta(a,b) distribution by the acceptance-rejection method. Generate a random sample of size 1000 from the Beta(3,2) distribution. Graph the histogram of the sample with the theoretical Beta(3,2) density superimposed.

Answer 2

  1. First, we define a function to generate a random sample of size n from the Beta(a,b) distribution by the acceptance-rejection method.

condition:Beta(a,b) with pdf $f(x) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1}(1-x)^{b-1}$, $0<x<1$ , $g(x) = 1$ , $0<x<1$ and $c = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}$.

generate_beta_random_variable <- function(a,b) {
    n <- 1e3;j<-k<-0;y <- numeric(n)
    while (k < n) {
      u <- runif(1)
      j <- j + 1
      x <- runif(1) #random variate from g
      if (x^(a-1) * (1-x)^(b-1) > u) {
        #we accept x
        k <- k + 1
        y[k] <- x
        }
    }
    hist( y, prob=T, ylab='', main='', breaks = "Scott")
    z=seq(0,1,.001)
    l=factorial(a+b-1)/(factorial(a-1)*factorial(b-1))*z^(a-1)*(1-z)^(b-1)
    lines(z,l)

}
  1. Generating a random sample of size 1000 from the Beta(3,2) distribution and graph the histogram of the sample with the theoretical Beta(3,2) density superimposed.
generate_beta_random_variable(3,2)

According to the graph, random sample generating by code we write is almostly same as theoretical Beta(3,2) distribution. The answer to the question 2 is completed.

Question 3

exercise 3.12

Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter $\Lambda$ has $Gamma(r,\beta)$ distribution and Y has $Exp( \Lambda )$ distribution. That is, $(Y | \Lambda = \lambda) ~ f_Y (y|\lambda) = \lambda) \exp^{-\lambda y}$. Generate 1000 random observations from this mixture with $r = 4$ and $\beta = 2$.

(Remark:because of Browser display incorrectly, I rewrite the question here just the same as the content above.) Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter Λ has Gamma(r,β) distribution and Y has Exp(Λ) distribution. That is, (Y |Λ = λ) ∼ f Y (y|λ) = λe^(-λy) . Generate 1000 random observations from this mixture with r = 4 and β = 2.

Answer 3

  1. first of all we generate Lambda with Gamma(r,$\beta$) distribution.

  2. Then we generate 1000 random observations from this mixture with r = 4 and $\beta$ = 2,and graph the histogram of the sample .

n = 1000
r = 4
beta = 2
Lambda = rgamma( n,r,beta )
y = rexp( n,Lambda )
# graph the histogram
hist( y, prob=T, ylab='', main='', breaks = "Scott") 

Assignment 09-30

Question 1

exercise 5.4

Write a function to compute a Monte Carlo estimate of the Beta(3, 3) cdf, and use the function to estimate F(x) for $x = 0.1,0.2,\dots,0.9$ . Compare the estimates with the values returned by the pbeta function in R.

Answer 1

  1. a function to compute a Monte Carlo estimate of the Beta(a, b) cdf.
set.seed(1)
beta_cdf <- function(x,a,b) {
    u=runif(1000,min = 0,max = x)
    g = factorial(a+b-1)/(factorial(a-1)*factorial(b-1))*u^(a-1)*(1-u)^(b-1)
    theta = mean(x*g)
}
  1. Let $a=3$ and $b=3$, we compare Beta(3,3) cdf for $x = 0.1,0.2,\dots,0.9$ generated from function we write and pbeta function in R respectively.
x=seq(0.1,0.9,0.1)
for (i in x) {
  a = beta_cdf(i,3,3)

  b = pbeta(i,3,3)

  print(c(i,a,b))
}

Question 2

exercise 5.9

The Rayleigh density is $f(x) = \frac{x}{\sigma^2}e^{-x^2/(2\sigma^2)} , x \geq 0,\sigma >0$. Implement a function to generate samples from a Rayleigh($\sigma$) distribution, using antithetic variables. What is the percent reduction in variance of $\frac{X+X^{'}}{2}$ compared with$\frac{X_1+X_2}{2}$ for independent $X_1,X_2$ ?

Answer 2

according to the Rayleigh distribution density function, we calculate the cdf of Rayleigh distribution $F(x)=1-e^{-\frac{x^2}{2\sigma^2}}$.

  1. we write functions to generate random number from Rayleigh distribution with inverse transform method and antithetic variables respectively.
generate_Rayleigh_random_variable = function(sigma){
    n = 1000
    x = runif(n)
    g = (-2*sigma^2*log(1-x))^(1/2)
    return(g)
}


generate_Rayleigh_random_variable_antithetic_variable = function(sigma){
    n = 1000
    x = runif(n)
    g = (-2*sigma^2*log(1-x))^(1/2)
    f = (-2*sigma^2*log(x))^(1/2)
    return(c(g,f))
}
  1. Let $\sigma=0.5$, and we use antithetic variables to generate z_1($z_1=\frac{X+X^{'}}{2}$), and we alse generate random number with inverse transform method for independent $X_1$ , $X_2$.
set.seed(100)
a= generate_Rayleigh_random_variable_antithetic_variable(0.5)
z=numeric(1000)
for(i in seq(1,1000,1)){
  z[i]=a[i]+a[1000+i]
}
z_1=z/2
print(sprintf('variance of random sample with antithetic variable: %f',var(z_1)))


x_1 = generate_Rayleigh_random_variable(0.5)
x_2 = generate_Rayleigh_random_variable(0.5)
z_2 = (x_1 + x_2)/2
print(sprintf('variance of random sample with independent variable: %f',var(z_2)))


percent_reduction = (var(z_2) - var(z_1))/var(z_2)
print(sprintf('percent_reduction is: %s',paste(round(percent_reduction,6)*100,'%',sep = '')))  

Assignment 10-12

Question 1

exercise 6.9

Let X be a non-negative random variable with $\mu = E[X] < \infty$. For a random sample $x_1 ,\dots,x_n$ from the distribution of X, the Gini ratio is defined by $G=\frac{1}{2n^2\mu}\sum_{j=1}^n \sum_{i=1}^{n} |x_i - x_j|$. The Gini ratio is applied in economics to measure inequality in income dis- tribution (see e.g. [163]). Note that G can be written in terms of the order statistics x (i) as $G=\frac{1}{n^2\mu} \sum_{i=1}^{n} (2i-n-1)x_{(i)}$. If the mean is unknown, let $\hat{G}$ be the statistic G with $\mu$ replaced by $\bar{x}$. Estimate by simulation the mean, median and deciles of $\hat{G}$ if X is standard lognormal. Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.

Answer 1

  1. Write a function to compute gini ratio.
set.seed(1)
n=1000
gini_ratio_computing = function(x,mu){
  x=sort(x)
  gini_ratio=0
  a=0
  for (i in 1:n) {
      if(mu==FALSE){
        mu=mean(x)
      }
      a=a+(2*i-n-1)*x[i]
  }

  gini_ratio=a/(n^2*mu)
  return(gini_ratio)
}
  1. simulation the mean, median and deciles of $\hat{G}$ when X is lognormal distribution.
k=1
gini_ratio_statistic_for_lognormal=numeric(n)
while (k<1001) {
  y=rnorm(n)
  x=exp(y)
  gini_ratio_statistic_for_lognormal[k]=gini_ratio_computing(x,1/2)
  k=k+1
}

decile=seq(0.1,1,by=0.1)
gini_ratio_hat_for_lognormal_mean=mean(gini_ratio_statistic_for_lognormal)
gini_ratio_hat_for_lognormal_median=median(gini_ratio_statistic_for_lognormal)
gini_ratio_hat_for_lognormal_deciles=quantile(gini_ratio_statistic_for_lognormal,decile)
print(sprintf('gini ratio for lognormal mean: %f, median:%f ', gini_ratio_hat_for_lognormal_mean,gini_ratio_hat_for_lognormal_median ))
print( gini_ratio_hat_for_lognormal_deciles)
hist(gini_ratio_statistic_for_lognormal, prob=T, ylab='', main='', breaks = "Scott")
  1. simulation the mean, median and deciles of $\hat{G}$ when X is uniform distribution.
k=1
gini_ratio_statistic_for_uniform=numeric(n)
while (k<1001) {
  x=runif(n)
  gini_ratio_statistic_for_uniform[k]=gini_ratio_computing(x,1/2)
  k=k+1
}

decile=seq(0.1,1,by=0.1)
gini_ratio_hat_for_uniform_mean=mean(gini_ratio_statistic_for_uniform)
gini_ratio_hat_for_uniform_median=median(gini_ratio_statistic_for_uniform)
gini_ratio_hat_for_uniform_deciles=quantile(gini_ratio_statistic_for_uniform,decile)
print(sprintf('gini ratio for uniform mean: %f, median: %f ', gini_ratio_hat_for_uniform_mean,gini_ratio_hat_for_uniform_median ))
print(gini_ratio_hat_for_uniform_deciles)
hist(gini_ratio_statistic_for_uniform, prob=T, ylab='', main='', breaks = "Scott")
  1. simulation the mean, median and deciles of $\hat{G}$ when X is Bernoulli(0.1) distribution.
gini_ratio_statistic_for_bernoulli=numeric(n)
l=1
while (l<1001) {
  x=rbinom(n,1,0.1)
  gini_ratio_statistic_for_bernoulli[l]=gini_ratio_computing(x,0.1)
  l=l+1
}
decile=seq(0.1,1,by=0.1)
gini_ratio_hat_for_bernoulli_mean=mean(gini_ratio_statistic_for_bernoulli)
gini_ratio_hat_for_bernoulli_median=median(gini_ratio_statistic_for_bernoulli)
gini_ratio_hat_for_bernoulli_deciles=quantile(gini_ratio_statistic_for_bernoulli,decile)
print(sprintf('gini ratio for lognormal mean: %f, median:%f ', gini_ratio_hat_for_bernoulli_mean,gini_ratio_hat_for_bernoulli_median ))
print(gini_ratio_hat_for_bernoulli_deciles)
hist(gini_ratio_statistic_for_bernoulli, prob=T, ylab='', main='', breaks = "Scott")

Question 2

exercise 6.10

Construct an approximate 95% confidence interval for the Gini ratio $\gamma = E[G]$ if X is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.

Answer 2

We try to use central limit theorey to compute gini ratio hat confidence interval with a lot of simulations to generate indenpendent identical distribution random varible gini ratio hat.

n=200
m=80
alpha=0.5
upper_confidence=numeric(n)
lower_confidence=numeric(n)
for (j in 1:n) {
  gini_ratio_hat=numeric(n)
  for (i in 1:m) {
    y = rnorm(n,mean = 0,sd=1)
    x = exp(y)
    gini_ratio_hat[i]=gini_ratio_computing(x,mu=FALSE)
  }
  gini_ratio_hat_mean=mean(gini_ratio_hat)
  upper_confidence[j]=gini_ratio_hat_variance = var(gini_ratio_hat)
  lower_confidence[j]=gini_ratio_hat_mean+(gini_ratio_hat_variance/m)^(1/2)*qnorm(1-alpha/2)

}
upper_confidence=mean(upper_confidence)
lower_confidence=mean(lower_confidence)
print(sprintf('confidence interval of gini ratio hat is :(%f , %f)',upper_confidence,lower_confidence ))

Question 3

exercise 6.B

Tests for association based on Pearson product moment correlation $\rho$, Spearman’s rank correlation coefficient $\rho_s$ , or Kendall’s coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution (X,Y ) such that X and Y are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.

Answer 3

1.nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal

set.seed(1)
library(MASS)
m <- 1000
pvalues <- replicate(m, expr = {
  Sigma = matrix(c(12,2,2,1),2,2)
  bivarite_normal=mvrnorm(n=50, rep(0, 2), Sigma)
  x=bivarite_normal[1:50,1]
  y=bivarite_normal[1:50,2]
  pearson_test <- cor.test(x,y,method = "pearson")
  spearman_test = cor.test(x,y,method = 'spearman')
  kendall_test = cor.test(x,y,method = 'kendall')
  c(pearson_test$p.value,spearman_test$p.value,kendall_test$p.value) } )

power_pearson = mean(pvalues[1,1:1000] < 0.05)
power_spearman= mean(pvalues[2,1:1000] < 0.05)
power_kendall = mean(pvalues[3,1:1000]<0.05)
power_pearson
power_spearman
power_kendall

Obviously, the correlation test when the sampled distribution is bivariate normal is more powerful than tests based on $\rho_s$ or $\tau$.

  1. an example of an alternative (a bivariate distribution (X,Y) such that X and Y are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.
set.seed(1)
library(MASS)
m <- 1000
pvalues <- replicate(m, expr = {
  Sigma = matrix(c(9,2,2,1),2,2)
  bivarite_normal=mvrnorm(n=50, rep(0, 2), Sigma)
  x=bivarite_normal[1:50,1]
  y=bivarite_normal[1:50,2]
  pearson_test <- cor.test(x,y,method = "pearson")
  spearman_test = cor.test(x,y,method = 'spearman')
  kendall_test = cor.test(x,y,method = 'kendall')
  c(pearson_test$p.value,spearman_test$p.value,kendall_test$p.value) } 
)

power_pearson = mean(pvalues[1,1:1000] < 0.05)
power_spearman= mean(pvalues[2,1:1000] < 0.05)
power_kendall = mean(pvalues[3,1:1000]<0.05)
power_pearson
power_spearman
power_kendall

Assignment 11-02

Question 1

Compute a jackknife estimate of the bias and the standard error of the correlation statistic in Example 7.2.

Answer 1

library(bootstrap)
LSAT=law$LSAT
GPA=law$GPA
n = length( LSAT ) #numbeer of data
cor_jack = numeric( n )
cor_hat = cor( LSAT,GPA ) #corelation of LSAT  and GPA
for (i in 1:n) { cor_jack[i] = cor( LSAT[-i],GPA[-i] ) }
bias_jack = (n-1)*( mean(cor_jack) - cor_hat )
cor_bar = mean(cor_jack) # mean of statistic generated from jackknife
se_jack = sqrt( (n-1)*mean( (cor_jack-cor_bar)^2 ) ) 

print(sprintf('estimate of the bias  of the correlation statistic is: %f',bias_jack ))

print(sprintf('estimate of the standard error of the correlation statistic is: %f',se_jack ))

Question 2

Refer to Exercise 7.4. Compute 95% bootstrap confidence intervals for the mean time between failures $\frac{1}{\lambda}$ by the standard normal, basic, percentile, and BCa methods. Compare the intervals and explain why they may differ.

Answer 2

library(boot)
attach(aircondit)
x = hours
B = 5000 
set.seed(1)
gaptime_hat = mean(x) #MLE of 1/lambda
#bootstrap estimate of 1/lambda
frac_1_lambda_boot = boot(data=aircondit,statistic=function(x,i){mean(x[i,])}, R=B )

frac_1_lambda_boot

We see the MLE is 1/lambda = 108.0833, with estimated std. error = 37.77646, bias = 0.2438.

Next we try to calculate the bootstrap intervals by the standard normal, basic, percentile, and BCa methods.

einf_jack = empinf(frac_1_lambda_boot, type='jack')
boot.ci( frac_1_lambda_boot, type=c('norm','basic','perc','bca'), L=einf_jack ) 

We see the intervals are quite different. A primary reason is that the bootstrap distribution is still skewed, affecting the simpler methods and their appeal to the Central Limit Theorem. For example

hist(frac_1_lambda_boot$t, main='', xlab=expression(1/lambda), prob=T)
points(frac_1_lambda_boot$t0, 0, pch = 19)

The BCa interval incorporates an acceleration adjustment for skew, and may be preferred here.

Question 3

Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.

Answer 3

library(bootstrap)
attach( scor )
n = length( scor[,1] )
x = as.matrix(scor)
theta_jack = numeric( n )
lambda_hat = eigen(cov(scor))$values
theta_hat = lambda_hat[1]/sum(lambda_hat) #compute the mean of theta from sample
#according to the formala, we write a function to calculate the theta. 
theta = function(x){
 eigen(cov(x))$values[1]/sum(eigen(cov(x))$values)
}  

for (i in 1:n) { theta_jack[i] = theta( x[-i,] ) }

theta_bar = mean(theta_jack)
bias_jack = (n-1)*( mean(theta_jack) - theta_hat )
se_jack = sqrt( (n-1)*mean( (theta_jack-theta_bar)^2 ) )

print(sprintf('the jackknife estimates of bias of hat of theta is : %f', bias_jack))
print(sprintf('the jackknife estimates of standard error of hat of theta is : %f', se_jack))
detach(scor)

Question 4

In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.

Answer 4

The code to estimate the parameters of the four models follows. Plots of the predicted response with the data are also constructed for each model and shown follows.

library(DAAG)
attach(ironslag)
a <- seq(10, 40, .1) #sequence for plotting fits

L1 <- lm(magnetic ~ chemical)
plot(chemical, magnetic, main="Linear", pch=16)
yhat1 <- L1$coef[1] + L1$coef[2] * a
lines(a, yhat1, lwd=2)


L2 <- lm(magnetic ~ chemical + I(chemical^2))
plot(chemical, magnetic, main="Quadratic", pch=16)
yhat2 <- L2$coef[1] + L2$coef[2] * a + L2$coef[3] * a^2
lines(a, yhat2, lwd=2)


L3 <- lm(log(magnetic) ~ chemical)
plot(chemical, magnetic, main="Exponential", pch=16)
logyhat3 <- L3$coef[1] + L3$coef[2] * a
yhat3 <- exp(logyhat3)
lines(a, yhat3, lwd=2)


L4 <- lm(log(magnetic) ~ log(chemical))
plot(log(chemical), log(magnetic), main="Log-Log", pch=16)
logyhat4 <- L4$coef[1] + L4$coef[2] * log(a)
lines(log(a), logyhat4, lwd=2)

Once the model is estimated, we want to assess the fit. Cross validation can be used to estimate the prediction errors.

n <- length(magnetic) 
e1 <- e2 <- e3 <- e4 <- numeric(n)
k=1
# fit models on leave-two-out samples

while (k<n) {
    y <- magnetic[c(-k,-(k+1))]
    x <- chemical[c(-k,-(k+1))]

    J1 <- lm(y ~ x)
    yhat1 <- J1$coef[1] + J1$coef[2] * chemical[k]
    e1[k] <- magnetic[k] - yhat1

    J2 <- lm(y ~ x + I(x^2))
    yhat2 <- J2$coef[1] + J2$coef[2] * chemical[k] +
    J2$coef[3] * chemical[k]^2
    e2[k] <- magnetic[k] - yhat2

    J3 <- lm(log(y) ~ x)
    logyhat3 <- J3$coef[1] + J3$coef[2] * chemical[k]
    yhat3 <- exp(logyhat3)
    e3[k] <- magnetic[k] - yhat3

    J4 <- lm(log(y) ~ log(x))
    logyhat4 <- J4$coef[1] + J4$coef[2] * log(chemical[k])
    yhat4 <- exp(logyhat4)
    e4[k] <- magnetic[k] - yhat4

    k=k+2
}

The following estimates for prediction error are obtained from the leave two out cross validation.

c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2))

According to the prediction error criterion, Model 2, the quadratic model, would be the best fit for the data.

L2

The fitted regression equation for Model 2 is $$ \hat{Y} = 24.49262 - 1.39334X + 0.05452X^2$$

par(mfrow = c(1, 2)) #layout for graphs
plot(L2$fit, L2$res) #residuals vs fitted values
abline(0, 0) #reference line
qqnorm(L2$res) #normal probability plot
qqline(L2$res) #reference line
par(mfrow = c(1, 1)) #restore display

Assignment 11-16

Question 1

Implement the two-sample Cram´ er-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.

Answer 1

1.Initial parameters

set.seed(1)
library(cramer)
attach(chickwts)
x <- sort(as.vector(weight[feed == "soybean"]))
y <- sort(as.vector(weight[feed == "linseed"]))
R <- 99 #number of replicates
z <- c(x, y) #pooled sample
K <- 1:26
W_2_hat =numeric(R) #storage for replicates

2.Generate samples and calculat p value with permutation method.

W_2_0 = cramer.test(x,y)$statistic
for (i in 1:R) {
    #generate indices k for the first sample
    k <- sample(K, size = 14, replace = FALSE)
    x1 <- z[k]
    y1 <- z[-k] #complement of x1
    W_2_hat[i] = cramer.test(x1,y1)$statistic
}
p <- mean(abs(c(W_2_0, W_2_hat)) >= W_2_0)
round(c(p,cramer.test(x,y)$p.value),3)

3.Draw the histgram

hist(W_2_hat, main = "", freq = FALSE, xlab = "W_2_hat ",
breaks = "scott")
points(W_2_0, 0, cex = 1, pch = 16) 

Question 2

Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and com- pare the deciles of the generated observations with the deciles of the standard Cauchy distribution (see qcauchyor qt with df=1). Recall that a Cauchy(θ,η) distribution has density function $$ f(x)=\frac{1}{\theta \pi (1+[(x-\eta)/\theta]^2)},~~-\infty0. $$ The standard Cauchy has the Cauchy($\theta$= 1,$\eta$ = 0) density. (Note that the standard Cauchy density is equal to the Student t density with one degree of freedom.)

Answer 2

1.I want to simpliy the question with assumming the proposal distribution is symmetrical.

generate_cauthy_Metropolis_Hastings = function(n, sigma, x0, N) {
  # n: degree of freedom of t distribution
  # sigma:  standard variance of proposal distribution N(xt,sigma)
  # x0: initial value
  # N: size of random numbers required.
  x <- numeric(N)
  x[1] <- x0
  u <- runif(N)
  k <- 0
  for (i in 2:N) {
    y <- rnorm(1, x[i-1], sigma)
    if (u[i] <= ( (dt(y, n)*dnorm(x[i-1],y,sigma))/(dt(x[i-1], n)*dnorm(y,x[i-1],n)))){
        x[i] <- y
    }else {
      x[i] <- x[i-1]
      k <- k + 1
    }
  }
  return(list(x=x, k=k))
}

n <- 1  #degrees of freedom for target Student t dist.
N <- 5000


x0 <- 10 #initial the value
cauthy_distribution_random_varibl= generate_cauthy_Metropolis_Hastings(n,2,x0,N)
x=cauthy_distribution_random_varibl$x
k=cauthy_distribution_random_varibl$k
#number of candidate points rejected
print(sprintf('rate of candidate points rejected: %f',k/N))

2.graph the result

refline <- qt(c(.025, .975), df=n)
plot(x, type="l",xlab=bquote(sigma == 1),ylab="X", ylim=range(x))
abline(h=refline)

par(mfrow=c(1,2))
qqnorm(x)#绘制sigma=1的Q-Q图
hist(x,xlab="X",main="sigma=1")#绘制sigma=1的直方图

Assignment 11-23

Question 1

For exercise 9.6, use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R} < 1.2$

Answer 3

1.Calculate joint probablity according to the sample.

set.seed(1)
joint_probablity <- function( theta,x ) {
    probablity=(2+theta)^x[1] * (1-theta)^(x[2]+x[3]) * theta^x[4]
    return(probablity)
}

2.Write a function to generate a Markov chain of theta.

theta_chain <- function(n) {
    xdata = c( 125, 18, 20, 34 ) #observed multinom. data
    theta = numeric(n)
    theta[1] = runif(1) #initialize: sample from prior on theta
    k = 0
    u = runif(n)

    for (t in 2:n) {
        xt = theta[t-1]
        alpha = xt/(1-xt)
        y <- rbeta(1, shape1=alpha, shape2=1 )
        numer = joint_probablity( y,xdata ) * dbeta( xt, y/(1-y), 1)
        denom = joint_probablity( xt,xdata ) * dbeta( y, alpha, 1)
        if ( u[t] <= numer/denom )
            theta[t] = y else {
            theta[t] = theta[t-1]
            k = k + 1
        } 
    } 
    return(theta)
}
  1. Write a function to get Gelman Rubin statistic
Gelman.Rubin <- function(psi) {
    # psi[i,j] is the statistic psi(X[i,1:j])
    # for chain in i-th row of X
    psi <- as.matrix(psi)
    n <- ncol(psi)
    k <- nrow(psi)
    psi.means <- rowMeans(psi) #row means
    B <- n * var(psi.means) #between variance est.
    psi.w <- apply(psi, 1, "var") #within variances
    W <- mean(psi.w) #within est.
    v.hat <- W*(n-1)/n + (B/n) #upper variance est.
    r.hat <- v.hat / W #G-R statistic
    return(r.hat)
}
k <- 4 #number of chains to generate
n <- 8000 #length of chains
b <- 1000 #burn-in length


#generate the chains
X <- matrix(0, nrow=k, ncol=n)
for (i in 1:k)
    X[i, ] <- theta_chain(n)

#compute diagnostic statistics
psi <- t(apply(X, 1, cumsum))
for (i in 1:nrow(psi))
    psi[i,] <- psi[i,] / (1:ncol(psi))
print(Gelman.Rubin(psi))

For the Gelman.Rubin statistic is less than 1.2, so the chain of theta has converged approximately to the target distribution according to $\hat{R}<1.2$.

Draw the graph of chains of theta and rhat.

#plot psi for the four chains
par(mfrow=c(2,2))
for (i in 1:k)
    plot(psi[i, (b+1):n], type="l",
        xlab=i, ylab=bquote(psi))
par(mfrow=c(1,1)) #restore default

#plot the sequence of R-hat statistics
rhat <- rep(0, n)
for (j in (b+1):n)
    rhat[j] <- Gelman.Rubin(psi[,1:j])
plot(rhat[(b+1):n], type="l", xlab="", ylab="R")
abline(h=1.2, lty=2)

Question 2

Find the intersection points $A(k)$ in $(0,\sqrt{k})$ of the curves $$S_{k-1}(a)=P(t(k-1)>\sqrt{\frac{a^2(k-1)}{k-a^2}})$$ and $$S_{k}(a)=P(t(k)>\sqrt{\frac{a^2k}{k+1-a^2}})$$ for $k = 4 : 25,100,500,1000$, where $t(k)$ is a Student t random variable with k degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by Sz´ekely [260].)

Answer 2

  1. Generate a function to compare the difference of $S_k$ and $S_{k-1}$.
k = c( 4:25, 100, 500, 1000 )
object = function( a, df ){
    num_1 = sqrt( (a^2)*(df-1)/(df - a^2) )
    S_kminus1 = pt( q=num_1, df=df-1, lower.tail = FALSE)
    num_2 = sqrt( (a^2)*df/(df + 1 - a^2) )
    S_k = pt( q=num_2, df=df, lower.tail = FALSE)

    return( S_k - S_kminus1 )
} 
  1. Draw a graph to tell the interval of the root approximately.
par(mfrow=c(2,5))
for (i in k) {
    a=seq(1,2,0.01)
    plot(a,object(a,i))
}
  1. According to the graph above, the interval lies in (1,2), so we can find the root.
for ( i in 1:length(k) ) {
    root=uniroot(object, lower=1, upper=2, df=k[i])$root
    print( c( k[i], root ) )
}

Assignment 11-30

Question 1

Write a function to compute the cdf of the Cauchy distribution, which has density $$\frac{1}{\theta\pi(1+[(X-\eta)/\theta]^2)} ,-\infty < x < \infty, $$ where $\theta>0$. Compare your results to the results from the R function pcauchy. (Also see the source code in pcauchy.c.)

Answer 1

  1. Write a function to generate cauthy distribution
cauchy_distribution = function(q, eta=0, theta=1) {
    cauthy_density = function(x,eta,theta){ 
        1/(theta*pi*(1 + ((x-eta)/theta)^2))
    }
    result = integrate( f=cauthy_density,lower=-Inf, upper=q,
                        rel.tol=.Machine$double.eps^0.25,
                        eta=eta, theta=theta )

    return( result$value )
}  

2.Compare with the standard cauchy distribution :

x = matrix( seq(-5,5), ncol=1 )
cbind( x, apply(x, MARGIN=1, FUN=cauchy_distribution), pcauchy(x) ) 
a=apply(x, MARGIN=1, FUN=cauchy_distribution)

compare with standard cauchy distribution with different eta parameter.

cbind( x, apply( x, MARGIN=1, FUN=cauchy_distribution, eta=2 ),pcauchy(x,location=2) ) 

compare with standard cauchy distribution with different theta parameter.

cbind( x, apply( x, MARGIN=1, FUN=cauchy_distribution, theta=2 ),pcauchy(x,scale=2) ) 

According to the result above, cauthy distribution we generate is same with standard cauthy distribution.

Question 2

A-B-O blood type problem Let the three alleles be A, B, and O.

dat <- rbind(Genotype=c('AA','BB','OO','AO','BO','AB'),
             Frequency=c('p2','q2','r2','2pr','2qr','2pq',1),
             Count=c('nAA','nBB','nOO','nAO','nBO','nAB','n'))
knitr::kable(dat)

Observed data: $n_{A\cdot}=n_{AA}+n_{AO}=28$ (A-type), $n_{B\cdot}=n_{BB}+n_{BO}=24$ (B-type), $n_{OO}=41$ (O-type), $n_{AB}=70$ (AB-type). Use EM algorithm to solve MLE of $p$ and $q$ (consider missing data $n_{AA}$ and $n_{BB}$). Record the maximum likelihood values in M-steps, are they increasing?

Answer 2

Complete data log-likelihood function.

$$\log L(x,p)= n_{AA}\log(p_A^2)+n_{AO}\log(p_Ap_O) + n_{AB}\log(p_Ap_B) + n_{BB}\log{p_Bp_B} + n{BO}\log(p_Bp_O) + n_{OO}\log(p_Op_O)$$ 1. initial parameter

x = c(28, 24, 41,70)
n = rep(0,6)
p = rep(1/3,3)
m = 20
  1. Compute the expectation
expectation = function(x,p){
    n.aa = (x[1]*(p[1]^2))/((p[1]^2)+2*p[1]*p[3])
    n.ao = (2*x[1]*p[1]*p[3])/((p[1]^2)+2*p[1]*p[3])
    n.bb = (x[2]*(p[2]^2))/((p[2]^2)+2*p[2]*p[3])
    n.bo = (2*x[2]*p[2]*p[3])/((p[2]^2)+2*p[2]*p[3])

    n = c(n.aa,n.ao,n.bb,n.bo,x[3],x[4])
    return(n)
}
  1. Maximize the expectation.
maximization = function(x,n){
    p.a = (2*n[1]+n[2]+n[3])/(2*sum(x))
    p.b = (2*n[4]+n[5]+n[2])/(2*sum(x))
    p.o = (2*n[6]+n[3]+n[5])/(2*sum(x))
    p = c(p.a,p.b,p.o)
    return(p)
}
  1. Iteration of em algorithm.
record_mle= numeric(m)
for(i in 1:m){
    n = expectation(x,p)
    p = maximization(x,n)
    record_mle[i]=(p[1]^(2*n[1]))*((2*p[1]*p[3])^(n[2]))*(p[2]^(2*n[3]))*((2*p[2]*p[3])^(n[4]))*((2*p[1]*p[2])^n[5])*(p[3]^(2*n[6]))
}


p

record_mle
plot(record_mle)
lines(record_mle)

Assignment 12-07

Question 1

  1. Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:
formulas <- list(
    mpg ~ disp,
    mpg ~ I(1 / disp),
    mpg ~ disp + wt,
    mpg ~ I(1 / disp) + wt
)

Answer 1

1.Use for loops to fit linear models

lm_forloop <- vector("list", length(formulas))
for (i in seq_along(formulas)){
    lm_forloop[[i]] <- lm(formulas[[i]], data = mtcars)
}

lm_forloop

2.Use lapply() to fit linear models

lm_lapply <- lapply(formulas, function(x) lm(formula = x, data = mtcars))

lm_lapply

Question 2

  1. Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply() . Can you do it without an anonymous function?
bootstraps <- lapply(1:10, function(i) {
rows <- sample(1:nrow(mtcars), rep = TRUE)
mtcars[rows, ]
})

Answer 2

1.Fit the model mpg ~ disp by using lapply()

lapply_fit <- lapply(bootstraps, function(x) lm(mpg~disp, data=x))

lapply_fit

2.Fit the model mpg ~ disp by using a for loop.

for_loop_fit <- vector("list", length(bootstraps))
for (i in seq_along(bootstraps)){
  for_loop_fit[[i]] <- lm(mpg ~ disp, data = bootstraps[[i]])
}

for_loop_fit

Question 3

For each model in the previous two exercises, extract $R^2$ using the function below.

rsq <- function(mod) summary(mod)$r.squared

Answer 3

lapply(formulas, function(x) rsq(lm(formula=x, data=mtcars)))

lapply(bootstraps, function(x) rsq(lm(mpg~disp, data=x)))

Question 4

  1. The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.
trials <- replicate(
    100,
    t.test(rpois(10, 10), rpois(7, 10)),
    simplify = FALSE
)

Extra challenge: get rid of the anonymous function by using [[ directly.

Answer 4

sapply(trials, function(x) x[["p.value"]])

sapply(trials, "[[", "p.value")

Question 5

  1. Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?

Answer 5

According to the question, we want to combine Map() and vapply() to create a function to make both input and output are both vector(or a matrix). Apparently, Map() can take vector as its input, and vapply() can take vector as its output. So the code as below.

map_vapply <- function(X, FUN, FUN.VALUE, simplify = FALSE){
    out <- Map(function(x) vapply(x, FUN, FUN.VALUE), X)
    return(out)
}

Obviously, arguments should take both map() and vapply() function take originally.

Assignment 12-14

Question 1

  1. Make a faster version of chisq.test() that only computes the chi-square test statistic when the input is two numeric vectors with no missing values.

Answer 1

1.Compute expected value

expected <- function(colsum, rowsum, total) {
    (colsum/total)*(rowsum/total)*total
}

2.Compute chisquare statistics.

chi_stat <- function(observed, expected) {
    ((observed-expected)^2)/expected
}
  1. Difine a chi_test function by myself.
chisq_test <- function(x, y) {
    total <- sum(x) + sum(y)
    rowsum_1 <- sum(x)
    rowsum_2 <- sum(y)
    colsum = x+y
    expected_1=mapply(expected, colsum,rowsum_1, total)
    expected_2=mapply(expected, colsum,rowsum_2, total)
    chi_1=mapply(chi_stat, x,expected_1)
    chi_2=mapply(chi_stat, x,expected_2)
    chistat=sum(chi_1+chi_2)

    return(chistat)
}

print(chisq_test(c(13,27,19,10), c(15,17,21,27)))

print(chisq.test(c(13,27,19,10), c(15,17,21,27)))

print(microbenchmark::microbenchmark(
    chisq_test(c(13,27,19,10), c(15,17,21,27)),
    chisq.test(c(13,27,19,10), c(15,17,21,27))
))

Warning message is because of chisq.test result. Obviously, chisq_test generated by myself is faster then chisq.test.

Question 2

Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?

Answer 2

table_fast <- function(x, y) {
    x_num <- unique(x)
    y_num <- unique(y)
    mat <- matrix(0, length(x_num), length(y_num))
    for (i in seq_along(x)) {
        mat[which(x_num == x[[i]]), which(y_num == y[[i]])] <-
        mat[which(x_num == x[[i]]),  which(y_num == y[[i]])] + 1
    }
    dimnames <- list(x_num, y_num)
    tab <- array(mat, dim = dim(mat), dimnames = dimnames)
    class(tab) <- "table"
    return(tab)
}

a=c(3,7,15)
microbenchmark::microbenchmark(table(a,a), table_fast(a, a))

Apparently,we made a fast table2().



mrjinyx/StatComp18001 documentation built on May 19, 2019, 6:22 p.m.