In zhangmanustc/StatComp18024: Three functions for StatComp

Overview

StatComp18024 is a simple R package developed to present the homework of statistic computing course by author 18024.

Homework 1-2018/09/14

Question 1

create examples of time series by ts.

Answer 1

ts(1:10,start=1959)
ts(1:47,frequency=12,start=c(1959,2))
ts(1:10,frequency=4,start=c(1959,2))
ts(matrix(rpois(36,5),12,3),start=c(1961,1),frequency=12)

Question 2

Graphic segmentation.

Answer 2

layout(matrix(1:4,2,2))
layout.show(4)
layout(matrix(1:6,3,2))
layout.show(6)
layout(matrix(1:6,2,3))
layout.show(6)
m<-matrix(c(1:3,3),2,2)
layout(m)
layout.show(3)
m<-matrix(1:4,2,2)
layout(m,widths=c(1,3),heights=c(3,1))
layout.show(4)
m<-matrix(c(1,1,2,1),2,2)
layout(m,widths=c(2,1),heights=c(1,2))
layout.show(2)
m<-matrix(0:3,2,2)
layout(m,c(1,3),c(1,3))
layout.show(3)

Question 3

Simple analysis of variance.

Answer 3

data(InsectSprays)
InsectSprays
aov.spray<-aov(sqrt(count)~spray,data=InsectSprays)
aov.spray<-aov(sqrt(InsectSprays$count)~InsectSprays$spray)
aov.spray<-aov(sqrt(InsectSprays[,1])~InsectSprays[,2])
aov.spray
summary(aov.spray)
opar<-par()
par(mfcol=c(2,2))
plot(aov.spray)
termplot(aov.spray,se=TRUE,partial.resid=TRUE,rug=TRUE)

Question 4

Drawing of 10 pairs of simple random values for flat graphics.

Answer 4

x<-rnorm(10)
y<-rnorm(10)
plot(x,y)
plot(x,y,xlab="Ten random values",ylab="Ten other values",xlim=c(-2,2),ylim=c(-2,2),pch=22,col="red",bg="yellow",bty="l",tcl=0.4,main="How to customize a plot with R",las=1,cex=1.5)
opar<-par()
par(bg="lightyellow",col.axis="blue",mar=c(4,4,2.5,0.25))
plot(x,y,xlab="Ten random values",ylab="Ten other values",xlim=c(-2,2),ylim=c(-2,2),pch=22,col="red",bg="yellow",bty="l",tcl=-.25,las=1,cex=1.5)
title("How to customize a plot with R(bis)",font.main=3,adj=1)

opar<-par()
par(bg="lightgray",mar=c(2.5,1.5,2.5,0.25))
plot(x,y,type="n",xlab="",ylab="",xlim=c(-2,2),ylim=c(-2,2),xaxt="n",yaxt="n")
rect(-3,-3,3,3,col="cornsilk")
points(x,y,pch=10,col="red",cex=2)
axis(side=1,c(-2,0,2),tcl=-0.2,labels=FALSE)
axis(side=2,-1:1,tcl=-0.2,labels=FALSE)
title("How to customize a plot with R(ter)",font.main=4,adj=1,cex.main=1)
mtext("Ten random values",side=1,line=1,at=1,cex=0.9,font=3)
mtext("Ten other values",line=0.5,at=-1.8,cex=0.9,font=3)
mtext(c(-2,0,2),side=1,las=1,at=c(-2,0,2),line=0.3,col="blue",cex=0.9)
mtext(-1:1,side=2,las=1,at=-1:1,line=0.2,col="blue",cex=0.9)

Question 5

Generating rule sequence.

Answer 5

x<-1:30
1:10-1
1:(10-1)
seq(1,5,0.5)
seq(length=9,from=1,to=5)
c(1,1.5,2,2.5,3,3.5,4,4.5,5)
rep(1,30)
sequence(4:5)
sequence(c(10,5))
gl(3,5)
gl(3,5,length=30)
gl(2,6,label=c("Male","Female"))
gl(2,10)
gl(2,1,length=20)
gl(2,2,length=20)
expand.grid(h=c(60,80),w=c(100,300),sex=c("Male","Female"))

Homework 2-2018/09/21

Question 1

A discrete random variable X has probability mass function: $p(x=0)=0.1;p(x=1)=0.2;p(x=2)=0.2;p(x=3)=0.2;p(x=4)=0.3$\qquad.Use the inverse transform method to generate a random sample of size 1000 from the distribution of X.Construct a relative frequency table and compare the empirical with the theoretical probabilities.Repeat using the R sample function.

Answer 1

Analysis:

The process of inverse transform method for discrete distribution is as following:Random variable X takes values$\cdots<x_{i-1}<x_{i}<x_{i+1}<\cdots$\qquad; Inverse transformation:$F_X^{-1}(u)=x_{i}$\qquad for $F_{X}(x_{i-1})<u\leq F_{X}(x_{i})$\qquad

x<-c(0,1,2,3,4)#generate vector x=(0,1,2,3,4),let x take the value of 0,1,2,3,4
p<-c(0.1,0.2,0.2,0.2,0.3)#generate vector p=(0.1,0.2,0.2,0.2,0.3),Let x get 0,1,2,3,4 with probability of 0.1,0.2,0.2,0.2,0.3 respectively
cp<-cumsum(p)#returns to the cumulative sums of vector p
m<-1e3#m is set to 1000
r<-numeric(m)
r<-x[findInterval(runif(m),cp)+1]#find the interval containing samples
r#list the observations of generated sample
ct<-as.vector(table(r))#calculate the frequency column of different values of the sample
ct/sum(ct)/p#get the ratio of frequency to probability

Analysis of the results:

The above results represent the ratio of the different frequency and probability of the generated random samples and we can see the results are close to 1, which means the empirical value is approximately equal to the theoretical value,therefore the results are reliable.

Question 2

Write a function to generate a random sample of size n from the Beta(a, b) distribution by the acceptance-rejection method. Generate a random sample of size 1000 from the Beta(3,2) distribution. Graph the histogram of the sample with the theoretical Beta(3,2) density superimposed.

Answer 2

#beta(3,2) with pdf $f(x)=12x^{2}(1-x)$\qquad,0<x<1;g(x)=1,0<x<1
n<-1e3
j<-k<-0#both initial values of j and k are assigned to 0
y<-numeric(n)
while(k<n){
  u<-runif(1)#generate random numbers U~U(0,1)
  j<-j+1
  x<-runif(1)#random variate from g

#when x equals 2/3,the function $f(x)=12x^{2}(1-x)$\qquad attain the maximum 16/9.Due to c can't be too large, or the effect will be very low,thus the value of c is set to be the maximum of f(x).In this case, the number of iterations will be reduced to 1804.

  if(27/4*x*x*(1-x)>u){   #we accept x when U<=f(Y)/(c*g(Y)),c=16/9
    k<-k+1
    y[k]<-x
  }
}
y#list the generated random sample of size 1000 from the Beta(3,2) distribution
j#(experiments)for n random numbers
hist(y,prob=TRUE,main=expression(f(x)==12*x^2*(1-x)))#Graph the histogram of the sample
y<-seq(0,1,0.01)
lines(y,12*y^2*(1-y),col="red",lwd=1.5)# add the theoretical Beta(3,2) density line

Question 3

Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter $\Lambda$\qquad has $Gamma(r,\beta)$\qquad distribution and $Y$\qquad has $Exp(\Lambda)$\qquad distribution.That is, $(Y|\Lambda=\lambda)\sim f_Y(y|\lambda)=\lambda e^{-\lambda y}$\qquad.Generate 1000 random observations from this mixture with $r=4$\qquad and $\beta=2$\qquad.

Answer 3

n<-1e3#assign n to 1000 to generate 1000 random observations
r<-4#set the shape parameter r to 4
beta<-2#set scale parameter beta to 4
lambda<-rgamma(n,r,beta)#generate variable lambda subject to gamma distribution whose shape parameter is r,and scale parameter is beta
#generate 1000 random observations from this mixture
x<-rexp(n,rate=lambda)#generation for n observations subject to the exponential distribution with rate lambda
x#list the generated 1000 random observations

Homework 3-2018/09/28

Question 1 -exercise 5.4

Write a function to compute a Monte Carlo estimate of the Beta(3, 3) cdf,and use the function to estimate F(x) for x=0.1,0.2,...0.9.Compare the estimates with the values returned by the pbeta function in R.

Answer 1

Analysis: First,we know that if $X\sim B(a,b)$\qquad,then the density function p(x) of X is:$p(x)=1/beta(a,b)\times x^{a-1}\times (1-x)^{b-1}$\qquad,when $0<x<1$\qquad;else,$p(x)=0$\qquad. Second the cdf of X is:$F(x)=\int_{0}^{x}p(t)dt$\qquad. Third,the Monte Carlo estimating method of integration $\hat\theta=F(x)=\int_{a}^{b}g(x)dx$\qquad is as following:generate $i.i.d$\qquad samples $X_{1},\cdots,X_{m}$\qquad from $U(a,b)$\qquad;compute $\overline{g_{m}(X)}=\frac{1}{m}g(X_{i})$\qquad; $\hat\theta=(b-a)(\overline{g_{m}(X)})$\qquad.

Betacdf <- function( x, alpha=3, beta=3) {#construct cdf of Beta(3,3) by function()
m<-1e3
if ( any(x < 0) ) return (0)
stopifnot( x < 1 )#the range of x is between 0 and 1
t <- runif( m, min=0, max=x )#generate samples from U(0,x)
p<-(x-0)*(1/beta(alpha,beta)) * t^(alpha-1) * (1-t)^(beta-1)
cdf<-mean(p)#compute Monte Carlo estimate of cdf
return( min(1,cdf) )#ensure that cdf<=1
} #end function

set.seed(123)
for (i in 1:9) {
estimate<-Betacdf(i/10,3,3)#use function constructed to estimate F(x) when x=0.1,0.2,...,0.9
returnedvalues<-pbeta(i/10,3,3)#obtain values returned by the pbeta function
print( c(estimate,returnedvalues) )#compare the eatimates with returned values
} #end for loop

Analysis of results:

From the results above,the estimates(left) are very close to the actual value(right),they differ from the fourth decimal place,thus the estimation accuracy is very high.

Question 2-exercise 5.9

The Rayleigh density [156, (18.76)] is $f(x)=\frac{x}{\sigma^{2}}e^{-x^{2}/2\sigma^{2}} x\ge0,\sigma>0$\qquad. Implement a function to generate samples from a Rayleigh($\sigma$\qquad) distribution,using antithetic variables. What is the percent reduction in variance of $\frac{X+X'}{2}$\qquad,compared with $\frac{X+X'}{2}$\qquad for independent $X_{1},X_{2}$\qquad?

Answer 2

Analysis:

To implenment a function to generate samples,we can use the inverse transform method.For each n,generate $U_{j}\sim U(0,1)$\qquad,and compute $X_{i}=F_{X}^{-1}(U_{i}),i=1,\cdots,n$\qquad. The $U,V=1-U\sim U(0,1)$\qquad, but U and 1-U are negatively correlated,the $U$\qquad and $1-U$\qquad are negatively correlated. And $F(x)=\int_{0}^{x}f(t)dt=\int_{0}^{x}t/\sigma^{2}e^{-t^{2}/2\sigma^{2}}dt=1-e^{-x^{2}/(2\sigma^{2})}$\qquad, then $F(x)^{-1}=\sigma\sqrt{-2ln(1-F(x))}$\qquad, let $\sigma=1$\qquad, $X=F_{X}^{-1}(U)=\sqrt{-2ln(1-U)}=\sqrt{-2ln(V)}$\qquad, $X=F_{X}^{-1}(V)=\sqrt{-2ln(1-V)}=\sqrt{-2ln(U)}$\qquad, And $Var(\frac{X+X'}{2})=(Var(X)+Var(X')+2Cov(X,X'))/4$\qquad, if $X_{1}$\qquad and $X_{2}$\qquad are independent,then $Var(\frac{X_{1}+X_{2}}{2})=Var(X)$\qquad.

Rayleigh_distribution<- function(sigma, n) {#construct funtions to generate samples from  Rayleigh distribution
  X <- X_ <- numeric(n)
  for (i in 1:n) {
    U<- runif(n)#generate n observations from U~U(0,1)
    V<- 1-U#then V~U(0,1)
    X<- sigma*sqrt(-2*log(V))
    X_<-sigma*sqrt(-2*log(U))
    #according inverse transform from annalysis above,we obtain that $X=F_{X}^{-1}(U)=\sqrt{-2ln(1-U)}=\sqrt{-2ln(V)}$\qquad
    var1<-var(X)#if X1 and X2 are independent,then var((X1+X2)/2)=var(X)
    var2<-(var(X)+var(X_)+2*cov(X,X_))/4#compute the variance of (X+X')/2 generated by antithetic variables
    reduction <-((var1-var2)/var1)#compute the reduction in variance by using antithetic variables
  }
  percent_reduction<-paste0(format(100*reduction, format = "fg", digits = 4), "%")#set format type by function format
  return(percent_reduction)
}
set.seed(123)
sigma = 1#set the value of sigma-the scale parameter of Rayleigh distribution to 1
n <- 1000
Rayleigh_distribution(sigma, n)

Analysis of results:

We conclude that the percent reduction in variance of $\frac{X+X'}{2}$\qquad,compared with $\frac{X+X'}{2}$\qquad for independent $X_{1},X_{2}$\qquad is 97.15%,so the antithetic variables method to variance reducing is of significant effect.

Question 3-exercise 5.13

Find two importance functions $f1$\qquad and $f2$\qquad that are supported on $(1,\infty)$\qquad and are close to $\int_{1}^{\infty}\frac{x^{2}}{\sqrt{2\pi}}e^{-x^{2}/2}dx$\qquad by importance sampling?Explain.

Answer 3

Analysis:

$f1$\qquad and $f2$\qquad was selected to be $f_1(x)=e^{-x}; x>1$\qquad and $f_2(x)=4/(1+x^{2})\pi); x>1$\qquad.Due to that the variance of $\hat\theta$\qquad is $var(g(X_{i})/f({X_{i}}))/m$\qquad,when When $f$\qquad is close to $g$\qquad,even $g=cf$\qquad,the variance will obtain the minimal value 0.So we draw the figure of function $g/f1$\qquad and $g/f2$\qquad, the flatter the image is,the smaller variance in estimating the integration.

 x <- seq(1,4,0.01)#generate sequence x
    g <- (x^2 * exp(-x^2 / 2)) / sqrt(2 *pi)
    f1 <- exp(-x) #set the first importance function
    f2 <- 4 / ((1 + x^2) * pi)#set the second importance function
    gs <- c(expression(g(x)==e^{-x}/(1+x^2)),expression(f[1](x)==e^{-x}),expression(f[2](x)==4/((1+x^2)*pi)))
    #figure "ratio of g and f" including the funtion of g/f1 and g/f2
    plot(x, g/f1, type = "l", ylab = "",ylim = c(0,3.2), lwd = 2, lty = 2,col=2,main='(ratio of g and f)')
    lines(x, g/f2, lty = 3, lwd = 2,col=3)
    legend("topright", legend = gs[-1],lty = 2:3, lwd = 2, inset = 0.02,col=2:3)

Analysis of results:

According to the figure called"ratio of g and f",the green line which represents g/f2 is flatter,so the importance function $f_2(x)=4/(1+x^{2})\pi)$\qquad should produce the smaller variance in estimating.

Question 4-exercise 5.14

Obtain a Monte Carlo estimate of $\int_{1}^{\infty}\frac{x^{2}}{\sqrt{2\pi}}e^{-x^{2}/2}dx$\qquad by importance sampling.

Answer 4

Analysis:

According to exercise 5.13,f1 is constructed to be the expression of exponential distribution-$f_1(x)=e^{-x}; x>1$\qquad ,so sample x1 should be generated from exponential distribution.And f2 is constructed to be $f_2(x)=4/(1+x^{2})\pi); x>1$\qquad,so to generate samples,we can use the inverse transform method,generate $U_{j}\sim U(0,1)$\qquad,and compute $X_{i}=F_{X}^{-1}(U_{i}),i=1,\cdots,n$\qquad, according to $F(X)=\int_{1}^{\infty}(x^{2}e^{(-x^{2})/2})/\sqrt{2\pi}dt$\qquad. And according to $\hat\theta=\frac{1}{m}\sum_{1}^{m}\frac{g(X_{i})}{f(X_{i})}$\qquad,the mean and the variance of $\hat\theta$\qquad is obtained.

g <- function(x) { (x^2/ sqrt(2*pi) * exp(-x^2 / 2))* (x > 1)}#set the function g
f1 <- function(x) { exp(-x)}#set the importance funtion f1 and f2
f2 <- function(x) {  4 / ((1 + x^2) * pi)* (x >= 1)}
m <- 1e4
x1 <- rexp(m)#f1 is constructed to be the expression of exponential distribution,so sample x1 should be generated from exponential distribution
u <- runif(m) #generate samples from exponential distribution by inverse transform method
x2 <- tan(pi *(u+pi^2/16)/ 4)#obtain the function of x2 via inverse transform method
x2[which(x2 < 1)] <- 1# set the range of x2,to catch overflow errors in g(x)
ratio1<-g(x1) / f1(x1)
ratio2<-g(x2) / f2(x2)
theta.hat <- se <- numeric(2)
theta.hat<- c(mean(ratio1), mean(ratio2))#obtain a MC estimate of the integration
se <- c(sd(ratio1), sd(ratio2))#obtain  variance of the estimator 
rbind(theta.hat, se)

Analysis of results:

the MC estimate of the mean and the variance of $\theta$\qquad via importance function f1 and f2 is presented above,two means are similar,while the variance of $\theta$\qquad via f2 is smaller,which has the same explanation with exercise 5.13,importance function f2 is more efficient in variance reduction.

Homework 4-2018/10/12

Question 1 -exercise 6.9

Let X be a non-negative random variable with $\mu=E(X)<\infty$\qquad. For a random sample $x_{1},...,x_{n}$\qquad from the distribution of X, the Gini ratio is defined by $G=\frac{1}{2n^{2}\mu}\sum_{j=1}^{n}\sum_{i=1}^{n}|x_{i}-x_{j}|$\qquad The Gini ratio is applied in economics to measure inequality in income distribution (see e.g. [163]). Note that G can be written in terms of the order statistics $x_{i}$\qquad as $G=\frac{1}{n^{2}\mu}\sum_{i=1}^{n}(2i-n-1)_x{i}$\qquad If the mean is unknown, let $\hat{G}$qquad be the statistic G with $\mu$\qquad replaced by $\bar{x}$\qquad. Estimate by simulation the mean, median and deciles of $\hat{G}$\qquad if X is standard lognormal.Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.

Answer 1

Analysis:

First,generate 20 samples from standard lognormal,uniform distribution,and Bernoulli(0.1),then sort $x_{1},...,x_{n}$\qquad in increaing order,to obtain order statistics,then compute $G=\frac{1}{n^{2}\mu}\sum_{i=1}^{n}(2i-n-1)x_{i}$\qquad via samples,where n represents the number of samples,$\mu$\qquad can be replaced by $\bar{x}$\qquad.Later by 1000 MC simulations to obtain the value of $\hat{G}$\qquad,and calculate the mean, median and deciles of $\hat{G}$\qquad estimated by MC simulations.

n<-20#number of samples
m<-1e3#number of MC simulations
G1<-G2<-G3<-numeric(m)
set.seed(123)
for (j in 1:m){# Monte Carlo simulations
  x1<-sort(rlnorm(n))#sort x generated from standard lognormal
  x2<-sort(runif(n))#sort x generated from uniform distribution
  x3<-sort(rbinom(n,size=1,prob=0.1))#sort x generated from Bernoulli(0.1)
  mu1<-mean(x1);mu2<-mean(x2);mu3<-mean(x3)#mu is replaced by mean(x)
  z1<-z2<-z3<-numeric(m)
    for(i in 1:n){
      z1<-(2*i-n-1)*x1[i]
      z2<-(2*i-n-1)*x2[i]
      z3<-(2*i-n-1)*x3[i]
    }
    G1[j]<- sum(z1)/(n^2*mu1)
    G2[j]<- sum(z2)/(n^2*mu2)
    G3[j]<- sum(z3)/(n^2*mu3)#set the function of Gini ratio
}

#Estimate by simulation the mean of Ghat
meanG1<-mean(G1);meanG2<-mean(G2);meanG3<-mean(G3,na.rm = TRUE)
#Estimate by simulation the median of Ghat
medianG1<-median(G1);medianG2<-median(G2);medianG3<-median(G3,na.rm = TRUE)
#Estimate by simulation the deciles of Ghat
decileG1<-quantile(G1,  seq(0, 1, 0.1));decileG2<-quantile(G2,  seq(0, 1, 0.1))
decileG3<-quantile(G3,seq(0, 1, 0.1),na.rm = TRUE)
#print results
print(c(meanG1,medianG1));print(decileG1)#results of the mean, median and deciles of Ghat if X is standard lognormal
print(c(meanG2,medianG2));print(decileG2)#results of the mean, median and deciles of Ghat if X~U(0,1)
print(c(meanG3,medianG3));print(decileG3)#results of the mean, median and deciles of Ghat if X~Bernoulli(0.1)
hist(G1,prob=TRUE,main="x is from standard lognormal")#construct density histograms if X is standard lognormal
hist(G2,prob=TRUE,main="x is from U(0,1) distribution")#construct density histograms if X~U(0,1)
hist(G3,prob=TRUE,main="x is from Bernoulli(0.1)")#construct density histograms if X~Bernoulli(0.1)

Analysis of result:

according to the value above,we obtain that Gini ratio estimation G(X from bernoulli(0,1)>G(X from standard lognormal)>G(X from uniform),Due to tha Gini coefficient is between 0-1, and the larger the Gini coefficient, the higher the inequality,so the inequality degree is the same order.

Question 2-exercise 6.10

Construct an approximate 95% confidence interval for the Gini ratio $\gamma=E(G)$\qquad if X is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.

Answer 2

Analysis:Assuming that G population subject to normal distribution,due to that $t=\frac{\sqrt{n}(\bar{x}-\mu)}{s}~t(n-1)$\qquad,so t can be used as pivot quantity,then $u=\frac{\bar{G}-E(G)}{se(G)/\sqrt{n}}$\qquad to be pivot quantity,then the CI is obtained as[$\bar{x}-t_{1-\alpha}/2(n-1)s/\sqrt{n},\bar{x}+t_{1-\alpha}/2(n-1)s/\sqrt{n}$\qquad].

n<-20#number of samples
m<-1000#number of MC simulations
a=0;b=1#set the parameter of lognormal distribution
G<-numeric(m)
alpha <-0.05
set.seed(123)
for (j in 1:m){# Monte Carlo simulations
  x<-sort(rlnorm(n,a,b))#sort x generated from standard lognormal
  mu<-mean(x)#?? is replaced by mean(x)
  z<-numeric(m)
    for(i in 1:n){
      z<-(2*i-n-1)*x[i]
    }
    G[j]<- sum(z)/(n^2*mu)#set the function of Gini ratio

}
UCL <- mean(G)+qt(1-(alpha/2), df=n-1)*sd(G)/sqrt(n)#obtain the confidence interval upper limit
LCL <- mean(G)-qt(1-(alpha/2), df=n-1)*sd(G)/sqrt(n)#obtain the lower confidence interval
CI<-(c(LCL,UCL))#results of confidence interval
print(CI)
# Start another same simulation,and in this circle,we compare each estimation of G to judge if G is between confidence interval,and count it,then
for (j in 1:m){# Monte Carlo simulations
  x<-sort(rlnorm(n,a,b))#sort x generated from standard lognormal
  mu<-mean(x)#?? is replaced by mean(x)
  z<-numeric(m)
    for(i in 1:n){
      z<-(2*i-n-1)*x[i]
    }
    G[j]<- sum(z)/(n^2*mu)#set the function of Gini ratio
    c=0#Count parameter starts from 0
      if(G[j]<LCL)c=c
        else{
        if(G[j]>UCL)c=c
          else{
          c=c+1
    }#set conditions if G in each simulation is between confidence interval,then c=c+1
  }
  }
coverage_rate<-c/m#calculate the coverage rate=times of G falling in the interval/numbers of simulation
percent_rate<-paste0(format(100*coverage_rate, format = "fg", digits = 3), "%")#set format type by function format

Question 3-exercise 6.B

Tests for association based on Pearson product moment correlation $\rho$\qquad, Spearman's rank correlation coefficient $\rho_{s}$\qquad, or Kendall???s coefficient $\tau$\qquad, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_{s}$\qquad or $\tau$\qquad are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution (X,Y) such that X and Y are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.

Answer 3

library(MASS)#for mvrnorm
N <- 1e3 # Number of random samples
alpha<-0.05#set the significant level to 0.05 
p1<-p2<-p3<-numeric(N)
set.seed(123)
rho <- 0#set the correlation coefficient to be 0 due to the  null hypothesis  of the test is that the correlation coefficient to be 0
mu1 <- 0; s1 <- 1#set the mean and standard deviation of variable x
mu2 <- 0; s2 <- 1#set the mean and standard deviation of variable x
# Parameters for bivariate normal distribution
mu <- c(mu1,mu2) # Mean 
sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2,2) # Covariance matrix
for(i in 1:N){
bvn<- mvrnorm(N, mu = mu, Sigma = sigma ) # generate bivariate normal variables from MASS package
x<-bvn[,1]#variable x from bivariate normal distribution
y<-bvn[,2]#variable y from bivariate normal distribution
p1[i]<-cor.test(x,y,method="spearman")$p.value#cauculate the p_value under the nonparametric tests based on Spearman???s rank correlation coefficient ??s
p2[i]<-cor.test(x,y,method="kendall")$p.value#cauculate the p_value under the nonparametric tests based on Kendall???s coefficient ??
p3[i]<-cor.test(x,y,method="pearson")$p.value#cauculate the p_value under the correlation test
}
power<-c(mean(p1<=alpha),mean(p2<=alpha),mean(p3<=alpha))#obtain the power(average rate of test that is a significant ) under three tests
print(power)

Analysis of results:

according to the value above,we know that power means the average rate of test that is a significant,and power1<power3,power2<power3(p1 represents the nonparametric tests based on ??s;p2 represents the nonparametric tests based on $\rho_{s}$\qquad;p3 represents the nonparametric tests based on $\tau$\qquad).So the nonparametric tests based on $\rho_{s}$\qquad or $\tau$\qquad are less powerful than the correlation test.

To obtain at least one of the nonparametric tests have better empirical power than the correlation test against this alternative,we can set the correlation coefficient not to 0,that is x and y are related.

library(MASS)#for mvrnorm
N <- 1e3 # Number of random samples
alpha<-0.05#set the significant level to 0.05 
p1<-p2<-p3<-numeric(N)
set.seed(123)
rho <- 0.15#set different correlation coefficient,and after many trials,0.15 is the especial example 
mu1 <- 0; s1 <-1#set the mean and standard deviation of variable x
mu2 <- 0; s2 <-1#set the mean and standard deviation of variable x
# Parameters for bivariate normal distribution
mu <- c(mu1,mu2) # Mean 
sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2,2) # Covariance matrix
for(i in 1:N){
bvn<- mvrnorm(N, mu = mu, Sigma = sigma ) # generate bivariate normal variables from MASS package
x<-bvn[,1]#variable x from bivariate normal distribution
y<-bvn[,2]#variable y from bivariate normal distribution
p1[i]<-cor.test(x,y,method="spearman")$p.value#cauculate the p_value under the nonparametric tests based on Spearman???s rank correlation coefficient ??s
p2[i]<-cor.test(x,y,method="kendall")$p.value#cauculate the p_value under the nonparametric tests based on Kendall???s coefficient ??
p3[i]<-cor.test(x,y,method="pearson")$p.value#cauculate the p_value under the correlation test
}
power<-c(mean(p1<=alpha),mean(p2<=alpha),mean(p3<=alpha))#obtain the power(average rate of test that is a significant ) under three tests
print(power)

Analysis of results:

according to the value above,we know that power1>power3,power2>power3(p1 represents the nonparametric tests based on ??s;p2 represents the nonparametric tests based on $\rho_{s}$\qquad;p3 represents the nonparametric tests based on $\tau$\qquad).So when the correlation coefficient is 0.15,the nonparametric tests have better empirical power than the correlation test against this alternative.

Homework 5-2018/11/02

Question 1 -exercise 7.1

Compute a jackknife estimate of the bias and the standard error of the correlation statistic in Example 7.2.

Answer 1

Analysis:

The jacknife estimate of bias is:$\hat{bias}{jack}=(n-1)(\bar{\hat{\theta}}{(.)}-\theta)$\qquad,and the jackknife estimate of standard error is:$\hat{se}{jack}=\sqrt{\frac{n-1}{n}}\sum{i=1}^{n}(\hat{\theta}{(i)}-\bar{\hat{\theta}{(.)})^{2}}$\qquad,and to obtain the estimate of the bias and the standard error of the correlation statistic,so $\theta$\qquad=cov(LSAT, GPA).

set.seed(1)
library(bootstrap)#for law dataset
data(law,package = "bootstrap")
LSAT <- law$LSAT #the variable LAST of dataset law
GPA <- law$GPA#the variable GPA of dataset law
n<-length(LSAT)#n means the number of observed sample
theta.hat <- cor(LSAT, GPA)#the correlation statistic in Example 7.2
theta.jack <- numeric(n)
for(i in 1:n){
  theta.jack[i] <-cor(LSAT[-i], GPA[-i])#compute the jackknife replicates, leave out the ith observation xi
}
bias.jack <- (n-1)*(mean(theta.jack)-theta.hat)#jackknife estimate of bias
se.jack <- sqrt((n-1)*mean((theta.jack-mean(theta.hat))^2))#Jackknife estimate of standard error
round(c(original=theta.hat,bias=bias.jack,se=se.jack),5)#display the results of theta.hat,and the estimator of the bias and the standard error

Analysis of results:

The jacknife estimate of bias of the correlation statistic is -0.00647,and the standard errror estimate is 0.14253.

Question 2-exercise 7.5

Refer to Exercise 7.4. Compute 95% bootstrap confidence intervals for the mean time between failures $1/\lambda$\qquad by the standard normal, basic, percentile,and BCa methods. Compare the intervals and explain why they may differ.

Answer 2

Analysis:

According to exexercise 7.4,the times between failures follow an exponential model $Exp(\lambda)$\qquad with p.d.f. $f(x)=\lambda e^{-\lambda} I_{[0,\infty]}(x)$\qquad.The MLE of the hazard rate $\lambda$\qquad is the sample mean $\hat{\lambda}=1/\bar{X}$\qquad,so the statistic in 7.5 is $1/\lambda=\bar{X}$\qquad ,we can obtain the normal,basic,and percentile bootstrap confidence intervals using the boot and boot.ci functions in the boot package.

set.seed(1)
m<-1e3#number of cycles
B<-1e3#The number of bootstrap replicates
library(boot)#for boot and boot.ci function
data(aircondit,package = "boot")#for aircondi dataset
for(i in 1:m){
  boot.obj <- boot(data=aircondit,statistic=function(x,i){mean(x[i,])}, R=B )#generate R bootstrap replicates of a statistic applied to data,statistic is the mean time
  ci <- boot.ci(boot.obj,type=c("norm","basic","perc","bca"))#generates 4 different types of equi-tailed two-sided nonparametric confidence intervals
}
print(ci)

Analysis of results:

From the reults above,we compare the intervals and obtain that the CI of four types is quite different,the specific results are :normal:(37.1,181.4);basic:(31.5,172.8);Percentile:(43.3,184.7);Bca:( 56.8, 216.4 ) and the reason may be the bootstrap distribution is not unbiased, affecting the simpler methods or the sampling distribution of statisticis not close to normal.

Question 3-exercise 7.8

Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$\qquad.

Answer 3

Analysis:

According to exercise 7.7,$\theta=\frac{\hat{\lambda_{1}}}{\sum_{j=1}^{5}\hat{\lambda_{j}}}$\qquad,so the statistic is set to be $\theta$\qquad.Then the jacknife estimate of bias is $\hat{bias}{jack}=(n-1)(\bar{\hat{\theta}}{(.)}-\theta)$\qquad,and the jackknife estimate of standard error is $\hat{se}{jack}=\sqrt{\frac{n-1}{n}}\sum{i=1}^{n}(\hat{\theta}{(i)}-\bar{\hat{\theta}{(.)})^{2}}$\qquad.

library(bootstrap)#for scor dataset
n <-length( scor[,1] )#number of calculations
x<-as.matrix(scor)
theta.jack <- numeric(n)
data(scor,package = "bootstrap")#attach the data scor
theta<-function(x){
  eigen(cov(x))$values[1]/sum(eigen(cov(x))$values)
}#construct funtion theta=lambda1/sum(lambdai)
theta.hat <- theta(scor)#compute the statistic to be estimated
for(i in 1:n){
  theta.jack[i] <-theta(x[-i,])#compute the jackknife replicates, leave-one-out estimates
}
bias.jack <- (n-1)*(mean(theta.jack)-theta.hat)#jackknife estimate of bias
se.jack <- sqrt((n-1)*mean((theta.jack-mean(theta.jack))^2))#Jackknife estimate of standard error
round(c(original=theta.hat,bias=bias.jack,se=se.jack),6)#display the results of theta.hat,and the estimator of the bias and the standard error

Analysis of results:

The jacknife estimate of bias of the statistic $\hat{\theta}$\qquadd is 0.001069,and the standard errror estimate is 0.049552.

Question 4-exercise 7.11

In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.

Answer 4

Analysis:

There are several steps to model selection,The proposed models for predicting magnetic measurement (Y) from chemical measurement (X) are: 1. Linear: $Y=\beta_{0}+\beta_{1}X+\epsilon$\qquad. 2. Quadratic:$Y=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\epsilon$\qquad. 3. Exponential:$log(Y)=log(\beta_{0})+\beta_{1}X+\epsilon$\qquad. 4. Log-Log:$log(Y)=\beta_{0}+\beta_{1}log(X)+\epsilon$\qquad. And Leave-two-out cross-validation involves using two observations as the validation set and the remaining observations as the training set.This is repeated on all ways to cut the original sample on a validation set of two observations and a training set.

data(ironslag, package = "DAAG")#for data ironslag
magnetic<-ironslag$magnetic#extract variable magnetic
chemical<-ironslag$chemical#extract variable chemical
n <- length(magnetic)
p<-n#set number of outer circulation 
q<-n-1#set number of inner circulation 
a<-array(dim=c(p,q))
e1 <- e2 <- e3 <- e4<-a
# for n-fold cross validation
#fit models on leave-two-out samples
for (k in 1:p) {#outer circulations from  1 to n
  y1 <- magnetic[-k]#y1 is magnetic leave yk out
  x1<- chemical[-k]#x1 is chemical leave xk out
for(l in 1:q){#inner circulations from 1 to n-1
  y <- y1[-l]#y is y1 leave yl out
  x <- x1[-l]#x is x1 leave xl out
#model selection:cross validation
#1.linear model and estiamtion of prediction error
  J1 <- lm(y ~ x)
  yhat1 <- J1$coef[1] + J1$coef[2] * x1[l]#mutatis mutandis u for chemical
  e1[k,l] <- y1[l] - yhat1#mutatis mutandis z for magnetic
#2.Quadratic model and estiamtion of prediction error
  J2 <- lm(y ~ x + I(x^2))
  yhat2 <- J2$coef[1] + J2$coef[2] * x1[l] +J2$coef[3] *x1[l]^2
  e2[k,l] <- y1[l] - yhat2
#3.Exponential model and estiamtion of prediction error
  J3 <- lm(log(y) ~ x)
  logyhat3 <- J3$coef[1] + J3$coef[2] * x1[l]
  yhat3 <- exp(logyhat3)
  e3[k,l] <- y1[l] - yhat3
#4.log-log model and estiamtion of prediction error
  J4 <- lm(log(y) ~ log(x))
  logyhat4 <- J4$coef[1] + J4$coef[2] * log(x1[l])
  yhat4 <- exp(logyhat4)
  e4[k,l] <- y1[l] - yhat4
}
}
c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2))#estimates for prediction error

Analysis of results:

according to results above,we know that the prediction error of model J2 is the minimum to be 17.87018,according to the prediction error criterion ,by using leave-two-out cross validation,the model J2-Quadratic model would be the best fit for the data.

Homework 6-2018/11/16

Question 1 -exercise 8.1

Implement the two-sample Cramer-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.

Answer 1

Analysis:

In Example 8.1 the means of the soybean and linseed groups were compared.Suppose now that we are interested in testing for any type of difference in the two groups.

The hypotheses of interest are $H0 : F = G$\qquad vs $H1 : F \neq G$\qquad, where F is the distribution of weight of chicks fed soybean supplements and G is the distribution of weight of chicks fed linseed supplements.

The Cramer-von Mises statistic, which estimates the integrated squared distance between the distributions, is defined by $W_{2}=\frac{mn}{(m+n)^{2}}[\sum_{i=1}^{n}(F_{n}(x_{i})-G_{m}(x_{i}))^{2}+\sum_{j=1}^{m}(F_{n}(y_{j})-G_{m}(y_{j}))^{2}]$\qquad where $F_{n}$\qquad is the ecdf of the sample $x1,\dots, xn$\qquad and $G_{m}$\qquad is the ecdf of the sample $y1,\dots,ym$\qquad.

set.seed(1)
attach(chickwts)
x <- sort(as.vector(weight[feed == "soybean"]))
y <- sort(as.vector(weight[feed == "linseed"]))
detach(chickwts)


R <- 999 #number of replicates

z <- c(x, y) #pooled sample

W <-k<- numeric(R) #storage for replicates
m<-length(x)
n<-length(y)
N<-length(z)
u<-numeric(N)

cvm<-function(x1,y1){#set the statistic function
  Fn<-ecdf(x1)#Empirical cumulative distribution function of x
  Gm<-ecdf(y1)#Empirical cumulative distribution function of y
  z1<-c(x1,y1)
for(i in 1:N){
  u[i]<-(Fn(z1[i])-Gm(z1[i]))^2
}
m*n/(m+n)^2*sum(u)
}
W0 <- cvm(x, y)#statistic
for (i in 1:R) {
#generate indices k for the first sample
k <- sample(1:N, size = m, replace = FALSE)
x1 <- z[k]
y1 <- z[-k] #complement of x1
W[i] <- cvm(x1, y1)
}
p <- mean(c(W0, W) >= W0)#calculate p value
p

Analysis of results:

According to the results above,the p_value is 0.421>0.05 which does not support the alternative hypothesis that distributions differ,so the we accept the null hypothesis.

Question 2

Design experiments for evaluating the performance of the NN,energy, and ball methods in various situations.

1.Unequal variances and equal expectations

2.Unequal variances and unequal expectations

3.Non-normal distributions: t distribution with 1 df (heavy-taileddistribution), bimodel distribution (mixture of two normal distributions)

4.Unbalanced samples (say, 1 case versus 10 controls) Note: The parameters should be choosen such that the powers are distinguishable (say, range from 0.3 to 0.9).

Answer 2

Analysis:

1.For situation unequal variances and equal expectations,we set $X\sim N(0,1),Y\sim N(0,2)$\qquad.

2.For situation unequal variances and unequal expectations,we set $X\sim N(0,1),Y\sim N(1,2)$\qquad.

3.For situation Non-normal distributions: t distribution with 1 df (heavy-taileddistribution), bimodel distribution (mixture of two normal distributions),we set $X\sim t(1),Y\sim N(1,2)$\qquad.

4.For situation Unbalanced samples (say, 1 case versus 10 controls),we set n1=10,n2=100.

set.seed(1)
library(RANN) # implementing a fast algorithm
              # for locating nearest neighbors 
              # (alternative R package: "yaImpute")
library(boot)
library(Ball)
library(energy)#for enenrgy test
library(devtools)#for Ball statistic
attach(chickwts) # chicken weights for various feed methods
x <- as.vector(weight[feed == "sunflower"]) 
y <- as.vector(weight[feed == "linseed"]) 
detach(chickwts)
z <- c(x, y)
m <- 1e2; k<-3; p<-2;n1 <- n2 <- 20#size for sample 1 and 2
R<-999; N = c(n1,n2)
Tn <- function(z, ix, sizes,k) {
  n1 <- sizes[1]
  n2 <- sizes[2]
  n <- n1 + n2
  if(is.vector(z)) z <- data.frame(z,0);
  z <- z[ix, ];
  NN <- nn2(data=z, k=k+1) # what's the first column?
  block1 <- NN$nn.idx[1:n1,-1] 
  block2 <- NN$nn.idx[(n1+1):n,-1] 
  i1 <- sum(block1 < n1 + .5); i2 <- sum(block2 > n1+.5) 
  (i1 + i2) / (k * n)
}

eqdist.nn <- function(z,sizes,k){
  boot.obj <- boot(data=z,statistic=Tn,R=R,
  sim = "permutation", sizes = sizes,k=k)
  ts <- c(boot.obj$t0,boot.obj$t)
  p.value <- mean(ts>=ts[1])
  list(statistic=ts[1],p.value=p.value)
}
p.values <- matrix(NA,m,3)

#Unequal variances and equal expectations

for(i in 1:m){
  x <- matrix(rnorm(n1*p,0,1),ncol=p);
  y <- matrix(rnorm(n2*p,0,2),ncol=p);
  z <- rbind(x,y)
  p.values[i,1] <- eqdist.nn(z,N,k)$p.value#performance of NN method
  p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value#performance of energy method
  p.values[i,3] <- bd.test(x=x,y=y,R=999,seed=i*12345)$p.value#performance of ball method
}
alpha <- 0.05                        #confidence level 
pow <- colMeans(p.values<alpha)
print(pow)



#Unequal variances and unequal expectations
for(i in 1:m){
  x <- matrix(rnorm(n1*p),ncol=p);
  y <-cbind(rnorm(n2,0,2),rnorm(n2,1,1))
  z <- rbind(x,y)
  p.values[i,1] <- eqdist.nn(z,N,k)$p.value#performance of NN method
  p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value#performance of energy method
  p.values[i,3] <- bd.test(x=x,y=y,R=999,seed=i*12345)$p.value#performance of ball method
}
alpha <- 0.05                        #confidence level 
pow <- colMeans(p.values<alpha)
print(pow)

#Non-normal distributions: t distribution with 1 df (heavy-taileddistribution), bimodel distribution (mixture of two normal distributions)
m <- 1e2
n1 <- n2 <- 20
p<-2
for(i in 1:m){
  x <- matrix(rt(n1*p,1),ncol=p);
  y <-cbind(rnorm(n2,0,1),rnorm(n2,0,1))
  z <- rbind(x,y)
  p.values[i,1] <- eqdist.nn(z,N,k)$p.value#performance of NN method
  p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value#performance of energy method
  p.values[i,3] <- bd.test(x=x,y=y,R=999,seed=i*12345)$p.value#performance of ball method
}
alpha <- 0.05                        #confidence level 
pow <- colMeans(p.values<alpha)
print(pow)



#Unbalanced samples (say, 1 case versus 10 controls)
n1<-10;n2<-100;N=c(n1,n2)
for(i in 1:m){
  x <- matrix(rt(n1*p,1),ncol=p);
  y <- matrix(rnorm(n2*p,1,2),ncol=p);
  z <- rbind(x,y)
  p.values[i,1] <- eqdist.nn(z,N,k)$p.value#performance of NN method
  p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value#performance of energy method
  p.values[i,3] <- bd.test(x=x,y=y,R=999,seed=i*12345)$p.value#performance of ball method
}
alpha <- 0.05                        #confidence level 
pow <- colMeans(p.values<alpha)
print(pow)

Analysis:

From the results above,we find that the powers between methods NN,energy and ball is distingushable,range from 0.53 to 0.95,and the method ball perform best.

From the results above,we find that the powers between methods NN,energy and ball is distingushable,range from 0.67 to 0.89,and the method ball perform best.

From the results above,we find that the powers between methods NN,energy and ball are distingushable,range from 0.19 to 0.74,and the method ball perform best.

From the results above,we find that the powers between methods NN,energy and ball is distingushable,range from 0.41 to 0.78,and the method energy perform best.

Question 3-exercise 9.3

Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard Cauchy distribution (see qcauchy or qt with df=1). Recall that a Cauchy $(\theta,\eta)$\qquad distribution has density function

$f(x)=\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^{2})}$\qquad $-\infty0$\qquad

The standard Cauchy has the Cauchy( $(\theta=1,\eta=0)$\qquad) density. (Note that the standard Cauchy density is equal to the Student t density with one degree of freedom.)

Answer 3

Analysis: An M-H algorithm: 1.Set $g(\cdot|X)$\qquad (proposal pdf);2.Generate $X_0$\qquad from a distribution g;3.Repeat for $t=2,\ldots,N$\qquad: (a) Generate $Y$\qquad from $g(\centerdot|X_{t})$\qquad; (b) Generate $U$\qquad from Uniform(0,1); (c)if $U\leq\frac{f(Y)g(X_{t}|Y)}{f(X_{t})g(Y|X_{t})}$\qquad accept $Y$\qquad and set $X_{t+1}=Y$, otherwise set $X_{t+1}=X_t$\qquad; (d) Increment $t$\qquad.

m <- 10000
x <- numeric(m)
x[1] <- rnorm(1)
k <- 0
b<-1000
#generates random deviates
u <- runif(m)
#Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution.
for (i in 2:m) 
{
    xt <- x[i-1]
    y <- rnorm(1, mean = xt)
    #Density Cauchy distribution
    num <- dcauchy(y)*dnorm(xt, mean = y)
    den <- dcauchy(xt)*dnorm(y, mean = xt)

    if(u[i] <= num/den) 
    {
        x[i] <- y
    }
    else 
    {
        x[i] <- xt
       k <- k+1#y is rejected
    }
}

plot(x,type = "l",main="",ylab="x")
y <- x[b:m]
b <- x[1000:m]
sequence <- seq(0,1,0.01)
standardCauchy <- qcauchy(sequence)
standardCauchy <- standardCauchy[(standardCauchy> -Inf) & (standardCauchy< Inf)]
hist(b, freq = FALSE,main = "")
lines(standardCauchy, dcauchy(standardCauchy), lty = 2)
qqplot(standardCauchy, quantile(x,sequence), main="",xlab="Cauchy Quantiles", ylab="Sample Quantiles")#compare the deciles of the generated observations with the deciles of the standard Cauchy distribution

Analysis of results:

Part of a chain generated by a Metropolis-Hastings sampler of a cauchy distribution,and the QQ plot is shown in Figure.From the plot above, it appears that the sample quantiles are in approximate agreement with the theoretical quantiles(the QQ plot is an informal approach to assessing the goodness-of-fit of the generated sample with the target distribution)

Question 4-exercise 9.6

Rao [220, Sec. 5g] presented an example on genetic linkage of 197 animals in four categories (also discussed in [67, 106, 171, 266]). The group sizes are (125, 18, 20, 34). Assume that the probabilities of the corresponding multinomial distribution are $(\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4})$\qquad.Estimate the posterior distribution of $\theta$\qquad given the observed sample, using one of the methods in this chapter.

Answer 4

Analysis:

In this example, we cannot directly simulate random variates from the posterior distribution. One approach to estimating $\theta$\qquad is to generate a chain that converges to the posterior distribution and estimate $\theta$\qquad from the generated chain. Use the random walk Metropolis sampler with a uniform proposal distribution to generate the posterior distribution of $\theta$\qquad.

set.seed(1)
t <- c(125,18,20,34) #actual value of theta
w <- 0.25 #width of the uniform support set
m <- 5000 #length of the chain
burn <- 1000 #burn-in time
animals <- 197
x <- numeric(m) #the chain



prob <- function(y, t) {
# computes (without the constant) the target density
if (y < 0 || y >1)
return (0)
return((1/2+y/4)^t[1] *
((1-y)/4)^t[2] * ((1-y)/4)^t[3] *
(y/4)^t[4])
}
u <- runif(m) #for accept/reject step
v <- runif(m, -w, w) #proposal distribution
x[1] <- .25
for (i in 2:m) {
y <- x[i-1] + v[i]
if (u[i] <= prob(y, t) / prob(x[i-1], t))
x[i] <- y else
x[i] <- x[i-1]
}
xt <- x[(burn+1):m]
theta.hat<-mean(xt)
print(theta.hat)
that <- sum(t) * c((2+theta.hat)/4, (1-theta.hat)/4, (1-theta.hat)/4, theta.hat/4)
print(round(that))

Analysis of results:

The sample mean of the generated chain is 0.6192334,and the evaluated group size is(129 19 19 30),which is similiar to the real size,so the $\hat\theta$\qquad is similiar to the real $\theta$\qquad.

Homework 7-2018/11/23

Question 1 -exercise 9.6

Rao [220, Sec. 5g] presented an example on genetic linkage of 197 animals in four categories (also discussed in [67, 106, 171, 266]). The group sizes are (125, 18, 20, 34). Assume that the probabilities of the corresponding multinomial distribution are$(\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4})$\qquad.

Estimate the posterior distribution of $\theta$\qquad given the observed sample, using one of the methods in this chapter. For exercise 9.6, use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat R < 1.2$\qquad.

Answer 1

Analysis:

Basic idea: If a chain converges, $\phi_{in}$\qquad also converges uniformly for i.
- Between-sequence variance: $B_{n}=\frac{n}{k-1}\sum_{i=1}^{k}(\bar{\phi_{i\cdot}}-\bar{\phi_{\cdot\cdot}})^2$\qquad
- With-sequence variance: $W_n=\frac{1}{k}\sum_{i=1}^{k}\frac{1}{n}\sum_{j=1}^{n}(\phi_{ij}-\bar{\phi_{i\cdot}})^2$\qquad
- A variance estimate of $\phi$\qquad: $\widehat{var}(\phi)=\frac{n-1}{n}W_{n}+\frac{1}{n}B_{n}$\qquad
- Gelman-Rubin statistic: $\sqrt{\hat{R}}=\sqrt{\frac{\widehat{Var}(\phi)}{W_{n}}}$\qquad
- By using the Gelman-Rubin method to monitor convergence of the chain,we run the chain until the chain has converged approximately to the target distribution according to $\hat{R}$\qquad<1.1 and $\hat{R}$\qquad<1.2.

set.seed(1)
Gelman.Rubin <- function(psi) {
  # psi[i,j] is the statistic psi(X[i,1:j])
  # for chain in i-th row of X
  psi <- as.matrix(psi)
  n <- ncol(psi)
  k <- nrow(psi)

  psi.means <- rowMeans(psi)     #row means
  B <- n * var(psi.means)        #between variance est.
  psi.w <- apply(psi, 1, "var")  #within variances
  W <- mean(psi.w)               #within est.
  v.hat <- W*(n-1)/n + (B/n)     #upper variance est.
  r.hat <- v.hat / W             #G-R statistic
  return(r.hat)
}

t <- c(125,18,20,34) #actual value
b<- 1000 #burn-in time
prob<- function(y, t) {
  # computes (without the constant) the target density
  if (y < 0 || y >1)
    return (0)
  else
    return((1/2+y/4)^t[1] *((1-y)/4)^t[2] * ((1-y)/4)^t[3] *(y/4)^t[4])
}

#generate chains
chain<-function(w,N){
  x<-rep(0,N)
  u <- runif(N) #for accept/reject step
  v <- runif(N, -w, w) #proposal distribution
  x[1] <-w
  for (i in 2:N) {
    y <- x[i-1] + v[i]
    if (u[i] <= prob(y, t) / prob(x[i-1], t))
      x[i] <- y 
    else
        x[i] <- x[i-1]
  } 
return(x)
}
w <- c(0.05,0.2,0.4,0.8)#set the paremeter of proposal distribution
n<- 15000 #length of the chain
k<-4 #number of chains to generate

#generate the chains
X <- matrix(0, nrow=k, ncol=n)
for (i in 1:k)
  X[i, ] <- chain(w[i],n)

#compute diagnostic statistics
psi <- t(apply(X, 1, cumsum))
for (i in 1:nrow(psi))
  psi[i,] <- psi[i,] / (1:ncol(psi))
  print(Gelman.Rubin(psi))
#plot psi for the four chains
par(mfrow=c(2,2))
for (i in 1:k)
  plot(psi[i, (b+1):n], type="l",
       xlab=i, ylab=bquote(psi))
par(mfrow=c(1,1)) #restore default


#plot the sequence of R-hat statistics
rhat <- rep(0, n)
for (j in (b+1):n)
  rhat[j] <- Gelman.Rubin(psi[,1:j])
plot(rhat[(b+1):n], type="l", xlab="", ylab="R",ylim=c(1,1.3))
abline(h=1.1, lty=2)
abline(h=1.2, lty=2)

Analysis of results:

From this plot it is evident that the chain is converging fast.The value of $\hat{R}$\qquad is below 1.2 within 2000 iterations and below 1.1 within 4000 iterations.

Question 2

Find the intersection points $A(k)$\qquad in $(0,\sqrt{k})$\qquad of the curves:

$S_{k-1}(a)=P(t(k-1)>\sqrt{\frac{a^{2}(k-1)}{k-a^{2}}})$\qquad $S_{k}(a)=P(t(k)>\sqrt{\frac{a^{2}(k)}{k+1-a^{2}}})$\qquad

for k = 4:25,100,500,1000,where $t(k)$\qquad is a Student t random variable with k degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by Szekely [260].)

Answer 2

Analysis:

We can construct equation $S_{k-1}(a)=S_{k}(a)$\qquad,and find the root of the equation.

findIntersection = function (k) {
  s.k.minus.one = function (a) {#set function S_{k-1}(a)
    1-pt(sqrt(a^2 * (k - 1) / (k - a^2)), df = k-1)
  }
  s.k = function (a) {#set function S_{k}(a)
    1-pt(sqrt(a^2 * k / (k + 1 - a^2)), df = k)
  }

  f = function (a) {#set function S_{k}(a)-S_{k-1}(a)
    s.k(a) - s.k.minus.one(a)
  }#Find the intersection points A(k) in (0,sqrt(k)) of the curves
  eps = .Machine$double.eps^0.5
  return(uniroot(f, interval = c(eps, sqrt(k)-eps))$root)
}
k<-c(4:25, 100, 500, 1000)
rs = sapply(k, function (k) {findIntersection(k)})
s<-cbind(rs,k)
s

Homework 8-2018/11/30

Question 1 -exercise 11.6

Write a function to compute the cdf of the Cauchy distribution, which has density

$\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)},-\infty<x<\infty$\qquad

where$\theta>0$\qquad. Compare your results to the results from the R function pcauchy.(Also see the source code in pcauchy.c.)

Answer 1

Analysis:

We use "integrate" function
- It is an adaptive quadrature method for a one variable funtion. One can specify
- the limits of integration (can be infinite)
- the maximum number of subintervals
- the relative error
F(x)=$\int_{-\infty}^{x}\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)},-\infty0$\qquad

dcauchy <- function(t, eta, theta) {#compute the desity function of cauchy distribution
  if (theta<=0) return (0)#let theta > 0
  else 
    return (1/(theta*pi*(1+((t-eta)/theta)^2)))
  }
cdf.cauchy <- function(x, eta, theta){#compute the cdf of cauchy distribution
res <- integrate(dcauchy, lower=-Inf, upper=x,
                   rel.tol=.Machine$double.eps^0.25,
                   eta=eta, theta=theta)$value

return(res)
}
pcauchy(0,0,1)#Compare your results to the results from the R function pcauchy.
cdf.cauchy (0,0,1)
pcauchy(1,0,1)
cdf.cauchy (1,0,1)
pcauchy(-0.1,0,1)
cdf.cauchy (-0.1,0,1)

Analysis of results:

From the results above,we find computing reults of funtion is the same as the results from the R function pcauchy.

Question 2

A-B-O blood type problem
- Let the three alleles be A, B, and O.

Genotype | AA|BB |OO |AO |BO |AB |AA ----------|---|---|---|---|---|---|--- Frequency |p2 |q2 |r2 |2pr|2qr|2pq|1 Count |nAA|nBB|nOO|nAO|nBO|nAB|n

Observed data: $n_{A\cdot}=n_{AA}+n_{AO}=28$\qquad (A-type), $n_{B\cdot}=n_{BB}+n_{BO}=24$\qquad (B-type), $n_{OO}=41$\qquad (O-type), $n_{AB}=70$\qquad (AB-type).
Use EM algorithm to solve MLE of $p$\qquad and $q$\qquad (consider missing data $n_{AA}$\qquad and $n_{BB}$\qquad).
Record the maximum likelihood values in M-steps, are they increasing?

Answer 2

Analysis:

The EM (Expectation Maximization) algorithm is a general optimization method that is often applied to find maximum likelihood estimates when data are incomplete.

First:

compute the likelihood function,then evaluate the log-likelihood of the data for any arbitrary set of allele frequencies.

Second:

estimate the number of individuals carrying each possible genotype based on the observed counts for each phenotype and estimates of the allele frequencies at each iteration.

you can run the E-M function with the debug flag set to TRUE

Third:

set the initial value to be p=0.3,q=0.2,r=0.5,then run the E-M function.

#set the likelihood function 
lnL <- function(p, q, nA =28, nB = 24, nAB = 70, nO = 41) {
  r = 1.0 - p - q
  nA * log(p^2 + 2*p*r) + nB * log(p^2 + 2 * q * r) + 
    nAB * log(2 * p * q) + 2 * nO * log(r)
}

EM <- function (p,q,nA =28, nB = 24, nAB = 70, nO = 41, debug = FALSE) {

  # Evaluate the likelihood using initial estimates
  llk <- lnL(p, q, nA, nB, nAB, nO)

  # Count the number of iterations so far
  iter <- 1

  # Loop until convergence ...
  while (TRUE)
  {
    # Estimate the frequency for allele O
    r= 1.0 - p - q

    # First we carry out the E-step

    # The counts for genotypes O/O and A/B are effectively observed
    # Estimate the counts for the other genotypes
    nAA <- nA * p / (p + 2*r)
    nAO <- nA - nAA
    nBB <- nB * q / (q + 2*r)
    nBO <- nB - nBB

    # Print debugging information
    if (debug)
    {
      cat("Round #", iter, "lnLikelihood = ", llk, "\n")
      cat("    Allele frequencies: p = ", p, ", q = ", q, ",r = ", r, "\n")
      cat("    Genotype counts:    nAA = ", nAA, ", nAO = ", nAO, ", nBB = ", nBB, 
          ", nBO = ", nBO, "\n")
    }

    # Then the M-step
    p <- (2 * nAA + nAO + nAB) / (2 * (nA + nB + nO + nAB))
    q <- (2 * nBB + nBO + nAB) / (2 * (nA + nB + nO + nAB))


    # Then check for convergence ...
    llk1 <- lnL(p, q, nA, nB, nAB, nO)
    if (abs(llk1 - llk) < (abs(llk) + abs(llk1)) * 1e-6) break       

    # Otherwise keep going
    llk <- llk1
    iter <- iter + 1
  }

 list(p = p, q = q)
}
EM(0.3,0.2,nA =28, nB = 24, nAB = 70, nO = 41, debug = TRUE)

Analysis of results:

From the results above,we know that the eatimated value is p=0.32734,q= 0.3104234,and the log likelihood estimates islnLikelihood = -261.3305 , -251.1735, -251.1293 ,-251.1236 ,-251.122,which means the maximum likelihood estimates is monotonous increasing.

Homework 9-2018/12/09

Question 1

Answer 1

Analysis:

Lapply() takes a function, applies it to each element in a list, and returns the results in the form of a list.

Code:

formulas <- list(
  mpg ~ disp,
  mpg ~ I(1 / disp),
  mpg ~ disp + wt,
  mpg ~ I(1 / disp) + wt
)
#With lapply()
model_mtcars1<-lapply(formulas,lm,data=mtcars)
model_mtcars1
#With a for loop
model_mtcars2<- vector("list", length(formulas))
        for (i in seq_along(formulas)) {
          model_mtcars2[[i]] <- lm( formulas[[i]],mtcars)
        }
model_mtcars2

Analysis of results:

From the results,we know that "lapply(x,f,...)" is equivalent to "for loops".

Question 2

Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply().Can you do it without an anonymous function?

bootstraps <- lapply(1:10, function(i) {
  rows <- sample(1:nrow(mtcars), rep = TRUE)
  mtcars[rows, ]
})

Answer 2

Analysis:

Similiar to question 1,we apply funtion laaply and for loop to each of the bootstrap replicates of mtcars.

Code:

set.seed(1)
bootstraps <- lapply(1:10, function(i) {
rows <- sample(1:nrow(mtcars), rep = TRUE)
mtcars[rows, ]
})
#With lapply() 
model_bootstrap1<-lapply(bootstraps,lm,formula=mpg ~ disp)
model_bootstrap1
#With a for loop
model_bootstrap2 <- vector("list", length(bootstraps))
        for (i in seq_along(bootstraps)) {
          model_bootstrap2[[i]] <- lm( mpg ~ disp,bootstraps[[i]])
        }
model_bootstrap2

Analysis of results:

From this code, we can see that lapply() is a wrapper for a common for loop pattern,and lapply() makes it easier to work with lists by eliminating much of the boilerplate associated with looping.

Question 3

For each model in the previous two exercises, extract R2 using the function below.

rsq <- function(mod) summary(mod)$r.squared

Answer 3

Analysis:

Apply the function rsq separately to model in question 1 and question 2.

Code:

#model in question 1
formulas <- list(
  mpg ~ disp,
  mpg ~ I(1 / disp),
  mpg ~ disp + wt,
  mpg ~ I(1 / disp) + wt
)
#With lapply()
model_mtcars1<-lapply(formulas,lm,data=mtcars)

#With a for loop
model_mtcars2<- vector("list", length(formulas))
        for (i in seq_along(formulas)) {
          model_mtcars2[[i]] <- lm( formulas[[i]],mtcars)
        }


#model in question 2
bootstraps <- lapply(1:10, function(i) {
rows <- sample(1:nrow(mtcars), rep = TRUE)
mtcars[rows, ]
})
#With lapply() 
model_bootstrap1<-lapply(bootstraps,lm,formula=mpg ~ disp)

#With a for loop
model_bootstrap2 <- vector("list", length(bootstraps))
        for (i in seq_along(bootstraps)) {
          model_bootstrap2[[i]] <- lm( mpg ~ disp,bootstraps[[i]])
        }



#extract R2
rsq <- function(mod) summary(mod)$r.squared
#For model in question 1 
model1_rsq1<-sapply(model_mtcars1, rsq)
model1_rsq2<-sapply(model_mtcars2, rsq)
#For model in question 2
model2_rsq1<-sapply(model_bootstrap1, rsq)
model2_rsq2<-sapply(model_bootstrap1, rsq)
rbind(model1_rsq1,model1_rsq2)
rbind(model2_rsq1,model2_rsq2)

Analysis of results:

From the results,we can see that the rsq in question 1 is range from 0.72 to 0.88,and the the rsq in question 2 is range from 0.57 to 0.77.

Question 4

The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.

trials <- replicate(
  100,
  t.test(rpois(10, 10), rpois(7, 10)),
  simplify = FALSE
)

Extra challenge: get rid of the anonymous function by using [[ directly.

Answer 4

Analysis:

Apply sapply() and an anonymous function to the trial and extract the p-value.

Code:

set.seed(1)
# Create some random data
trials <- replicate(
100,
t.test(rpois(10, 10), rpois(7, 10)),
simplify = FALSE
)
#Use sapply() and an anonymous function to extract the p-value from every trial
p1<-sapply(trials, function(test) test$p.value)
#Get rid of the anonymous function by using [[ directly.
p2<-sapply(trials, '[[', i = "p.value") 
cbind(p1,p2)

Analysis of results:

According to the results,we can see p-value is large,it suggest that most of the trials is not significant.

Question 5

Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?

Answer 5

Analysis:

Use parallel: parallel::mcMap()) does.

library(parallel)
vapply_Map <- function(f, FUN.VALUE , ...) {
    out <- mcMap(f, ...)
    vapply(out, identity, FUN.VALUE)
}

Homework 10-2018/12/14

title: "Homework-2018.12.14" author: "By 18024" date: "2018/12/14" output: html_document

Question 1 -exercise 4

Make a faster version of chisq.test() that only computes the chi-square test statistic when the input is two numeric vectors with no missing values. You can try simplifying chisq.test() or by coding from the mathematical definition. (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test).

Answer 1

Analysis:

First:

According to WIKIPEDIA,we can compute the chi-square test statistic to test the independence of observations in two vectors.

Second:

$H_{0}:E_{i,j}=Np_{i.}p_{.j}$\qquad

$p_{i,.}=\frac{O_{i}}{N}$\qquad

$p_{.,j}=\frac{O_{.,j}}{N}$\qquad

$\chi^{2}=\sum_{i=1}^{r}\sum_{j=1}^{c}\frac{(O_{i,j}-E_{i,j})^{2}}{E_{i,j}}$\qquad

Third:

Degree of freedom is equal to $(r-1)(c-1)$\qquad

#construct function to compute  the expected (theoretical) count
expected <- function(colsum, rowsum, total) {
  (colsum / total) * (rowsum / total) * total
}
#construct function to compute the value of the test-statistic
chi_stat <- function(observed, expected) {
  ((observed - expected) ^ 2) / expected
}


chisq_test2 <- function(x, y) {
  total <- sum(x) + sum(y)
  rowsum_x <- sum(x)
  rowsum_y <- sum(y)
  chistat <- 0
# computes the chi-square test statistic which is apparently different from chisq.test function
  for (i in seq_along(x)) {
    colsum <- x[i] + y[i]
    expected_x <- expected(colsum, rowsum_x, total)
    expected_y <- expected(colsum, rowsum_y, total)
    chistat <- chistat + chi_stat(x[i], expected_x)
    chistat <- chistat + chi_stat(y[i], expected_y)
  }
  chistat
}
o <- as.table(rbind(c(56, 67, 57,34), c(135, 125, 45,65)))
#compare the results of chi-square test
print(chisq_test2(c(56, 67, 57,34), c(135, 125, 45,65)))
print(chisq.test(as.table(rbind(c(56, 67, 57,34), c(135, 125, 45,65)))))
#compare the computing time
print(microbenchmark::microbenchmark(
  chisq_test2(c(56, 67, 57,34), c(135, 125, 45,65)),
  chisq.test(as.table(rbind(c(56, 67, 57,34), c(135, 125, 45,65))))
))

Analysis of results:

First:

According to the reults above,we can find that chisq-square statistics of the function we constructed is 0.04762132,which is less than $\chi_{0.95}(4)$\qquad,and the p-value of chisq.test() is 0.2243>0.05,so two functions obtain the same result to accept the null hypothesis.

Second:

According to the time comparison,we can find the function we constructed is much quicker than chisq.test().

Question 2-Exercise 5

Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?

Answer 2

Table is used to do contingency table analysis

#With table
subject1<-c(60,70,60,80)
subject2<-c(60,60,60,90)
data<-data.frame(subject1,subject2)
result_table<-table(data)
result_table

#with loop
x<-c(60,70,80,90)
f<-function(x){
i<-numeric(4)
j<-numeric(4)
q<-numeric(4)
for (i in 1:4){
  for(j in 1:4){
    for(q in 1:4){
result_2<-ifelse(subject1[i]==x[j]&subject2[i]==x[q],"1","0")
    }}}}

print(microbenchmark::microbenchmark(
  table(data),
  f(x)))

Analysis of results:

From the result of table,we can find that there are two person who scores 60,60,and 1 person 70,60,one person 80,90.

If there is a faster version of table() for the case, chi-square test maybe can be speeded up for the course refers to the the expected (theoretical) count.

Thanks for your reading!

zhangmanustc/StatComp18024 documentation built on May 9, 2019, 10:45 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

zhangmanustc/StatComp18024 Three functions for StatComp

In zhangmanustc/StatComp18024: Three functions for StatComp

Overview

Homework 1-2018/09/14

Question 1

Answer 1

Question 2

Answer 2

Question 3

Answer 3

Question 4

Answer 4

Question 5

Answer 5

Homework 2-2018/09/21

Question 1

Answer 1

Question 2

Answer 2

Question 3

Answer 3

Homework 3-2018/09/28

Question 1 -exercise 5.4

Answer 1

Question 2-exercise 5.9

Answer 2

Question 3-exercise 5.13

Answer 3

Question 4-exercise 5.14

Answer 4

Homework 4-2018/10/12

Question 1 -exercise 6.9

Answer 1

Question 2-exercise 6.10

Answer 2

Question 3-exercise 6.B

Answer 3

Homework 5-2018/11/02

Question 1 -exercise 7.1

Answer 1

Question 2-exercise 7.5

Answer 2

Question 3-exercise 7.8

Answer 3

Question 4-exercise 7.11

Answer 4

Homework 6-2018/11/16

Question 1 -exercise 8.1

Answer 1

Question 2

Answer 2

Question 3-exercise 9.3

Answer 3

Question 4-exercise 9.6

Answer 4

Homework 7-2018/11/23

Question 1 -exercise 9.6

Answer 1

Question 2

Answer 2

Homework 8-2018/11/30

Question 1 -exercise 11.6

Answer 1

Question 2

Answer 2

Homework 9-2018/12/09

Question 1

Answer 1

Question 2

Answer 2

Question 3

Answer 3

Question 4

Answer 4

Question 5

Answer 5

Homework 10-2018/12/14

Question 1 -exercise 4

Answer 1

Question 2-Exercise 5

zhangmanustc/StatComp18024
Three functions for StatComp