StatComp18096 is a simple R package developed to present the homework answers for the 'Statistical Computing' course.
Implement at least three examples of different types from the book "R for Beginners", "Statistical Computing with R".
Show the answers with texts, numerical results, tables, and figures.
Example 1 - How to plot (R for Beginners P45)
The example shows us the way to plot a nice figure by using plot parameters and Par function.
Generating ten pairs of random values from normal distribtion.
Before showing the figures, ten pairs of random values are established as following:
x<-rnorm(10) y<-rnorm(10) rbind(x,y)
FIGURE 1.1
The first plot is displayed in black and white.
plot(x,y)
FIGURE 1.2 The second plot is much better using option value to replace the default value.
# replace some plotting parameters plot(x,y,xlab="Ten random values",ylab="Ten other values",xlim=c(-2,2),ylim=c(-2,2),pch=22,col="red",bg="yellow",bty="l",tcl=0.4,main="How to customize a plot with R",las=1,cex=1.5)
FIGURE 1.3
Par can be used to set or query graphical parameters. The default drawing parameters are copied to the list opar and we continue modify the former plot.
opar <- par() par(bg="lightyellow", col.axis="blue", mar=c(4, 4, 2.5, 0.25)) plot(x, y, xlab="Ten random values", ylab="Ten other values", xlim=c(-2, 2), ylim=c(-2, 2), pch=22, col="red", bg="yellow", bty="l", tcl=-.25, las=1, cex=1.5) title("How to customize a plot with R (bis)", font.main=3, adj=1)
FIGURE 1.4
We used here plot function to draw a "blank" figure, and then use low-level functions to add points, coordinate axes, labels, etc. By doing so, we can totally control the graphic drawing.
opar <- par() # change the backgroud and margin par(bg="lightgray", mar=c(2.5, 1.5, 2.5, 0.25)) # plot a blank figure plot(x, y, type="n", xlab="", ylab="", xlim=c(-2, 2), ylim=c(-2, 2), xaxt="n", yaxt="n") # change the color of the drawing area rect(-3, -3, 3, 3, col="cornsilk") points(x, y, pch=10, col="red", cex=2) axis(side=1, c(-2, 0, 2), tcl=-0.2, labels=FALSE) axis(side=2, -1:1, tcl=-0.2, labels=FALSE) title("How to customize a plot with R (ter)", font.main=4, adj=1, cex.main=1) mtext("Ten random values", side=1, line=1, at=1, cex=0.9, font=3) mtext("Ten other values", line=0.5, at=-1.8, cex=0.9, font=3) mtext(c(-2, 0, 2), side=1, las=1, at=c(-2, 0, 2), line=0.3, col="blue", cex=0.9) mtext(-1:1, side=2, las=1, at=-1:1, line=0.2, col="blue", cex=0.9)
Example 2 - densityplot (R for Beginners P50)
Generating a Random independent normal samples, which is divided into nine subsamples.
Using densityplot function to plot for each subsample.
panel.densityplot is used to plot empirical density function, panel.mathdensity is used to plot the fitted normal distribution density function.
library("lattice") n <- seq(5, 45, 5) x <- rnorm(sum(n)) y <- factor(rep(n, n), labels=paste("n =", n)) densityplot(~ x | y, panel = function(x, ...) { panel.densityplot(x, col="DarkOliveGreen", ...) panel.mathdensity(dmath=dnorm, args=list(mean=mean(x), sd=sd(x)), col="darkblue") })
Example 3 - xyplot & splom (R for Beginners P51)
Earthquake data This example shows the geographic location of earthquakes at different depths.
data(quakes) mini <- min(quakes$depth) maxi <- max(quakes$depth) int <- ceiling((maxi - mini)/9) inf <- seq(mini, maxi, int) quakes$depth.cat <- factor(floor(((quakes$depth - mini) / int)), labels=paste(inf, inf + int, sep="-")) # add the labels with paste xyplot(lat ~ long | depth.cat, data = quakes) # split according to depth.cat
Iris data Compared with the former example, this xyplot uses panel.superpose drawing the overlapped groups in the same diagram
data(iris) xyplot( Petal.Length ~ Petal.Width, data = iris, groups=Species, panel = panel.superpose, type = c("p", "smooth"), span=.75, auto.key = list(x = 0.15, y = 0.85) # legend,relevant location (0,1) )
splom(~ x) is used to plot two-dimensional graph matrix.
splom( ~iris[1:4], groups = Species, data = iris, xlab = "", panel = panel.superpose, auto.key = list(columns = 3) )
Example 4 - stratified sampling ("Statistical Computing with R" P146) Stratified sampling is implemented in a more general way, for the Monte Carlo estimate of $\int_0^1e^{-x}(1+x^2)^{-1}dx$. The standard Monte Carlo estimate is also obtained for comparison.
M <- 10000 #number of replicates k <- 10 #number of strata r <- M / k #replicates per stratum N <- 50 #number of times to repeat the estimation T2 <- numeric(k) estimates <- matrix(0, N, 2) g <- function(x) { exp(-x - log(1+x^2)) * (x > 0) * (x < 1) } for (i in 1:N) { estimates[i, 1] <- mean(g(runif(M))) for (j in 1:k) T2[j] <- mean(g(runif(M/k, (j-1)/k, j/k))) estimates[i, 2] <- mean(T2) } #The result of this simulation produces the following estimates. apply(estimates, 2, mean) apply(estimates, 2, var)
This represents a more than 98% reduction in variance.
### Plot importance functions in Figures 5.1(a) and 5.1.(b) #par(ask = TRUE) #uncomment to pause between graphs x <- seq(0, 1, .01) w <- 2 f1 <- exp(-x) f2 <- (1 / pi) / (1 + x^2) f3 <- exp(-x) / (1 - exp(-1)) f4 <- 4 / ((1 + x^2) * pi) g <- exp(-x) / (1 + x^2) #for color change lty to col #figure (a) plot(x, g, type = "l", main = "", ylab = "", ylim = c(0,2), lwd = w) lines(x, g/g, lty = 2, lwd = w) lines(x, f1, lty = 3, lwd = w) lines(x, f2, lty = 4, lwd = w) lines(x, f3, lty = 5, lwd = w) lines(x, f4, lty = 6, lwd = w) legend("topright", legend = c("g", 0:4), lty = 1:6, lwd = w, inset = 0.02) #figure (b) plot(x, g, type = "l", main = "", ylab = "", ylim = c(0,3.2), lwd = w, lty = 2) lines(x, g/f1, lty = 3, lwd = w) lines(x, g/f2, lty = 4, lwd = w) lines(x, g/f3, lty = 5, lwd = w) lines(x, g/f4, lty = 6, lwd = w) legend("topright", legend = c(0:4), lty = 2:6, lwd = w, inset = 0.02)
Example 5 -Estimating a confidence level ("Statistical Computing with R" P159)
Example 6.4 (Confidence interval for variance)
If $X_1,...,X_n$ is a random sample from a $Normal(??, ??^2)$ distribution, n ?? 2, and $S^2$ is the sample variance, then $V = \frac{(n - 1)S^2}{??2} ?? \chi^2(n - 1)$.
A one side $100(1-\alpha)\%$ confidence interval is given by $(0,\frac{(n-1)S2}{\chi^2\alpha})$, where $\chi^2_\alpha$ is the $\alpha$-quantile of the $\chi^2(n-1)$ distribution. If the sampled population is normal with variance $\sigma^2$, then the probability that the confidence interval contains $\sigma^2$ is $1-\alpha$.
The calculation of the 95% upper confidence limit (UCL) for a random sample size n = 20 from a $Normal(0, \sigma^2 = 4)$ distribution is shown below.
n <- 20 alpha <- .05 x <- rnorm(n, mean=0, sd=2) UCL <- (n-1) * var(x) / qchisq(alpha, df=n-1) print(UCL)
The confidence level is exactly $P(\frac{19S^2}{\chi_{0.05}^2(19)}>4)=P(\frac{(n-1)S^2}{\alpha^2}>\chi_{0.05}(n-1))=0.95$
A discrete random variable X has probability mass function
x<- 0:4 prob <- c(0.1,0.2,0.2,0.2,0.3) x.prob <- rbind(x,prob) x.prob
Use the inverse transform method to generate a random sample of size 1000 from the distribution of X.
Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function.
x<- 0:4 prob <- c(0.1,0.2,0.2,0.2,0.3) cp <- cumsum(prob); m <- 1000; r <- numeric(m) r <- x[findInterval(runif(m),cp)+1] ct <- as.vector(table(r)); ctfre<-ct/sum(ct); ctdif<-ct/sum(ct)/prob # generate a random sample of size 1000 n <- 1000 a <- sample(x,size = n,prob = prob ,replace = TRUE) # compute the frequency a <- as.vector(table(a)) a.fre <- a/n/prob diffe1 <- a.fre diffe2 <- ctdif data <- data.frame(x,ct,a,ctfre,a.fre,prob,diffe1,diffe2) colnames(data) <- c("x","ct-num","sample-num","ct-freq","samp-freq","prob","dif1","dif2") data
The dataframe shows that the sample is great because both of the empirical probabilities generated by sample function and inverse transformation method are near to the theoretical prob.
For save space, I did not show you the sample data.
Write a function to generate a random sample of size n from the Beta(a, b) distribution by the acceptance-rejection method.
Generate a random sample of size 1000 from the Beta(3,2) distribution.
Graph the histogram of the sample with the theoretical Beta(3,2) density superimposed.
# write the function rBeta <- function(n,a,b){ k <- 0; j <- 0 y <- numeric(n) while (k < n) { u <- runif(1) j <- j + 1 x <- runif(1) #random variate from g c <- prod(2:(a+b-1))/prod(2:(a-1))/prod(2:(b-1)) r <- x^(a-1)*(1-x)^(b-1) if (r > u) { #we accept x k <- k + 1 y[k] <- x } } y } # Generate a random sample of size 1000 from the Beta(3,2) distribution. a <- 3; b <- 2 c <- 12 t <- rBeta(1000,3,2) #sample data t hist(t, prob = TRUE, main = bquote(f(x)==12*x^2*(1-x))) #density histogram of sample) y <- seq(0, 1, .01) lines(y,c*y^(a-1)*(1-y)^(b-1)) # density curve
As is required, I first establish the funtion rBeta(n,a,b) and n means sample size, a and b are the parameters of beta(a,b) distribution.
The pdf of beta(a,b) is $f(x)=\frac{(a+b-1)!}{(a-1)!(b - 1)!}x^{(a - 1)}(1-x)^{(b - 1)}$.
There exists a constant $c = \frac{(a+b-1)!}{(a-1)!(b-1)!}$ such that $f(x) <= cg(x)$ when $g(x)=1I^x_{(0,1)}$
P(acceptance) = $\frac{1}{c}$
After set the function, I generate a random sample of sample size 1000 from Beta(3,2) and plot the histogram.
Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter$\Lambda$ has $Gamma(r, \beta)$ distribution and Y has $Exp(\Lambda)$ distribution.
That is, $(Y |\Lambda = \lambda) \sim f_{Y} (y|\lambda) = \lambda{e^{-\lambda{y}}}$. Generate 1000 random observations from this mixture with r = 4 and $\beta$ = 2
#generate a Exponential-Gamma mixture n <- 1000 r <- 4 beta <- 2 lambda <- rgamma(n, r, beta) #lambda is random #now supply the sample of lambda's as the exponential parameter x <- rexp(n, lambda) #the mixture x
First, we generate 1000 random obeserations called lambda vector from $Gamma(r, \beta)$ distribution, and then we supply the sample of lambda's as the exponential parameter.
That is the way to generate a random sample of size 1000 from this mixture with r = 4 and $\beta$ = 2
Write a function to compute a Monte Carlo estimate of the Beta(3, 3) cdf, and use the function to estimate F(x) for x = 0.1, 0.2,..., 0.9. Compare the estimates with the values returned by the pbeta function in R.
step1Write a function called 'MCbeta' of computing a Monte Carlo estimate of the Beta(a, b) cdf, where Beta(a,b) cdf: $$ F = \int_0^x\frac{\Gamma(a+b-1)}{\Gamma(a-1)\Gamma(b-1)}x^{(a-1)}(1-x)^{(b-1)} $$
In this function there are four parameters: a, b, n and x.
Consider the problem of estimating $\theta = \int_0^x30x^2(1-x)^2$. Generate 9 random uniform sample(0,x) whose size are all 1000, in which x equals to 0.1, 0.2,..., 0.9.
step2 Using Monte Carlo to estimate the cdf, and the estimated value = $\frac{1}{1000}x^2(1-x)^2$
step3
Compare the estimates with the value F(x) computed (numerically) by the pbeta function with ratio.
The code and the result is shown as below:
MCbeta <- function(a,b,n,x){ cdf <- numeric(length(x)) for (i in 1 :length(x) ){ u <- runif(n,0,x[i]) g <- x[i] * factorial(a+b-1)/factorial(a-1)/factorial(b-1)* u ^(a-1) *(1-u) ^(b-1) cdf[i] <- mean(g) #estimated value } Phi <- pbeta(x,a,b) #beta(a,b) ratio <- cdf/Phi print(round(rbind(x, cdf, Phi, ratio), 3)) } x <- seq(.1,0.9, length=9) MCbeta(3,3,1000,x)
As is shown before, the ratio is very near to 1,i.e.,the cdf is closed to Phi.
The Rayleigh density [156, (18.76)] is $f(x) = \frac{x}{??^2} e^{-\frac{x^2}{2??^2}}, x ?? 0,??> 0$. Implement a function to generate samples from a Rayleigh($\sigma$) distribution, using antithetic variables. What is the percent reduction in variance of $\frac{X+X'}{2}$ compared with $\frac{X_1+X_2}{2}$ for independent $X_1$, $X_2$?
step1
If X is a continuous random variable with cdf $F_X (x)$, here $$F_X (x)=1-e^{\frac{x^2}{2\sigma^2}}$$ Then $U = F_X(X) \sim Uniform(0, 1)$
step2 The variable U follows a uniform distribution along [0, 1], the first sample will be {u_1 , ?? , u_n}, where, for any given i, u_i is obtained from $U(0, 1)$. The second sample is built from $u_1 ?? , ?? ,u_n ??$, where, for any given i:$u_i'=1-u_i$
Then $x = F ^{-1}_X (u)$, $X = F^{-1}(U), X' = F^{-1}(U')$both have distribution F but are antithentic to each other $F^{-1}$ is a monotone function.
and $X=\sqrt{-2\sigma^2}ln(1-U)$
Furthermore, covariance is negative, allowing for initial variance reduction.
step3
Get the mean value and repeat m times, then we get the variance and variance reduction.
The code and the result is shown as below:
Antith.sampling <- function(sigma, n, antithetic = TRUE){ u <- runif(n/2) if (antithetic){ v <- 1 - u a <- sqrt(-2*sigma^2*log(1-u)) b <- sqrt(-2*sigma^2*log(1-v)) variance <- (var(a)+var(b)+2*cov(a,b))/4 # c(a,b) is the whole sample, n is the sample size) } else{ v <- runif(n/2) u <- c(u, v) a <- sqrt(-2*sigma^2*log(1-u)) # a is the whole sample, n is the sample size variance <-var(a) } variance } var1 <- Antith.sampling(2,1000,antithetic = FALSE) # variance of independent random variables var2 <- Antith.sampling(2,1000,antithetic = TRUE) # var of antithentic variables ratio <- 100 * (var1-var2)/var1 #the percent of reduction print(c(var1,var2,ratio))
We can conclude that the Antithetic Variables method improves the effeciency of variance reduction a lot.
Find two importance functions $f1$ and $f2$ that are supported on $(1, \infty)$ and are close to
$$g(x) = \frac{x^2}{\sqrt{2??}} e^{-x^2/2}, x> 1$$ Which of your two importance functions should produce the smaller variance in estimating$$\int_1^\infty\frac{x^2}{\sqrt{2\pi}} e^{-x2/2}dx$$by importance sampling? Explain.
step Consider the density $g(x)$ and plot it, we choose several ??$f(\dot)$ which is close to $g(x)$
So, I choose $$f_1(x) = e^{-x}, x \in (1,\infty)$$ and $$f_2(x) = \frac{1}{x^2}, x \in (1,\infty)$$
The code and the result is shown as below:
x <- seq(1,10,0.02) y <- x^2/sqrt(2*pi)* exp((-x^2/2)) y1 <- exp(-x) y2 <- 1 / (x^2) gs <- c(expression(g(x)==e^{-x^2/2}*x^2/sqrt(2*pi)),expression(f[1](x)==1/(x^2)),expression(f[2](x)==x*e^{(1-x^2)/4}/sqrt(2*pi))) par(mfrow=c(1,2)) #figure (a) plot(x, y, type = "l", ylab = "", ylim = c(0,0.5),main='density function') lines(x, y1, lty = 2,col="red") lines(x, y2, lty = 3,col="blue") legend("topright", legend = 0:2, lty = 1:3,inset = 0.02,col=c("black","red","blue")) #figure (b) plot(x, y/y1, type = "l",ylim = c(0,2),ylab = "",lty = 2, col="red",main = 'ratios') lines(x, y/y2, lty = 3,col="blue") legend("topright", legend = 1:2, lty = 2:3, inset = 0.02,col=c("red","blue"))
The function that corresponds to the most nearly constant ratio g(x)/f(x) appears to be $f_2$, which is clear in figure.
g <- function(x) {x^2*exp(-x^2/2)/sqrt(2*pi)*(x>1)} m <- 10000 theta.hat <- se <- numeric(2) x <- rexp(m, 1) #using f1 fg <- g(x) / exp(-x) theta.hat[1] <- mean(fg) se[1] <- sd(fg) u <- runif(m) x <- 1/(1-u) #using f2 fg <- g(x)*x^2 theta.hat[2] <- mean(fg) se[2] <- sd(fg) rbind(theta.hat,se)
Same to the conclusion before, $f_2$ is better for its smaller variance.
Obtain a Monte Carlo estimate of$\int_1^\infty\frac{x^2}{\sqrt{2??}} e^{-x2/2}dx$by importance sampling
step1
(5.13 Cont.), we choose $f(x)=\frac{1}{x^2},x \in (1,+\infty)$as the importance function of $g(x)=\frac{x^2}{\sqrt{2\pi}} e^{-x2/2},x\in(1,+\infty)$
step2
$$\int_1^{\infty}\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx=E(g(x)/f(x))$$
The code and the result is shown as below:
g <- function(x) {x^2*exp(-x^2/2)/sqrt(2*pi)*(x>1)} m <- 1e4 u <- runif(m) x <- 1/(1-u) #using f2 fg <- g(x)*x^2 theta.hat <- mean(fg) print(theta.hat) theta <- integrate(g,1,Inf) theta
Using integrate function to compute the theorial value,he estimate value is quite near to it.
Let X be a non-negative random variable with $?? = E[X] < \infty$. For a random sample $x_1,...,x_n$ from the distribution of X, the Gini ratio is defined by$$G = \frac{1}{2n\mu^2}\Sigma_i\Sigma_j|x_i-x_j|$$ The Gini ratio is applied in economics to measure inequality in income distribution. Note that G can be written in terms of the order statistics $x(i)$ as $$G = \frac{1}{n\mu^2}\Sigma_1^n(2i - n - 1)x(i).$$ If the mean is unknown, let $\hat{G}$ be the statistic $G$ with $\mu$ replaced by $\overline{x}$. Estimate by simulation the mean, median and deciles of $\hat{G}$ if $X$ is standard lognormal. Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.
step1
step2 Compute $Media (G)=G^{m/2}$ Compute deciles $Deciles(G)=G^{im/10},i=1,...10.$
step3 Repeat the code and replacing the x from uniform distribution and Bernoulli(0.1)
step4 Plot the histgram.
The code and the result is shown as below:
m <- 1000 n <- 1000 Gini <- function(m,n){ G_hat1<-G_hat2<-G_hat3 <- numeric(m) for (j in 1:m) { x1 <- rlnorm(n) x2 <- runif(n) x3 <- rbinom(n,size=1,prob=0.1) u1 <- mean(x1);u2 <- mean(x2);u3 <- mean(x3) x1sort <- sort(x1);x2sort <- sort(x2);x3sort <- sort(x3) g1<-g2<-g3 <- numeric(m) for (i in 1:m){ g1[i] <- (2*i-m-1)*x1sort[i] g2[i] <- (2*i-m-1)*x2sort[i] g3[i] <- (2*i-m-1)*x3sort[i] } G_hat1[j] <- sum(g1)/m^2/u1 G_hat2[j] <- sum(g2)/m^2/u2 G_hat3[j] <- sum(g3)/m^2/u3 } print(c(mean(G_hat1),median(G_hat1))) print(quantile(G_hat1,seq(0,1,0.1))) hist(G_hat1,prob=TRUE,main = "X from lognorm distribution") print(c(mean(G_hat2),median(G_hat2))) print(quantile(G_hat2,seq(0,1,0.1))) hist(G_hat2,prob=TRUE,main = "X from uniform distribution") print(c(mean(G_hat3),median(G_hat3))) print(quantile(G_hat3,seq(0,1,0.1))) hist(G_hat3,prob=TRUE,main = "X from Bionomial(0.1) distribution") } Gini(m,n)
As is shown before, the answer is totally different because X comes from diffferent populations.The Gini ratio measures the inequality among values of a frequency distribution (for example, levels of income). The larger the gini ratio, the higher the inequality. So when X comes from uniform, and G seems to be the lowest and when X comes from Bonulli distribution, G seems to be the highest.
Construct an approximate 95% confidence interval for the Gini ratio $Y = E[G]$ if X is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.
step1 Generate the replicates $G(j), j = 1 ...,m$ by repeating:
step2
According to the big sample properties, I compute the approximate 95% CI with the function Gini(m,n,a,b,alpha).m is the sample size of X,X comes from lognorm distribution, n is the sample number and a,b is the unknown parameters of lognorm distribution.The output is $(1-\alpha ) 100\%$ confidence interval.
step3 Assess the coverage rate with MC method.
The code and the result is shown as below:
GCI <- function(m,n,a,b,alpha){ G <- numeric(m) for (i in 1:m) { # function to calculate Gini with Normal distribution. Gi <- function(n,a,b){ y <- numeric(n) y <- sort(rnorm(n)) # x=exp(y) ~ln(0,1) mu <- exp(a+b^2/2) # mu of X Gini <- 1/n^2/mu * sum((2*c(1:n)-n-1)*exp(y)) return(Gini) } G[i]<-Gi(n,a,b) } LCL <- quantile(G,alpha/2) UCL <- quantile(G,1-alpha/2) CI <- c(LCL,UCL) count <-sum(G <= UCL)-sum(G < LCL) coverage <- count/n #coverage rate return(c(CI,coverage)) } GCI(1000,1000,0,1,0.1)
The output shows the CI and coverage rate.
Tests for association based on Pearson product moment correlation $\rho$, Spearman??s rank correlation coefficient $\rho_s$, or Kendall??s coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $(X, Y)$ such that $X$ and $Y$ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.
step1 first, we conduct a convariance matrix for independent variables. then, we compute the power and compare them.
The code and the result is shown as below:
library(MASS) powers <- function(N,alpha,mu,sigma){ p1<-p2<-p3<-numeric(N) for(i in 1:N){ bvn<- mvrnorm(N, mu = mu, Sigma = sigma ) # mvrnorm function, independent x<-bvn[,1];y<-bvn[,2] p1[i]<-cor.test(x,y,method="spearman")$p.value p2[i]<-cor.test(x,y,method="kendall")$p.value p3[i]<-cor.test(x,y,method="pearson")$p.value } power<-c(mean(p1<=alpha),mean(p2<=alpha),mean(p3<=alpha)) return(power) } set.seed(123) N <- 500 # Number of random samples alpha<-0.05 # Target parameters for univariate normal distributions rho <- 0 mu1 <- 0; s1 <- 1 mu2 <- 1; s2 <- 4 # Parameters for bivariate normal distribution mu <- c(mu1,mu2) # Mean sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2,2) # Covariance matrix powers(N,alpha,mu,sigma)
library(MASS) powers <- function(N,alpha,mu,sigma){ p1<-p2<-p3<-numeric(N) for(i in 1:N){ bvn<- mvrnorm(N, mu = mu, Sigma = sigma ) # mvrnorm function, independent x<-bvn[,1];y<-bvn[,2] p1[i]<-cor.test(x,y,method="spearman")$p.value p2[i]<-cor.test(x,y,method="kendall")$p.value p3[i]<-cor.test(x,y,method="pearson")$p.value } power<-c(mean(p1<=alpha),mean(p2<=alpha),mean(p3<=alpha)) print(round(power),3) } N <- 500 # Number of random samples alpha<-0.05 # Target parameters for univariate normal distributions rho <- .6 mu1 <- 0; s1 <- 1 mu2 <- 1; s2 <- 4 # Parameters for bivariate normal distribution mu <- c(mu1,mu2) # Mean sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2,2) # Covariance matrix powers(N,alpha,mu,sigma)
We can find that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. And when changing the $\rho$ into 0.6, the power is different. The question is :
Let $X$ be a non-negative random variable with $\mu = E[X] < ??$. For a random sample $x_1$ ,...,$x_n$ from the distribution of X, the Gini ratio is defined by $${G}=\frac{1}{2n^2\mu}\sum_{j=1}^n \sum_{i=1}^n |x_i-x_j|$$ The Gini ratio is applied in economics to measure inequality in income distribution (see e.g. [163]). Note that G can be written in terms of the order statistics $x_(i)$ as $${G}=\frac{1}{n^2\mu}\sum_{i=1}^n (2i-n-1)x_{(i)}$$ If the mean is unknown, let $\hat{G}$ be the statistic $G$ with $\mu$ replaced by $\bar{x}$. Estimate by simulation the mean, median and deciles of $\hat{G}$ if $X$ is standard lognormal.Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.
Construct an approximate 95% confidence interval for the Gini ratio y = E[G] if X is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.
Tests for association based on Pearson product moment correlation $\rho$, Spear-mans rank correlation coefficient $\rho_s$ , or Kendalls coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $(X,Y)$ such that $X$ and $Y$ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.
The answer code is as follows:
#The answer of first question: m <- 1000 n <- 20 g <- numeric(m) medians <- means <- numeric(3) y <- gini1 <- gini2 <- gini3 <- numeric(m) for (i in 1:m) { x <- sort(rlnorm(n)) xmean <- mean(x) for (j in 1:n) { y[j] <- (2*j-n-1)*x[j] } gini1[i] <- 1/n^2/xmean*sum(y[1:n]) } for (i in 1:m) { x <- sort(runif(n)) xmean <- mean(x) for (j in 1:n) { y[j] <- (2*j-n-1)*x[j] } gini2[i] <- 1/n^2/xmean*sum(y[1:n]) } for (i in 1:m) { x <- sort(rbinom(n,c(0,1),c(0.1,0.9))) xmean <- mean(x) for (j in 1:n) { y[j] <- (2*j-n-1)*x[j] } gini3[i] <- 1/n^2/xmean*sum(y[1:n]) } par(mfrow=c(3,1)) par(pin=c(2,1)) hist(gini1,prob = TRUE) lines(density(gini1),col = "red",lwd = 2) hist(gini2,prob = TRUE) lines(density(gini2),col = "blue",lwd = 2) hist(gini3,prob = TRUE) lines(density(gini3),col = "green",lwd = 2) medians[1] <- median(gini1) medians[2] <- median(gini2) medians[3] <- median(gini3) medians quantiles1 <- as.vector(quantile(gini1,seq(0.1,0.9,0.1))) quantiles1 quantiles2 <- as.vector(quantile(gini2,seq(0.1,0.9,0.1))) quantiles2 quantiles3 <- as.vector(quantile(gini3,seq(0.1,0.9,0.1))) quantiles3 means[1] <- mean(gini1) means[2] <- mean(gini2) means[3] <- mean(gini3) means #The answer of second question: m <- 1000 n <- 20 gini <- numeric(m) #Get A series of gini ratios genarating from a lognormal distribution for (i in 1:m) { x <- sort(rlnorm(n)) xmean <- mean(x) for (j in 1:n) { y[j] <- (2*j-n-1)*x[j] } gini[i] <- 1/n^2/xmean*sum(y[1:n]) } #get the lower confidence interval LCI<- mean(gini)-sd(gini)*qt(0.975,m-1) #get the upper confidence interval UCI <- mean(gini)+sd(gini)*qt(0.975,m-1) #get the confidence interval CI <- c(LCI,UCI) print(CI) #calculate the converage rte covrate<-sum(I(gini>CI[1]&gini<CI[2]))/m print(covrate) #The answer of third question: #We need load the MASS package library(MASS) mean <- c(0, 0) sigma <- matrix( c(25,5, 5, 25),nrow=2, ncol=2) m <- 1000 #Calculate the power using pearson correlation test by setting the parameter method as pearson pearvalues <- replicate(m, expr = { mydata1 <- mvrnorm(50, mean, sigma) x <- mydata1[,1] y <- mydata1[,2] peartest <- cor.test(x,y,alternative = "two.sided", method = "pearson") peartest$p.value } ) power1 <- mean(pearvalues <= 0.05) power1 #Calculate the power using spearman correlation test by setting the parameter method as spearman spearvalues <- replicate(m, expr = { mydata2 <- mvrnorm(50, mean, sigma) x <- mydata2[,1] y <- mydata2[,2] speartest <- cor.test(x,y,alternative = "two.sided", method = "spearman") speartest$p.value } ) power2 <- mean(spearvalues <= 0.05) power2 #Calculate the power using kendall correlation test by setting the parameter method as kendall kenvalues <- replicate(m, expr = { mydata3 <- mvrnorm(50, mean, sigma) x <- mydata3[,1] y <- mydata3[,2] kentest <- cor.test(x,y,alternative = "two.sided", method = "kendall") kentest$p.value } ) power3 <- mean(kenvalues <= 0.05) power3 R <- 1000 m <- 1000 #Calculate the power using pearson correlation test by setting the parameter method as pearson pearvalues <- replicate(m, expr = { u <- runif(R/2,0,10) v <- sin(u) peartest <- cor.test(u,v,alternative = "two.sided", method = "pearson") peartest$p.value } ) power1 <- mean(pearvalues <= 0.05) power1 #Calculate the power using spearman correlation test by setting the parameter method as spearman spearvalues <- replicate(m, expr = { u <- runif(R/2,0,10) v <- sin(u) speartest <- cor.test(u,v,alternative = "two.sided", method = "spearman") speartest$p.value } ) power2 <- mean(spearvalues <= 0.05) power2 #Calculate the power using kendall correlation test by setting the parameter method as kendall kenvalues <- replicate(m, expr = { u <- runif(R/2,0,10) v <- sin(u) kentest <- cor.test(u,v,alternative = "two.sided", method = "kendall") kentest$p.value } ) power3 <- mean(kenvalues <= 0.05) power3
The law school data set law contains LSAT and GPA for 15 law schools.This data set is a random sample from the universe of 82 law schools in law82.
set.seed(1) library(bootstrap) #for the law data a<-matrix(c(round(law$LSAT,digits = 0),law$GPA),nrow=2,byrow = TRUE ) dimnames(a)<-list(c("LSAT","GPA"),1:15) knitr::kable(a)
Compute a jackknife estimate of the bias and the standard error of the correlation statistic.
step1
Define the ith jackknife sample$x_{(i)}$ to be the subset of $x$ that leaves out the ith observation $x_i$.
That is, $x_{(i)} = (x_1,\ldots,x_{i-1}, x_{i+1},\ldots,x_n)$.
step2
$\hat{\theta} = T_n(x)$,define the ith jackknife replicate $\hat{\theta}{(i)} = T{n-1}(x_{(i)}), i = 1,\ldots,n$.
step3
$bias=E(\hat{\theta}-\theta)$, $\hat{bias}_{jack}=(n-1)E(\overline{\hat{\theta}^{\star}}-\hat{\theta})$
$\hat{se}{jack}=\sqrt{(n-1)/n \sum{i=1}^n(\bar{\hat{\theta}^{\star}}-\hat{\theta}_{i})^2}$
The Jackknife estimate of $\theta$ is $\hat{\theta}{jack}=\hat{\theta}-\hat{bias}{jack}$
The code and the result is shown as below:
library(bootstrap) x <- law$LSAT; y<-law$GPA cor <- cor(x,y) n <- length(x) cor_jack <- numeric(n) #storage of the resamples for (i in 1:n) cor_jack[i] <- cor(x[-i],y[-i]) bias.jack <- (n-1)*(mean(cor_jack)-cor) se.jack <- sqrt((n-1)/n*sum((cor_jack-mean(cor_jack)))^2) print(list(cor= cor ,est=cor-bias.jack, bias = bias.jack,se = se.jack, cv = bias.jack/se.jack))
We can find that in the 15 jackknife repeats, $\hat{bias}_{jack}$ is very small and the estimated coefficient of variation is far lower than 0.25.
Refer to the air-conditioning data set aircondit provided in the boot package. The 12 observations are the times in hours between failures of air conditioning equipment: 3, 5, 7, 18, 43, 85, 91, 98, 100, 130, 230, 487. Assume that the times between failures follow an exponential model Exp($\lambda$).
Compute 95% bootstrap confidence intervals for themean time between failures $\frac{1}{\lambda}$ by the standard normal, basic, percentile,and BCa methods. Compare the intervals and explain why they may differ.
step1 $\frac{1}{\lambda}$is the mean value, so we use the data to resample and compute the mean.
step2 Use bootstrap to estimate the bias and standard error of the estimate $\frac{1}{\lambda}$.
step3 Use R function boot.ci{boot}: basic, normal, percentile and BCa.
The code and the result is shown as below:
#Bootstrap library(boot) data(aircondit,package = "boot") air <- aircondit$hours theta.hat <- mean(air) #set up the bootstrap B <- 2000 #number of replicates n <- length(air) #sample size theta.b <- numeric(B) #storage for replicates #bootstrap estimate of standard error of R for (b in 1:B) { #randomly select the indices i <- sample(1:n, size = n, replace = TRUE) dat <- air[i] #i is a vector of indices theta.b[b] <- mean(dat) } bias.theta <- mean(theta.b - theta.hat) se <- sd(theta.b) print(list(bias.b = bias.theta,se.b = se)) theta.boot <- function(dat,ind) { #function to compute the statistic mean(dat[ind]) } boot.obj <- boot(air, statistic = theta.boot, R = 2000) print(boot.obj)
Here I use both of function I write and the boot function from package "boot" and get the value and bias and se of the bias.
print(boot.ci(boot.obj, type=c("basic","norm","perc","bca")))
As is shown the intervals differ a lot because of transformation respecting and different order accuracy.
The BCa confidence intervals are transformation respecting and BCa intervals have second order accuracy. Transformation respecting means that if $(\hat{\theta}^_{\alpha1}, \hat{\theta}^{\alpha2})$ is a confidence interval for $\theta$, and $t(\theta)$ is a transformation of the parameter $\theta$, here $\theta = \frac{1}{\lambda}$, then the corresponding interval for $t(\theta)$ is $(t(\hat{\theta}^_{\alpha1}),t(\hat{\theta}^{\alpha2}))$. A confidence interval is first order accurate if the error tends to zero at rate $1/\sqrt{n}$ for sample size $n$, and second order accurate if the error tends to zero at rate $1/n$. The bootstrap percentile interval is transformation respecting but only first order accurate. The standard normal confidence interval is neither transformation respecting nor second order accurate.
Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.
We need to use eigen function to get the eigen values and get the jackknife estimates. Then use the method to obtain the jackknife estimates of bias and standard error of $\hat{\theta}$: An unbiased estimate of the bias $E(\hat\theta)-\theta_0$ is $$(n-1)(\bar{\hat\theta}{(\cdot)}-\hat\theta)$$. An unbiased estimate of $se(\hat\theta)$ is $$ \sqrt{\frac{n-1}n\sum{i=1}^n(\hat\theta_{(i)}-\bar{\hat\theta}_{(\cdot)})^2}$$
The code and the result is shown as below:
#Jackknife #compute the jackknife replicates, leave-one-out estimates library(bootstrap) data(scor,package = "bootstrap") n <- length(scor) theta.jack <- numeric(n) dat <- cbind(scor$mec, scor$vec, scor$alg, scor$ana, scor$sta) sigma.hat <- cov(dat) theta.hat <- eigen(sigma.hat)$values[1]/sum(eigen(sigma.hat)$values) for (i in 1:n){ sigma.jack <- cov(dat[-i,]) theta.jack[i] <- eigen(sigma.jack)$values[1]/sum(eigen(sigma.jack)$values) } #jackknife estimate of bias bias.jack <- (n - 1) * (mean(theta.jack) - theta.hat) #Jackknife estimate of standard error se.j <- sqrt((n-1) * mean((theta.jack - mean(theta.jack))^2)) print(list(bias.jack = bias.jack,se.jack = se.j))
Numerical results show that the bias is small and the method is very accurate because the se is quite small.
In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.
step1
We need to delete two data from datasets on which interpolation is to be done and to search the optimal shape parameter.
For $k,j = 1,\ldots,n$, let observation $(x_k, y_k),(x_j,y_j),k\neq j$ be the test points and use the remaining observations to fit the model.
For each $x_i, i= 1,\ldots,n$, it will be used as test points for n times.
Compute the predicted response $\hat{y_{i,k}} = \hat{\beta}0 +\hat{ \beta}_1x{i,k}$ for the test point.
Compute the prediction error $e_{i,k} = y_{k} ??? \hat{y}_{i,k}$.
step2
Estimate the mean of the squared prediction errors $\hat{\sigma_{\epsilon}}^2=\frac{1}{n(n-1)}\sum{e^2_{i,k}}$
step3
Compare the calculation accuracy of the two methods.
The code and the result is shown as below:
ptm <- proc.time() library(DAAG); attach(ironslag) n <- length(magnetic) #in DAAG ironslag e1 <- e2 <- e3 <- e4 <- matrix(0,n,n) # yhat1 <- yhat2 <- yhat3 <- yhat4 <- matrix(0,n,n) # logyhat3 <- logyhat4 <- matrix(0,n,n) # for n-fold cross validation # fit models on leave-two-out samples for (i in 1:n) { for (j in 1:n){ if (j != i){ y <- magnetic[c(-i,-j)] x <- chemical[c(-i,-j)] J1 <- lm(y ~ x) yhat1 <- J1$coef[1] + J1$coef[2] * chemical[j] e1[i,j] <- magnetic[j] - yhat1 J2 <- lm(y ~ x + I(x^2)) yhat2 <- J2$coef[1] + J2$coef[2] * chemical[j] + J2$coef[3] * chemical[j]^2 e2[i,j] <- magnetic[j] - yhat2 J3 <- lm(log(y) ~ x) logyhat3 <- J3$coef[1] + J3$coef[2] * chemical[j] yhat3 <- exp(logyhat3) e3[i,j] <- magnetic[j] - yhat3 J4 <- lm(log(y) ~ log(x)) logyhat4 <- J4$coef[1] + J4$coef[2] * log(chemical[j]) yhat4 <- exp(logyhat4) e4[i,j] <- magnetic[j]- yhat4 } } } ltocv <- c(sum(e1^2), sum(e2^2), sum(e3^2),sum(e4^2))/(n*(n-1)) ltocv.ptm <- proc.time() - ptm # same function with system.time(exp) print(list("timeconsuming_of_ltocv"=ltocv.ptm[1:3])) ltocv
### Example 7.18 (Model selection: Cross validation) # Example 7.17, cont. ptm <- proc.time() n <- length(magnetic) #in DAAG ironslag e1 <- e2 <- e3 <- e4 <- numeric(n) # for n-fold cross validation # fit models on leave-one-out samples for (k in 1:n) { y <- magnetic[-k] x <- chemical[-k] J1 <- lm(y ~ x) yhat1 <- J1$coef[1] + J1$coef[2] * chemical[k] e1[k] <- magnetic[k] - yhat1 J2 <- lm(y ~ x + I(x^2)) yhat2 <- J2$coef[1] + J2$coef[2] * chemical[k] + J2$coef[3] * chemical[k]^2 e2[k] <- magnetic[k] - yhat2 J3 <- lm(log(y) ~ x) logyhat3 <- J3$coef[1] + J3$coef[2] * chemical[k] yhat3 <- exp(logyhat3) e3[k] <- magnetic[k] - yhat3 J4 <- lm(log(y) ~ log(x)) logyhat4 <- J4$coef[1] + J4$coef[2] * log(chemical[k]) yhat4 <- exp(logyhat4) e4[k] <- magnetic[k] - yhat4 } loocv <- c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2)) loocv.ptm <- proc.time() - ptm print(list("timeconsuming_of_loocv"=loocv.ptm[1:3]))
a <- data.frame(rbind(loocv,ltocv)) row.names(a) <- c("loocv","ltocv") names(a) <- c("L1","L2","L3","L4") knitr::kable(a)
This table shows us the estimation of the mean of the squared prediction errors $\hat{\sigma}_\epsilon^2$ of the two method, and in the r code of leave-two-out method, yhat was caculated for two times.
According to the numberical result, the computation time of LTOCV increases from 0.26s to more than 10 seconds due to pair wises elections of data, but the the mean of the squared prediction errors did not decreases significantly.
The prediction error criterion of both methods shows that the quadratic model is best fit for the data. The fitted model is
$$\hat{Y} = 24.49262 - 1.39334X + 0.05452X^2$$
Implement the two-sample Cram??er-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2
step1
Consider the null hypothesis:$$H_0: F=G$$
Cram??er-von Mises statistic is $$W_2=\frac{mn}{(m+n)^2}[\sum_1^n(F_n(x_i)-G_m(x_i))^2+\sum_1^m(F_n(y_j)-G_m(y_j))^2]$$
where $F_n$ is the ecdf of the sample $x_1,\ldots,x_n$ and $G_m$ is the ecdf of the
sample $y_1,\ldots,y_m$. Large values of $W_2$ support alternative $F \neq G$.
step2 Let $r_1,r_2,\ldots,r_n$ be the ranks of the x's in the combined sample, and let $s_1,s_2,\ldots,s_m$ be the ranks of the y's in the combined sample. $$T = \frac{nm}{n+m} \omega^2 = \frac{U}{nm(n+m)}-\frac{4mn - 1}{6N}$$ where U is defined as $$U = n \sum_{i=1}^n(r_i-i)^2 + m\sum_{j=1}^m (s_j-j)^2$$ Note: the method can be found in wiki.
step3 According to the statistic or p-value to decide whether the null hypothesis is rejected or not.
The code and the result is shown as below:
set.seed(1) attach(chickwts) x <- sort(as.vector(weight[feed == "soybean"])) y <- sort(as.vector(weight[feed == "linseed"])) detach(chickwts) n <- length(x) #sample1 size m <- length(y) #sample2 size z <- c(x, y) #pooled sample N <- n + m h <- numeric(N) T <- m*n/(m+n)^2 R <- 999 #number of replicates K <- 1:N reps <- numeric(R) #storage for replicates v.n <- v1.n <- numeric(n); v.m <- v1.m <- numeric(m) for (i in 1:n) v.n[i] <- ( x[i] - i )**2 for (j in 1:m) v.m[j] <- ( y[j] - j )**2 reps0 <- ( (n * sum(v.n) + m * sum(v.m)) / (m * n * N) ) - (4 * m * n - 1) / (6 * N) #replicates for (i in 1:R) { #generate indices k for the first sample k <- sample(K, size = n, replace = FALSE) x1 <- z[k] y1 <- z[-k] #complement of x1 z1 <- c(x1,y1) for (i in 1:n) { v1.n[i] <- ( x1[i] - i )**2 } for (j in 1:m) { v1.m[j] <- ( y1[j] - j )**2 } reps[k] <- ( (n * sum(v1.n) + m * sum(v1.m)) / (m * n * N) ) - (4 * m * n - 1) / (6 * N) } p <- mean( c(reps0, reps) >= reps0 ) p
According to the p-value, we reject $H_0$ significantly.
Design experiments for evaluating the performance of the NN, energy, and ball methods in various situations.
step1 In this simulation experiment, I will compare the empirical power of the third nearest neighbor test, the energy test and the ball test.
step2
First, unequal variances and equal expectations $$F_1=N_2(\mu = (0, 0)_2, \Sigma = I_2), F_2 = N_2(?? = (0, 0)_2 , ?? = 2I_2) $$ Second,unequal variances and unequal expectations $$F_1=N_2(\mu = (0, 0)_2, \Sigma = I_2), F_2 = N_2(?? = (0, 1)_2 , ?? = 2I_2) $$ Third, non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)
$$F_1=t_1, F_2 = N_1(0,1)*N_1(1,2) $$
And finally, unbalanced samples (say, 1 case versus 10 controls)
$$F_1=t_1,F_2 = N_2((0,0),I_2) $$
The code and the result is shown as below:
library(RANN) library(energy) library(Ball) library(boot) m <- 50; k<-3; p<-2 n1 <- n2 <- 15;N = c(n1,n2); R<-999 Tn <- function(z, ix, sizes,k) { n1 <- sizes[1] n2 <- sizes[2] n <- n1 + n2 z <- z[ix, ] o <- rep(0, NROW(z)) z <- as.data.frame(cbind(z, o)) NN <- nn2(z, k=k+1) #uses package RANN block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[-(1:n1),-1] i1 <- sum(block1 < n1 + 0.5) i2 <- sum(block2 > n1 + 0.5) return((i1 + i2) / (k * n)) } eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R, sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) #Unequal variances and equal expectations for(i in 1:m){ x <- matrix(rnorm(n1*2),ncol=2); y <- matrix(rnorm(n1*2,0,2),ncol=2); z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value #NN p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value #energy method p.values[i,3] <- bd.test(x=x,y=y,R=R,seed=i*12345)$p.value #ball method } alpha <- 0.05 #confidence level pow <- colMeans(p.values<alpha) pow
These alternatives differ in variance only, and the empirical evidence summarized before suggests that ball method is most powerful.
#Unequal variances and unequal expectations for(i in 1:m){ x <- matrix(rnorm(n1*2),ncol=2); y <- cbind(rnorm(n2,0,2),rnorm(n2,1,2)); z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value #NN p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value #energy method p.values[i,3] <- bd.test(x=x,y=y,R=R,seed=i*12345)$p.value #ball method } alpha <- 0.05 #confidence level pow <- colMeans(p.values<alpha) pow
The emperical power differs from 0.38 to 0.96 and the ball is also the best.
#Non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions) for(i in 1:m){ x <- matrix(rt(n1*2,1), ncol = 2) y <- cbind(rnorm(n2),rnorm(n2,1,2)) z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value #NN p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value #energy method p.values[i,3] <- bd.test(x=x,y=y,R=R,seed=i*12345)$p.value #ball method } alpha <- 0.05 #confidence level pow <- colMeans(p.values<alpha) pow
Differently the energy method displays best and the ball is also not bad.
#Unbalanced samples (say, 1 case versus 10 controls) n1 <- 10; n2 <- 100; N = c(n1,n2) for(i in 1:m){ x <- matrix(rt(n1*2,1),ncol=2); y <- matrix(rnorm(n2*2),ncol=2); z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,3)$p.value #NN p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value #energy method p.values[i,3] <- bd.test(x=x,y=y,R=R,seed=i*12345)$p.value #ball method } alpha <- 0.05 #confidence level pow <- colMeans(p.values<alpha) pow
Unbalanced data shows that the energy is best and the NNtest follows.
Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard Cauchy distribution (see qcauchy or qt with df=1). Recall that a $Cauchy(\theta,\eta)$ distribution has density function$$f(x)=\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)},-\infty
The standard Cauchy has the $Cauchy(\theta = 1, \eta = 0)$ density. (Note that the standard Cauchy density is equal to the Student t density with one degree of freedom.)
step1
Choose a proposal distribution $g(??|X_t)$. Here I use Normal Distribution.
step2 Generate $X_0$ from a distribution normal distribution, and repeat.
step3
Plot and compare with the theorial value.
The code and the result is shown as below:
# Cauchy distribution density function f <- function(x, theta,eta) { stopifnot(theta > 0) return(1/(theta*pi*(1+((x-eta)/theta)^2))) } m <- 10000 theta <- 1; eta <- 0 x <- numeric(m) x[1] <- rnorm(1) #proposal densty k <- 0 u <- runif(m) for (i in 2:m) { xt <- x[i-1] y <- rnorm(1, mean = xt) num <- f(y, theta, eta) * dnorm(xt, mean = y) den <- f(xt, theta, eta) * dnorm(y, mean = xt) if (u[i] <= num/den) x[i] <- y else { x[i] <- xt k <- k+1 #y is rejected } print(k) }
In this exercise, approximately 20% of the candidate points are rejected, so the chain is somewhat not bad, but can be better.
plot(1:m, x, type="l", main="", ylab="x") #discard the burnin sample b <- 1001 y <- x[b:m] a <- ppoints(100) Qc <- qcauchy(a) #quantiles of Standard Cauchy Q <- quantile(x, a) qqplot(Qc, Q, main="", xlim=c(-2,2),ylim=c(-2,2), xlab="Standard Cauchy Quantiles", ylab="Sample Quantiles") hist(y, breaks="scott", main="", xlab="", freq=FALSE) lines(Qc, f(Qc, 1, 0))
From the plot, it appears that the sample quantiles are in approximate agreement with the theoretical quantiles.
Rao [220, Sec. 5g] presented an example on genetic linkage of 197 animals in four categories (also discussed in [67, 106, 171, 266]). The group sizes are 278 (125, 18, 20, 34). Assume that the probabilities of the corresponding multinomial distribution are$$(\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4})$$. Estimate the posterior distribution of ?? given the observed sample, using one of the methods in this chapter.
$$f(GS|\theta) = (\frac{2 + \theta}{4})^{gs_1}(\frac{1-\theta}{4})^{gs_2}(\frac{1-\theta}{4})^{gs_3}(\frac{\theta}{4})^{gs_4}=(\frac{2 + \theta}{4})^{gs_1}(\frac{1-\theta}{4})^{gs_2 + gs_3}(\frac{\theta}{4})^{gs_4}$$
gs is group size
The code and the result is shown as below:
gs <- c(125,18,20,34) #group size m <- 5000 w <- .25 b <- 1001 #the target density prob <- function(y, gs) { if (y < 0 | y >1) return (0) else return((1/2+y/4)^gs[1] *((1-y)/4)^gs[2]*((1-y)/4)^gs[3]*(y/4)^gs[4]) } u <- runif(m) #for accept/reject step v <- runif(m, -w, w) #proposal distribution x[1] <- .25 for (i in 2:m) { y <- x[i-1] + v[i] if (u[i] <= prob(y, gs) / prob(x[i-1], gs)) x[i] <- y else x[i] <- x[i-1] } theta.hat <- mean(x[b:m]) theta.hat
# compare gs.hat <- sum(gs) * c((2+theta.hat)/4, (1-theta.hat)/4, (1-theta.hat)/4, theta.hat/4) round(gs.hat)
So, theta.hat is near to the real value.
Exercise 9.6 presented an example on genetic linkage of 197 animals in four categories. The group sizes are 278(125, 18, 20, 34). Assume that the probabilities of the corresponding multinomial distribution are$$(\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4}).$$ Use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R}$ < 1.2.
step1 The Gelman-Rubin method of monitoring convergence of a M-H chain is based on comparing the behavior of several generated chains with respect to the variance of one or more scalar summary statistics.
step2 The Gelman-Rubin statistic is the estimated potential scale reduction$\sqrt{\hat{R}}=\sqrt{\frac{\hat{Var(\psi}}{W}}.$
step3 The factor $\hat{R}$ decreases to 1 as the length of the chain tends to infinity, so $\hat{R}$ should be close to 1 if the chains have approximately converged to the target distribution. Here we choose 1.1 and 1.2.
### Gelman-Rubin method of monitoring convergence Gelman.Rubin <- function(psi) { # psi[i,j] is the statistic psi(X[i,1:j]) # for chain in i-th row of X psi <- as.matrix(psi) n <- ncol(psi) k <- nrow(psi) psi.means <- rowMeans(psi) #row means B <- n * var(psi.means) #between variance est. psi.w <- apply(psi, 1, "var") #within variances W <- mean(psi.w) #within est. v.hat <- W*(n-1)/n + (B/n) #upper variance est. r.hat <- v.hat / W #G-R statistic return(r.hat) } gs <- c(125,18,20,34) #group size #m <- 5000 #w <- .25 b <- 1001 #the target density prob <- function(y, gs) { if (y < 0 | y >1) return (0) else return((1/2+y/4)^gs[1] *((1-y)/4)^gs[2]*((1-y)/4)^gs[3]*(y/4)^gs[4]) } chain <- function(w, N) { #generates a Metropolis chain for Normal(0,1) #with Normal(X[t], sigma) proposal distribution #and starting value X1 x <- rep(0, N) u <- runif(N) #for accept/reject step v <- runif(N, -w, w) #proposal distribution x[1] <- w for (i in 2:N) { y <- x[i-1] + v[i] if (u[i] <= prob(y, gs) / prob(x[i-1], gs)) x[i] <- y else x[i] <- x[i-1] } return(x) } k <- 4 #number of chains to generate n <- 15000 #length of chains w <- c(0.05,0.25,0.5,0.7,0.9) #generate the chains X <- matrix(0, nrow=k, ncol=n) for (i in 1:k) X[i, ] <- chain(w[i], n) #compute diagnostic statistics psi <- t(apply(X, 1, cumsum)) for (i in 1:nrow(psi)) psi[i,] <- psi[i,] / (1:ncol(psi)) print(Gelman.Rubin(psi)) #plot psi for the four chains par(mfrow=c(2,2)) for (i in 1:k) plot(psi[i, (b+1):n], type="l", xlab=i, ylab=bquote(psi)) par(mfrow=c(1,1)) #restore default #plot the sequence of R-hat statistics rhat <- rep(0, n) for (j in (b+1):n) rhat[j] <- Gelman.Rubin(psi[,1:j]) plot(rhat[(b+1):n], type="l", xlab="",ylim = range(1,1.4), ylab="R") abline(h=1.2, lty=2) abline(h=1.1,lty = 2)
The plot of $R^2$ is shown in the figure for time 1001 to 15000. From this plot it is evident that the chain is converging faster than when the proposal distribution had a very small variance. The value of $R^2$ is below 1.2 within 500 iterations and below 1.1 within 2000 iterations.
Find the intersection points A(k) in $(0,\sqrt{k})$ of the curves$$S_{k - 1}(a) = P(t(k - 1) > \sqrt{\frac{a^2(k - 1)}{k - a^2}})$$ and $$S_{k}(a) = P(t(k) > \sqrt{\frac{a^2k}{k + 1 - a^2}})$$ for k = 4 : 25, 100, 500, 1000, where t(k) is a Student t random variable with k degrees of freedom.
k <- c(4 : 25, 100, 500, 1000) #Declare the variables n <- length(k) a <- rep(0,n) #store the intersection points eps <- .Machine$double.eps^0.25 for (i in 1:n) { out <- uniroot(function(a){ pt(sqrt(a^2*(k[i] -1)/(k[i] - a ^ 2)),df = ( k[i] - 1)) - pt(sqrt(a ^ 2 * k[i] /(k[i] + 1 - a ^ 2)),df = k[i])}, lower = eps, upper = sqrt(k[i] - eps) #the endpoints ) a[i] <- out$root } intsecp <- rbind(k,a) #knitr::kable(intsecp) print(round(intsecp,3)) plot(k[1:19] ,a[1:19], xlab = "k value(df)", ylab = "A(k)")
From the tendency, we can find that when k increases, the intersection values increases too.
Write a function to compute the cdf of the Cauchy distribution, which has density $$\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)},-\infty < x < \infty,$$ where $\theta > 0$. Compare your results to the results from the R function $pcauchy$.
step
set.seed(1) # generate the distribution function of cauchy distribution cdf <- function(x,eta,theta){ n <- length(x) cdf <- numeric(n) # compute the density function of cauchy distribution pdf <- function(y) 1/(theta* pi *(1+((y-eta)/theta)^2)) for (i in 1 : n){ cdf[i] <- integrate(pdf, -Inf, x[i])$value } return(cdf) } # consider the special case eta <- 2 theta <- 3 x <- seq(-15 + eta, 15 + eta, .01) plot(x, cdf(x,eta, theta),type = "l",lwd = 3, xlab = "x eta = 2, theta = 3", ylab = "cdf(x)") lines(x,pcauchy(x,eta,theta),col = 2) legend("topleft", c("cdf","pcauchy"), lwd = c(3,1),col = c(1,2)) #compute the difference mean(cdf(x,eta, theta) - pcauchy(x))
dat <- rbind(Genotype=c('AA','BB','OO','AO','BO','AB','Sum'), Frequency=c('p2','q2','r2','2pr','2qr','2pq',1), Count=c('nAA','nBB','nOO','nAO','nBO','nAB','n')) knitr::kable(dat,format="html",table.attr = "class=\"table table-bordered\"", align="c")
Use both for loops and $lapply()$ to fit linear models to the $mtcars$ using the formulas stored in this list:
formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt )
Here, fomulas is a list, and it varys. The data is mtcars which does not vary. So the argument of the function is the list formulas.
set.seed(1) ## loop out1 <- vector("list", length(formulas)) for (i in seq_along(formulas)) { out1[[i]] <- lm(formulas[[i]],data = mtcars) } ## lapply (out1<-lapply(formulas,function(x) lm(x,mtcars)))
Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?
bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] })
Differently, the linear function does not vary but the input varies i.e. the bootrap is a list. So the argument of the function is the list bootstraps.
## loop out2 <- vector("list", length(bootstraps)) for (i in seq_along(bootstraps)) { out2[[i]] <- lm(mpg ~ disp,bootstraps[[i]]) } ## lapply (out2 <- lapply(bootstraps, function(x) lm(mpg ~ disp,x)))
For each model in the previous two exercises, extract R2 using the function below.
rsq <- function(mod) summary(mod)$r.squared
set.seed(2) # out is the linear model generated in ex3 and ex4. out <- c(out1,out2) (R_sqr <- sapply(out, rsq)) plot(1:14,R_sqr,ylim = c(0,1))
The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.
trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE )
Extra challenge: get rid of the anonymous function by using [[ directly.
##sapply sapply(trials, function(x) x$p.value) ##get rid of the anonymous function sapply(trials, `[[`, 'p.value')
Conclusion: We cannot reject H0 significantly for most p-values is lower than 0.05.
And actually the data come from the same distribution $Poisson(10)$
Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?
library(parallel) mcvMap <- function(f, FUN.VALUE , ...) { out <- mcMap(f, ...) vapply(out, identity, FUN.VALUE) }
We can use parallel or foreach packages to do iterates in parallel.
Make a faster version of chisq.test() that only computes the chi-square test statistic when the input is two numeric vectors with no missing values. You can try simplifying chisq.test() or by coding from the mathematical definition (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test).
Analysis
chisq.test() performs chi-squared contingency table tests and goodness-of-fit tests.
Calculate the chi-squared test statistic, $\chi ^{2}$, which resembles a normalized sum of squared deviations between observed and theoretical frequencies.
The value of the test-statistic is
$$\chi ^{2}=\sum {i=1}^{r}\sum {j=1}^{c}\frac{(O_{i,j}-E_{i,j})^{2}}{E_{i,j}}$$
$$E_{i,j} = N \cdot p_{i,j} = N \cdot p_{i\cdot} \cdot p_{\cdot j} = \frac{E_{i\cdot} \cdot E_{\cdot j}}{N}$$
here, c = 2 because the input is two numeric vectors with no missing values.
Determine the degrees of freedom, df, of that statistic.$$df = (r - 1)??(c - 1) = r - 1, \qquad c=2.$$
Select a desired level of confidence for the result of the test.
Compare $\chi ^{2}$ to the critical value from the chi-squared distribution with df degrees of freedom and the selected confidence level , which in many cases gives a good approximation of the distribution of $\chi ^{2}$.
Use $microbenchmark::microbenchmark$ to compare with the version of $chisq.test()$
Code
# Testing for statistical independence ## chisq_test <- function(x){ m <- nrow(x); n <- ncol(x); N <- sum(x) E <- matrix(0,m,n) rowsums <- unlist(lapply(1:m, function(i) sum(x[i,]))) # which is used to computed pi. colsums <- unlist(lapply(1:n, function(j) sum(x[,j]))) for (i in 1:m){ for (j in 1:n) { E[i,j] <- rowsums[i]*colsums[j]/N } } df <- (m-1) * (n-1) chi_sqr <- sum((x-E)^2/E) # p_value <- dchisq(chi_sqr, df = df) (test <- list(chi_sqr = chi_sqr, df = df, p_value = p_value)) } # chisq.test() ## From Agresti(2007) p.39 M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477))) dimnames(M) <- list(gender = c("F", "M"), party = c("Democrat","Independent", "Republican")) (M) (Xsq <- chisq.test(M)) # Prints test summary chisq_test(M) print(microbenchmark::microbenchmark( chisq_test(M), chisq.test(M) ))
Conclusion
We can say that calling chisq_test() quite a bit faster than calling chisq.test() for two numeric vectors with no missing values.
Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?
Analysis
Code
table2 <- function(x, y) { x_val <- sort(unique(x)); y_val <- sort(unique(y)) # remove the duplicate elements m <- length(x_val); n <- length(y_val) # dimensions of the table mat <- matrix(0L, m, n) for (i in seq_along(x)) { mat[which(x_val == x[[i]]), which(y_val == y[[i]])] <- mat[which(x_val == x[[i]]), which(y_val == y[[i]])] + 1L } dimnames <- list(x_val, y_val) names(dimnames) <- as.character(as.list(match.call())[-1]) # R has names for dimnames... :/ table <- array(mat, dim = dim(mat), dimnames = dimnames) class(table) <- "table" table } # Example:generate random samples. data <- data.frame(group=as.factor(sample(1:4,100,TRUE)), race=as.factor(sample(1:3,100,TRUE))) # look at the table tab1 <- with(data,table(group,race)) tab2 <- with(data,table2(group,race)) # Check whether the table2 is same with table function identical(tab1,tab2) # Use it to speed up your chi-square test for testing chisq_test(tab1) microbenchmark::microbenchmark(chisq_test(tab1), chisq_test(tab2))
Conclusion
We can use it to speed up your chi-square test, but not very efficeient.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.