Implement at least three examples of different types in the book "R for Beginners".
In this example, we choose the data set "InsectSprays" to do some simple analysis of variance(aov) so that we can know whether six Insecticides have different effects.
At first, some features of the data can be established as the following:
count.1<-matrix(InsectSprays$count,nrow=12) mean.var<-rbind(colMeans(count.1),apply(count.1,2,var)) rownames(mean.var)<-c("mean","var") colnames(mean.var)<-c("A","B","C","D","E","F") mean.var
In order to get more stable variance and make data satisfy Normality hypothesis, we should do square root conversion of data. The results of aov is as following:
aov.spray<-aov(sqrt(InsectSprays$count)~InsectSprays$spray) summary(aov.spray) aov.spray$coefficients
As the P-value is much smaller than 0.05, it can be considered that the result is much significant.
This example in the book "R for beginners" mainly used some drawing function to reflect some features of the data iris. Moreover, I will show you some numerical results of the data.
Three species can be clearly distinguished by the figure as following:
library(lattice) data("iris") xyplot(Petal.Length~Petal.Width,data=iris,groups=Species,type=c("p","smooth"),span=.75,auto.key=list(x=0.15,y=0.85))
Next, we just pick two spiecies to do Multiple regression analysis. It is neccessary to define Dumb variable to replace the variable species.
iris$isSetosa<-ifelse(iris$Species=="setosa",1,0) iris$isVersicolor<-ifelse(iris$Species=="versicolor",1,0) iris_demo=iris[,-5] m_iris<-lm(Sepal.Width~.,data=iris_demo) summary(m_iris)
The results above reflect that only intercept is not significant. The coeficients about isSetosa and isVersicolor is much significant which implies there are several differences between different species.
This example tells us how to enrich a scatter plot. We will establish several scatter plots about ten pairs of random values. The latter is richer than former. Before showing the figures, ten pairs of random values are established as following:
x<-rnorm(10) y<-rnorm(10) rbind(x,y)
Figure 1:
plot(x,y)
Figure 2:
plot(x,y,xlab="Ten random values",ylab="Ten other values",xlim=c(-2,2),ylim=c(-2,2),pch=22,col="red",bg="yellow",bty="l",tcl=0.4,main="How to customize a plot with R",las=1,cex=1.5)
Figure 3:
opar <- par() par(bg="lightyellow", col.axis="blue", mar=c(4, 4, 2.5, 0.25)) plot(x, y, xlab="Ten random values", ylab="Ten other values", xlim=c(-2, 2), ylim=c(-2, 2), pch=22, col="red", bg="yellow", bty="l", tcl=-.25, las=1, cex=1.5) title("How to customize a plot with R (bis)", font.main=3, adj=1)
Figure 4:
opar <- par() par(bg="lightgray", mar=c(2.5, 1.5, 2.5, 0.25)) plot(x, y, type="n", xlab="", ylab="", xlim=c(-2, 2), ylim=c(-2, 2), xaxt="n", yaxt="n") rect(-3, -3, 3, 3, col="cornsilk") points(x, y, pch=10, col="red", cex=2) axis(side=1, c(-2, 0, 2), tcl=-0.2, labels=FALSE) axis(side=2, -1:1, tcl=-0.2, labels=FALSE) title("How to customize a plot with R (ter)", font.main=4, adj=1, cex.main=1) mtext("Ten random values", side=1, line=1, at=1, cex=0.9, font=3) mtext("Ten other values", line=0.5, at=-1.8, cex=0.9, font=3) mtext(c(-2, 0, 2), side=1, las=1, at=c(-2, 0, 2), line=0.3, col="blue", cex=0.9) mtext(-1:1, side=2, las=1, at=-1:1, line=0.2, col="blue", cex=0.9)
A discrete random variable $X$ has probability mass function
|$x$|0|1|2|3|4| |-|-|-|-|-|-| |$p(x)$|0.1|0.2|0.2|0.2|0.3|
Use the inverse transform method to generate a random sample of size 1000 from the distribution of $X$. Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function.
x<-c(0,1,2,3,4);p<-c(0.1,rep(0.2,3),0.3) cp<-cumsum(p);n<-1000;r<-numeric(1000) set.seed(211) r1<-x[findInterval(runif(n),cp)+1] ct1<-as.vector(table(r1)) fre1<-rbind(x,ct1/n);rownames(fre1)<-c("x","frequency") fre1 ct1/sum(ct1)/p ## Inverse transform method
The first one is the relative frequency table. The sencond is the ratio between the frequency and the theoretical probabilities.
Next, we establish the results by using the sample function.
set.seed(212) r2<-sample(c(0,1,2,3,4),n,replace=T,c(0.1,rep(0.2,3),0.3)) ct2<-as.vector(table(r2)) fre2<-rbind(x,ct2/n);rownames(fre2)<-c("x","frequency") fre2 ct2/sum(ct2)/p #R sample function
Write a function to generate a random sample of size n from the $Beta(a, b)$ distribution by the acceptance-rejection method. Generate a random sample of size 1000 from the $Beta(3,2)$ distribution. Graph the histogram of the sample with the theoretical $Beta(3,2)$ density superimposed.
$Beta(3,2)$ with pdf $f(x)=12x^2(1-x),~0<x<1$. Let $c=2$ and $g(x)=x,~0<x<1$.
set.seed(221) n <- 1000;j<-k<-0;y <- numeric(n) while (k < n) { u <- runif(1) j <- j + 1 x <- runif(1) #random variate from g if (6*x^2*(1-x) > u) { #we accept x k <- k + 1 y[k] <- x } } 1000/j #efficiency # #(experiments) for n random numbers
hist(y,freq=F,main="Histogram of Beta(3,2)",xlab="Beta(3,2)") curve(12*x^2*(1-x),from=0,to=1,add=TRUE,col="darkblue", lwd=2)
As above figure shows, the generated random sample by the acceptance-rejection method quite matches with the distribution $Beta(3,2)$.
Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter $??$ has $Gamma(r, ??)$ distribution and $Y$ has $Exp(??)$ distribution. That is, $(Y |?? = ??) ?? fY (y|??) = ??e^{-??y}$. Generate 1000 random observations from this mixture with $r = 4$ and $?? = 2$.
set.seed(231) n<-1000;r<-4;beta<-2 lambda<-rgamma(n,r,beta) y<-rexp(n,lambda) ##1000 random observations y.cut<-y[y<4] hist(y.cut,freq=F,breaks = c(0.2*0:10,3,4),main="Histogram of mixture") curve(2*(1+(1/2)*x)^(-5),0,4,add=T)
1000 random observations are stored in the vector $y$. As there are extremely few points bigger than 4, so that we can graph the histogram without these points. The histogram of the generated sample is shown above.
Write a function to compute a Monte Carlo estimate of the $Beta(3, 3)$ cdf, and use the function to estimate $F(x)$ for $x = 0.1, 0.2,\dots, 0.9$. Compare the estimates with the values returned by the pbeta function in R.
$Beta(3,3)$ has pdf $f(x)=30x^2(1-x)^2,~~0<x<1$. Then the cdf of $Beta(3,3)$ can be represented as $F(x)=\int_0^x 30y^2(1-y)^2dy$. By the principle of Monte Carlo estimate, $F(x)=\mathbb{E}[30xY^2(1-Y)^2]$ where $Y\sim U(0,x)$.
The function computed the $Beta(3,3)$ cdf is shown as follows:
Fx<-function(n,x) {y<-runif(n,0,x);mean(30*x*y^2*(1-y)^2)} pbeta33<-function(x) ##Beta(3,3) cdf {n<-1000 ##size 1000 m<-10 ##10 groups Fx.vec<-sapply(1:m,function(o) Fx(n,x)) mean(Fx.vec) }
Next, we show you the estimated values of $F(x)$ for $x=0.1,0.2,\dots,0.9$.
beta33.hat<-numeric(9) set.seed(311) for(i in 1:9) beta33.hat[i]<-pbeta33(i/10) cbind(1:9/10, round(beta33.hat,5))
Then we establish the theoretical values by pbeta function.
cbind(1:9/10,pbeta(1:9/10,3,3))
At last, we combine the estimated values and the theoretical values to compare them.
comp<-data.frame(cbind(1:9/10, round(beta33.hat,5),pbeta(1:9/10,3,3))) colnames(comp)<-c("x","beta33.hat","beta33") comp
From the above sheet, the estimated value is very close to the theoretical value under different values of $x$. Hence, we can fully consider that the Monte Carlo estimate of $Beta(3,3)$ cdf is much effective.
The Rayleigh density [156, (18.76)] is $f(x)=\frac{x}{\sigma^2}e^{-x^2/(2\sigma^2)},~~x\ge0, \sigma>0$. Implement a function to generate samples from a $Rayleigh(\sigma)$ distribution, using antithetic variables. What is the percent reduction in variance of $\frac{X+X^`}{2}$ compared with $\frac{X_1+X_2}{2}$ for independent $X_1, X_2$?
In order to get Rayleigh cdf, we should compute $F(x)=\int_0^x \frac{y}{\sigma^2}e^{-y^2/(2\sigma^2)}dy$ by Monte Carlo, using antithetic vaeiables. First, the integral is equal to $\mathbb{E}[x\frac{Y}{\sigma^2}e^{-\frac{Y^2}{2\sigma^2}}]$ where $Y\sim U(0,x)$. Moreover, it can also be reprensented as $\mathbb{E}[x^2\frac{U}{\sigma^2}e^{-\frac{x^2U^2}{2\sigma^2}}]$ where $U\sim U(0,1)$. Define $g(U)=x^2\frac{U}{\sigma^2}e^{-\frac{x^2U^2}{2\sigma^2}}$ Then we see that $g(U)$ and $g(1-U)$ are a pair of antithetic varibles.
The function to compute $F(x)$ is of follows:(we assume $\sigma=1$ originally)
MC.Ray <- function(x,sigma=1, R = 10000, antithetic = TRUE) { cdf <- numeric(length(x)) u <- runif(R/2) if (!antithetic) v <- runif(R/2) else v <- 1 - u u <- c(u, v) for (i in 1:length(x)) { g <- x[i]^2*u*exp(-x[i]^2*u^2/(2*sigma^2))/sigma^2 cdf[i] <- mean(g) } cdf }
When $x=1$ which implies to compute $F(1)$, we compare the variances between single MC experiment and antithetic variable approach.
m <- 1000 MC1 <- MC2 <- numeric(m) x <- 1 set.seed(321) for (i in 1:m) { MC1[i] <- MC.Ray(x, R = 1000, anti = FALSE) MC2[i] <- MC.Ray(x, R = 1000) } print(sd(MC1)) print(sd(MC2)) print((var(MC1)-var(MC2))/var(MC1))
The antithetic variable approach achieved approximately 90% reduction in variance.
Next, we generate samples from $Rayleigh(\sigma=1)$ by inverse transform method. Actually, the cdf of $Rayleigh(\sigma)$ has the analytical form $F(x)=1-e^{-x^2/(2\sigma^2)}$. So that the inverse transform has form $F^{-1}(u)=(2\sigma^2\log(\frac{1}{1-u})^{1/2}$. Then we can construct antithetic varibles $X=F^{-1}(U)$ and $X^`=F^{-1}(1-U)$ from which $U\sim U(0,1)$.
The function to generate samples is of follows:
IT.Ray<-function(n,sigma=1,antithetic=TRUE){ u <- runif(n/2) if (!antithetic) v <- runif(n/2) else v <- 1 - u u <- c(u, v) x<-sqrt(-2*sigma^2*log(1-u)) x } set.seed(321) hist(IT.Ray(10000),ylim=c(0,0.6),breaks =seq(0,5,0.5),freq=F,main="Histogram of Rayleigh(1)",xlab="x") curve(x*exp(-x^2/2),0,4,add=T)
We plot the histogram with 10000 generated random numbers and add the density curve of $Rayleigh(1)$ in the same figure. As you see, the histogram is very close to the theoritical density.
At last, suppose taht $X, X_1, X_2 \sim F^{-1}(U)$ and $X^\sim F^{-1}(1-U)$ such that $X_1,X_2$ are independent. Clearly, $X$ and $X^
$ are a pair of antithetic variables. Then we compute the percent reduction in variance of $\frac{X+X^`}2$ compared with $\frac{X_1+X_2}2$.
m <- 1000 anti.F <- anti.T <- numeric(m) set.seed(321) for (i in 1:m) { anti.F[i] <- mean(IT.Ray(10000, antithetic = FALSE)) anti.T[i] <- mean(IT.Ray(10000)) } print(sd(anti.F)) print(sd(anti.T)) print((var(anti.F)-var(anti.T))/var(anti.F))
The antithetic variable approach achieved approximately 95% reduction in variance.
Find two importance functions $f_1$ and $f_2$ that are supported on $(1,\infty)$ and are ??close?? to $$ g(x)=\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2},~~x>1. $$ Which of your two importance functions should produce the smaller variance in estimating $$ \int_1^{\infty}\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx $$ by importance sampling? Explain.
We show two importance functions following: $$ f_1(x)=e^{1-x},~~x>1;~~f_2(x)=\frac12xe^{\frac{1-x^2}{4}},~~x>1. $$ We can show you the figure of $g(x)/f_1$ and $g(x)/f_2$
#figure $g(x)/f(x)$ x <- c(seq(1, 5, .1)) w <- 2 g<-x^2*exp(-x^2/2)/sqrt(2*pi) f1<-exp(1-x) f2<-x*exp((1-x^2)/4)/2 plot(x, g/f1, type = "l", main = "", ylab = "", ylim = c(0,0.8), lwd = w, lty = 2) lines(x, g/f2, lty = 3, lwd = w) legend("topright", legend = c(1:2), lty = 2:6, lwd = w, inset = 0.02)
Obviously, cdf of $f_1,f_2$ can be calculated as following: $$ F_1(x)=1-e^{1-x},~~x>1;~~F_2(x)=1-e^{\frac{1-x^2}{4}},~~x>1. $$ Then we can generate random samples by inverse transform method. We can see that $$ F_1^{-1}(u)=1-\log(1-u);~~F_2^{-1}(u)=(1-4\log(1-u))^{1/2}. $$ Next, we compute the integral with two importance functions respectively.
m <- 10000 ##size 10000 theta.hat <- se <- numeric(2) g <- function(x) { x^2*exp(-x^2/2)/sqrt(2*pi)*(x>1) } u <- runif(m) #f1, inverse transform method x<-1-log(1-u) fg <- g(x)/exp(1-x) theta.hat[1] <- mean(fg) se[1] <- sd(fg) u <- runif(m) #f2, inverse transform method x<-(1-4*log(1-u))^0.5 fg <- 2*g(x) /(x*exp((1-x^2)/4)) theta.hat[2] <- mean(fg) se[2] <- sd(fg) rbind(theta.hat,se)
As established above, we find that $f_2$ produces the smaller variance. Actually, the variances is much closer, remind the figure of $g(x)/f_1$ and $g(x)/f_2$, we see that these two lines are similar, the line of $g(x)/f_2$ is a little bit steadier. Therefore, the varience of $f_2$ is a little smaller than the other one.
Obtain a Monte Carlo estimate of $$ \int_{1}^{\infty}\frac{x^2}{\sqrt{2\pi}}e^{-x^2/2}dx $$ by importance sampling.
From the answer in exercise 5.13, we use the importance function $f_2$ to calculate the value of this integral.
The estimated value and variance is of followng:
m <- 10000 ##size 10000 g <- function(x) { x^2*exp(-x^2/2)/sqrt(2*pi)*(x>1) } u <- runif(m) #f2, inverse transform method x<-(1-4*log(1-u))^0.5 fg <- 2*g(x) /(x*exp((1-x^2)/4)) theta.hat<- mean(fg) se<- sd(fg) c(theta.hat,se)
Let $X$ be a non-negative random variable with $\mu= E[X]<\infty$. For a random sample $x_1,\dots, x_n$ from the distribution of $X$, the Gini ratio is defined by $$ G=\frac{1}{2n^2\mu}\sum_{j=1}^{n}\sum_{i=1}^n|x_i-x_j|. $$ The Gini ratio is applied in economics to measure inequality in income distribution (see e.g. [163]). Note that $G$ can be written in terms of the order statistics $x_{(i)}$ as $$ G=\frac{1}{n^2\mu}\sum_{i=1}^{n}(2i-n-1)x_{(i)}. $$ If the mean is unknown, let $\hat{G}$ be the statistic $G$ with $\mu$ replaced by $\bar{x}$. Estimate by simulation the mean, median and deciles of $\hat{G}$ if $X$ is standard lognormal. Repeat the procedure for the uniform distribution and $Bernoulli(0.1)$. Also construct density histograms of the replicates in each case.
1.Standard lognormal If $X$ is standard lognormal, then $\log(x)\sim N(0,1)$. Thus the pdf of X is $f(x)=\frac{1}{\sqrt{2\pi}x}e^{-\frac{(\log{x})^2}{2}}$. Actually, we can compute the theoritical expectation of $X$, its value is $e^{1/2}$.
n<-1000 #generate 1000 random numbers m<-100 #repeat 100 times Glnorm<-numeric(m) v<-2*(1:n)-n-1 set.seed(411) for(i in 1:m){ x<-sort(rlnorm(n)) Glnorm[i]<-(1/(n^2*exp(0.5)))*as.numeric(v%*%x) } lnormmean<-mean(Glnorm) lnormmed<-quantile(Glnorm,0.5) lnormdec<-quantile(Glnorm,probs=seq(0,1,0.1)) print(c(lnormmean,lnormmed,lnormdec))
The mean, median and deciles are established above. Mean value is 0.523 which is much approaching to median.
hist(Glnorm,freq=F,xlim=c(0.45,0.65),main="Gini ratio of standard lognormal")
2.Uniform distribution If $X\sim \mathbb{U}(0,1)$, then $E(X)=0.5$.
Gunif<-numeric(m) set.seed(411) for(i in 1:m){ x<-sort(runif(n)) Gunif[i]<-(1/(n^2*0.5))*as.numeric(v%*%x) } unifmean<-mean(Gunif) unifmed<-quantile(Gunif,0.5) unifdec<-quantile(Gunif,probs=seq(0,1,0.1)) print(c(unifmean,unifmed,unifdec))
The mean, median and deciles are established above. Mean value is 0.333 which is much approaching to median.
hist(Gunif,freq=F,xlim=c(0.320,0.345),main="Gini ratio of uniform distribution")
Gbinom<-numeric(m) set.seed(411) for(i in 1:m){ x<-sort(rbinom(n,100,0.1)) Gbinom[i]<-(1/(n^2*0.5))*as.numeric(v%*%x) } binommean<-mean(Gbinom) binommed<-quantile(Gbinom,0.5) binomdec<-quantile(Gbinom,probs=seq(0,1,0.1)) print(c(binommean,binommed,binomdec))
The mean, median and deciles are established above. Mean value is 3.370 which is much approaching to median.
hist(Gbinom,freq=F,xlim=c(3.1,3.6),main="Gini ratio of Bernoulli(100,0.1)")
Construct an approximate 95% confidence interval for the Gini ratio $\gamma = E[G]$ if $X$ is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.
If $X\sim lognormal(\mu,\sigma^2)$, then $\log(X)\sim N(\mu,\sigma^2)$. Assume that we have a random sample $X_1,X_2,\dots,X_n$ from the population $X$. Denote $Y_i$ by $\log(X_i)$ for $1\le i\le n$, then we can construct an approximate 95% confidence interval for $\sigma$ following: $$ (\sqrt{\frac{(n-1)S^2}{\chi_{n-1}^2(0.025)}},\sqrt{\frac{(n-1)S^2}{\chi_{n-1}^2(0.975)}}), $$ where $S^2=\frac{1}{n-1}\sum_{i=1}^{n}(Y_i-\bar{Y})^2$.
Note that $E(G)=2\phi(\frac{\sigma}{\sqrt{2}})-1$ where $\phi$ is the distribution function of standard normal (You can find this conclusion in wikipedia). Therefore we can naturally construct an approximate 95% confidence interval for $\gamma=E(G)$ following: $$ (2\phi(\frac{\sqrt{\frac{(n-1)S^2}{\chi_{n-1}^2(0.025)}}}{\sqrt{2}})-1,2\phi(\frac{\sqrt{\frac{(n-1)S^2}{\chi_{n-1}^2(0.975)}}}{\sqrt{2}})-1). $$
We assume that $X$ obtains standard lognormal.
n<-1000 #generate 1000 random numbers m<-100 #generate 100 gini ratio every times l<-100 #repeat 100 times mu<-0 sigma<-1 ci.sigma<-function(x,n){#function to construct confident interval for sigma s2<-(1/(n-1))*sum((x-mean(x))^2) c((n-1)*s2/qchisq(0.975,n-1),(n-1)*s2/qchisq(0.025,n-1)) } ciG<-cbind(numeric(m),numeric((m))) judge<-numeric(m) #judge whether true value is in ci or not power.G<-numeric(l) G.ture<-2*pnorm(sigma/sqrt(2))-1 #the true value of G set.seed(412) for(j in 1:l){ for(i in 1:m){ x<-rlnorm(n) cisigma<-ci.sigma(log(x),n) ciG[i,]<-2*pnorm(cisigma/sqrt(2))-1 judge[i]<-I(G.ture>ciG[i,1] & G.ture<ciG[i,2]) } power.G[j]<-mean(judge) } mean(power.G)
The coverage rate of estimation is 0.9524 which is a little bigger than 0.95.
Tests for association based on Pearson product moment correlation ??, Spearman??s rank correlation coefficient $\rho_s$, or Kendall??s coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $(X, Y)$ such that $X$ and $Y$ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.
We assume $(X,Y)\sim N(\mu,\Sigma)$, where $\mu=(0,0)^T$ and $\Sigma$ is of following: \begin{matrix} 1 & \rho\ \rho & 1 \end{matrix}
Spearman's rank correlation:
library(MASS) n <- 20 m <- 1000 rho.s <- c(seq(-0.8,-0.1,0.1),seq(0.1,0.8,0.1)) #alternatives M <- length(rho.s) power.s <- numeric(M) set.seed(431) for (i in 1:M) { rho <-rho.s[i] pvalues.s <- replicate(m, expr = { #simulate under alternative mu1 x <- mvrnorm(n, c(0,0), matrix(c(1,rho,rho,1),2,2)) cortest.s <- cor.test(x[,1],x[,2], alternative = "two.sided",method="spearman") cortest.s$p.value } ) power.s[i] <- mean(pvalues.s<= .05) } power.s
Kendall's coefficient:
n <- 20 m <- 1000 rho.k <- c(seq(-0.8,-0.1,0.1),seq(0.1,0.8,0.1)) #alternatives M <- length(rho.k) power.k <- numeric(M) set.seed(431) for (i in 1:M) { rho <-rho.k[i] pvalues.k <- replicate(m, expr = { #simulate under alternative mu1 x <- mvrnorm(n, c(0,0), matrix(c(1,rho,rho,1),2,2)) cortest.k <- cor.test(x[,1],x[,2], alternative = "two.sided",method="kendall") cortest.k$p.value } ) power.k[i] <- mean(pvalues.k<= .05) } power.k
Pearson's correlation:
n <- 20 m <- 1000 rho.p <- c(seq(-0.8,-0.1,0.1),seq(0.1,0.8,0.1)) #alternatives M <- length(rho.p) power.p <- numeric(M) set.seed(431) for (i in 1:M) { rho <-rho.p[i] pvalues.p <- replicate(m, expr = { #simulate under alternative mu1 x <- mvrnorm(n, c(0,0), matrix(c(1,rho,rho,1),2,2)) cortest.p <- cor.test(x[,1],x[,2], alternative = "two.sided",method="pearson") cortest.p$p.value } ) power.p[i] <- mean(pvalues.p<= .05) } power.p
power.rho<-rbind(power.s,power.k,power.p) colnames(power.rho)<-c(seq(-0.8,-0.1,0.1),seq(0.1,0.8,0.1)) power.rho
As you see, the nonparametric tests based on $??_s$ or $??$ are less powerful than the correlation test when the sampled distribution is bivariate normal.
Pearson's correlation measures linear trend between two datas, it also bases on normality assumption. When the bivariate distribution $(X,Y)$ is not bivariate normal, Pearson's correlation is less significant.
Now, we assume that $X\sim t_{2}$, $Y\sim U(-1,0)$ when $X<0$ and $Y\sim U(0,1)$ when $X\ge0$. Clearly, $X,Y$ are dependent.
n <- 20 m <- 1000 set.seed(431) pvalues.s1 <- replicate(m, expr = { #simulate under alternative mu1 x <- rt(n,2) y<-numeric(n) for(i in 1:n){ if(x[i]<0) y[i]<-runif(1,-1,0) else y[i]<-runif(1,0,1) } cortest.s1 <- cor.test(x,y, alternative = "two.sided",method="spearman") cortest.s1$p.value } ) power.s1 <- mean(pvalues.s1<= .05) power.s1
n <- 20 m <- 1000 set.seed(431) pvalues.p1 <- replicate(m, expr = { #simulate under alternative mu1 x <- rt(n,2) y<-numeric(n) for(i in 1:n){ if(x[i]<0) y[i]<-runif(1,-1,0) else y[i]<-runif(1,0,1) } cortest.p1 <- cor.test(x,y, alternative = "two.sided",method="pearson") cortest.p1$p.value } ) power.p1 <- mean(pvalues.p1<= .05) power.p1
In this case, the empirical power of the nonparametric test Spearman's rank correlation coefficient $\rho_s$ is 0.987, which is bigger than the empirical power of Pearson correlation $\rho=0.849$.
Compute a jackknife estimate of the bias and the standard error of the correlation statistic in Example 7.2.
#jackknife library(bootstrap) #for the law data print(cor(law$LSAT,law$GPA)) print(cor(law82$LSAT,law82$GPA))
The sample correlation is R = 0.7763745. The correlation for the universe of 82 law schools is R = 0.7599979. Use jackknife to estimate the bias and standard error of the correlation statistic computed from the sample of scores in law.
#estimate of bias by jackknife n<-nrow(law) R<-numeric(nrow(law)) R.hat<-cor(law$LSAT,law$GPA) for(i in 1:n) R[i]<-cor(law$LSAT[-i],law$GPA[-i]) bias.j<-(n-1)*(mean(R)-R.hat) print(bias.j)
The estimate of bias is -0.006473623.
#estimate of standard error by jackknife se.j<-sqrt((n-1)*mean((R-mean(R))^2)) se.j
The estimate of standard error is 0.1425186.
Refer to Exercise 7.4. Compute 95% bootstrap confidence intervals for the mean time between failures $1/\lambda$ by the standard normal, basic, percentile, and BCa methods. Compare the intervals and explain why they may differ.
library(boot) set.seed(521) boot.mean <- function(x,i) mean(x[i]) #use MLE estimate boot.obj <- boot(aircondit[,1], statistic=boot.mean, R = 2000) ci<- boot.ci(boot.obj,type=c("norm","basic","perc","bca")) ci mean(aircondit[,1])
These four confidence intervals of the mean time between failures are established above, respectively. Although the four CIs all cover the value 108.0833 which is the mean time of potential data, they are quite different. As you see, bca CI is the biggest, while basic CI is the smallest. According to the fomulation of bca and percentile CI and compute some relevant quantiles, it is trival to see that the upper and lower bounds of bca CI is bigger than percentile CI. The normal CI is a litter different with percentile CI. To see it, we plot the histogram of mean time computed with generated samples (figure1). We find the distribution of mean time computed with generated samples is a little different with normal distribution. So that they are different. When the data is much close to normal, norm and percentile CI is almost same.
set.seed(521) mu<-mean(boot.obj$t) sigma<-sd(boot.obj$t) figure1<-boot.obj$t f<-function(x) (1/(sqrt(2*pi)*sigma))*exp(-(x-mu)^2/(2*sigma^2)) hist(figure1,freq=F,main="Histogram of mean time",ylim=c(0,0.012)) #histogram of mean time computed with generated samples curve(f,add=T)
Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.
library(bootstrap) Rmtr.hat<-cov(scor) egvals.hat<-eigen(Rmtr.hat)$values #eigen values of potential data theta.hat<-egvals.hat[1]/sum(egvals.hat) J<-nrow(scor) theta.star<-numeric(J) for(i in 1:J){ Rmtr<-cov(scor[-i,]) egvals<-eigen(Rmtr)$values theta.star[i]<-egvals[1]/sum(egvals) } bias.j<-(J-1)*(mean(theta.star)-theta.hat) se.j<-sqrt((J-1)*mean((theta.star-mean(theta.star))^2)) print(c(bias.j,se.j))
The bias is 0.001069139, and the standard error is 0.049552307.
In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.
library(DAAG) magnetic<-ironslag$magnetic chemical<-ironslag$chemical l<-nrow(ironslag) n<-floor(l/2) e1<-e2<-e3<-e4<-numeric(2*n) set.seed(541) a<-sample(1:l,replace = F) #disorder the potential data magnetic<-magnetic[a] chemical<-chemical[a] # for n-fold cross validation # fit models on leave-two-out samples for (i in 1:n) { y <- magnetic[-c(2*i-1,2*i)] x <- chemical[-c(2*i-1,2*i)] J1 <- lm(y ~ x) yhat1.1 <- J1$coef[1] + J1$coef[2] * chemical[2*i-1] yhat1.2 <- J1$coef[1] + J1$coef[2] * chemical[2*i] e1[2*i-1] <- magnetic[2*i-1] - yhat1.1 e1[2*i] <- magnetic[2*i] - yhat1.2 J2 <- lm(y ~ x + I(x^2)) yhat2.1 <- J2$coef[1] + J2$coef[2] * chemical[2*i-1] + J2$coef[3] * chemical[2*i-1]^2 yhat2.2 <- J2$coef[1] + J2$coef[2] * chemical[2*i] + J2$coef[3] * chemical[2*i]^2 e2[2*i-1] <- magnetic[2*i-1] - yhat2.1 e2[2*i] <- magnetic[2*i] - yhat2.2 J3 <- lm(log(y) ~ x) logyhat3.1 <- J3$coef[1] + J3$coef[2] * chemical[2*i-1] logyhat3.2 <- J3$coef[1] + J3$coef[2] * chemical[2*i] yhat3.1 <- exp(logyhat3.1) yhat3.2 <- exp(logyhat3.2) e3[2*i-1] <- magnetic[2*i-1] - yhat3.1 e3[2*i] <- magnetic[2*i] - yhat3.2 J4 <- lm(log(y) ~ log(x)) logyhat4.1 <- J4$coef[1] + J4$coef[2] * log(chemical[2*i-1]) logyhat4.2 <- J4$coef[1] + J4$coef[2] * log(chemical[2*i]) yhat4.1 <- exp(logyhat4.1) yhat4.2 <- exp(logyhat4.2) e4[2*i-1] <- magnetic[2*i-1] - yhat4.1 e4[2*i] <- magnetic[2*i] - yhat4.2 } c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2))
According to the prediction error criterion, Model 2, the quadratic model, would be the best fit for the data.
J2
The fitted regression equation for Model 2 is $\hat{Y} = 23.93750??? 1.34026X + 0.05345X^2.$
Some of the residual plots for Model 2 are shown below.
plot(J2)
Implement the two-sample Cramer-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.
We implement the two-sample Cramer-von Mises test for equal distributions as following:
cvm.d<-function(x,y){ #compute the Cramer-von Mises statistic n<-length(x);m<-length(y) Fn<-ecdf(x);Gm<-ecdf(y) W2<-((m*n)/(m+n)^2)* (sum((Fn(x)-Gm(x))^2)+sum((Fn(y)-Gm(y))^2)) W2 }
Next, we apply the test to the data in Examples 8.1 and 8.2.
library(boot) attach(chickwts) x <- sort(as.vector(weight[feed == "soybean"])) y <- sort(as.vector(weight[feed == "linseed"])) detach(chickwts) set.seed(611) R <- 999 #number of replicates z <- c(x, y) #pooled sample n<-length(x) m<-length(y) N <- 1:(m+n) reps <- numeric(R) #storage for replicates W2 <- cvm.d(x, y) W2 for (i in 1:R) { #generate indices k for the first sample k <- sample(N, size = n, replace = FALSE) x1 <- z[k] y1 <- z[-k] #complement of x1 reps[i] <- cvm.d(x1, y1) } p <- mean(c(W2, reps) >= W2) p
The observed Cramer-von Mises statistic is $W_2 = 0.1515236$. The aproximate ASL 0.394 implies that there is no enough evidence to reject the null hypothesis.
Design experiments for evaluating the performance of the NN, energy, and ball methods in various situations.
(1) Unequal variances and equal expectations
(2) Unequal variances and unequal expectations
(3) Non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)
(4) Unbalanced samples (say, 1 case versus 10 controls)
Note: The parameters should be choosen such that the powers are distinguishable (say, range from 0.3 to 0.9).
library(energy) library(RANN) library(Ball) m <- 100 k<-3 p<-2 #p-dimension Multivarziate n1 <- n2 <- 50 #size of samples R<-999 ##number of replicates n <- n1+n2 N = c(n1,n2) Tn <- function(z, ix, sizes,k) { n1 <- sizes[1]; n2 <- sizes[2]; n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0); z <- z[ix, ]; NN <- nn2(data=z, k=k+1) block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5); i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } eqdist.nn <- function(z,sizes,k){ #NN test boot.obj <- boot(data=z,statistic=Tn,R=R, sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3)
We will establish the powers with three methods in the following different situations.
(1)Unequal variances and equal expectations
Suppose that $X\sim N((0,0),I)$ and $Y\sim N((0,0),(1,0,0,0.25))$.
set.seed(621) for(i in 1:m){ x <- matrix(rnorm(n1*p),ncol=p); y <- cbind(rnorm(n2),rnorm(n2,sd=0.5)); z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R,seed=i*12345)$p.value } alpha <- 0.1 #confidence level pow <- colMeans(p.values<alpha) pow
NN, energy and Ball methods have powers 0.62, 0.45 and 0.7 respectively.
(2)Equal variances and unequal expectations
Suppose that $X\sim N((0,0),I)$ and $Y\sim N((0,0.6),I)$.
set.seed(623) for(i in 1:m){ x <- matrix(rnorm(n1*p),ncol=p); y <- cbind(rnorm(n2),rnorm(n2,mean=0.6)); z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R,seed=i*12345)$p.value } alpha <- 0.1 #confidence level pow <- colMeans(p.values<alpha) pow
NN, energy and Ball methods have powers 0.38, 0.85 and 0.76 respectively.
(3)Non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)
Suppose that $X=(X_1,X_2), Y=(Y_1,Y_2)$ where $X_1\sim t_1, X_2\sim 0.5N(0,1)+0.5N(1,1)$. $Y_1\sim t_1, Y_2\sim 0.5N(1,1)+0.5N(2,1)$.
set.seed(623) for(i in 1:m){ x <- cbind(rt(n1,1),rnorm(n2, mean = sample(c(0, 1), size = n2, prob = c(0.5, 0.5), replace = T), sd =1 )) y <- cbind(rt(n2,1),rnorm(n2, mean = sample(c(1, 2), size = n2, prob = c(0.5, 0.5), replace = T), sd =1 )) z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R,seed=i*12345)$p.value } alpha <- 0.1 #confidence level pow <- colMeans(p.values<alpha) pow
NN, energy and Ball methods have powers 0.68, 0.64 and 0.71 respectively.
Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard Cauchy distribution (see qcauchy or qt with df=1). Recall that a $Cauchy(\theta, \eta)$ distribution has density function $$ f(x) =\frac{1} {\theta\pi(1 + [(x- \eta)/\theta]^2)}, - \infty < x < \infty, \theta>0. $$ The standard Cauchy has the $Cauchy(\theta= 1, \eta = 0)$ density. (Note that the standard Cauchy density is equal to the Student t density with one degree of freedom.)
We choose normal distribution as the proposal distribution. Denote by $g(??|X)$ the the density function of $\mathbb{U}(X-1,X+1)$. The initial value $X_0$ is generated from $\mathbb{U}(-1,1)$.
f<-function(x) 1/(pi*(1+x^2)) #density function of standard Cauchy distribution set.seed(631) m <- 20000 #length of the chain x <- numeric(m) x[1] <- runif(1,min=-1,max=1) k <- 0 u <- runif(m) for (i in 2:m) { xt <- x[i-1] y <- xt+runif(1,min=-1,max=1) num <- f(y) #* dnorm(xt, mean = y,sd=0.5) den <- f(xt) #* dnorm(y, mean = xt,sd=0.5) if (u[i] <= num/den) x[i] <- y else { x[i] <- xt k <- k+1 #y is rejected } } print(k) #the number of rejected candidate points
In this method with the proposal distribution we selected, approximately 15% of the candidate points are rejected which implies that the chain is not effective enough but is okay.
Next, we give the comparision between the deciles of the generated observations and the deciles of the standard Cauchy distribution. Moreover, we will establish the qqplot and histogram of the generated observations.
gob<- quantile(x[-(1:1000)],seq(0.1,0.9,0.1)) stc<-qt(seq(0.1,0.9,0.1),1) rbind(gob,stc) b <- 1001 #discard the burnin sample y <- x[b:m] a <- seq(0.1,0.9,0.01) QR <- qcauchy(a) #quantiles of Rayleigh Q <- quantile(y, a) qqplot(QR, Q, main="", xlab="Cauchy Quantiles", ylab="Sample Quantiles") hist(y, breaks="scott", main="", xlab="", freq=FALSE) lines(QR, f(QR))
The deciles between the generated observations and the deciles of the standard Cauchy distribution are much closer. Besides, as the qqplot and histogram shown above, the generated sample is much approximate with standard Cauchy distribution.
Rao $[220, Sec. 5g]$ presented an example on genetic linkage of 197 animals in four categories (also discussed in $[67, 106, 171, 266])$. The group sizes are $(125, 18, 20, 34)$. Assume that the probabilities of the corresponding multinomial distribution are $$ (\frac12+\frac{\theta}{4},\frac{1-\theta}4,\frac{1-\theta}4,\frac{\theta}4). $$ Estimate the posterior distribution of $\theta$ given the observed sample, using one of the methods in this chapter.
We use the random walk Metropolis sampler with a uniform proposal distribution to generate the posterior distribution of $\theta$.
w <- 0.25 #width of the uniform support set m <- 10000 #length of the chain burn <- 1000 #burn-in time N <- 197 x <- numeric(m) #the chain set.seed(641) grpsize<-c(125,20,18,34) #group sizes prob <- function(y, grpsize) { # computes (without the constant) the target density if (y < 0 || y >= 1) return (0) return((1/2+y/4)^grpsize[1] * ((1-y)/4)^grpsize[2] * ((1-y)/4)^grpsize[3] * (y/4)^grpsize[4]) } u <- runif(m) #for accept/reject step v <- runif(m, -w, w) #proposal distribution x[1] <-0.5 for (i in 2:m) { y <- x[i-1] + v[i] if (u[i] <= prob(y, grpsize) / prob(x[i-1], grpsize)) x[i] <- y else x[i] <- x[i-1] } xtheta<-x[-(1:burn)] theta.hat<-mean(xtheta) round(theta.hat,3) print(round(c(0.5+theta.hat/4,(1-theta.hat)/4,(1-theta.hat)/4,theta.hat/4),3)) print(round(c(125,20,18,34)/N,3)) hist(xtheta,freq=F,main="Histogram of theta",xlab="theta")
The value of estimated $\theta$ is 0.623.
For exercise 9.6, use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R}$< 1.2.
We construct five chains with initial value 0.4,0.45,0.5,0.55,0.6, then we compute the Gelman-Rubin statistic.
Gelman.Rubin <- function(psi) { # psi[i,j] is the statistic psi(X[i,1:j]) # for chain in i-th row of X psi <- as.matrix(psi) n <- ncol(psi) k <- nrow(psi) psi.means <- rowMeans(psi) #row means B <- n * var(psi.means) #between variance est. psi.w <- apply(psi, 1, "var") #within variances W <- mean(psi.w) #within est. v.hat <- W*(n-1)/n + (B/n) #upper variance est. r.hat <- v.hat / W #G-R statistic return(r.hat) }
prob <- function(y, grpsize) { # computes (without the constant) the target density if (y < 0 || y >= 1) return (0) return((1/2+y/4)^grpsize[1] * ((1-y)/4)^grpsize[2] * ((1-y)/4)^grpsize[3] * (y/4)^grpsize[4]) } theta.chain<-function(w,m,burn){ #generate the chain (w is the initial value; m is the length of the chain) x<-numeric(m) u <- runif(m) #for accept/reject step v <- runif(m, -w, w) #proposal distribution x[1] <-0.5 for (i in 2:m) { y <- x[i-1] + v[i] if (u[i] <= prob(y, grpsize) / prob(x[i-1], grpsize)) x[i] <- y else x[i] <- x[i-1] } return(x[(burn+1):m]) }
set.seed(711) m <- 5000 #length of chains burn <- 1000 #burn-in length #choose overdispersed initial values grpsize<-c(125,20,18,34) x0 <- c(0.4, 0.45, 0.5, 0.55,0.6) k<-length(x0) #number of chains to generate #generate the chains X <- matrix(0, nrow=k, ncol=m-burn) for (i in 1:k) X[i, ] <- theta.chain(x0[i],m,burn) Gelman.Rubin(X)
The Gelman-Rubin statistic $\hat{R}$ is 1.001811<1.2.
Find the intersection points $A(k)$ in $(0,\sqrt{k})$ of the curves $$ S_{k-1}(a)=P(t(k-1)>\sqrt{\frac{a^2(k-1)}{k-a^2}}) $$ and $$ S_{k}(a)=P(t(k)>\sqrt{\frac{a^2k}{k+1-a^2}}) $$
for $k = 4 : 25, 100, 500, 1000$, where $t(k)$ is a Student t random variable with $k$ degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by Sz??ekely [260].)
K<-c(4:25,100,500,1000) res<-numeric(length(K)) for(i in 1:length(K)){ k<-K[i] f<-function(a) {pt(sqrt(a^2*(k-1)/(k-a^2)),k-1)-pt(sqrt(a^2*k/(k+1-a^2)),k)} res[i]<-uniroot(f,c(1,2))$root } cbind(K,res)
The intersection points are established above.
Write a function to compute the cdf of the Cauchy distribution, which has density
$$
\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)},~~-\infty
We use function "integrate" in R to compute the cdf of Cauchy distribution.
f<-function(x,eta,theta){ #density function 1/(theta*pi*(1+((x-eta)/theta)^2)) } df<-function(x,eta=0,theta=1){ #cdf of cauchy(eta,theta) n<-length(x) res<-numeric(n) for(i in 1:n){ res[i]<-integrate(f, lower=-Inf, upper=x[i], rel.tol=.Machine$double.eps^0.25, theta=theta,eta=eta)$value } return(res) }
Next, we use "df" function builded myself and "pcauchy" function in R to construct the cdf figures with different parameters $\theta, \eta$ respectively.
#(1) #standard Cauchy ditribution eta<-0 theta<-1 x<-seq(-10,10,0.1) plot(x,df(x,eta,theta), lwd = 0.5,ylab="cdf",main="cdf of Cauchy(eta=0, theta=1)") curve(pcauchy(x,location = 0, scale = 1),add=TRUE,lwd = 2,col="red") #(2) eta<-0 theta<-5 x<-seq(-10,10,0.1) plot(x,df(x,eta,theta), lwd = 0.5,ylab="cdf",main="cdf of Cauchy(eta=0, theta=5)") curve(pcauchy(x,location = 0, scale = 5),add=TRUE,lwd = 2,col="red") #(3) eta<-5 theta<-1 x<-seq(-5,15,0.1) plot(x,df(x,eta,theta), lwd = 0.5,ylab="cdf",main="cdf of Cauchy(eta=5, theta=1)") curve(pcauchy(x,location = 5, scale = 1),add=TRUE,lwd = 2,col="red") #(4) eta<-5 theta<-5 x<-seq(-5,15,0.1) plot(x,df(x,eta,theta), lwd = 0.5,ylab="cdf",main="cdf of Cauchy(eta=5, theta=5)") curve(pcauchy(x,location = 5, scale = 5),add=TRUE,lwd = 2,col="red")
As the figures showing, the cdf computed by myself is much close to the results from "pcauchy".
Finally, we establish the the source code in pcauchy.c.
pcauchy
A-B-O blood type problem
$\blacktriangleright$ Let the three alleles be A, B, and O.
Genotype AA $~~$ BB $~~$ OO $~~$ AO $~~$ BO $~~$ AB
Frequency p2 $~~$ q2 $~~$ r2 $~~$ 2pr $~~$ 2qr $~~$ 2pq $~~$ 1
Count $~~~~~n_{AA}$ $n_{BB}$ $n_{OO}$ $n_{AO}$ $n_{BO}$ $n_{AB}~~$ n
$\blacktriangleright$ Observed data: $n_{A??} = n_{AA} + n_{AO} = 28 ~~~(A-type),$
$n_{B??} = n_{BB} + n_{BO} = 24 ~~(B-type), n_{OO} = 41~~ (O-type), n_{AB} = 70 ~~(AB-type).$
$\blacktriangleright$ Use EM algorithm to solve MLE of p and q (consider missing data $n_{AA} and n_{BB}$).
$\blacktriangleright$ Record the maximum likelihood values in M-steps, are they increasing?
First, we show you the complete data likelihood and with the initial parameters $\hat{p}_0,~\hat{q}_0$ we give you the fomulation of one step MLES.
Complete data likelihood: $$ L(p,q|n_{AA},n_{BB},n_{OO},n_{AO},n_{BO},n_{AB})=(p^2)^{n_{AA}}(q^2)^{n_{BB}}(r^2)^{n_{OO}}(2pr)^{n_{AO}}(2qr)^{n_{BO}}(2pq)^{n_{AB}}, $$ $$ l(p,q|n_{AA},n_{BB},n_{OO},n_{AO},n_{BO},n_{AB})=n_{A\cdot}log(pr)+n_{AA}log(p/r)+n_{B\cdot}log(qr)+n_{BB}log(q/r)+2n_{OO}log(r)+n_{AB}log(pq), $$ $$ E_{(\hat{p}0,\hat{q}_0)}[l(p,q|n{AA},n_{BB},n_{OO},n_{AO},n_{BO},n_{AB})]=n_{A\cdot}log(pr)+n_{A\cdot}\frac{\hat{p}0}{2-\hat{p}_0-2\hat{q}_0}log(p/r)+n{B\cdot}log(qr)+n_{B\cdot}\frac{\hat{q}0}{2-2\hat{p}_0-\hat{q}_0}log(q/r)+2n{OO}log(r)+n_{AB}log(pq), $$ where $\hat{p}_0$ and $\hat{q}_0$ are two initial parameters.
Define $a_0=\frac{\hat{p}0}{2-\hat{p}_0-2\hat{q}_0}$ and $b_0=\frac{\hat{q}_0}{2-2\hat{p}_0-\hat{q}_0}$, we can simplify the expectation equation: $$ E{(\hat{p}0,\hat{q}_0)}[l(p,q|n{AA},n_{BB},n_{OO},n_{AO},n_{BO},n_{AB})]=[n_{A\cdot}(1+a_0)+n_{AB}]log(p)+[n_{A\cdot}(1-a_0)+n_{B\cdot}(1-b_0)+2n_{OO}]log(r)+[n_{B\cdot}(1+b_0)+n_{AB}]log(q). $$ Then maximize the expectation equation, we obtain MLEs $\hat{p}1,~\hat{q}_1$: $$ \hat{p}_1=\frac{H_p}{H_p+H_q+H{pq}},~~\hat{q}1=\frac{H_q}{H_p+H_q+H{pq}}, $$ where $H_p=n_{A\cdot}(1+a_0)+n_{AB}$, $H_1=n_{B\cdot}(1+b_0)+n_{AB}$ and $H_{pq}=n_{A\cdot}(1-a_0)+n_{B\cdot}(1-b_0)+2n_{OO}$.
Next, we use EM algorithm to solve MLE of $p$ and $q$.
p0<-0.2; q0<-0.2 #initial estimates nA<-28; nB<-24; nOO<-41; nAB<-70 #observed data tol <- .Machine$double.eps^0.5 N <- 10000 #max. number of iterations p<-p0; q<-q0 El<-numeric(N) #Record the maximum likelihood values in M-steps k<-1 #Record the circle times for (i in 1:N) { a<-p/(2-p-2*q) b<-q/(2-2*p-q) Hp=nA*(1+a)+nAB Hq=nB*(1+b)+nAB Hpq=nA*(1-a)+nB*(1-b)+2*nOO El[i]<-Hp*log(p)+Hpq*log(1-p-q)+Hq*log(q) #Record the maximum likelihood values in M-steps x<-p; y<-q #storage previous estimates p<-Hp/(Hp+Hq+Hpq); q<-Hq/(Hp+Hq+Hpq) k<-k+1 if (abs(p-x)/x<tol && abs(q-y)/y<tol) break p.hat<-p; q.hat<-q El<-El[1:k] } p.hat q.hat El diff(El) #First order difference of maximum likelihood values in M-steps
By EM algorithm, we obtain MLE of $p$ and $q$ are 0.3273442 and 0.3104267. From the First order difference of maximum log-likelihood values, they are not increasing.
Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:
formulas <- list(
mpg ~ disp,
mpg ~ I(1 / disp),
mpg ~ disp + wt,
mpg ~ I(1 / disp) + wt
)
#Use for loops attach(mtcars) formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt ) f<-list() for (i in seq_along(formulas)){ f[[i]]<-lm(formulas[[i]],mtcars) } f
#Use lapply() lapply(formulas,function(x) lm(x,mtcars))
Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?
bootstraps <- lapply(1:10, function(i) {
rows <- sample(1:nrow(mtcars), rep = TRUE)
mtcars[rows, ]
})
#Use for loops set.seed(921) attach(mtcars) bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] }) f<-list() for (i in seq_along(bootstraps)){ f[[i]]<-lm(mpg~disp,bootstraps[[i]]) } f
#Use lapply() lapply(bootstraps,function(x) lm(mpg~disp,x))
#Without an anonymous function lapply(bootstraps,lm,formula=mpg~disp)
For each model in the previous two exercises, extract $R^2$ using the function below.
rsq <- function(mod) summary(mod)$r.squared
In this exercise, we only use lapply() to extract $R^2$.
set.seed(921) mod1<-lapply(formulas,function(x) lm(x,mtcars)) mod2<-lapply(bootstraps,function(x) lm(mpg~disp,x)) rsq <- function(mod) summary(mod)$r.squared c(unlist(lapply(mod1,rsq)),unlist(lapply(mod2,rsq)))
The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.
trials <- replicate(
100,
t.test(rpois(10, 10), rpois(7, 10)),
simplify = FALSE
)
Extra challenge: get rid of the anonymous function by using [[ directly.
#Use sapply() and an anonymous function set.seed(941) trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE ) sapply(trials,function(x) x$p.value)
#Without an anonymous function p.value<-function(x) x$p.value #define p.value function sapply(trials,p.value)
Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?
We combine Map() and vapply() to create an lapply() variant.
lapply1<-function (f,n,type, ...) { #n is the length of output of function f. f <- match.fun(f) tt=Map(f, ...) if(type=="numeric") y=vapply(tt,cbind,numeric(n)) else if (type=="character") y=vapply(tt,cbind,character(n)) else if (type=="complex") y=vapply(tt,cbind,complex(n)) else if (type=="logical") y=vapply(tt,cbind,logical(n)) return(y) } #example set.seed(951) kk<-function(x,w){ y=weighted.mean(x, w, na.rm = TRUE) return(c(1,y)) } xs <- replicate(5, runif(10), simplify = FALSE) ws <- replicate(5, rpois(10, 5) + 1, simplify = FALSE) lapply1(kk,2,"numeric",xs, ws) is.matrix(lapply1(kk,2,"numeric",xs, ws))
Then, we use function "foreach" to create an lapply() variant.
library(foreach) lapply2<-function(...,combine){ foreach(...,.combine=combine) }
We use the new function to complete some simple Parallel computations.
#output is vector lapply2(a=c(1,2,3),b=rep(1,3),combine="c")%do%{x1<-a+b;x2<-a*b;c(x1,x2)} #output is matrix lapply2(a=c(1,2,3),b=rep(1,3),combine="cbind")%do%{x1<-a+b;x2<-a*b;c(x1,x2)} #Obtain same results in the former question with the new function lapply2(i=1:length(trials),combine=c)%do%trials[[i]]$p.value bias<-lapply2(i=1:length(trials),combine=c)%do%trials[[i]]$p.value-sapply(trials,function(x) x$p.value) bias
Make a faster version of chisq.test() that only computes the chi-square test statistic when the input is two numeric vectors with no missing values. You can try simplifying chisq.test() or by coding from the mathematical definition (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test).
set.seed(1011) chisq.test2<-function(x,y){ input<-as.table(rbind(x,y)) out<-chisq.test(input)$statistic out } #Example x<-rpois(10000,lambda=20) y<-rbinom(10000,30,0.6) chisq.test2(x,y) system.time(chisq.test2(x,y)) system.time(chisq.test((as.table(rbind(x,y)))))
Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?
set.seed(1011) x<-rpois(10000,lambda=20) y<-rbinom(10000,30,0.6) as.table2<-function(x,y){ out<-rbind(x,y) class(out)<-"table" out } class(as.table2(x,y)) system.time(as.table2(x,y)) system.time(as.table(rbind(x,y))) system.time(chisq.test(as.table2(x,y))) system.time(chisq.test(as.table(rbind(x,y))))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.