Use knitr to produce at least 3 examples(texts,figures,tables).
There is a data set called "tli" from the package "datasets".It contains grades,sex,disadvg,ethnicty and tlimth of 100 randomly selected students participating in the Assessment of Academic Skills.I list the data with a table and plot the data.Then Icalculate the convariance between grade and tlimth.
library(datasets)#Require package "datasets" library(xtable)#Require package "xtable" data(tli)#Pick a data set "tli" print(xtable(tli[1:20,]), type= 'html')#List "tli" with a table plot(tli)#Visualize the data a<-cov(tli[,1],tli[,5])#calculate the convariance between grade and tlimth
The covariance between grade and tlimth isr a
In example 2,I defined variables x,y and calculated the linear regression between x and y. I listed the results and visualized it.
library(knitr)#Require package "knitr" x <- 1:10#Value x from 1 to 10 y <- x^2#Define y lmr <- lm(y ~ x)#Calculate the linear regression between x and y co <- summary(lmr)$coefficients#Put the results into a matrix knitr::kable(co)#List co by table plot(x,y)#Visualize (x,y)
The $R^2$ is r summary(lmr)$r.squared
I picked a data set from package "datasets" called "PlantGrowth".It is a data frame of 30 cases on 2 variables:weight and group.The levels of group are ‘ctrl’, ‘trt1’, and ‘trt2’.The codes list the data and plot the number of each groups with different colors.
data("PlantGrowth")#Pick a data set "tli" print(xtable::xtable(PlantGrowth[1:30,]), type= 'html') grp<-table(PlantGrowth[,'group']) barplot(grp,main = "number of group",col=c("brown2", "aquamarine1","aliceblue"))
The number of ctrl,trt1 and trt2 is r grp
.
The Rayleigh density is$$f(x)=\frac{x}{\sigma^{2}} e^{-x^{2} /\left(2 \sigma^{2}\right)}, \quad x \geq 0, \sigma>0$$ Develop an algotithm to generate random samples from a Rayleigh($\sigma$) distribution. Generate Rayleigh($\sigma$) samples for several choices of $\sigma>0$ and check that the mode of the generated samples is close to the theoretical mode $\sigma$ (check the histgram).
Idea: We can use the Inverse Transform Method to solve the problem. Through the Rayleigh density we can calculate the distribution function is $$F(x)=\ 1-e^{\frac{-x^{2}}{2 \sigma^{2}}}$$.I write a function to generate random samples.Then I choose $\sigma=2,3,4,5$ to check that whether the mode of the generated samples is close to the theoretical mode or not.
set.seed(12345) Rayleigh.hist<-function(n,sigma) #write a function to gennerate random samples { u<-runif(n) x<-sqrt(-2*sigma^2*log(1-u)) hist(x,breaks=20,prob=TRUE,main = paste("Histogram of sigma=" , sigma)) #density histgram of sample y<-seq(0,20,.01) lines(y,(y/sigma^2)*exp(-y^2/(2*sigma^2))) #density curve f(x) } #par(mfrow=c(2,2)) #2 graphs per page for (sigma in 2:5) #use the loop to choose sigma =2,3,4,5 { Rayleigh.hist(10000,sigma) }
Result: Through the histgrams, the mode of the generated samples is close to the theoretical mode $\sigma$.
Generate a random sample of size 1000 from a normal location mixture. The components of the mixture have N(0,1) and N(3,1) distributions with mixing probabilities$\ p_{1}$ and $\ p_{2}=1-p_{1}$ . Graph the histogram of the sample with density superimposed, for$\ p_{1}=0.75$ . Repeat with different values for$\ p_{1}$ and observe whether the empirical distribution of the mixture appears to be bimodal. Make a conjecture about the values of $\ p_{1}$ that produce bimodal mixtures.
Idea: (1) Generate x1 ~N(0,1), x2 ~N(3,1). (2) Generate an integer p=0,1,where P(0)=p1,P(1)=p2. (3) x=px1+(1-p)x2 and graph the histgram.
#write a function to generate sample from normal location mixture loc.mix<-function(p1){ x1<-rnorm(1000,0,1) #generate sample from normal distribution N(0,1) x2<-rnorm(1000,3,1) u<-runif(1000) p<-as.integer(u>p1) #vectors of 0's and 1's x<-p*x1+(1-p)*x2 #the mixture hist(x,breaks=50,prob=TRUE,main = paste("Histogram of p1=" , p1)) #histgram of the sample } loc.mix(0.75) #p1=0.75 #par(mfrow=c(1,4)) #4 graphs per page for (p1 in seq(0,1,0.02)) #use the loop to repeat with p1=0, 0.02,0.04,0.06,...,1 { loc.mix(p1) }
Result: From the histgrams of different values of $\ p_{1}$, I deduce that when $0.3<\mathrm p_{1}<0.7$ ,it can produce bimodal mixtures.
Write a function to generate a random sample from a $\ W_{d}(\Sigma, n)$ (Wishart) distribution for $\ n>d+1 \geq 1$ , based on Bartlett's decomposition.
Idea: Follow the steps in the textbook(P80)
#write a function to generate a random sample from a wishart distribution wishart.Bartlett <-function(d,n,Sigma){ T<-matrix(0,d,d) #a d*d 0-matrix T for (i in 1:d) #use a loop to generate T { for (j in 1:d) { if(i==j) T[i,j]<-rnorm(1,0,1) #T(i,j)~N(0,1),i>j if(i<j) T[i,j]<-sqrt(rchisq(1,n-i+1,ncp = 0)) #T(i,i)~sqrt(X^2(n-i+1)) } } A<-T%*%t(T) #A has a Wd(Id,n) distribution L<-chol(Sigma) #Choleski factorization L%*%A%*%t(L) #the sample from a Wd(sigma,n) distribution } #an example to test the function wishart.Bartlett(2,4,matrix(c(5,1,1,3),2,2))
Result:Through the example,the function works.
Compute a Monte Carlo estimate of $\int_{0}^{\pi / 3} \sin t d t$ and compare your estimate with the exact value of the integral.
m<-10000 x<-runif(m,min=0,max=pi/3) esti<-mean(sin(x))*(pi/3) print(esti) #estimate print(cos(0)-cos(pi/3)) #extra value
The estimate isr esti
and the extra value isr (cos(0)-cos(pi/3))
Use Monte Carlo integration with antithetic variables to estimate $\int_{0}^{1} \frac{e^{-x}}{1+x^{2}} d x$ and find the approximate reduction in variance as a percentage of the variance without variance reduction.
#The anti.MC function will compute the estimate with or without antithetic sampling N<-10000 u<-v<-numeric(N/2) anti.MC <- function(antithetic = TRUE) { u <- runif(N/2) if (!antithetic) v <- runif(N/2) else v <- 1 - u g1<-exp(u)/(1+u^2) g2<-exp(v)/(1+v^2) esti <- mean(c(g1,g2)) esti } #estimate with or without antithetic sampling m <- 1000 autith<- anti.MC(anti =TRUE)#with autithentic noauti<- anti.MC(anti = FALSE) #without print(autith) print(noauti) #The approximate reduction in variance can be estimated by a simulation under both methods MC1 <- MC2 <- numeric(m) for (i in 1:m) { MC1[i] <- anti.MC(anti = FALSE) MC2[i] <- anti.MC(anti =TRUE) } print((var(MC1) - var(MC2))/var(MC1))#the approximate reduction in variance
The estimate with antithetic sampling is r autith
.The antithetic variable approach achieved approximately 66% reduction.
Obtain the stratified importance sampling estimate in Example 5.13 and compare it with the result of Example 5.10.
m <-10000 #number of replicates k <-5 #number of strata M<-m/k #importance sample size g <- function(x) { exp(-x - log(1+x^2)) * (x > 0) * (x < 1) } fj <-function(x) { 5*exp(-x) / (1 - exp(-1))} fg <-function(x) { g(x)/fj(x)} F <-function(u) { - log(1 - u * (1 - exp(-1)))} theta.SI<-se.SI<-matrix(0,k,1) for (j in 1:k) #use the loop to estimate in k intervals { if(j==1) { x <- runif(m,0,F(1)*j/k) theta.SI[j] <- mean(g(x)) se.SI[j]<-var(fg(x)) } if(j>1 && j<5) { x <- runif(m,F(1)*(j-1)/k,F(1)*j/k) theta.SI[j] <- mean(g(x)) se.SI[j]<-var(fg(x)) } if(j==5) { x <- runif(m,F(1)*(j-1)/k,1) theta.SI[j] <- mean(g(x)) se.SI[j]<-var(fg(x))} } apply(theta.SI,2,mean) #estimate apply(se.SI,2,sum)/M #variance
The stratified importance sampling estimate is r apply(theta.SI,2,mean)
.Compare to example 5.10,it represents a more than 98% reduction in variance.
Suppose a 95% symmetric t-interval is applied to estimate a mean, but the sample data are non-normal. Then the probability that the confidence interval covers the mean is not necessarily equal to 0.95. Use a Monte Carlo experiment to estimate the coverage probability of the t-interval for random samples of $\chi^{2}(2)$ data with sample size n = 20. Compare your t-interval results with the simulation results in Example 6.4. (The t-interval should be more robust to departures from normality than the interval for variance.)
Generate $\chi^{2}(2)$ data with sample size n = 20. Chi-square distribution $\chi^{2}(2)$ has mean=2 and variance=4. $T=\frac{\bar{X}-\mu_{0}}{S / \sqrt{n}} \sim t(n-1)$. Then the coverage probability by a 95% symmetric t-interval is $P\left(-t_{n-1}\left(\frac{\alpha}{2}\right)<T<t_{n-1}\left(\frac{\alpha}{2}\right)\right)$ .
#estimate the coverage probability set.seed(123) n<-20 #sample size alpha<-0.05 Stat<- replicate(1000, expr = { x <- rchisq(n, df = 2) sqrt(20)*(mean(x)-2) / 2 } ) cover.hat<-mean(Stat < -qt(alpha/2,n-1) & Stat > qt(alpha/2,n-1)) #coverage probability cover.hat
The estimate of the coverage probability is r cover.hat
. In example 6.4,the result is 0.95. Compare to example 6.4, the t-interval result is close but a little larger.
Estimate the 0.025, 0.05, 0.95, and 0.975 quantiles of the skewness $\sqrt{b_{1}}$ under normality by a Monte Carlo experiment. Compute the standard error of the estimates from (2.14) using the normal approximation for the density (with exact variance formula). Compare the estimated quantiles with the quantiles of the large sample approximation $\sqrt{b_{1}} \approx N(0,6 / n)$.
I write a function to estimate the quantiles with different sample size n. In the function, firstly, I generate random skewness under normality and use function "quantile" to estimate the quantiles of skewness. Then I use (2.14) to compute the standard error. Finally, I use a table to compare the estimated quantiles with the quantiles of $\sqrt{b_{1}} \approx N(0,6 / n)$. I pick n=10,100,1000.
set.seed(123) #write a function to choose different n quan.esti<-function(n){ #estimate the quantiles of skewness k<-10000 #number of replicates b<-numeric(k) for (i in 1:k) { x<-rnorm(n,0,1) xbar<-mean(x) m3 <- mean((x - xbar)^3) m2 <- mean((x - xbar)^2) b[i]<- m3 / m2^1.5 } quanti<-quantile(b,c(0.025,0.05,0.95,0.975)) #estimate of quantiles #compute the standard error f<-function(x) exp(-x^2/(2*(6/n)^2))/sqrt(2*pi*sqrt(6/n)) var.hat<-numeric(length(quanti)) q<-c(0.025,0.05,0.95,0.975) for (i in 1:length(quanti)) { var.hat[i]<-q[i]*(1-q[i])/(n*f(quanti[i])^2) } se.hat<-sqrt(var.hat) #SE #compare with N(0,6/n) norm.hat<-qnorm(c(0.025,0.05,0.95,0.975),0,sqrt(6/n)) #output the results data.frame(quanti,se.hat,norm.hat) } quan.esti(10) #n=10 quan.esti(100) #n=100 quan.esti(1000) #n=1000
Compare the estimated quantiles with the quantiles of the large sample approximation $\sqrt{b_{1}} \approx N(0,6 / n)$ in the table, when sample size n gets larger, quantiles of the estimates and the approximation N(0,6/n) are getting closer.
Estimate the power of the skewness test of normality against symmetric $\operatorname{Beta}(\alpha, \alpha)$ distributions and comment on the results. Are the results different for heavy-tailed symmetric alternatives such as $t(\nu)$ ?
#write a function to compute the sample skewness statistic sk <- function(x) { xbar <- mean(x) m3 <- mean((x - xbar)^3) m2 <- mean((x - xbar)^2) return( m3 / m2^1.5 ) } beta<-0.1 #significance level n<-30 #sample size m<-1000 #number of replicates alpha<-c(seq(1,10,1)) #parameter of Beta distribution N<-length(alpha) pwr<-numeric(N) #critical value for the skewness test cv <- qnorm(1-beta/2, 0, sqrt(6*(n-2) / ((n+1)*(n+3)))) # the power of the skewness test for (j in 1:N) { #for each alpha a<-alpha[j] sktests <- numeric(m) pvalues<-replicate(m,expr = { x<-rbeta(n,a,a) sktests<-as.integer(abs(sk(x))>= cv) }) pwr[j]<- mean(pvalues) } pwr # plot power vs alpha plot(alpha, pwr, type = "b",xlab = bquote(alpha), ylim = c(0,0.2)) abline(h =0.1, lty = 3) se <- sqrt(pwr * (1-pwr) / m) #add standard errors lines(alpha, pwr+se, lty = 3) lines(alpha, pwr-se, lty = 3)
From the plot, we can see that powers of the test for each alpha are less than 0.10. With the alpha gets larger, the power is getting closer to 0.10.
for (j in 1:10) { #for each v=alpha v<-alpha[j] sktests <- numeric(m) pvalues<-replicate(m,expr = { x<-rt(n,df=v) sktests<-as.integer(abs(sk(x))>= cv) }) pwr[j]<- mean(pvalues) } # plot power vs v plot(alpha, pwr, type = "b",xlab = bquote(v), ylim = c(0,1)) abline(h =0.1, lty = 3) se <- sqrt(pwr * (1-pwr) / m) #add standard errors lines(alpha, pwr+se, lty = 3) lines(alpha, pwr-se, lty = 3)
From the plot, the power of the test for each $\nu$ is larger than 0.10. The power is getting closer to 0.10 when alpha gets larger. The results are different from symmetric Beta distribution.
Use Monte Carlo simulation to investigate whether the empirical Type I error rate of the t-test is approximately equal to the nominal significance level $\alpha$, when the sampled population is non-normal. The t-test is robust to mild departures from normality. Discuss the simulation results for the cases where the sampled population is (i)$\chi^{2}(1)$, (ii) Uniform(0,2), and (iii) Exponential(rate=1). In each case, test $H_{0}: \mu=\mu_{0}$ vs $H_{1}: \mu \neq \mu_{0}$, where $\mu_{0}$ is the mean of $\chi^{2}(1)$, Uniform(0,2), and Exponential(1), respectively.
For this experiment, the significance level is alpha=0.05 and the sample size is n = 30. The means of $\chi^{2}(1)$, Uniform(0,2), and Exponential(1) are all equal to 1. The method I use is similar to Example 6.7 in the textbook.
n<-30 alpha<-0.05 #the nominal significance level mu0<-1 m<-10000 #number of replicates p_x<-p_y<-p_z<-numeric(m) for (j in 1:m) { x<-rchisq(n,1) y<-runif(n,min = 0,max = 2) z<-rexp(n,rate = 1) ttest_x <- t.test(x, alternative = "two.sided", mu = mu0) #t test ttest_y <- t.test(y, alternative = "two.sided", mu = mu0) ttest_z <- t.test(z, alternative = "two.sided", mu = mu0) p_x[j] <- ttest_x$p.value p_y[j] <- ttest_y$p.value p_z[j] <- ttest_z$p.value } px.hat<-mean(p_x<alpha) py.hat<-mean(p_y<alpha) pz.hat<-mean(p_z<alpha) se.x <- sqrt(px.hat * (1 - px.hat) / m) #standard error se.y <- sqrt(py.hat * (1 - py.hat) / m) se.z <- sqrt(pz.hat * (1 - pz.hat) / m) print(c(alpha,px.hat,py.hat,pz.hat,se.x,se.y,se.z))
From the simulation, the empirical Type I error rate of the t-test is r px.hat
, r py.hat
, r pz.hat
for $\chi^{2}(1)$, Uniform(0,2), Exponential(1) respectively. They are all larger than alpha=0.05. For $\chi^{2}(1)$, the empirical Type I error rate differ from alpha=0.05 by more than ten standard error. For Uniform(0,2), the result differs from alpha=0.05 by about one standard error. For Exponential(1), the result differs from alpha=0.05 by more than seven standard error.
If we obtain the powers for two methods under a particular simulation setting with 10000 experiments: say,0.651 for one method and 0.676 for another method. Can we say the powers are different at 0.05 level?
Efron and Tibshirani discuss the scor (bootstrap) test score data on 88 students,who took examinations in five subjects [84, Table 7.1], [188, Table 1.2.1].The first two tests (mechanics, vectors) were closed book and the last three tests (algebra, analysis, statistics) were open book. Each row of the data frame is a set of scores $\left(x_{i 1}, \ldots, x_{i 5}\right)$ for the $i^{t h}$ student. Use a panel display to display the scatter plots for each pair of test scores. Compare the plot with the sample correlation matrix. Obtain bootstrap estimates of the standard errors for each of the following estimates: $\hat{\rho}{12}=\hat{\rho}(\mathrm{mec}, \mathrm{vec})$, $\hat{\rho}{34}=\hat{\rho}(\mathrm{alg}, \mathrm{ana})$, $\hat{\rho}{35}=\hat{\rho}(\mathrm{alg}, \mathrm{sta})$, $\hat{\rho}{45}=\hat{\rho}(\mathrm{ana}, \mathrm{sta})$.
I use function "pairs" to display the scatter plots for each pair of test scores. The method to estimate the standard errors for each correlation estimate is in the book.
set.seed(123) library(bootstrap) #the scatter plots for each pair of test scores pairs(scor) #sample correlation matrix cor(scor) #bootstrap estimate of SE of estimates of correlations r<-500 #number of replicates n<-nrow(scor) R_12<-R_34<-R_35<-R_45<-numeric(n) for(j in 1:r){ i<-sample(1:n,size = n,replace=TRUE) mec<-scor$mec[i] vec<-scor$vec[i] alg<-scor$alg[i] ana<-scor$ana[i] sta<-scor$sta[i] R_12[j]<-cor(mec,vec) R_34[j]<-cor(alg,ana) R_35[j]<-cor(alg,sta) R_45[j]<-cor(ana,sta) } print(c(sd(R_12),sd(R_34),sd(R_35),sd(R_45)))
r sd(R_12)
, r sd(R_34)
, r sd(R_35)
, r sd(R_45)
.Repeat Project 7.A for the sample skewness statistic. Compare the coverage rates for normal populations (skewness 0) and $\chi^{2}(5)$ distributions (positive skewness).
I use the Monte Carlo method in the slides. To compute confidence interval estimates for the skewness, I use the function "boot" and "boot.ci" in the boot package. The code generates 95% confidence intervals for the skewness statistic. The skewness of $\chi^{2}(5)$ distribution is sqrt(8/5).
#the sample skewness statistic skew.boot <- function(x,i) { x<-x[i] xbar <- mean(x) m3 <- mean((x - xbar)^3) m2 <- mean((x - xbar)^2) m3/m2^1.5 }
(i) The coverage rates for normal populations
library(boot) n<-20 #the sample size m<-1000 #number of replicates norm.left<-norm.right<-basic.left<- basic.right<-perce.left<-perce.right<-numeric(m) for(i in 1:m){ x<-rnorm(n) x<-as.matrix(x) boot.obj<-boot(x,statistic=skew.boot,R=1000) a<-boot.ci(boot.obj, type = c("basic", "norm", "perc")) norm.left[i]<-as.integer(a$norm[2]>0) norm.right[i]<-as.integer(a$norm[3]<0) basic.left[i]<-as.integer(a$basic[4]>0) basic.right[i]<-as.integer(a$basic[5]<0) perce.left[i]<-as.integer(a$percent[4]>0) perce.right[i]<-as.integer(a$percent[5]<0) } #proportion of times that confidence intervals miss on the left and right print(c(mean(norm.left),mean(norm.right), mean(basic.left),mean(basic.right), mean(perce.left),mean(perce.right))) #the coverage probabilities cat('norm=', 1-mean(norm.left)-mean(norm.right), 'basic=', 1-mean(basic.left)-mean(basic.right), 'perc=', 1-mean(perce.left)-mean(perce.right))
(ii) The coverage rates for chi-square populations(df=5)
sk<-sqrt(8/5) for(i in 1:m){ y<-rchisq(n,df=5) y<-as.matrix(y) boot.obj<-boot(y,statistic=skew.boot,R=1000) a<-boot.ci(boot.obj, type = c("basic", "norm", "perc")) norm.left[i]<-as.integer(a$norm[2]>sk) norm.right[i]<-as.integer(a$norm[3]<sk) basic.left[i]<-as.integer(a$basic[4]>sk) basic.right[i]<-as.integer(a$basic[5]<sk) perce.left[i]<-as.integer(a$percent[4]>sk) perce.right[i]<-as.integer(a$percent[5]<sk) } #proportion of times that confidence intervals miss on the left and right print(c(mean(norm.left),mean(norm.right), mean(basic.left),mean(basic.right), mean(perce.left),mean(perce.right))) #the coverage probabilities cat('norm=', 1-mean(norm.left)-mean(norm.right), 'basic=', 1-mean(basic.left)-mean(basic.right), 'perc=', 1-mean(perce.left)-mean(perce.right))
Compare the coverage rates for the two distributions:
Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.
Use the method in the textbook. The MLE of $\Sigma$ is (n-1)*cov(scor)/n.
data(scor, package = "bootstrap") n <- nrow(scor) theta.jack <- numeric(n) theta.hat <- max(eigen((n-1)*cov(scor)/n)$values)/sum(eigen((n-1)*cov(scor)/n)$values) for (i in 1:n) theta.jack[i] <- max(eigen((n-2)*cov(scor[-i,])/(n-1))$values) / sum(eigen((n-2)*cov(scor[-i,])/(n-1))$values) bias.j <- (n - 1) * (mean(theta.jack) - theta.hat) #bias se.j <- sqrt((n - 1) * mean((theta.jack - mean(theta.jack))^2)) #se cat("bias =",bias.j,"se =",se.j)
The jackknife estimates of bias and standard error of $\hat{\theta}$ are r bias.j
and r se.j
.
In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Repeat the analysis replacing the Log-Log model with a cubic polynomial model. Which of the four models is selected by the cross validation procedure? Which model is selected according to maximum adjusted $R^2$?
Use the method in Example 7.18.
library(DAAG) attach(ironslag) n <- length(magnetic) e1<-e2<-e3<-e4<-numeric(n) r1<-r2<-r3<-r4<-numeric(n) for (k in 1:n) { y <- magnetic[-k] x <- chemical[-k] S1 <- lm(y ~ x) #Linear model yhat1 <- S1$coef[1] + S1$coef[2] * chemical[k] e1[k] <- magnetic[k] - yhat1 r1[k] <- summary(S1)$adj.r.squared S2 <- lm(y ~ x + I(x^2)) #Quadratic model yhat2 <- S2$coef[1] + S2$coef[2] * chemical[k] +S2$coef[3] * chemical[k]^2 e2[k] <- magnetic[k] - yhat2 r2[k] <- summary(S2)$adj.r.squared S3 <- lm(log(y) ~ x) #Exponential model logyhat3 <- S3$coef[1] + S3$coef[2] * chemical[k] yhat3 <- exp(logyhat3) e3[k] <- magnetic[k] - yhat3 r3[k] <- summary(S3)$adj.r.squared S4 <- lm(y ~ x +I(x^2)+ I(x^3)) #Cubic polynomial model yhat4 <- S4$coef[1] + S4$coef[2] * chemical[k]+S4$coef[3] * chemical[k]^2+S4$coef[4] * chemical[k]^3 e4[k] <- magnetic[k] - yhat4 r4[k] <- summary(S4)$adj.r.squared } #the mean of the squared prediction errors c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2)) #the mean of adjusted R^2 in cross validation procedure c(mean(r1), mean(r2), mean(r3), mean(r4)) #the adjusted R^2 without cross validation procedure R1<-summary(lm(magnetic ~ chemical))$adj.r.squared R2<-summary(lm(magnetic ~ chemical + I(chemical^2)))$adj.r.squared R3<-summary(lm(log(magnetic) ~ chemical))$adj.r.squared R4<-summary(lm(magnetic ~ chemical + I(chemical^2) + I(chemical^3)))$adj.r.squared c(R1,R2,R3,R4)
According to the prediction error criterion, Model 2, the quadratic model is selected by the cross validation procedure. And the quadratic model is also selected according to maximum adjusted $R^2$ whether or not in cross validation procedure.
The Count 5 test for equal variances in Section 6.4 is based on the maximum number of extreme points. Example 6.15 shows that the Count 5 criterion is not applicable for unequal sample sizes. Implement a permutation test for equal variance based on the maximum number of extreme points that applies when sample sizes are not necessarily equal.
Generate X ~ N(1,2) with sample size 30, Y ~ N(0,4) with sample size 50. Generate a function "maxout" to count the maximum number of extreme points. $H_{0}:\sigma_x=\sigma_y$ vs $H_{1}:\sigma_x\not=\sigma_y$. Significance level alpha = 0.05. Use the method in the textbook. Plot the permutation distribution of the maximun number of extreme points and compute the p-value.
set.seed(123) #The function maxout below counts the maximum number of extreme points maxout <- function(x, y) { X <- x - mean(x) Y <- y - mean(y) outx <- sum(X > max(Y)) + sum(X < min(Y)) outy <- sum(Y > max(X)) + sum(Y < min(X)) return(max(c(outx, outy))) } R <- 9999 #number of replicates x <- rnorm(30, 1, 2) y <- rnorm(50, 0, 4) z <- c(x,y) K <- 1:80 t0 <- maxout(x,y) rep <- replicate(R,expr={ #permutations of the sample k <- sample(K, size = 30, replace = FALSE) x1 <- z[k] y1 <- z[-k] maxout(x1,y1) }) #Permutation distribution of the maximun number of extreme points hist(rep,main = "distribution of maxout", freq=FALSE, xlab="Replicates of z",xlim = c(0,15)) abline(v = t0,col ="red") #p-value p<-mean(c(t0, rep) >= t0) p
The p-value for testing equal variance is r p
. So the null hypothesis is rejected under alpha=0.05.
Power comparison (distance correlation test versus ball covariance test)
Use the function "dCov" and "ndCov2" to compute the test statistic "dcovtest" for distance correlation test. Choose significance level alpha=0.05, compute the powers of two test methods when sample size n=30,45,...,150 for the two models respectively. Then plot the powers vs sample size n.
n<-seq(30,150,15) #the sample size k<-length(n) alpha<-0.05 m<-200 #number of replicates set.seed(12345) library(boot) dCov <- function(x, y) { x <- as.matrix(x) y <- as.matrix(y) n <- nrow(x) m <- nrow(y) if (n != m || n < 2) stop("Sample sizes must agree") if (! (all(is.finite(c(x, y))))) stop("Data contains missing or infinite values") Akl <- function(x) { d <- as.matrix(dist(x)) m <- rowMeans(d) M <- mean(d) a <- sweep(d, 1, m) b <- sweep(a, 2, m) return(b + M) } A <- Akl(x) B <- Akl(y) dCov <- sqrt(mean(A * B)) dCov } ndCov2 <- function(z, ix, dims) { #dims contains dimensions of x and y p <- dims[1] q1 <- dims[2] + 1 d <- p + dims[2] x <- z[ , 1:p] y <- z[ix, q1:d] #permute rows of y return(nrow(z) * dCov(x, y)^2) } dcovtest<-function(z,dims){ boot.obj <- boot(data = z, statistic = ndCov2, R = 199, sim = "permutation",dims = c(2, 2)) tb<-c(boot.obj$t0,boot.obj$t) p.value<-mean(tb >= tb[1]) list(statistic=tb[1],p.value=p.value) } library(Ball) p1<-p2<-numeric(m)
pcov1<-pball1<-numeric(k) for (j in 1:k) { for (i in 1:m) { x <- matrix(rnorm(n[j] * 2), nrow = n[j], ncol = 2) e <- matrix(rnorm(n[j] * 2), nrow = n[j], ncol = 2) y<-x/4+e z <- as.data.frame(cbind(x, y)) p1[i]<-dcovtest(z,dims = c(2,2))$p.value p2[i]<-bcov.test(x,y,R=199,seed=i*12345)$p.value } pcov1[j]<-mean(p1<alpha) pball1[j]<-mean(p2<alpha) } #plot the powers vs sample number n plot(n,pcov1,type="o",xlab="n",ylab="powers",ylim=c(0,1),main="Model1",col=1) lines(n,pball1, type="o", col=2) legend('bottomright',legend=c('dcov','ball'),col=1:2,lty=1)
pcov2<-pball2<-numeric(k) for (j in 1:k) { for (i in 1:m) { x <- matrix(rnorm(n[j] * 2), nrow = n[j], ncol = 2) e <- matrix(rnorm(n[j] * 2), nrow = n[j], ncol = 2) y<-x*e/4 z <- as.data.frame(cbind(x, y)) p1[i]<-dcovtest(z,dims = c(2,2))$p.value p2[i]<-bcov.test(x,y,R=199,seed=i*12345)$p.value } pcov2[j]<-mean(p1<alpha) pball2[j]<-mean(p2<alpha) } #plot the powers vs sample number n plot(n,pcov2,type="o",xlab="n",ylab="powers",ylim=c(0,1),main="Model2",col=1) lines(n,pball2, type="o", col=2) legend('bottomright',legend=c('dcov','ball'),col=1:2,lty=1)
From a power comparison with 200 test decisions for each of the sample sizes we have obtained the results shown in the Figures.
Implement a random walk Metropolis sampler for generating the standard Laplace distribution (see Exercise 3.2). For the increment, simulate from a normal distribution. Compare the chains generated when different variances are used for the proposal distribution. Also, compute the acceptance rates of each chain.
Use the method in the textbook. The standard Laplace distribution has density f(x) = exp{−|x|}/2, x ∈ R. Choose variance sigma = 0.05,0.5,3,10.
set.seed(1234) f.lap <- function(x) exp(-abs(x))/2 #density of laplace distribution #Use a function to gennerate the laplace distribution rw.MH <- function(sigma, x0, N){ # sigma: variance of normal distribution # x0: initial value # N: size of random numbers required x <- numeric(N) x[1] <- x0 u <- runif(N) k<-0 for (i in 2:N) { y <- rnorm(1,x[i-1],sigma) if(u[i] <= (f.lap(y)) / f.lap(x[i-1])) #accept y { x[i] <-y k <- k+1 #compute acceptance rate } else {x[i]<-x[i-1]} #reject y } return(list(x=x, k=k)) } N <- 2000 sigma <- c(0.05,0.5,3,10) x0 <- 20 rw1<-rw.MH(sigma[1],x0,N) rw2<-rw.MH(sigma[2],x0,N) rw3<-rw.MH(sigma[3],x0,N) rw4<-rw.MH(sigma[4],x0,N) #acceptance rates print(c(rw1$k, rw2$k, rw3$k, rw4$k)/N) #graphs of chains #par(mfrow=c(2,2)) rw<-cbind(rw1$x, rw2$x, rw3$x, rw4$x) for (j in 1:4) plot(rw[,j],type = "l", xlab = bquote(sigma== .(round(sigma[j],3))), ylab = "x",ylim = range(rw[,j]))
The natural logarithm and exponential functions are inverses of each other, so that mathematically log(exp(x)) = exp(log(x)) = x. Show by example that this property does not hold exactly in computer arithmetic. Does the identity hold with near equality? (See all.equal.)
x<-2:8 #in computer arithmetic log(exp(x))==exp(log(x)) exp(log(x))==x log(exp(x))==x #nearly equality all.equal(log(exp(x)),exp(log(x))) all.equal(log(exp(x)),x) all.equal(x,exp(log(x)))
Through the results, the identity hold with near equality for chosen x.
Write a function to solve the equation $$\begin{aligned} \frac{2 \Gamma\left(\frac{k}{2}\right)}{\sqrt{\pi(k-1)} \Gamma\left(\frac{k-1}{2}\right)} \int_{0}^{c_{k-1}}\left(1+\frac{u^{2}}{k-1}\right)^{-k / 2} d u & \= \frac{2 \Gamma\left(\frac{k+1}{2}\right)}{\sqrt{\pi k} \Gamma\left(\frac{k}{2}\right)} \int_{0}^{c_{k}}\left(1+\frac{u^{2}}{k}\right)^{-(k+1) / 2} d u \end{aligned}$$ for a, where $$c_{k}=\sqrt{\frac{a^{2} k}{k+1-a^{2}}}$$. Compare the solutions with the points A(k) in Exercise 11.4.
Solve the equation by bisection.
#Exercise 11.4 k<-c(4:25,100) Ak<-numeric(length(k)) #write a function to compute the roots bisec<-function(k){ f<-function(a) {sqrt((k-1)*a^2/(k-a^2))} g<-function(a) {sqrt(k*a^2/(k+1-a^2))} it <- 0 eps <- .Machine$double.eps^0.25 r <- seq(0,sqrt(k)-1e-5, length=3) y <- c(pt(f(r[1]),k-1)-pt(g(r[1]),k), pt(f(r[2]),k-1)-pt(g(r[2]),k), pt(f(r[3]),k-1)-pt(g(r[3]),k)) if (y[1] * y[3] > 0) stop("f does not have opposite sign at endpoints") while(it < 1000&&abs(y[2]>eps)) { it <- it + 1 if (y[1]*y[2] < 0) { r[3] <- r[2] y[3] <- y[2] } else { r[1] <- r[2] y[1] <- y[2] } r[2] <- (r[1] + r[3]) / 2 y[2] <- pt(f(r[2]),k-1)-pt(g(r[2]),k) } return(r[2]) } #compute the roots for(i in 1:length(k)) Ak[i]<-bisec(k[i])
#Exercise 11.5 k<-c(4:25,100) ck<-function(a,k){sqrt((a^2)*k/(k+1-a^2))} #write a function to compute the roots bisec<-function(k){ it <- 0 r <- seq(1e-5,sqrt(k+1)/2, length=3) y<-numeric(3) while(it < 100) { it <- it + 1 ck11<-ck(r[1],k-1) f<-function(u){ck11*(1+(ck11*u)^2/(k-1))^(-k/2)} int1<-integrate(f,lower=0,upper =1)$value h11<-exp(lgamma(k/2))*int1/(exp(lgamma((k-1))/2)*sqrt(k-1)) ck12<-ck(r[1],k) g<-function(u){ck12*(1+(ck12*u)^2/k)^(-(k+1)/2)} int2<-integrate(g,lower=0,upper =1)$value h12<-exp(lgamma((k+1)/2))*int2/(exp(lgamma(k/2))*sqrt(k)) y[1]<-h11-h12 ck21<-ck(r[2],k-1) f<-function(u){ck21*(1+(ck21*u)^2/(k-1))^(-k/2)} int1<-integrate(f,lower=0,upper =1)$value h21<-exp(lgamma(k/2))*int1/(exp(lgamma((k-1))/2)*sqrt(k-1)) ck22<-ck(r[2],k) g<-function(u){ck22*(1+(ck22*u)^2/k)^(-(k+1)/2)} int2<-integrate(g,lower=0,upper =1)$value h22<-exp(lgamma((k+1)/2))*int2/(exp(lgamma(k/2))*sqrt(k)) y[2]<-h21-h22 ck31<-ck(r[3],k-1) f<-function(u){ck31*(1+(ck31*u)^2/(k-1))^(-k/2)} int1<-integrate(f,lower=0,upper =1)$value h31<-exp(lgamma(k/2))*int1/(exp(lgamma((k-1))/2)*sqrt(k-1)) ck32<-ck(r[3],k) g<-function(u){ck32*(1+(ck32*u)^2/k)^(-(k+1)/2)} int2<-integrate(g,lower=0,upper =1)$value h32<-exp(lgamma((k+1)/2))*int2/(exp(lgamma(k/2))*sqrt(k)) y[3]<-h31-h32 if (y[1]*y[2] < 0) { r[3] <- r[2] y[3] <- y[2] } else { r[1] <- r[2] y[1] <- y[2] } r[2] <- (r[1] + r[3]) / 2 } return(r[2]) }
#output the roots a<-numeric(length(k)) for(i in 1:length(k)) {a[i]<-bisec(k[i])} #output ex11.4 solution1 = matrix(0,2,length(k)) rownames(solution1)==c("k","A(k)") solution1[1,]=k solution1[2,]=Ak solution1 #output ex11.5 solution2 = matrix(0,2,length(k)) rownames(solution2)==c("k","a") solution2[1,]<-k solution2[2,]<-a solution2
Compare the results of 11.4 and 11.5, the roots are different but very close.
Use the method in the slides.
#the log-maximum likelihood function LML <- function(nA,nAB,nB,nO,pA,pB,pO) { nAB*log(2*pA*pB)+nA*log(2*pA*pO+pA^2)+nB*log(2*pB*pO+pB^2)+nO*log(pO^2) } #write a funtion to solve pro<-function(p,q,n.obs) { n<-sum(n.obs) nA<-n.obs[1] nB<-n.obs[2] nO<-n.obs[3] nAB<-n.obs[4] nOO<-nO lml<-pA<-pB<-pO<-rep(0,10) pA[1]<-p pB[1]<-q pO[1]<-1-p-q for(i in 2:11){ pA.old<-pA[i-1] pB.old<-pB[i-1] pO.old<-pO[i-1] #compute nAA,nAO,nBB,nBO nAA<-nA*pA.old^2/(pA.old^2+2*pA.old*pO.old) nAO<-nA-nAA nBB<-nB*pB.old^2/(pB.old^2+2*pB.old*pO.old) nBO<-nB-nBB #M-step pA[i]<-(2*nAA+nAO+nAB)/(2*n) pB[i]<-(2*nBB+nBO+nAB)/(2*n) pO[i]<-(2*nOO+nAO+nBO)/(2*n) lml[i-1]<-LML(nA,nAB,nB,nO,pA[i],pB[i],pO[i]) } return(list(pA=pA,pB=pB,pO=pO,lml=lml)) } n.obs<-c(28,24,41,70) # observed data,n_A,n_B,n_O,n_AB P<-0.02;Q<-0.01 #initial q a<-pro(P,Q,n.obs) #compute p,q p<-a$pA q<-a$pB lml<-a$lml #the outputs of p and q print(cbind(p,q)) #the log-maximum likelihood values in M-steps plot(lml,type = "o")
We can see that p=0.3273442, q=0.3104267.
Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:
formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt )
formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt ) fit1<-fit2<-vector("list",4) #for loops for (i in 1:4) { fit1[[i]]<-lm(formulas[[i]],data = mtcars) } fit1 #lapply() fit2<-lapply(formulas, function(x) lm(x,data = mtcars)) fit2
Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?
bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] })
bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] }) boot.fit1<-boot.fit2<-boot.fit3<-vector("list",10) #for loop for (i in 1:10) { boot.fit1[[i]]<-lm(mpg~disp,data = bootstraps[[i]]) } boot.fit1 #lapply() boot.fit2<-lapply(1:10,function(i) lm(mpg~disp,data = bootstraps[[i]])) boot.fit2 #without an anonymous function boot.fit3<-lapply(lapply(bootstraps,"[",c(1,3)),lm) boot.fit3
For each model in the previous two exercises, extract R2 using the function below.
rsq <- function(mod) summary(mod)$r.squared
rsq <- function(mod) summary(mod)$r.squared sapply(fit1,rsq) sapply(fit2,rsq) sapply(boot.fit1,rsq) sapply(boot.fit2,rsq) sapply(boot.fit3,rsq)
The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.
trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE )
Extra challenge: get rid of the anonymous function by using [[ directly.
trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE ) #use an anonymous function sapply(1:100,function(i) trials[[i]]$p.value) #extra challenge sapply(trials,"[[",3)
Implement mcsapply(), a multicore version of sapply(). Can you implement mcvapply(), a parallel version of vapply()? Why or why not?
# mcsapply() is based on parLapply(),the multicore version of lapply() # arguments cl,X,FUN etc. have the same use as parLapply() # arguments simplify and USE.NAMES have the same use as sapply() mcsapply<-function (cl = NULL, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE, chunk.size = NULL) { FUN <- match.fun(FUN) answer <- parLapply(cl = cl, X = as.list(X), fun = FUN, ..., chunk.size = chunk.size) if (USE.NAMES && is.character(X) && is.null(names(answer))) names(answer) <- X if (!isFALSE(simplify) && length(answer)) simplify2array(answer, higher = (simplify == "array")) else answer } # an example to use mcsapply() library(parallel) no_cores<-detectCores()-1 cl<-makeCluster(no_cores) mcsapply(cl,1:8,function(x) 2*x) system.time(sapply(1:500000,function(x) 2*x)) system.time(mcsapply(cl,1:500000,function(x) 2*x)) # shut down the workers stopCluster(cl)
We can implement mcvapply(). It is already done by the function "future_vapply()" in the R package "future.apply".
You have already written an R function for Exercise 9.4 (page 277, Statistical Computing with R). Rewrite an Rcpp function for the same task.
Compare the generated random numbers by the two functions using qqplot.
Campare the computation time of the two functions with microbenchmark.
Comments your results.
# the function I wrote for Ex.9.4 rwmhR <- function(sigma, x0, N){ # sigma: variance of normal distribution # x0: initial value # N: size of random numbers required x <- numeric(N) x[1] <- x0 u <- runif(N) k<-0 for (i in 2:N) { y <- rnorm(1,x[i-1],sigma) if(u[i] <= (exp(-abs(y))/2 /exp(-abs(x[i-1]))/2)) #accept y { x[i] <-y k <- k+1 #compute acceptance rate } else {x[i]<-x[i-1]} #reject y } return(list(x=x, k=k)) } # the Rcpp function library(Rcpp) sourceCpp(code=' #include<Rcpp.h> using namespace Rcpp; //[[Rcpp::export]] List rwmhC(double sigma, double x0, int N) { NumericVector x(N); x(0) = x0; NumericVector u = as<NumericVector>(runif(N)); int k=0; double y=0; for (int i=1;i<N; i++) { y = as<double>(rnorm(1,x(i-1),sigma)); if(u(i) < exp(-abs(y))/2 /exp(-abs(x(i-1))/2)) /** accept y */ { x(i)= y; k = k+1; /** compute acceptance rate*/ } else {x(i)=x(i-1);} /** reject y*/ } return(List::create(Named("x")=x, Named("k")=k)); } ') N<-5000 sigma<-c(0.3,3) x0<-3 rwC1<-rwmhC(sigma[1],x0,N) rwC2<-rwmhC(sigma[2],x0,N) #acceptance rates print(c(rwC1$k,rwC2$k)/N) #graphs of chains par(mfrow=c(1,2)) rw<-cbind(rwC1$x, rwC2$x) for (j in 1:2) plot(rw[,j],type = "l", xlab = bquote(sigma== .(round(sigma[j],3))), ylab = "x",ylim = range(rw[,j])) # compare the generated random numbers by qqplot rwR1<-rwmhR(sigma[1],x0,N)$x rwR2<-rwmhR(sigma[2],x0,N)$x qqplot(rwC1$x,rwR1) qqplot(rwC2$x,rwR2) # Campare the computation time library(microbenchmark) ts1<-microbenchmark(mhR=rwmhR(sigma[1],x0,N),mhC=rwmhC(sigma[1],x0,N)) ts2<-microbenchmark(mhR=rwmhR(sigma[2],x0,N),mhC=rwmhC(sigma[2],x0,N)) summary(ts1) summary(ts2)
Through the figures, we can see the generated random numbers are nearly positive correlated. Compare the computation times, Rcpp function is much faster than R function.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.