SC19020 is a simple R package developed to calculate two important concepts in the "Non-parametric statistic" course and "Statistic computing" course. Bootstrap is a very important resampling method in Statistic and will be of great help for us to study the properties of the given data especially when sample size is not large. The kernel density estimator(KDE) is a very important non-parametric method to estimate the true density function of a given set of data when we know little about the true distribution of the data, and with KDE, we can study many other relevant concepts of the given data. In this package, two functions are considered, namely, bootstrapR (generate bootstraping samples of the given data) and KDER (compute the kernel density estimator of the given data).
The source R code for bootstrapR is as follows:
bootstrapR <- function(X, n) { m <- length(X) out <- matrix(nrow = n, ncol = m) for(i in 1:n){ out[i,] <- sample(X, m, replace = TRUE) } return(out) }
In order to show how to use bootstapR, we give an example.
library(SC19020) X <- c(96,87,88,90,85) outer <- bootstrapR(X, 5) outer
The source R code for KDER is as follows:
KDER <- function(X, h, x) { n <- length(X) out <- (1/h)*mean(dnorm((x-X)/h)) return(out) }
In order to show how to use KDER, we give an example.
X <- rnorm(100,1,2) x <- seq(-1, 1, by = 0.01) h <- 0.1 y <- numeric(length(x)) for(i in 1:length(x)){ y[i] <- KDER(X, h, x[i]) } plot(x, y, type = "l")
Use knitr to porduce at least 3 examples(texts, figures, tables)
1.figure
x<-c(20,30,40,45,60) y<-c(16,20,27,40,60) plot(x,y,type = "l",col="blue") title(main="An example")
2.table
math<-c(95,80,74,66,84) engl<-c(80,77,68,92,85) phys<-c(90,82,65,74,90) chem<-c(88,96,70,84,80) mydata<-data.frame(math,engl,phys,chem) mydata
3.text
We derive the nonparametric maximum likelihood estimate, $\hat{F}$ say, of a lifetime distribution $\boldsymbol{F}$ on the basis of two independent samples, one a sample of size $m$ from $\boldsymbol{F}$ and the other a sample of size $n$ from the length-biased distribution of $\boldsymbol{F}$, i.e.from $G_{F}(y) \equiv G(y)=\frac{1}{\mu} \int_{0}^{y} x d F(x),\quad y\geq 0$. We further show that $(m+n)^{1 / 2}(\hat{F}-F)$ converges weakly to a pinned Gaussian process with a simple covariance function, when $m+n \rightarrow \infty$ and $m / n \rightarrow$ constant.
3.4 The Rayleigh density [156, Ch. 18] is $f(x)=\frac{x}{\sigma^{2}} e^{-x^{2} /\left(2 \sigma^{2}\right)}, \quad x \geq 0, \sigma>0$. Develop an algorithm to generate random samples from a Rayleigh($\sigma$) distribution. Generate Rayleigh($\sigma$) samples for several choices of $\sigma>0$ and check that the mode of the generated samples is close to the theoretical mode $\sigma$ (check the histogram).
1.basic ideas
采用The inverse transform method,可以求出Rayleigh distribution 的cdf,即$F(x)=1-e^{-\frac{x^{2}}{2 \sigma^{2}}}$,先从均匀U(0,1)分布中生成随机样本,再反解出x=$F_{x}^{-1}(U)$,从而样本x服从于Rayleigh distribution
2.code
(1)take sigma as 1
sigma <- 1;n <- 100000 u <- runif(n) x <- sqrt(-2*(sigma^2)*log(1-u)) # F(x) = 1-exp(-x^2/(2*sigma^2)), x>=0,sigma>0 hist(x,prob =TRUE, breaks=100,xlim=c(0,6),main=expression(f(x)==x*exp(-x^2/2))) y <- seq(0, 6, .01) lines(y, y/(sigma^2)*exp(-y^2/(2*(sigma^2))))
(2)take sigma as 2
sigma <- 2; n <- 100000 u <- runif(n) x <- sqrt(-2*(sigma^2)*log(1-u)) # F(x) = 1-exp(-x^2/(2*sigma^2)), x>=0,sigma>0 hist(x,prob =TRUE,breaks=100,xlim=c(0,8), main=expression(f(x)==x/4*exp(-x^2/8))) y <- seq(0, 8, 0.01) lines(y, y/(sigma^2)*exp(-y^2/(2*(sigma^2))))
3.my thoughts
The inverse transform method performs well in generating samples from complicated distributions but with inverse function of its cdf.
3.11 Generate a random sample of size 1000 from a normal location mixture. The components of the mixture have $N(0,1)$and $N(3,1)$distributions with mixing probabilities $p_{1}$ and $p_{2}=1-p_{1}$. Graph the histogram of the sample with density superimposed, for $p_{1}=0.75$. Repeat with different values for $p_{1}$ and observe whether the empirical distribution of the mixture appears to be bimodal. Make a conjecture about the values of $p_{1}$ that produce bimodal mixtures.
1.basic ideas
利用课件上的方法,生成随机样本,再比较样本的频率分布直方图和总体密度曲线的拟合程度
2.code
n <- 1e4 X1 <- rnorm(n,0,1) X2 <- rnorm(n,3,1) r <- sample(c(0,1),n,replace=TRUE,prob=c(0.25,0.75)) Z <- r*X1+(1-r)*X2 hist(Z,prob = TRUE,breaks = 100,xlim = c(-4,6)) x<-seq(-4,6,0.01) y<-0.75*dnorm(x,0,1)+0.25*dnorm(x,3,1) lines(x,y)
n <- 1e4 p <- c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9) for (i in 1:9){ X1 <- rnorm(n,0,1) X2 <- rnorm(n,3,1) r <- sample(c(0,1),size=n,replace=TRUE,prob=c(p[i],1-p[i])) Z<- r*X1+(1-r)*X2 hist(Z,breaks = 100,xlim = c(-4,6)) } par(mfrow=c(1,1))
3.conjecture
当$p_{1}$在0.5附近的时候,总体呈现双峰分布
4.my thoughts
I'm not familiar with showing different pictures in a single screen and making pictures more beautiful.
3.18 Write a function to generate a random sample from a $W_{d}(\Sigma, n)$ (Wishart) distribution for $n>d+1 \geq 1$, based on Bartlett’s decomposition.
1.basic ideas
首先生成一个下三角阵,满足$T_{i j} \stackrel{i i d}{\sim} N(0,1), i>j$,且$T_{i i} \sim \sqrt{\chi^{2}(n-i+1)}, i=1, \ldots, d$,再从Sigma矩阵中得到Choleski factorization,从而利用Bartlett’s decomposition的方法,我们可以生成一组随机样本服从于$W_{d}(\Sigma, n)$
2.code
n<-10; d<-5 X<-numeric(d) for(i in 1:d){ X[i]<-sqrt(rchisq(1,df=n-i+1)) } A <- diag(X)#生成符合题意的对角阵 B <- matrix(0,nrow=d,ncol=d) for(i in 1:d)#生成下三角元服从正态分布的矩阵 for(j in 1:i-1){ B[i,j]<-rnorm(1) } T <- A+B#符合题意的下三角阵 Sigma <- matrix(0,nrow=d,ncol=d) for(i in 1:d)#输入多元正态分布的协方差矩阵 for(j in 1:d){ if (i==j) Sigma[i,j]<-4 else Sigma[i,j]<-0.8} L<-chol(Sigma)#求出Sigma矩阵的Choleski factorization W<-t(L)%*%T%*%t(T)%*%L#利用Bartlett’s decomposition生成符合题意的样本 W
3.my thoughts
不太会写函数,以及如何简便地输入矩阵,此外,复杂的循环语句不太会书写
5.1 Compute a Monte Carlo estimate of $\theta=\int_{0}^{\frac{\pi}{3}} \sin t d t$ and compare your estimate with the exact value of the integral.
1.basic ideas
We can apply the Mento Carlo method and use a frequency to approximate the integration, so that we can get an estimation of the integration. We can examine the result with the true value. $\theta=\int_{0}^{\frac{\pi}{3}} \sin t d t=E_{Y}\left(\frac{\pi}{3} \sin Y\right), Y \sim U\left(0, \frac{\pi}{3}\right)$
2.code
m <- 1e4 x <- runif(m, min=0, max=pi/3) #generate random numbers theta.hat <- mean(sin(x)) * pi/3 #use sample mean to approximate the integration print(c(theta.hat,cos(0) - cos(pi/3))) #compare the approximation and the true value
3.my thoughts
The Monte Carlo method provides a simple and accurate method to appoximate some complicated integrations in a general way.
5.10 Use Monte Carlo integration with antithetic variables to estimate $\int_{0}^{1} \frac{e^{-x}}{1+x^{2}} d x$, and find the approximate reduction in varianceas a percentage of the variance without variance reduction.
1.basic ideas
We use the standard Mento Carlo method and Monte Carlo method with antithetic variables to approximate the integration, and we compare the two results by their variances.
2.code
m <- 1e4 x1 <- runif(m) x2 <- runif(m/2) #generate m/2 samples to ensure the total are the same with the 1st method MC_standard<-numeric(m); MC_antithetic<-numeric(m/2) MC_standard<-exp(-x1)/(1+x1^2) MC_antithetic<-1/2*(exp(-x2)/(1+x2^2)+exp(-1+x2)/(1+(1-x2)^2)) theta.stan <- mean(MC_standard) #the standard Mento Carlo method se.stan<- sd(MC_standard) theta.anti <- mean(MC_antithetic) #the Monte Carlo method with antithetic variables se.anti <- sd(MC_antithetic) reduc_sd <- 1-sd(MC_antithetic)/sd(MC_standard) #the reduction of sd cbind(theta.stan,theta.anti,se.stan,se.anti,reduc_sd)
3.my thoughts
This printing result shows that the Mento Carlo method with antithetic variables greatly reduced the sd by more than 86% than the standard Mento Carlo method.
5.15 Obtain the stratified importance sampling estimate in Example 5.13 and compare it with the result of Example 5.10.
1.basic ideas
We first choose several importance functions to estimate $\int_{0}^{1} \frac{e^{-x}}{1+x^{2}} d x$, and we can easily find that the 4th choice is the best. Then we use stratified importance sampling to estimate the integration again and compare the result with the former best one.
Because$\int_{0}^{1} \frac{e^{-x}}{1+x^{2}} d x=\int_{\alpha_0}^{\alpha_{1}} \frac{1-e^{-1}}{5\left(1+x^{2}\right)} \frac{5 e^{-x}}{1-e^{-1}} d x+\cdots+\int_{\alpha_{4}}^{\alpha_{5}} \frac{1-e^{-1}}{5\left(1+x^{2}\right)} \frac{5 e^{-x}}{1-e^{-1}} d x$, we first generate random numbers from the distribution with the density function: $f(x)=\frac{5e^{-x}}{1-e^{-1}}$,$\alpha_{ j}$<x<$\alpha_{ j+1}$, where $\alpha_{ j}$ is the $\frac{j}{5} \times 100\%$ of the distribution with a density function: $f(x)=\frac{e^{-x}}{1-e^{-1}}, 0<x<1$. And then we add the means of each interval to get an estimation of the integration and other results.
2.code
(1)several possible choices of importance functions to estimate$\int_{0}^{1} \frac{e^{-x}}{1+x^{2}} d x$ by importance sampling method are compared as below.
m <- 10000 theta.hat <- se <- numeric(5) g <- function(x) { exp(-x - log(1+x^2)) * (x > 0) * (x < 1) } x <- runif(m) #using f0 fg <- g(x) theta.hat[1] <- mean(fg) se[1] <- sd(fg) x <- rexp(m, 1) #using f1 fg <- g(x) / exp(-x) theta.hat[2] <- mean(fg) se[2] <- sd(fg) x <- rcauchy(m) #using f2 i <- c(which(x > 1), which(x < 0)) x[i] <- 2 #to catch overflow errors in g(x) fg <- g(x) / dcauchy(x) theta.hat[3] <- mean(fg) se[3] <- sd(fg) u <- runif(m) #f3, inverse transform method x <- - log(1 - u * (1 - exp(-1))) fg <- g(x) / (exp(-x) / (1 - exp(-1))) theta.hat[4] <- mean(fg) se[4] <- sd(fg) u <- runif(m) #f4, inverse transform method x <- tan(pi * u / 4) fg <- g(x) / (4 / ((1 + x^2) * pi)) theta.hat[5] <- mean(fg) se[5] <- sd(fg) rbind(theta.hat, se)
(2)Stratified Importance Sampling
m <- 10000 #number of replicates k <- 5 #number of strata r <- m/k #replicates per stratum #compute the 0%, 20%, 40%, 60%, 80% and 100%percentiles alpha <- numeric(6) for(j in 1:6){ alpha[j] <- -log(1-(1/5)*(j-1)*(1-exp(-1))) } x <- matrix(nrow=2000,ncol=5) g <- function(x)(1-exp(-1))/(5*(1+x^2))*(x>0)*(x<1) est <- numeric(5) #the inverse transform method to generate samples for (j in 1:k){ u <- runif(r) x[,j] <- - log(1 - (1/5)*(u+j-1) * (1 - exp(-1))) est[j] <- mean(g(x[,j])) } theta.hat<-sum(est); se<-sd(g(x)) cbind(theta.hat,se)
3.my thoughts
We can find that the result of Stratified Importance Sampling is quite better than the former best one of the method of importance functions.
6.5 Suppose a 95% symmetric t-interval is applied to estimate a mean, but the sample data are non-normal. Then the probability that the confidence interval covers the mean is not necessarily equal to 0.95. Use a Monte Carlo experiment to estimate the coverage probability of the t-interval for random samples of $\chi^{2}(2)$ data with sample size n = 20. Compare your t-interval results with the simulation results in Example 6.4. (The t-interval should be more robust to departures from normality than the interval for variance.)
6.6 Estimate the 0.025, 0.05, 0.95, and 0.975 quantiles of the skewness $\sqrt{b_{1}}$ under normality by a Monte Carlo experiment. Compute the standard error of the estimates from (2.14) using the normal approximation for the density (with exact variance formula). Compare the estimated quantiles with the quantiles of the large sample approximation $\sqrt{b_{1}} \approx N(0,6 / n)$.
1.basic ideas
Use a Monte Carlo experiment to estimate the coverage probability of three different confidence intervals and compare them.
2.code
set.seed(123) covp <- function(m){ n <- 20; alpha <- 0.05 LCL_mean <- numeric(m) UCL_mean <- numeric(m) UCL_var_chisq <- UCL_var_normal <- numeric(m) for(i in 1:m){ x <- rchisq(n,2) #samples from chisq with mean=2, var=4 #the lower and upper confidence limit of mean LCL_mean[i] <- mean(x)-sd(x)*qt(1-alpha/2,n-1)/sqrt(n) UCL_mean[i] <- mean(x)+sd(x)*qt(1-alpha/2,n-1)/sqrt(n) #the one-side upper confidence limit of variance UCL_var_chisq[i] <- (n-1)*var(x)/qchisq(alpha, df=n-1) y <- rnorm(n) #samples from norm with mean=0, var=1 #the one-side upper confidence limit of variance UCL_var_normal[i] <- (n-1)*var(y)/qchisq(alpha, df=n-1) } #coverage probability of mean from chisq distribution covp_mean_chisq <- mean(LCL_mean<2&2<UCL_mean) #coverage probability of variance from chisq distribution covp_var_chisq <- mean(4<UCL_var_chisq) #coverage probability of variance from normal distribution covp_var_normal <- mean(1<UCL_var_normal) cbind(covp_mean_chisq,covp_var_chisq,covp_var_normal) } m<-1e3; covp(m) m<-1e4; covp(m) m<-1e5; covp(m)
3.my thoughts
We can find that the 1st and 2nd coverage probabilties are always smaller than the true confidence levels because the samples are not from a normal distribution. However, we choose the t-statistic to construct the confidence interval, hence the method is wrong theorically.
At the same time, we can find the 3rd coverage probability is quite close to the true probability and it will get closer as the number of replicates becomes bigger because the confidence interval is correct theorically.
In addition, the confidence interval of mean performs better than that of variance, and I'm not quite sure of the reasons, maybe the former interval is more robust.
1.basic ideas
We first generate samples from normal distribution and replicate the process and use the the mean of sample quantiles to estimate the true quantiles of the skewness. And then we compare the estimated quantiles with the quantiles of the large sample approximation.
Also, we use $\operatorname{Var}\left(\hat{x}{q}\right)=\frac{q(1-q)}{n f\left(x{q}\right)^{2}}$ to compute the standard error of the estimates.
2.code
set.seed(123) n <- 50 #sample sizes m <- 1000 #replicates to compute a sample quantile once k <- 100 #replicates of Mento Carlo experiment alpha <- c(0.025,0.05,0.95,0.975) #different probabilities of quantiles skew <- numeric(m) se_quantile.hat <- quantile.appro <- quantile.hat <- numeric(4) quantile <- matrix(nrow=k,ncol=4) #we compute k sample quantiles of every probability for(i in 1:k){ for(j in 1:m){ x <- rnorm(n) skew[j] <- mean((x-mean(x))^3)/(mean((x-mean(x))^2))^(3/2) #sample skewness } quantile[i,] <- quantile(skew, probs=alpha) #a sample quantiles of skewness } #use the mean of sample quantiles to estimate the sample quantiles quantile.hat <- colMeans(quantile) for(i in 1:4){ #quantiles of the large sample approximation quantile.appro[i] <- qnorm(alpha[i],0,sqrt(6/n)) #the standard error of the estimates se_quantile.hat[i] <- sqrt((alpha[i]*(1-alpha[i])/(n*(dnorm(quantile.appro[i],0,sqrt(6/n))^2)))) } cbind(alpha,quantile.hat,quantile.appro,se_quantile.hat)
3.my thoughts
We can find the results of Mento Carlo method is close to the results of large sample approximation and the stand error is not big so that the Mento Carlo Method provides a good method to do statistic inference.
6.7 Estimate the power of the skewness test of normality against symmetric $\operatorname{Beta}(\alpha, \alpha)$ distributions and comment on the results. Are the results different for heavy-tailed symmetric alternatives such as $t(\nu)$?
6.A Use Monte Carlo simulation to investigate whether the empirical Type I error rate of the t-test is approximately equal to the nominal significance level $\alpha$, when the sampled population is non-normal. The t-test is robust to mild departures from normality. Discuss the simulation results for the cases where the sampled population is (i) $\chi^{2}(1)$, (ii) Uniform(0,2), and (iii) Exponential(rate=1). In each case, test: $H_{0}: \mu=\mu_{0}$ vs $H_{1}: \mu \neq \mu_{0}$, where$\mu_{0}$ is the mean of $\chi^{2}(1)$, Uniform(0,2), and Exponential(1), respectively.
Discussion: If we obtain the powers for two methods under a particular simulation setting with 10,000 experiments: say, 0.651 for one method and 0.676 for another method. Can we say the powers are different at 0.05 level?
(1)What is the corresponding hypothesis test problem?
(2)What test should we use? Z-test, two-sample t-test, paired t-test or McNemar test?
(3)What information is needed to test your hypothesis?
1.basic ideas
Use Monte Carlo methods to estimate the power of skewness test of normality against symmetric $\operatorname{Beta}(\alpha, \alpha)$ distributions and heavy-tailed symmetric alternatives such as $t(\nu)$.
2.code
#skewness test of normality against symmetric Beta distribution alpha<-0.05; n<-100 a<-seq(1,50,1) # critical value for the skewness test cv<-qnorm(1-alpha/2,0,sqrt(6*(n-2)/((n+1)*(n+3)))) # the function to compute skewness of samples sk <- function(x) mean((x-mean(x))^3)/(mean((x-mean(x))^2))^(3/2) pwr <- numeric(length(a)) m <- 10000 # num. repl. each sim. for (i in 1:length(a)) { sktests<-numeric(m) for (j in 1:m) { x <- rbeta(n,a[i],a[i]) # test decision is 1 (reject) or 0 sktests[j] <- as.integer(abs(sk(x)) >= cv) } pwr[i] <- mean(sktests) # proportion rejected, that is the power of test } plot(a,pwr,xlab = "parameter of Beta distribution",ylab = "power of test")
3.comments
the power of test is poor generally although it will get higher from zero to the nominal level as the parameter of Beta distribution becomes bigger.
# skewness test of normality against heavy-tailed symmetric alternatives such as t distribution alpha<-0.05; n<-100 a<-seq(1,30,1) cv<-qnorm(1-alpha/2,0,sqrt(6*(n-2)/((n+1)*(n+3)))) sk <- function(x) mean((x-mean(x))^3)/(mean((x-mean(x))^2))^(3/2) pwr <- numeric(length(a)) m <- 10000 #num. repl. each sim. for (i in 1:length(a)) { sktests<-numeric(m) for (j in 1:m) { x <- rt(n,a[i]) #test decision is 1 (reject) or 0 sktests[j] <- as.integer(abs(sk(x)) >= cv) } pwr[i] <- mean(sktests) #proportion rejected } plot(a,pwr,xlab = "df of t distribution",ylab = "power of test")
4.comparison and my thoughts
The powers of test against t distribution get lower from 100% to the nominal level 5% as the df of t distribution becomes bigger. We can find that the tendency of two tests is contrary but they all get closer to the nominal level as the parameters get bigger.
1.basic ideas
Use Monte Carlo experiment to assess Type I error rate.
2.code
#H0:mu=1 m <- 1e4; n <- 1000 set.seed(123) p.valx <- p.valy <- p.valz <- numeric(m) for(i in 1:m){ # generate samples from $\chi^{2}(1)$ and compute p-value x <- rchisq(n,1) p.valx[i] <- 2*(1-pt(abs(sqrt(n)*(mean(x)-1)/sd(x)),n-1)) # generate samples from uniform(0,2) and compute p-value y<-runif(n,0,2) p.valy[i] <- 2*(1-pt(abs(sqrt(n)*(mean(y)-1)/sd(y)),n-1)) # generate samples from Exp(1) and compute p-value z<-rexp(n,1) p.valz[i] <- 2*(1-pt(abs(sqrt(n)*(mean(z)-1)/sd(z)),n-1)) } #print the t1es print(c(mean(p.valx<=0.05),mean(p.valy<=0.05),mean(p.valz<=0.05)))
3.my thoughts
We can find that the t1es are quite close to the nominal level of the test, which means that even though the sampled population is non-normal, the t-test is quite robust to mild departures from normality.
(1)Let's take the statistic set "sleep" for an example, we may care about whether the new medicine is effective in improving the quality of sleeping, that is whether the means of two sampled population are the same or not in another word. So the null hypothesis is "$\mu_{1}=\mu_{2}$", and the alternative hypothsis is "$\mu_{1}\neq\mu_{2}$".
(2)We can use two-sample t-test or paired t-test to test the hypothesises.
(3)We need to know whether the distributions and of the sampled populations are normal or not, the variances of two sampled population are the same or not and the two sampled populations are independent or paired. All these above informations will determine which test we shall use.
7.6 Efron and Tibshirani discuss the scor (bootstrap) test score data on 88 students who took examinations in five subjects [84, Table7.1], [188, Table1.2.1]. The first two tests (mechanics, vectors) were closed book and the last three tests (algebra, analysis, statistics) were open book. Each row of the data frame is a set of scores $\left(x_{i 1}, \ldots, x_{i 5}\right)$ for the ith student. Use a panel display to display the scatter plots for each pair of test scores. Compare the plot with the sample correlation matrix. Obtain bootstrap estimates of the standard errors for each of the following estimates: $\hat{\rho}{12}=\hat{\rho}(\mathrm{mec}, \mathrm{vec})$, $\hat{\rho}{34}=\hat{\rho}(\mathrm{alg}, \mathrm{ana})$, $\hat{\rho}{35}=\hat{\rho}(\mathrm{alg}, \mathrm{sta})$, $\hat{\rho}{45}=\hat{\rho}(\mathrm{ana}, \mathrm{sta})$.
7.B Repeat Project 7.A for the sample skewness statistic. Compare the coverage rates for normal populations (skewness 0) and $\chi^{2}(5)$ distributions (positive skewness).
1.basic ideas
bootstrap samples to estimate the standard errors of different correlations
2.code
set.seed(12345) library(boot) data(scor, package = "bootstrap") scor<-as.matrix(scor) # display the scatter plots for each pair of test scores pairs(scor) # compute the correlations cor(scor) # bootstrap estimates of the correlations and their standard errors b.cor <- function(x,i) cor(x[i,1],x[i,2]) obj <- boot(data=scor,statistic=b.cor,R=10000) round(c(rho12=obj$t0,se=sd(obj$t)),3) b.cor <- function(x,i) cor(x[i,3],x[i,4]) obj <- boot(data=scor,statistic=b.cor,R=10000) round(c(rho34=obj$t0,se=sd(obj$t)),3) b.cor <- function(x,i) cor(x[i,3],x[i,5]) obj <- boot(data=scor,statistic=b.cor,R=10000) round(c(rho35=obj$t0,se=sd(obj$t)),3) b.cor <- function(x,i) cor(x[i,4],x[i,5]) obj <- boot(data=scor,statistic=b.cor,R=10000) round(c(rho45=obj$t0,se=sd(obj$t)),3)
3.my thoughts
We can find that both these relationships are positive. Further, the relationship between the two tests of closed book is lower than those tests of open book.
1.basic ideas
We can use bootstrap and Mento Carlo methods to calculate the coverage rates and missing proportions for the sample skewness statistic from different populations.
2.code
set.seed(12345) # the package to compute sample skewness and bootstrap library(moments); library(boot) n<-1e2; m<-1e3 boot.skew <- function(x,i) skewness(x[i]) ci_norm<-ci_chisq<-matrix(NA,m,2) # mento carlo methods to compute coverage probability for(i in 1:m){ # generate original sample U<-rnorm(n) V<-rchisq(n,5) # bootstrap to compute skewness out1 <- boot(U,statistic=boot.skew, R = 999) out2 <- boot(V,statistic=boot.skew, R = 999) # compute the confidence interval ci_norm[i,] <- boot.ci(out1,type="norm")$norm[2:3] ci_chisq[i,] <- boot.ci(out2,type="norm")$norm[2:3] } # print the results cbind(cp_norm =mean(ci_norm[,1]<=0&ci_norm[,2]>=0),prop_left=mean(0<ci_norm[,1]),prop_right=mean(0>ci_norm[,2])) cbind(cp_chisq =mean(ci_chisq[,1]<=sqrt(8/5)&ci_chisq[,2]>=sqrt(8/5)),prop_left=mean(sqrt(8/5)<ci_chisq[,1]),prop_right=mean(sqrt(8/5)>ci_chisq[,2]))
3.thoughts
We can find that the coverage rate of normal distribution is bigger than that of chisq distribution, which means the CI of skewness for normal distribution is better than that of the chisq distribution.
What's more, the proportion of times missing on the left and right is almost the same for the normal distribution, which means that normal distribution is symmetric.
However, the chisq the proportion of times missing on the right is apparently more than the proportion on the left, which means that chisq distribution is asymmetric, in fact, it inclines towards right.
7.8 Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.
7.10 In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Repeat the analysis replacing the Log-Log model with a cubic polynomial model. Which of the four models is selected by the cross validation procedure? Which model is selected according to maximum adjusted $R^{2}$?
1.basic ideas
Use the formats of jackknife to estimate bias and stanard error
2.code
data(scor, package = "bootstrap") lambda.hat <- eigen(cov(scor))$values theta.hat <- max(lambda.hat)/sum(lambda.hat) n <- nrow(scor) theta.j <- numeric(n) #leave one out for (i in 1:n) { x <- scor [-i,] lambda <- eigen(cov(x))$values theta.j[i] <- max(lambda)/sum(lambda) } bias.jack <- (n-1)*(mean(theta.j)-theta.hat) se.jack <- sqrt((n-1)^2/n*var(theta.j)) #print bias and se round(c(bias.jack=bias.jack, se.jack=se.jack),5)
3.my thoughts
We can find that the bias is quite close to zero and the standard error is very small.
1.basic ideas
Cross validation and adjusted R square are applied to select the best regression model
2.code
library(DAAG) attach(ironslag) n <- length(magnetic) e1 <- e2 <- e3 <- e4 <- numeric(n) #for n-fold cross validation for (i in 1:n) { y <- magnetic[-i] x <- chemical[-i] #Fit the model(s) using only the n−1 observations in the training set #Compute the predicted response for the test point #Compute the prediction error M1 <- lm(y ~ x) #linear model yhat1 <- M1$coef[1]+M1$coef[2]*chemical[i] e1[i] <- magnetic[i] - yhat1 M2 <- lm(y ~ x + I(x^2)) #quadratic polynomial model yhat2 <- M2$coef[1]+M2$coef[2]*chemical[i]+M2$coef[3]*chemical[i]^2 e2[i] <- magnetic[i] - yhat2 M3 <- lm(log(y) ~ x) #exponential model logyhat3 <- M3$coef[1]+M3$coef[2]*chemical[i] yhat3 <- exp(logyhat3) e3[i] <- magnetic[i] - yhat3 M4 <- lm(y ~ x + I(x^2)+I(x^3)) #cubic polynomial model yhat4 <- M4$coef[1]+M4$coef[2]*chemical[i]+M4$coef[3]*chemical[i]^2+M4$coef[4]*chemical[i]^3 e4[i] <- magnetic[i] - yhat4 } #calculate the average squared prediction error c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2)) #print the adjusted R squares c(summary(M1)$adj.r.squared, summary(M2)$adj.r.squared, summary(M3)$adj.r.squared, summary(M4)$adj.r.squared)
3.my thoughts
According to the cross validation procedure, we should choose the quadratic polynomial model. However, according to maximum adjusted R square, we should choose the cubic polynomial model.
8.3 The Count 5 test for equal variances in Section 6.4 is based on the maximum number of extreme points. Example 6.15 shows that the Count 5 criterion is not applicable for unequal sample sizes. Implement a permutation test for equal variance based on the maximum number of extreme points that applies when sample sizes are not necessarily equal.
Ex: Power comparison (distance correlation test versus ball covariance test)
Model 1: $Y=X/4+e$; Model 2: $Y=X/4\times e$
where $X\sim N\left(0_{2},I_{2}\right)$, $e\sim N\left(0_{2},I_{2}\right)$, $X$ and $e$ are independent.
1.basic ideas
In order to overcome the problem we met in eg 6.15, we can use permutation to generate samples with equal sizes and then compute the Count Five statistic.
2.code
set.seed(12345) ## count5 function count5test <- function(x, y) { X <- x - mean(x) Y <- y - mean(y) outx <- sum(X > max(Y)) + sum(X < min(Y)) outy <- sum(Y > max(X)) + sum(Y < min(X)) # return 1 (reject) or 0 (do not reject H0) return(as.integer(max(c(outx, outy)) > 5)) } ## generate samples with unequal sizes and same variance n1 <- 20; n2 <- 30; N <- n1 + n2 mu1 <- mu2 <- 0; sigma1 <- sigma2 <- 1 x <- rnorm(n1,mu1,sigma1); y <- rnorm(n2,mu2,sigma2) #use permutation to estimate t1e with sample size of 20 and 30 z <- c(x,y); R <- 9999; t <- numeric(R) for (i in 1:R) { k <- sample(1:N, size = N/2, replace = FALSE) x1 <- z[k]; y1 <- z[-k] # a new group of X and Y with the same size t[i] <- count5test(x1,y1) } p1 <- mean(t)# t1e #generate another group of samples with sample size of 20 and 50 n1 <- 20; n2 <- 50; N <- n1+n2 mu1 <- mu2 <- 0; sigma1 <- sigma2 <- 1 x <- rnorm(n1,mu1,sigma1); y <- rnorm(n2,mu2,sigma2) #use permutation to estimate t1e z <- c(x,y); R <- 9999; t <- numeric(R) for (i in 1:R) { k <- sample(1:N, size = N/2, replace = FALSE) x1 <- z[k]; y1 <- z[-k] # a new group of X and Y with the same size t[i] <- count5test(x1,y1) } p2 <- mean(t)# t1e # use Mento carlo method to estimate the empirical t1e with sample size of 20 and 50 p <- numeric(100) for(j in 1:100){ x <- rnorm(n1,mu1,sigma1); y <- rnorm(n2,mu2,sigma2) z <- c(x,y); R <- 9999; t <- numeric(R) for (i in 1:R) { k <- sample(1:N, size = N/2, replace = FALSE) x1 <- z[k]; y1 <- z[-k] t[i] <- count5test(x1,y1) } p[j] <- mean(t) } p3 <- mean(p)# empirical t1e # print the result round(c(p1=p1,p2=p2,p3=p3),3)
3.my thoughts
We can find that the result is quite good. When sample sizes are 20 and 30, we improve the original value in eg 6.15 from 0.1064 to 0.051, and when sample sizes are 20 and 50, we improve it from 0.2934 to 0.065.
I use Mento Carlo method to repeat the simulation above with n1 = 20 and n2 = 50, and I get the empirical t1e (0.072), which is quite good compared with the original t1e in eg 6.15.
As a result, the method with permutation is applicable for unequal sample sizes.
1.basic ideas
Use distance correlation test and ball covariance test to do independence hypothesis test of two different models, and use Mento Carlo method to estimate the powers and compare them with each other.
2.code
set.seed(12345) library(mvtnorm); library(Ball); library(boot) seed <- 12345; m <- 1e2 mean <- c(0,0); sigma <- matrix(c(1,0,0,1),nrow =2) ## the statistic of distance correlation test DCOR <- function(z,ix,dims) { p <- dims[1] q <- dims[2] d <- p + q x <- as.matrix(z[ , 1:p]) y <- as.matrix(z[ix, -(1:p)]) n <- nrow(x) m <- nrow(y) if (n != m || n < 2) stop("Sample sizes must agree") if (! (all(is.finite(c(x, y))))) stop("Data contains missing or infinite values") Akl <- function(x) { d <- as.matrix(dist(x)) m <- rowMeans(d); M <- mean(d) a <- sweep(d, 1, m); b <- sweep(a, 2, m) b + M } A <- Akl(x) B <- Akl(y) dCov <- mean(A * B) dVarX <- sqrt(mean(A * A)) dVarY <- sqrt(mean(B * B)) dCor <- dCov / sqrt(dVarX * dVarY) return(dCor) } N <- seq(10, 150, by=10)# different sample sizes power11 <- power12 <- power21 <- power22 <- numeric(length(N)) alpha <- 0.05 ## model 1 ## Mento Carlo Method for(i in 1:length(N)){ X1 <- rmvnorm(N[i],mean,sigma); e1 <- rmvnorm(N[i],mean,sigma) Y1 <- X1/4+e1 #generate samples z <- cbind(X1,Y1); z1 <- as.matrix(z) p.values <- matrix(NA,m,2) for(j in 1:m){ boot.obj <- boot(data = z1, statistic = DCOR, R = 99, sim ="permutation", dims = c(2, 2))# distance correlation test with permutation tb <- c(boot.obj$t0, boot.obj$t) p.values[j,1] <- mean(tb >= tb[1]) p.values[j,2] <- bcov.test(X1,Y1,R=99,seed=j*seed)$p.value# ball covariance test } power11[i] <- colMeans(p.values < alpha)[1] power12[i] <- colMeans(p.values < alpha)[2] } ## draw the graph about powers of two testing methods with model 1 par <- par(no.readonly = TRUE) plot(N,power11,type = "b",xlab="n",ylab="power",main="Model 1",lty=2,col=2) lines(N,power12,type="b",lty=2,col=3) legend("topleft", legend = c("dcor", "ball"), col = 2:3,lty = c(2,2)) ## model 2 (the same process with model 1) for(i in 1:length(N)){ X2 <- rmvnorm(N[i],mean,sigma); e2 <- rmvnorm(N[i],mean,sigma) Y2 <- (X2/4)*e2 z <- cbind(X2,Y2); z2 <- as.matrix(z) p.values <- matrix(NA,m,2) for(j in 1:m){ boot.obj <- boot(data = z2, statistic = DCOR, R = 99, sim = "permutation", dims = c(2, 2)) tb <- c(boot.obj$t0, boot.obj$t) p.values[j,1] <- mean(tb>=tb[1]) p.values[j,2] <- bcov.test(X2,Y2,R=99,seed=j*seed)$p.value } power21[i] <- colMeans(p.values < alpha)[1] power22[i] <- colMeans(p.values < alpha)[2] } plot(N,power21,type = "b",xlab="n",ylab="power",main="Model 2",lty=2,col=2) lines(N,power22,type="b",lty=2,col=3) legend("topleft", legend = c("dcor", "ball"), col = 2:3,lty = c(2,2))
3.my thoughts
We can find that, in model 1, the power of distance correlation test is not less than the ball covariance test, which means distance correlation test is more suitable for model 1.
However, in model 2, the power of ball covariance test is not less than the distance correlation test, which means ball covariance test is more suitable for model 2.
All in a word, for different models, we should choose different testing method in order to get the best result.
9.4 Implement a random walk Metropolis sampler for generating the standard Laplace distribution (see Exercise 3.2). For the increment, simulate from a normal distribution. Compare the chains generated when different variances are used for the proposal distribution. Also, compute the acceptance rates of each chain.
1.basic ideas
First we can calculate that, the CDF of Laplace distribution is $$F(x)=\left{\begin{array}{ll}{\frac{1}{2} e^{x},} & {x<0} \ {1-\frac{1} {2}e^{-x}} & {, x \geq 0}\end{array}\right.$$ We then implement a random walk Metropolis sampler for generating the standard Laplace distribution and compare them.
2.code
set.seed(12345) ## random walk Metropolis sampler rw.Metropolis <- function(sigma, x0, N) { x <- numeric(N) x[1] <- x0 u <- runif(N) k <- 0 for (i in 2:N) { y <- rnorm(1, x[i-1], sigma) if (u[i] <= ((1/2)*exp(-abs(y))) / ((1/2)*exp(-abs(x[i-1])))) x[i] <- y else { x[i] <- x[i-1] k <- k + 1 } } return(list(x=x, k=k)) } ## samples generated by random walk Metropolis sampler with proposal distributions of different variances N <- 2000; sigma <- c(0.05, 0.5, 2, 16); x0 <- 25 rw1 <- rw.Metropolis(sigma[1], x0, N) rw2 <- rw.Metropolis(sigma[2], x0, N) rw3 <- rw.Metropolis(sigma[3], x0, N) rw4 <- rw.Metropolis(sigma[4], x0, N) ## plots of four chains index <- 1:2000 refline <- c(log(2*0.025), -log(2*(1-0.975))) for(i in 1:4){ y <- rw.Metropolis(sigma[i], x0, N)$x plot(index,y,type = "l",ylab="x") abline(h=refline,lwd=1,col=2) title(paste("sigma=",sigma[i])) } ## compare the quantiles alpha <- seq(0.05,0.95,by=0.05) Q <- c(log(2*alpha[1:10]), -log(2*(1-alpha[11:19])))# the true quantiles of Lapalace disribution rw <- cbind(rw1$x, rw2$x, rw3$x, rw4$x) mc <- rw[401:N, ] Qrw <- apply(mc,2,function(x) quantile(x, alpha))# the sample quantiles of the four chains qq <- data.frame(round(cbind(Q, Qrw), 3)) names(qq) <- c('True','sigma=0.05','sigma=0.5','sigma=2','sigma=16') print(qq) ## acceptance rates print(c(acceptRate1=1-rw1$k/N, acceptRate2=1-rw2$k/N, acceptRate3=1-rw3$k/N, acceptRate4=1-rw4$k/N))
3.my thoughts
In the first plot, the increments are small and the chain is almost like a true random walk although its acceptance is very high.
The second and third chain converge to the target distribution after a short burn-in period of less than 400 and their quantiles are close to the true distribution. Besides, they have an acceptance rate higher than 0.5.
The fourth chain converges and its quantiles are close to the true distribution, but it is inefficient because the acceptance rate is just 0.1085.
11.1 The natural logarithm and exponential functions are inverses of each other, so that mathematically $log(e^x) = e^{log(x)}=x$. Show by example that this property does not hold exactly in computer arithmetic. Does the identity hold with near equality? (See all.equal.)
11.5 Write a function to solve the equation $$\frac{2 \Gamma\left(\frac{k}{2}\right)}{\sqrt{\pi(k-1)} \Gamma\left(\frac{k-1}{2}\right)}\int_{0}^{c_{k-1}}\left(1+\frac{u^{2}}{k-1}\right)^{-k / 2} d u $$ $$= \frac{2 \Gamma\left(\frac{k+1}{2}\right)}{\sqrt{\pi k} \Gamma\left(\frac{k}{2}\right)}\int_{0}^{c_{k}}\left(1+\frac{u^{2}}{k}\right)^{-(k+1) / 2} d u $$ for $a$, where $$c_{k}=\sqrt{\frac{a^{2} k}{k+1-a^{2}}}$$ Compare the solutions with the points $A(k)$ in Exercise 11.4.
Ex. A-B-O blood type problem
Let the three alleles be A, B, and O with allele frequencies p, q, and r. The 6 genotype frequencies under HWE and complete counts are as follows. $$\begin{array}{|c|c|c|c|c|c|}\hline \text { Genotype } & {A A} & {B B} & {O O} & {A O} & {B O} & {A B} & {S u m} \ \hline \text { Frequency } & {p^{2}} & {q^{2}} & {r^{2}} & {2 p r} & {2 q r} & {2 p q} & {1} \ \hline \text { Count } & {n {A A}} & {n {B B}} & {n {O O}} & {n {A O}} & {n {B O}} & {n{ A B}} & {n} \ \hline\end{array}$$ Observed data: nA· =nAA +nAO =28 (A-type), nB· =nBB +nBO =24 (B-type), nOO =41 (O-type), nAB =70 (AB-type).
Use EM algorithm to solve MLE of p and q (consider missing data nAA and nBB). Show that the log-maximum likelihood values in M-steps are increasing via line plot.
1.basic ideas
Mathematical calculation is not computational calculation for base 2 representation is used in computer arithmetic. We use "isTRUE" to compare two numbers to see whether they are equal exactly and use "all.equal" to see whether they are equal nearly.
2.code
isTRUE(exp(log(0.1))==log(exp(0.1))); isTRUE(exp(log(0.1))==0.1); isTRUE(log(exp(0.1))==0.1) isTRUE(all.equal(exp(log(0.1)),log(exp(0.1)))); isTRUE(all.equal(exp(log(0.1)),0.1)); isTRUE(all.equal(log(exp(0.1)),0.1))
3.conclusion
The identity $log(e^x) = e^{log(x)}=x$ does not hold exactly in computer arithmetic, however, it holds with near equality.
1.basic ideas
We use function "uniroot" to calculate the roots of two equations in 11.4 and 11.5 with different values of parameter k and compare them.
2.code
## answer of 11.4 k <- c(4:25,100,500,1000)# different values of parameter k (df of t distribution) solution1 <- numeric(length(k)) for(i in 1:length(k)){ # the one-dimensional nonlinear function f <- function(a) {pt(sqrt(a^2*(k[i]-1)/(k[i]-a^2)),k[i]-1)-pt(sqrt(a^2*k[i]/(k[i]+1-a^2)),k[i])} solution1[i] <- uniroot(f,lower=0+1e-6,upper=sqrt(k[i])-1e-6)$root } round(solution1,4) ## answer of 11.5 k <- c(4:25,100,500,1000) solution2 <- numeric(length(k)) for(i in 1:length(k)){ # the upper value of integrate function ck1 <- function(a) sqrt(a^2*(k[i]-1)/(k[i]-a^2)) # the left function of the equation g1 <- function(a) 2*exp(lgamma(k[i]/2)-log(sqrt(pi*(k[i]-1)))-lgamma((k[i]-1)/2))*(integrate(function(u) {(1+u^2/(k[i]-1))^(-k[i]/2)},lower = 1e-6,upper = ck1(a)-1e-6)$value) ck2 <- function(a) sqrt(a^2*(k[i])/(k[i]+1-a^2)) g2 <- function(a) 2*exp(lgamma((k[i]+1)/2)-log(sqrt(pi*k[i]))-lgamma(k[i]/2))*(integrate(function(u) {(1+u^2/(k[i]))^(-(k[i]+1)/2)},lower = 1e-6,upper = ck2(a)-1e-6)$value) # the one-dimensional nonlinear function f <- function(a) {g1(a)-g2(a)} solution2[i] <- uniroot(f,lower=0+1e-2,upper=sqrt(k[i])-1e-2)$root } round(solution2,4)
3.my thoughts
We can find that when k is not large, the roots of the two equantions in 11.4 and 11.5 are the same, however, when k is larger than 22, the roots of 11.4 are wrong, which is because computer can not calculate numbers too large.
As a result, we can use $exp(log(x))=x$ to calculate the ratio of two large numbers and improve the algorithm of 11.5 to get the true answers.
1.basic ideas
We use the EM (Expectation–Maximization) algorithm to find maximum likelihood estimates when data are incomplete.
2.code
library("stats4") ## the log likelihood function ll <- function(p,nAA,nBB) -{nAA*log(p[1]^2)+nBB*log(p[2]^2)+41*log((1-p[1]-p[2])^2)+(28-nAA)*log(2*p[1]*(1-p[1]-p[2]))+(24-nBB)*log(2*p[2]*(1-p[1]-p[2]))+70*log(2*p[1]*p[2])} tol <- 1e-10# convergence condition N <- 100# max numbers of iterations lmlv <- numeric(N)# log-maximum likelihood values p <- c(0.1,0.1)# initial estimate of the target parameter par <- matrix(nrow= N ,ncol = 2); par[1,] <- p for(i in 1:N){ # the E step nAA <- 28*p[1]^2/(p[1]^2+2*p[1]*(1-p[1]-p[2])) nBB <- 24*p[2]^2/(p[2]^2+2*p[2]*(1-p[1]-p[2])) lmlv[i]<-ll(p,nAA,nBB) # the M step par[i+1,] <- optim(par=p,fn=ll,nAA=nAA,nBB=nBB)$par p <- par[i+1,] if (sum(abs(par[i+1,]-par[i,])) < tol) break } print(round(c(p_mle=p[1],q_mle=p[2]),5))# MLE of p and q index <- c(1:i) plot(index,-lmlv[index],type="b",ylab="log-maximum likelihood values",xlab="iteration numbers")
3.my thoughts
The MLE of p is 0.32735 and the MLE of q is 0.31040 by EM algorithm and we can find that the log-maximum likelihood values in M-steps are increasing via the line plot.
Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:
formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt)
attach(mtcars) formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt) ## for loops for(i in seq_along(formulas)){ print(lm(formulas[[i]])) } ## lapply lapply(formulas, lm) detach(mtcars)
Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loops and lapply().
Can you do it without an anonymous function?
bootstraps <- lapply(1:10, function(i) {
rows <- sample(1:nrow(mtcars), rep = TRUE)
mtcars[rows, ]
})
bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] }) ## with an anonymous function # usefor loops for(i in seq_along(bootstraps)){ print(lm(bootstraps[[i]]$mpg ~ bootstraps[[i]]$disp)) } # use lapply lapply(bootstraps, lm, formula = mpg ~ disp) ## use lapply without an anonymous function fit <- function(x){ lm(mpg ~ disp,data=x) } lapply(bootstraps, fit)
For each model in the previous two exercises, extract $R^2$ using the function below.
rsq <- function(mod) summary(mod)$r.squared
attach(mtcars) rsq <- function(mod) summary(mod)$r.squared ## R2 in ex3 formulas <- list(mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt) # for loops for(i in seq_along(formulas)){ print(rsq(lm(formulas[[i]]))) } # lapply lapply(lapply(formulas,lm),rsq) ## R2 in ex4 bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] }) # for loops for(i in seq_along(bootstraps)){ print(rsq(lm(bootstraps[[i]]$mpg ~ bootstraps[[i]]$disp))) } # lapply lapply(lapply(bootstraps, lm, formula = mpg ~ disp),rsq) detach(mtcars)
The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial. trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE ) Extra challenge: get rid of the anonymous function by using [[ directly.
trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE ) ## use sapply() and an anonymous function pvalue <- function(test) test$p.value sapply(trials, pvalue) ## use [[ directly sapply(trials, '[[', 3)
Implement mcsapply(), a multicore version of sapply(). Can you implement mcvapply(), a parallel version of vapply()? Why or why not?
library(parallel) cl <- makeCluster(getOption("cl.cores", 2)) bootstraps <- lapply(1:100000, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, c(1,3)] }) ## for sapply() system.time(sapply(bootstraps, lm)) ## for parSapply parSapply(cl, 1:10, sqrt) system.time(parSapply(cl,bootstraps, lm))
Since we can compute each element in any order, it’s easy to dispatch the tasks to different cores, and compute them in parallel, so we can use parSapply.
However, there is no direct way to use a multicore version of vapply(), I guess it's because vapply() need to set a pre-specified type of return value, which can not be computed in parallel.
You have already written an R function for Exercise 9.4 (page 277, Statistical Computing with R). Rewrite an Rcpp function for the same task.
Compare the generated random numbers by the two functions using qqplot.
Campare the computation time of the two functions with microbenchmark.
Comments your results.
library(Rcpp) library(microbenchmark) set.seed(135) ## random walk Metropolis sampler in R rwMetropolisR <- function(sigma, x0, N) { x <- numeric(N) x[0] <- x0 u <- runif(N) k <- 0 for (i in 2:N) { y <- rnorm(1, x[i-1], sigma) if (u[i] <= ((1/2)*exp(-abs(y))) / ((1/2)*exp(-abs(x[i-1])))) x[i] <- y else { x[i] <- x[i-1] k <- k + 1 } } return(list(x=x, k=k)) } ## random walk Metropolis sampler in C++ cppFunction('List rwMetropolisC(double sigma, double x0, int N) { NumericVector x(N); x[0] = x0; int k = 0; for(int i = 1; i < N; i++){ double u = as<double>(runif(1)); double y = as<double>(rnorm(1, x[i-1], sigma)); if(u <= exp(-abs(y)+abs(x[i-1]))) x[i] = y; else { x[i] = x[i-1]; k += 1; } } return List::create(x,k); }') N <- 2000; sigma <- c(0.05, 0.5, 2, 16); x0 <- 25 # plots of four chains in C++ index <- 1:2000; refline <- c(log(2*0.025), -log(2*(1-0.975))) for(i in 1:4){ y <- unlist(rwMetropolisC(sigma[i], x0, N)[1]) plot(index, y, type = "l", ylab="x") abline(h = refline, lwd = 1, col = 2) title(paste("sigma=", sigma[i])) } # acceptance rates in C++ acceptRateC <- numeric(4) for(i in 1:4){ acceptRateC[i] <- 1-(as.numeric(rwMetropolisR(sigma[i], x0, N)[2]))/N } print(c(acceptRateC1 = acceptRateC[1], acceptRateC2 = acceptRateC[2], acceptRateC3 = acceptRateC[3], acceptRateC4 = acceptRateC[4])) # qqplot xR <- rwMetropolisR(2, 25, 10000)$x[1000:10000] xC <- unlist(rwMetropolisC(2, 25, 10000)[1])[1000:10000] qqplot(xR, xC, main="qqplot", xlab = "rwMetropolisR", ylab = "rwMetropolisC") abline(a = 0, b = 1) # computation time ts <- microbenchmark(xR = rwMetropolisR(2, 25, 10000), xC = rwMetropolisC(2, 25, 10000)) summary(ts)[, c(1,3,5,6)]
For me, it's easier to solve this question in R because I'm not quite familiar with the usage of C and C++.
The results in qqplot shows that the generated random numbers are quite near with these two functions. However, the results of computation time show an evident computational speed gain of C++ against R.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.