StatComp18017 is a simple R package including my homework.
The R package 'microbenchmark' can be used to benchmark the above R and C++ functions.
Most otfen, it is necessary to split the picture or know how the device splits the windows that we need.Apparently??the function split.screen()could fits our needs,but with the shortcomings of incompatibility and unability to be performed on several devices.Thus,how to know that the route to split??and how to code?
First,we can generate a matrix with specified construction,then utilize the funcyion of layout() to display the details of the split process.The examples are follows.
mat<- matrix(1:4,2,2) mat layout (mat) layout.show(4)
It is also feasible that fuse the function layout() and matrix() to simplify the code and get to our targets. As the examples show below.
layout (matrix(1:6,3,2)) layout.show(6)
m<- matrix(c(1:3,3),2,2) layout (m) layout.show(3)
In addition, options in function of layout(), such as widths and heights, give the possibilities to alter the split width and weight.So, the screen could be separated into plentiful shapes even be some complicated or weird.
m<- matrix(c(1,1,2,1),2,2) layout (m,widths = c(2,1),heights = c(1,2)) layout.show(2)
These are about the split of screen with the help of function of layout and matrix.
In the analysis process,to make a intuitive impression, the images are indispensable. So knowing how to code to display the relevant images is essential.Thus??here gives some simple examples.
The simplist example of the plot is followed.
x<-rnorm(10) y<-rnorm(10) plot(x,y)
In fact, changing the image characters and making the iamges more artistic and analyzable, through adjusting some parameter values, are practicible.
plot(x,y,xlab = "Ten random values",ylab = "Ten other values", xlim = c(-2,2), ylim = c(-2,2), pch=22, col="red", bg="yellow", bty="7",tcl=0.4, main="How to customize a plot with R", las=1,cex=1.5)
As we all know, the function of plot is one of the simple praphing methods.Thus more advanced functions are needed to modify the images pet by function plot as function par.
opar<-par() par(bg="lightyellow", col.axis="blue", mar=c(4,4,2.5,0.25)) plot(x,y,xlab = "Ten random values",ylab = "Ten other values", xlim = c(-2,2), ylim = c(-2,2), pch=22, col="red", bg="yellow",bty="7", tcl=-0.25, las=1, cex=1.5) title("How to customize a plot with R(bias)",font.main=3, adj=1) par(opar)
In this informative age, processing big data is common, thus we generally need some function with the cycle structure to find the potential rules. And in the code of R, some of the cycle functions are different and unique. Those we have to study this.So how the procedure of the cycle does?
In the R software, the function with the form of multi function apply can avoid the ues of the direct cycle structure.as the example follows.
x<-rnorm(10,-5,0.1) y<-rnorm(10,5,2) X<-cbind(x,y) apply(X, 2,mean) apply(X, 2, sd)
More information, the function of lappy() which has the silimar syntax with the function apply can work on the list objects and eventually return a list object.
forms<-list(y~x,y~poly(x,2)) lapply(forms,lm)
In more flexible situations, to accept the vectors or the matrixes as the main parameters, function sapply can be a good alternate with the final return of the form of table. Here we do not talk these.
Most otfen, it is necessary to split the picture or know how the device splits the windows that we need.Apparently??the function split.screen()could fits our needs,but with the shortcomings of incompatibility and unability to be performed on several devices.Thus,how to know that the route to split??and how to code?
First,we can generate a matrix with specified construction,then utilize the funcyion of layout() to display the details of the split process.The examples are follows.
mat<- matrix(1:4,2,2) mat layout (mat) layout.show(4)
It is also feasible that fuse the function layout() and matrix() to simplify the code and get to our targets. As the examples show below.
layout (matrix(1:6,3,2)) layout.show(6)
m<- matrix(c(1:3,3),2,2) layout (m) layout.show(3)
In addition, options in function of layout(), such as widths and heights, give the possibilities to alter the split width and weight.So, the screen could be separated into plentiful shapes even be some complicated or weird.
m<- matrix(c(1,1,2,1),2,2) layout (m,widths = c(2,1),heights = c(1,2)) layout.show(2)
These are about the split of screen with the help of function of layout and matrix.
In the analysis process,to make a intuitive impression, the images are indispensable. So knowing how to code to display the relevant images is essential.Thus??here gives some simple examples.
The simplist example of the plot is followed.
x<-rnorm(10) y<-rnorm(10) plot(x,y)
In fact, changing the image characters and making the iamges more artistic and analyzable, through adjusting some parameter values, are practicible.
plot(x,y,xlab = "Ten random values",ylab = "Ten other values", xlim = c(-2,2), ylim = c(-2,2), pch=22, col="red", bg="yellow", bty="7",tcl=0.4, main="How to customize a plot with R", las=1,cex=1.5)
As we all know, the function of plot is one of the simple praphing methods.Thus more advanced functions are needed to modify the images pet by function plot as function par.
opar<-par() par(bg="lightyellow", col.axis="blue", mar=c(4,4,2.5,0.25)) plot(x,y,xlab = "Ten random values",ylab = "Ten other values", xlim = c(-2,2), ylim = c(-2,2), pch=22, col="red", bg="yellow",bty="7", tcl=-0.25, las=1, cex=1.5) title("How to customize a plot with R(bias)",font.main=3, adj=1) par(opar)
In this informative age, processing big data is common, thus we generally need some function with the cycle structure to find the potential rules. And in the code of R, some of the cycle functions are different and unique. Those we have to study this. So how the procedure of the cycle does?
In the R software, the function with the form of multi function apply can avoid the ues of the direct cycle structure.as the example follows.
x<-rnorm(10,-5,0.1) y<-rnorm(10,5,2) X<-cbind(x,y) apply(X, 2,mean) apply(X, 2, sd)
More information, the function of lappy() which has the silimar syntax with the function apply can work on the list objects and eventually return a list object.
forms<-list(y~x,y~poly(x,2)) lapply(forms,lm)
In more flexible situations, to accept the vectors or the matrixes as the main parameters, function sapply can be a good alternate with the final return of the form of table. Here we do not talk these.
Write a function to compute a $ Monte Carlo $ estimate of the $ Beta(3, 3) $ cdf, and use the function to estimate # F(x) forx =0 .1,0.2,...,0.9. # Compare the estimates with the values returned by the $ pbeta $ function in R.
x <- seq(.1,0.9, length = 9) m <- 10000 u <- runif(m) cdf <- numeric(length(x)) for (i in 1:length(x)) { g <- x[i] ^3*u^2* (1-x[i]^2*u^2) cdf[i] <- 30*mean(g) }
Now the estimates ?? ?? for ten values of x are stored in the vector cdf. Compare the estimates with the value ??(x) computed (numerically) by the pnorm function.
Phi <- pbeta(x,3,3) print(round(rbind(x, cdf, Phi), 3))
Results for several values x>0 are shown compared with the value of the beta distribution cdf function pbeta. The Monte Carlo estimates appear to not be very close to the pnorm values cause the estimates will be worse in the extreme upper tail of the distribution.)
The $ Rayleigh $ density [156, (18.76)] is $ f(x)= x/sigma^2exp(???x^2/(2sigma^2)),x ??0,??>0.$ Implement a function to generate samples from a $ Rayleigh(??)$ distribution, using antithetic variables. What is the percent reduction in variance of $ (X+X')/ 2 $ compared with $ (X1+X2)/2 $ for independent $ X1 $, $X2 $?
MC.Phi <- function(x, R = 10000, antithetic = TRUE) { u <- runif(R/2) sigma <- 4 if (!antithetic) v <- runif(R/2) else v <- 1 - u u <- c(u, v) cdf <- numeric(length(x)) for (i in 1:length(x)) { g <- x[i]^2*u/sigma^2 * exp(-(u* x[i])^2 / (2*sigma^2) ) cdf[i] <- mean(g) } cdf }
A comparison of estimates obtained from a single Monte Carlo experiment is below.
x <- seq(1,10, length=10) sigma <- 4 Phi <- x/sigma^2*exp(-x^2/(2*sigma^2)) set.seed(123) MC1 <- MC.Phi(x, anti = FALSE) set.seed(123) MC2 <- MC.Phi(x) print(round(rbind(x, MC1, MC2, Phi), 5))
m <- 1000 MC1 <- MC2 <- numeric(m) x <- 2 for (i in 1:m) { MC1[i] <- MC.Phi(x, R = 1000, anti = FALSE) MC2[i] <- MC.Phi(x, R = 1000) } print(sd(MC1)) print(sd(MC2)) print((var(MC1) - var(MC2))/var(MC1))
The antithetic variable approach achieved approximately 99.5% reduction in variance at x =2.0
Find two importance functions f1 and f2 that are supported on (1,??) and are ??close?? to $ g(x)= x^2/sqrt(2pi)exp(???x^2/2), x > 1.$ Which of your two importance functions should produce the smaller variance in estimating $ integrate( x^2/sqrt(2??)* exp(???x^2/2),1,inf) $ by importance sampling? Explain.
m <- 10000 theta.hat <- se <- numeric(2) g <- function(x) { x^2/sqrt(2*pi)*exp(-x^2/2) * (x >1) } x <- rnorm(m) #using f1 fg <- sqrt(2*pi)*g(x)/exp(-x^2/2) theta.hat[1] <- mean(fg) se[1] <- sd(fg) x <- rchisq(m,4) #using f2 fg <- 4*g(x)/(x^2*exp(-x/2)) theta.hat[1] <- mean(fg) se[2] <- sd(fg) rbind(theta.hat, se)
we choose f1 as standard norm distribution,and f2 as chi-squared distribution.The estimates (labeled theta.hat) of the integate g(x) within the interval (1,inf) and the corresponding standard errors se for the simulation using each of the importance functions are perfomed above. From the results below,we know that the f1 produce the smaller variance. the relative codes are followed.
x <- seq(0, 1, .01) w <- 2 f1 <- exp(-x^2/2)/sqrt(2*pi) f2 <- x^2*exp(-x/2)/4 g <- x^2/sqrt(2*pi)*exp(-x^2/2) #figure (a) plot(x, g, type = "l", main = "", ylab = "", ylim = c(0,2), lwd = w) lines(x, f1, lty = 2, lwd = w) lines(x, f2, lty = 3, lwd = w) legend("topright", legend = c("g", 0:2), lty = 1:3, lwd = w, inset = 0.02) #figure (b) plot(x, g, type = "l", main = "", ylab = "", ylim = c(0,3.2), lwd = w, lty = 2) lines(x, g/f1, lty = 3, lwd = w) lines(x, g/f2, lty = 4, lwd = w) legend("topright", legend = c(0:2), lty = 2:4, lwd = w, inset = 0.02)
Obtain a Monte Carlo estimate of $ integrate( x^2/sqrt(2??)* exp(???x^2/2),1,inf) $ by importance sampling.
we use the standard norm distribution to estimate the integrate g(x) and we get the result :0.3868715.
m <- 10000 theta.hat <- se <- numeric(1) g <- function(x) { x^2/sqrt(2*pi)*exp(-x^2/2) * (x >1) } x <- rnorm(m) fg <- sqrt(2*pi)*g(x)/exp(-x^2/2) theta.hat <- mean(fg) theta.hat
Let $ X $ be a non-negative random variable with $ ?? = E[X] < ?? $ . For a random sample $ x_{1},...,x_{n} $ from the distribution of $ X $, the $ Gini $ ratio is de???ned by $$ G = \frac{1}{2n^2u}\sum_{j=1}^{n}\sum_{i=1}^{n}|x_{i}- x_{j}|.$$ The $ Gini $ ratio is applied in economics to measure inequality in income distribution (see e.g. [163]). Note that $ G $ can be written in terms of the order statistics $ x_{i} $ as $$ G = \frac{1}{n^2u}\sum_{i=1}^{n}(2i-n-1)x_{(i)}.$$ If the mean is unknown,let $ \hat{G} $ be the statistic $ G $ with $ ?? $ replaced by $ \overline {x} $. Estimate by simulation the mean, median and deciles of $ \hat{G} $ if $ X $ is standard lognormal. Repeat the procedure for the uniform distribution and $ Bernoulli(0.1) $. Also construct density histograms of the replicates in each case.
We use the ordered sample to esrimte the relevent target values of $ G $ function. The basic idea is : (1) generate the random samples with the number n: $ x_1,x_2,????,x_n $, and iid from the distribution of $ X $. (2) Sort the samples we get in increasing order, to obtain the ordered samples: $ x(1)<x(2)<????<x(n)$ (3) Compute the median function to get the results,mean ,median and deciles of the G.
First, we simulate the mean,median and deciles of $ \hat{G} $ if $ X $ is standard lognormal.
n <- 20 ## numbers of samples m <- 1000 ## experiment times G <- numeric(m) for (i in 1:m) { x <- sort(rlnorm(n)) ## X is standard lognormal mu <- mean(x) ## let the mu be the mean values of the samples for (j in 1:n) { t <- (2*j-n-1)*x[j] } G[i] <- sum(t)/(mu*n^2) } Ghat <- mean(G) ## compute the estimate values(mean) of G MGhat <- median(G) ## compute the estimate values(median) of G DGhat <- quantile(G,probs=0.1) ## compute the estimate values(deciles) of G c(Ghat,MGhat,DGhat) hist(G,prob=TRUE) ## density histogram of sample
Then, we simulate the mean,median and deciles of $ \hat{G} $ if $ X $ is uniform.
n <- 20 ## numbers of samples m <- 1000 ## experiment times G <- numeric(m) for (i in 1:m) { x <- sort(runif(n)) ## X has the uniform distribution mu <- mean(x) ## let the mu be the mean values of the samples for (j in 1:n) { t <- (2*j-n-1)*x[j] } G[i] <- sum(t)/(mu*n^2) } Ghat <- mean(G) ## compute the estimate values(mean) of G MGhat <- median(G) ## compute the estimate values(median) of G DGhat <- quantile(G,probs=0.1) ## compute the estimate values(deciles) of G c(Ghat,MGhat,DGhat) hist(G,prob=TRUE) ## density histogram of sample
Last, we simulate the mean,median and deciles of $ \hat{G} $ if $ X $ is Bernoulli(0.1).
n <- 20 ## numbers of samples m <- 1000 ## experiment times G <- numeric(m) for (i in 1:m) { x <- sort(rbinom(n,size=1,prob=0.1)) ## X has the Bernoulli(0.1) distribution mu <- mean(x) ## let the mu be the mean values of the samples for (j in 1:n) { t <- (2*j-n-1)*x[j] } G[i] <- sum(t)/(mu*n^2) } Ghat <- mean(G) ## compute the estimate values(mean) of G MGhat <- median(G) ## compute the estimate values(median) of G DGhat <- quantile(G,probs = seq(0,1,0.1),na.rm = TRUE) ## compute the estimate values(deciles) of G c(Ghat,MGhat,DGhat) hist(G,prob=TRUE) ## density histogram of sample
Construct an approximate 95% con???dence intervalfor the $ Gini $ ratio $ \gamma = E[G] $ if $ X $ is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a $ Monte Carlo $ experiment.
From the codes below, we can get the estimate of G and a 95% CI. Accoding to the relevent theory, we know that the CI is $$ [RG-sd(G)/sqrt(n)t_{(1-\frac{\alpha}2)}(n-1),RG+sd(G)/sqrt(n)t_{(1-\frac{\alpha}2)}(n-1)]$$,and the RG is the real value of the expectation of $ G $ computed with the specific distribution of $ X $.
n <- 20 ## numbers of samples m <- 1000 ## experiment times G <- numeric(m) for (i in 1:m) { x <- sort(rlnorm(n)) ## X is standard lognormal mu <- mean(x) ## let the mu be the mean values of the samples for (j in 1:n) { t[j] <- (2*j-n-1)*x[j] sum(t) } G[i] <- sum(t)/(mu*n^2) G[i] } mean(G) ## compute the estimate values(mean) of G alpha <- 0.05 sd <- 1 EG <- 2*pnorm(sd/sqrt(2))-1 ## compute the integration of G to get the real mean values UCL <- mean(G)+sd(G)*qt(alpha/2, df = n-1)/sqrt(n) ## compute the upper confidence limit LCL <- mean(G)-sd(G)*qt(alpha/2, df = n-1)/sqrt(n) ## compute the lower confidence limit c(UCL,LCL)
Then, we estimate the coverage rate according to the hypothesis test.
n <- 20 ## numbers of samples m <- 1000 ## experiment times G <- numeric(m) UCL=numeric(1000) LCL=numeric(1000) for(ii in 1:1000) { for (i in 1:m) { x <- sort(rlnorm(n)) ## X is standard lognormal mu <- mean(x) ## let the mu be the mean values of the samples for (j in 1:n) { t[j] <- (2*j-n-1)*x[j] sum(t) } G[i] <- sum(t)/(mu*n^2) G[i] } mean(G) ## compute the estimate values(mean) of G alpha <- 0.05 sd <- 1 EG <- 2*pnorm(sd/sqrt(2))-1 ## compute the integration of G to get the real mean values UCL[ii] <- mean(G)+sd(G)*qt(alpha/2, df = n-1)/sqrt(n) ## compute the upper confidence limit LCL[ii] <- mean(G)-sd(G)*qt(1-alpha/2, df = n-1)/sqrt(n) ## compute the lower confidence limit } sum(EG<UCL & EG>LCL) #count the number of intervals that satisfy the conditions mean(UCL>EG & EG>LCL) ## compute the coverage rate
Tests for association based on $ Pearson $ product moment correlation $ \rho $, $ Spearman??s $ rank correlation coe???cient $ \rho_{s} $, or $ Kendall??s $ coe???cient $ \tau $ ,are implemented in $ cor.test $. Show (empirically) that the nonparametric tests based on $ \rho_{s} $ or $ \tau $ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $ (X,Y) $ such that $ X $ and $ Y $ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.
Compute a $jackknife$ estimate of the bias and the standard error of the correlation statistic in $Example 7.2$ (p202).
If $\hat\theta$ is a smooth (plug-in) statistic, then $\hat\theta_{(i)} = t(F_{n-1}(x_{(i)}))$, and the jackknife estimate of bias is $$\hat{bias}{jack}=(n-1)(\overline{\hat\theta{(.)}}-\hat\theta),$$ when $\overline{\hat\theta_{(.)}}=\frac1n\sum_{i=1}^{n}\hat\theta_{(i)}$ is the mean of the estimates from the leave-one-out samples, and $\hat\theta= \hat\theta_{(x)}$ is the estimate computed from the original observed sample.
A $jackknife$ estimate of standard error is $$\hat{se}{jack}=\sqrt{\frac{n-1}{n}\sum{i=1}^{n}(\hat\theta_{(i)}-\overline{\hat\theta_{(.)}})^2},$$ for smooth statistics $\hat\theta.$
library(bootstrap) #for the law data data(law, package = "bootstrap") n <- nrow(law) # for all data available in the samples popuation theta.hat <- cor(law$LSAT,law$GPA) #compute the correlation between law$LSAT and law$GPA print(sprintf('theta.hat=%f',theta.hat)) #compute the jackknife replicates, leave-one-out estimates theta.jack <- numeric(n) for (i in 1:n) theta.jack[i] <- cor(law$LSAT[-i],law$GPA[-i]) bias <- (n - 1) * (mean(theta.jack) - theta.hat) print(sprintf('bias=%f',bias)) #jackknife estimate of bias se <- sqrt((n-1) * mean((theta.jack - mean(theta.jack))^2)) print(sprintf('se=%f',se)) #jackknife estimate of standard error
The correlation statistic is the correlation betwwen $LSAT$ and $GPA$, we use the $\theta$ to delegate the target correlation and denote the estimstor $\theta$ as $\hat\theta$, we compute and get the value $\hat\theta=0.776374$.Then we use the method of jackknife to estimate its bias and standard error.The final results are $\hat{bias}{jack}=-0.006474$ and $\hat{se}{jack}=0.142519.$
Refer to $Exercise 7.4$(p206). Compute 95% bootstrap confidence intervals for the mean time between failures $\frac1\lambda$ by the standard normal, basic, percentile, and $BCa$ methods. Compare the intervals and explain why they may differ.
1,The standard normal bootstrap confidence interval : If $\theta$ is a sample mean and the sample size is large, then the Central Limit Theorem implies that $$Z=\frac{\hat\theta-E[\hat\theta]}{se(\hat\theta)}$$ is approximately standard normal. Hence, if $\hat\theta$ is unbiased for $\theta$, then an approximate $100(1-\alpha)$% CI for $\theta$ is the $Z$-interval $$[\hat\theta-z_\frac{\alpha}{2}se(\hat\theta),\hat\theta+z_\frac{\alpha}{2}se(\hat\theta)]$$ where $$z_\frac{\alpha}{2} =\Phi^{-1}(1-\frac{\alpha}{2}).$$
2,The basic bootstrap confidence interval : The basic bootstrap confidence interval transforms the distribution of the replicates by subtracting the observed statistic.The $100(1-\alpha)$% confidence limits for the basic bootstrap confidence interval are $$(2\hat\theta-\hat\theta_{1-\frac{\alpha}{2}},2\hat\theta-\hat\theta_\frac{\alpha}{2}).$$
3,The percentile bootstrap confidence interval : Suppose that $\hat\theta^{(1)},\cdots,\hat\theta^{(B)}$ are the bootstrap replicates of the statistic $\hat\theta$. From the ecdf of the replicates, compute the $\frac\alpha2$ quantile $\hat\theta_\frac\alpha2$,and the $1-\frac\alpha2$quantile$\hat\theta_{1-\frac\alpha2}.$
4,The better bootstrap confidence interval(BCa) : For a $100(1-\alpha)$% BCa bootstrap confidence interval is $$(\hat\theta_{\alpha1}^,\hat\theta_{\alpha2}^).$$
But we can obtain the normal, basic, percentile and the better bootstrap confidence intervals using the $boot$ and $boot.ci$ functions in the $boot$ package.The code generates 95% confidence intervals are follow.
library(boot) #for boot and boot.ci data(aircondit, package = "boot") #get the data set and package theta.boot <- function(x,ind) { #function to compute the target values x <- x[ind] mean(x) #acording to the exponetial distribution,the $\frac1\lambda$ is exactly the values of the mean x } #Run the bootstrap and compute con???dence interval estimates for the bioequivalence ratio x <- aircondit$hours #give x the values in the data set aircondit to denote the time intervals dat <- x boot.obj <- boot(dat, statistic = theta.boot, R = 2000) #The output for the bootstrap and bootstrap con???dence intervals is below. print(boot.obj) print(boot.ci(boot.obj, type = c("basic", "norm", "perc","bca")))
From the codes above, we get the standard normal,the basic,the percentile and the better bootstrap 95% CI are $( 33.8, 181.2 ),$ $( 25.6, 172.1 ),$ $( 44.1, 190.6 ),$ $( 55.2, 229.3 )$ respectively. And we compare them then know that the results of the standard normal,the basic,the percentile methods are different but with subtle difference.Meanwhile, the results of the BCa method are different from that of other methods notably.
(1)The standard normal bootstrap CI is the simplest approach, but not necessarily the best.In this method, we have made several assumptions:to apply the normal distribution, we assume that the distribution of $\hat\theta$ is normal or $\hat\theta$is a sample mean and the sample size is large; We have also implicitly assumed that $\hat\theta$ is unbiased for $\theta.$
(2)The basic bootstrap CI transforms the distribution of the replicates by subtracting the observed statistic. We reuse the ecdf and depend on it strongly than other methods and we do not utilize the information sufficently,but we make no assumptions,so the result is more fitted to some extent.
(3)A bootstrap percentile interval uses the empirical distribution of the bootstrap replicates as the reference distribution. The quantiles of the empirical distribution are estimators of the quantiles of the sampling distribution of $\hat\theta$, so that these (random) quantiles may match the true distribution better when the distribution of $\hat\theta$ is not normal.
(4)The bias-corrected and accelerated (BCa)is a modified version of percentile intervals that have better theoretical properties and better performance in practice.And also we can get the same conclusion from the results.
Refer to $Exercise 7.7$(p210). Obtain the $jackknife$ estimates of bias and standard error of $\hat\theta$.
If we want to compute the jackknife estimates of $\hat\theta$, we must to get the covariance matrix of the samples popuation and compute its all eigenvalues. Then it is easier for us to compute the $\hat\theta$. Finally, we just need to estimate with three methods of jackknife.The simple procedures are:
(1) Get the data set and generate a covariance matrix $\Sigma.$
(2) Compute all eigenvalues of the covariance matrix : $\lambda_1,\cdots,\lambda_5$ and compute the will-estimated parameter $\hat\theta=\frac{\hat\lambda_1}{\sum_{i=1}^{5}\hat\lambda_i}.$
(3) Estimate $\hat\theta$ with the jackknife : its $jackknife$ estimate of bias is $$\hat{bias}{jack}=(n-1)(\overline{\hat\theta{(.)}}-\hat\theta),$$ and its $jackknife$ estimate of standard error is $$\hat{se}{jack}=\sqrt{\frac{n-1}{n}\sum{i=1}^{n}(\hat\theta_{(i)}-\overline{\hat\theta_{(.)}})^2},$$
library(bootstrap) #for the sample data data(scor) # duplicate the data scor to "data" str(scor) # the structure of the sample data n <- nrow(scor) # for all data available in the samples popuation cov <- cov(scor) # compute the covariance matrix cov e <- eigen(cov) #compute the eigenvalues of the covariance matrix e x <- e$values[1] # The maximum eigenvaues of the covariance matrix print(sprintf('x=%f',x)) y <- sum(e$values[1:5]) # The sum values of all eigenvaues of the covariance matrix print(sprintf('y=%f',y)) theta.hat <- x/y #The contribution rate of the covariance matrix print(sprintf('theta.hat=%f',x/y)) #compute the jackknife replicates, leave-one-out estimates theta.jack <- numeric(n) for (i in 1:n) cov.jack <- cov(scor[-i]) # compute the covariance matrix in the jackknife e <- eigen(cov.jack) #compute the eigenvalues of the covariance matrix in the jackknife x <- e$values[1] # The maximum eigenvaues of the jackknife covariance matrix y <- sum(e$values[1:5]) # The sum values of all eigenvaues of the jackknife covariance matrix theta.jack[i] <- x/y #The contribution rate of the jackknife covariance matrix bias <- (n - 1) * (mean(theta.jack) - theta.hat) print(sprintf('bias=%f',bias)) #jackknife estimate of bias se <- sqrt((n-1) * mean((theta.jack - mean(theta.jack))^2)) print(sprintf('se=%f',se)) #jackknife estimate of standard error
From the results, we can see that the eigrn values of the covariance matrix are $\lambda_1 = 686.98981, \lambda_2 = 202.11107,\lambda_3 = 103.74731, \lambda_4 = 84.63044,\lambda_5 = 32.15329$. So we compute the $\hat\theta=\frac{\hat\lambda_1}{\sum_{i=1}^{5}\hat\lambda_i}=0.619115.$ Then we estimate the jackknife estimate of standard error and bias of $\hat\theta$ are $\hat{bias_{jack}}=-53.250929, \hat{se_{jack}}=0.612080$ respectively.
In $Example 7.18$(p227), leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.
The proposed models for predicting magnetic measurement (Y) from chemical measurement (X) are:
1.$Linear: Y=\beta_0+\beta_1X+\epsilon.$
2.$Quadratic: Y=\beta_0+\beta_1X+\beta_2X^2+\epsilon.$
3.$Exponential: log(Y) = log(\beta_0)+\beta_1X + \epsilon.$
4.$Log-Log: log(Y)=\beta_0 + \beta_1log(X)+\epsilon.$
And the leave-two-out cross validation procedure is:
1.For $k =1,\cdots,n$,let observation $(x_k,y_k)$ and $(x_{n-k+1},y_{n-k+1})$ be the test point and use the remaining observations to fit the model.
(a)Fit the model(s) using only the n-2 observations in the training set $(x_i,y_i),i \neq k,i \neq n-k+1.$
(b) Compute the predicted response $\hat{y}_k=\hat\beta_0+\hat\beta_1x_k$for the test point.
(c) Compute the prediction error $e_{k1}=y_k-\hat{y}k.$ and $e{k2}=y_{n-k+1}-\hat{y}_{n-k+1}.$
(d) compute the mean values of $e_{k1}=y_k-\hat{y}k.$ and $e{k2}=y_{n-k+1}-\hat{y}_{n-k+1}$ and denote as $e_k.$
2.Estimate the mean of the squared prediction errors $\hat\sigma_\epsilon^2=\frac1n\sum_{k=1}^{n}e_k^2.$
The code to estimate the parameters of the four models follows.
library(DAAG) data(ironslag, package = "DAAG") attach(ironslag) n <- length(magnetic) #in DAAG ironslag e1<- e11<- e12<- e2<- e21<- e22<- e3<- e31<- e32<- e4<- e41<- e42 <- numeric(n) # for n-fold cross validation # fit models on leave-two-out samples for (k in 1:n) { y <- magnetic[{-k}&{-(n-k+1)}] ## leave-two-out samples x <- chemical[{-k}&{-(n-k+1)}] ## leave-two-out samples J1 <- lm(y ~ x) #the Linear model yhat11 <- J1$coef[1] + J1$coef[2] * chemical[k] e11[k] <- magnetic[k] - yhat11 yhat12 <- J1$coef[1] + J1$coef[2] * chemical[n-k+1] e12[k] <- magnetic[n-k+1] - yhat12 e1[k] <- e11[k]+e12[k] J2 <- lm(y ~ x + I(x^2)) #the Quadratic model yhat21 <- J2$coef[1] + J2$coef[2] * chemical[k] + J2$coef[3] * chemical[k]^2 e21[k] <- magnetic[k] - yhat21 yhat22 <- J2$coef[1] + J2$coef[2] * chemical[n-k+1] + J2$coef[3] * chemical[n-k+1]^2 e22[k] <- magnetic[n-k+1] - yhat22 e2[k] <- e21[k]+e22[k] J3 <- lm(log(y) ~ x) #the Exponential model logyhat31 <- J3$coef[1] + J3$coef[2] * chemical[k] yhat31 <- exp(logyhat31) e31[k] <- magnetic[k] - yhat31 logyhat32 <- J3$coef[1] + J3$coef[2] * chemical[n-k+1] yhat32 <- exp(logyhat32) e32[k] <- magnetic[n-k+1] - yhat32 e3[k] <- e31[k]+e32[k] J4 <- lm(log(y) ~ log(x)) #the Log-Log model logyhat41 <- J4$coef[1] + J4$coef[2] * log(chemical[k]) yhat41 <- exp(logyhat41) e41[k] <- magnetic[k] - yhat41 logyhat42 <- J4$coef[1] + J4$coef[2] * log(chemical[n-k+1]) yhat42 <- exp(logyhat42) e42[k] <- magnetic[n-k+1] - yhat42 e4[k] <- e41[k]+e42[k] } c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2))
Now according the results, we display the specific properties and plots for the model 2:
According to the prediction error criterion, Model 2, the $Quadratic$ model, would be the best fit for the data. And now we show some relative parameter values and plots of this model. The best model is L2(the $Quadratic$ model). The fitted regression equation for Model 2 is $$\hat{Y}= 24.49262 -1.39334X + 0.05452X^2.$$
Implement the two-sample Cramer-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.
The Cramer-von Mises statistic, which estimates the integrated squared distance between the distributions, is defined by
$$W_2=\frac{mn}{(m+n)^2}[\sum^n_{i=1}(F_n(x_i)-G_m(x_i))^2+\sum^m_{j=1}(F_n(x_j)-G_m(x_j))^2].$$
where $F_n$ is the ecdf of the sample $x_1,...,x_n$ and $G_m$ is the ecdf of the sample $y_1,...,y_m$.
attach(chickwts) x <- sort(as.vector(weight[feed == "soybean"])) y <- sort(as.vector(weight[feed == "linseed"])) detach(chickwts) R <- 999#number of replicates z <- c(x,y)#pooled sample K <- 1:length(z) D <- numeric(R) #storage for replicates CM <- function(x,y){#a function to compute the statistic of Cramer-von Mises ecdfx <- ecdf(x) ecdfy <- ecdf(y) n1 <- length(x) n2 <- length(y) sum1 <- sum((ecdfx(x)-ecdfy(x))^2) sum2 <- sum((ecdfx(y)-ecdfy(y))^2) w <- n1*n2/(n1+n2)^2*(sum1+sum2) return(w) } D0 <- CM(x,y) for(i in 1:R) { k <- sample(K, size = length(x), replace = F) x1 <- z[k] y1 <- z[-k] #complements of X1 D[i] <- CM(x1,y1) } p <- mean(c(D0, D) >= D0) print(p) hist(D, main = "", freq = FALSE, xlab = "D (p = 0.412)", breaks = "scott") points(D0, 0, cex = 1, pch = 16)
The approximate achieved significance level 0.412 does not support the alternative hypothesis that distributions differ. So we may say that soybean and linseed groups may be similar.
Design experiments for evaluating the performance of the NN, energy, and ball methods in various situations
In the following code, the first case, where unbalanced samples are with unequal variances and equal expectations is discussed.
library(RANN)## for nn function library(boot) library(energy) library(Ball) library(ggplot2) m <- 30#times to loop k<-3 p<-2#ncol # mu <- 0.5 n1 <- n2 <- 50#nrow R<-999#the number of replications in the bd.test function n <- n1+n2 N = c(n1,n2) Tn <- function(z, ix, sizes,k) { n1 <- sizes[1] n2 <- sizes[2] n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0) z <- z[ix, ] NN <- nn2(data=z, k=k+1) block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5); i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } eqdist.nn <- function(z,sizes,k){#NN boot.obj <- boot(data=z,statistic=Tn,R=R, sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3)#to store p values for(i in 1:m) { x <- matrix(rnorm(n1 * p, sd = 1), ncol = p)#unequal variances y <- matrix(rnorm(n2 * p, sd = 1.4), ncol = p) z <- rbind(x, y) p.values[i, 1] <- eqdist.nn(z, N, k)$p.value#NN p.values[i, 2] <- eqdist.etest(z, sizes = N, R = R)$p.value#in the energy package p.values[i, 3] <- bd.test(x = x, y = y, R = 999, seed = i)$p.value#"Ball Divergence" in the ball package } alpha <- 0.1#confidence level pow <- apply(p.values<alpha,2,mean)#compute the number of p.values which is less than 0.1 in each column print(pow) power <- data.frame(methods = c('NN','energy','Ball'),pow) ggplot(power,aes(methods,pow))+#plot geom_col(fill = 'orange')+ coord_flip()
for(i in 1:m) { x <- matrix(rnorm(n1 * p, mean = 0.4, sd = 1), ncol = p)#unequal variances and unequal expectations y <- matrix(rnorm(n2 * p, mean = 0, sd = 1.4), ncol = p) z <- rbind(x, y) p.values[i, 1] <- eqdist.nn(z, N, k)$p.value p.values[i, 2] <- eqdist.etest(z, sizes = N, R = R)$p.value p.values[i, 3] <- bd.test(x = x, y = y, R = 999, seed = i)$p.value } alpha <- 0.1 pow <- apply(p.values<alpha,2,mean) print(pow) power <- data.frame(methods = c('NN','energy','Ball'),pow) ggplot(power,aes(methods,pow))+ geom_col(fill = 'lightslategrey')+ coord_flip()
for(i in 1:m) { x <- matrix(rt(n1 * p,df = 1), ncol = p)# t distribution y <- matrix(rnorm(n2 * p,sd = sample(c(1,1.3),size = n2*p, prob = c(0.5,0.5),replace = T)), ncol = p)#bimodel distribution z <- rbind(x, y) p.values[i, 1] <- eqdist.nn(z, N, k)$p.value p.values[i, 2] <- eqdist.etest(z, sizes = N, R = R)$p.value p.values[i, 3] <- bd.test(x = x, y = y, R = 999, seed = i)$p.value } alpha <- 0.01 pow <- apply(p.values<alpha,2,mean) print(pow) power <- data.frame(methods = c('NN','energy','Ball'),pow) ggplot(power,aes(methods,pow))+ geom_col(fill = 'mediumpurple')+ coord_flip()
n1 <- 50 n2 <- 5 n <- n1+n2 N = c(n1,n2) for(i in 1:m) { x <- matrix(rnorm(n1*p,mean = 1), ncol = p)#100 samples y <- matrix(rnorm(n2*p,mean = 2), ncol = 2)#10 samples z <- rbind(x, y) p.values[i, 1] <- eqdist.nn(z, N, k)$p.value p.values[i, 2] <- eqdist.etest(z, sizes = N, R = R)$p.value p.values[i, 3] <- bd.test(x = x, y = y, R = 999, seed = i)$p.value } alpha <- 0.1 pow <- apply(p.values<alpha,2,mean) print(pow) power <- data.frame(methods = c('NN','energy','Ball'),pow) ggplot(power,aes(methods,pow))+ geom_col(fill = 'sienna2')+ coord_flip()
The comments are similar in every chunk, so I only put them in the first chunk.
A function is defined to implement NN method. In every question, a loop is run to compute p values of each method. In my code, confidence level $\alpha$ is defined as 0.1, and the number of p values which are less than 0.1 are obtained by the funcion apply. At last, I use ggplot2 to generate a figure to show the difference of power in three methods.
I have tried a lot of parameters to make the gap of power as large as possible, making the difference of power clear enough.
Use the $Metropolis-Hastings$ sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard $Cauchy$ distribution(see $qcauchy$ or $qt$ with df=1). Recall that a $Cauchy(\theta,\eta)$ distribution has density function $$f(x)=\frac1{\theta\pi(1+[(x-\eta)/\theta]^2)},-\infty
Use the Metropolis-Hastings sampler to generate random variables from a $Cauchy$ distribution. Implementation of a Metropolis-Hastings sampler for this example is as follows. Note that the base of the array in $R$ is 1, so we initialize the chain at $X_0$ in $x[1]$.
(a) Generate $Y$ from $qt(df = X_t) =qt(df= x[i-1])$
N <- 5000 my_vector <- numeric(N) my_vector[1] <- rnorm(1) counter <- 0 #generates random deviates u <- runif(N) for (i in 2:N) { xt <- my_vector[i-1] y <- rnorm(1, mean = xt) #Density Cauchy distribution numerator <- dcauchy(y)*dnorm(xt, mean = y) denominator <- dcauchy(xt)*dnorm(y, mean = xt) if(u[i] <= numerator/denominator) { my_vector[i] <- y } else { my_vector[i] <- xt counter <- counter+1 } } plot(my_vector,type = "l") After1kDiscard <- my_vector[1000:N]
#Sequence Generation a <- ppoints(100) sequence <- seq(0,1,0.01) standardCauchy <- qcauchy(sequence) standardCauchy <- standardCauchy[(standardCauchy> -Inf) & (standardCauchy< Inf)] Q <- quantile(my_vector, a) qqplot(standardCauchy, Q, main="",xlab="standardCauchy Quantiles", ylab="Sample Quantiles") hist(After1kDiscard, freq = FALSE,main = "after discard the first 1000 of the chain") lines(standardCauchy, dcauchy(standardCauchy), lty = 2)
To see the generated sample as a realization of a stochastic process, we can plot the sample vs the time index. The codes above have displayed a partial plot starting at time index 0. And the final plot shows that at times the candidate point is rejected and the chain does not move at these time points; this corresponds to the short horizontal paths in the graph. The QQ plot is an informal approach to assessing the goodness-of-fit of the generated sample with the target distribution. From the plot, it appears that the sample quantiles are in approximate agreement with the theoretical quantiles.
$Rao$ [220, Sec. 5g] presented an example on genetic linkage of 197 animals in four categories (also discussed in [67, 106, 171, 266]). The group sizes are (125,18,20,34). Assume that the probabilities of the corresponding multinomial distribution are $(\frac12+\frac\theta4,\frac{1-\theta}4,\frac{1-\theta}4,\frac\theta4).$ Estimate the posterior distribution of $\theta$ given the observed sample, using one of the methods in this chapter.
we need to estimate the parameters $p_1,i=1,2,3,4.$ with the missing datas. And the following codes and the corresponding perform results will show some experiments.
p1=125/197 p2=18/197 p3=20/197 p4=34/197 my_prob <- c(p1,p2,p3,p4) number_of_experiments <- 10 number_of_samples <- 197 experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob) experiments
We hope to find the likelihood function to use the specify method to get the goal.Then get the full data to estimate the $\theta$ to produce the maximum function.And we interal continually to optimize the estimate value.
# defube expectation function estep <- function(theta,z2){ z2 = 125*(0.25*theta/(0.5+0.25*theta)) # z1 = 125*(0.5/(0.5+0.25*theta)) return(z2) } mstep <- function(theta,z2){ theta <- (z2+34)/(z2+34+18+20) return(theta) } # set initial value for iteration cur_theta = 0.5 maxit = 100 maxit=1000 tol=1e-6 # start iteration for(i in 1:maxit){ new_z2 <- estep(cur_theta,cur_z2) new_theta <- mstep(cur_theta,new_z2) # Stop iteration if the difference between the current and new estimates is less than a tolerance level if( all(abs(cur_theta - new_theta) < tol) ){ flag <- 1; break} # Otherwise continue iteration cur_theta <- new_theta cur_z2 <- new_z2 } if(!flag) warning("Didn??t converge\n") final_theta = cur_theta paste0("Final Theta = ", format(round(cur_theta, 4), nsmall = 4)) paste0("Final Z2 = ", format(round(cur_z2, 4), nsmall = 4))
From the whole process ,we has get the disdribution of the parameter $\theta.$ Then we estimate the density function of the observed data.
p1=125/197 p2=18/197 p3=20/197 p4=34/197 my_prob <- c(p1,p2,p3,p4) number_of_experiments <- 1 number_of_samples <- 1000 experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob) experiments
But the estimate of the density distribution based on the completed data is like that we will do.
p1=0.5 p2=0.25*final_theta p3=0.25-p2 p4=p3 p5=p2 my_prob <- c(p1,p2,p3,p4,p5) number_of_experiments <- 10 number_of_samples <- 1000 experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob) experiments number_of_experiments <- 1 number_of_samples <- 1000 experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob) experiments
For $Exercise 9.6$, use the $Gelman-Rubin$ method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R}<1.2.$
The $Gelman-Rubin$ method of monitoring convergence of a $M-H$ chain is based on comparing the behavior of several generated chains with respect to the variance of one or more scalar summary statistics. The estimates of the variance of the statistic are analogous to estimates based on between-sample and within-sample mean squared errors in a one-way analysis of variance ($ANOVA$).
The between-sequence variance is $$B=\frac1{k-1}\sum_{i=1}^{k}\sum_{j=1}^{n}(\hat{\psi_{i.}}-\hat{\psi_{..}})^2=\frac{n}{k-1}\sum_{i=1}^{k}(\hat{\psi_{i.}}-\hat{\psi_{..}})^2$$
where $\hat{\psi_{i.}}=\frac1n\sum_{j=1}^{n}\psi_{ij} , \hat{\psi_{..}}=\frac1{nk}\sum_{i=1}^{k}\sum_{j=1}^{n}\psi_{ij}$
Within the $i^{th}$ sequence, the sample variance is $$s^2_i=\frac1n\sum_{j=1}^{n}(\psi_{ij}-\hat{\psi{i.}})^2$$ and the pooled estimate of within sample variance is $$W=\frac1{nk-k}\sum_{i=1}^{k}(n-1)s^2_i=\frac1k\sum_{i=1}^{k}s^2_i$$
The between-sequence and within-sequence estimates of variance are combined to estimate an upper bound for $V ar(\psi)$ is $$\hat{Var(\psi)}=\frac{n-1}{n}W+\frac1nB$$
The $Gelman-Rubin$ statistic is the estimated potential scale reduction $$\sqrt{\hat{R}}=\sqrt{\frac{\hat{Var(\psi)}}{W}}$$
which can be interpreted as measuring the factor by which the standard deviation of $\psi$ could be reduced by extending the chain. The factor $\hat{R}$ decreasesto 1 as the length of the chain tends to infinity, so $\hat{R}$ should be close to 1.ifthe chains have approximately converged to the target distribution. $Gelman$ suggests that $\hat{R}$ should be less than 1.1 or 1.2.
set.seed(1) Gelman.Rubin <- function(psi) { # psi[i,j] is the statistic psi(X[i,1:j]) # for chain in i-th row of X psi <- as.matrix(psi) n <- ncol(psi) k <- nrow(psi) psi.means <- rowMeans(psi) #row means B <- n * var(psi.means) #between variance est. psi.w <- apply(psi, 1, "var") #within variances W <- mean(psi.w) #within est. v.hat <- W*(n-1)/n + (B/n) #upper variance est. r.hat <- v.hat / W #G-R statistic return(r.hat) } theta <- .2 #parameter of proposal distribution k <- 4 #number of chains to generate N <- 5000 #length of chains b <- 500 #burn-in length M <- 5000 groupsize <- c(125,18,20,34) #group size prob <- function(theta,group){ if(theta<0||theta>=0.9) return(0) return((1/2+theta/4)^groupsize[1]*((1-theta)/4)^groupsize[2]*((1-theta)/4)^groupsize[3]*(theta/4)^groupsize[4]) } mn.chain <- function(groupsize,N,X1) { #generates a Metropolis chain #with Multinomial proposal distribution #and starting value X1 x <- numeric(N) x[1] <- X1#initialize x[1] w <- 0.25 u <- runif(N) v <- runif(M, -w, w) for (i in 2:N) { theta <- x[i - 1] + v[i] if (u[i] <= prob(theta, groupsize) / prob(x[i - 1], groupsize)) x[i] <- theta else x[i] <- x[i - 1] } return(x) } #choose overdispersed initial values x0 <- c(0.2, 0.4, 0.6, 0.8) #generate the chains X <- matrix(0, nrow=k, ncol=N) for (i in 1:k) X[i, ] <- mn.chain(groupsize,N,x0[i]) #compute diagnostic statistics psi <- t(apply(X, 1, cumsum)) for (i in 1:nrow(psi)) psi[i,] <- psi[i,] / (1:ncol(psi)) print(Gelman.Rubin(psi))
Then we can learn from the results:
The plots above are the four sequences of the summary statistic (the mean) $\psi$ showing from time 0 to 5000.
The dashed line on the plot is at $\hat{R}=1.2$,and we know that the according interations appropriately within 1000.
Find the intersection points $A(k)$ in $(0,\sqrt{k})$ of the curves $$S_{k-1}(a)=P(t(k-1)>\sqrt{\frac{a^2(k-1)}{k-a^2}})$$ and $$S_{k}(a)=P(t(k)>\sqrt{\frac{a^2k}{k+1-a^2}}),$$ for k = 4 : 25, 100, 500, 1000, where $t(k)$ is a Student $t$ random variable with $k$ degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by $Sz??ekely [260].$)
The Idea is :
generate the inset function that the inside one is the function of a and the ourside one is the function of k.
first, we code the function to compute the $S_{k-1}(a)$ and $S_k(a)$ respectively.
then construct the function of $S_k(a)-S_{k-1}(a)$
and the next step is to find the intersection points through the function $uniroot$
set.seed(1) findIntersection = function (k) { s.k.minus.one = function (a) { 1-pt(sqrt(a^2 * (k - 1) / (k - a^2)), df = k-1) } s.k = function (a) { 1-pt(sqrt(a^2 * k / (k + 1 - a^2)), df = k) } f = function (a) { s.k(a) - s.k.minus.one(a) } eps = 1e-2 return(uniroot(f, interval = c(eps, sqrt(k)-eps))$root) } rs = sapply(c(4:25, 100, 500, 1000), function (k) { findIntersection(k) }) cbind(k=c(4:25, 100, 500, 1000),a=rs)
From the results ,we can get the different intersection points with the different k.And we find the intersection point changes slightly while the k has a significant increase.
Write a function to compute the $cdf$ of the $Cauchy$ distribution, which has density $$\frac1{\theta\pi(1+((x-\eta)/\theta)^2)},-\infty
If we knoow the density function of the specified distribution,then integtate this density function with the suitable interval then we will get the target $CDF$
To get the truth $CDF$, we utilize the function $sapply$ and $pcauchy$
my.dcauchy = function (x, eta, theta) { stopifnot(theta > 0) return(1/(theta*pi*(1 + ((x - eta)/theta)^2))) } my.pcauchy = function (x, eta, theta) { stopifnot(theta > 0) integral = function (x) { my.dcauchy(x, eta, theta) } return(integrate(integral, lower = -Inf, upper = x)$value) } eta = 0 theta = 2 xs = seq(-10, 10) estimate = sapply(xs, function(x) my.pcauchy(x, eta, theta)) truth = sapply(xs, function(x) pcauchy(x, eta, theta)) round(cbind(estimate, truth),4)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| |Genotype|AA|BB|OO|AO|BO|AB|AA| |Frequency|p2|q2|r2|2pr|2qr|2pq|1| |Count|nAA|nBB|nOO|nAO|nBO|nAB|n|
Observed data: $n_{A??} = n_{AA} + n_{AO} = 28$ (A-type), $n_{B??} = n_{BB} + n_{BO} = 24$ (B-type), $n_{OO} = 41$ (O-type), $n_{AB} = 70$ (AB-type).
Use EM algorithm to solve $MLE$ of $p$ and $q$ (consider missing data $n_{AA}$ and $n_{BB}$).
Record the maximum likelihood values in $M$-steps, are they increasing?
The $EM$ (Expectation Maximization) algorithm is a general optimization method that is often applied to find maximum likelihood estimates when data are incomplete.
First, we compute the likelihood function,then evaluate the log-likelihood of the data for any arbitrary set of allele frequencies.
Second, we can estimate the number of individuals carrying each possible genotype based on the observed counts for each phenotype and estimates of the allele frequencies at each iteration.And you can run the E-M function with the debug flag set to TRUE.
Then we set the initial value to be p=0.3,q=0.2,r=0.5, and then run the E-M function.
#set the likelihood function lnL <- function(p, q, nA =28, nB = 24, nAB = 70, nO = 41) { r = 1.0 - p - q nA * log(p^2 + 2*p*r) + nB * log(p^2 + 2 * q * r) + nAB * log(2 * p * q) + 2 * nO * log(r) } EM <- function (p,q,nA =28, nB = 24, nAB = 70, nO = 41, debug = FALSE) { # Evaluate the likelihood using initial estimates llk <- lnL(p, q, nA, nB, nAB, nO) # Count the number of iterations so far iter <- 1 # Loop until convergence ... while (TRUE) { # Estimate the frequency for allele O r= 1.0 - p - q # First we carry out the E-step # The counts for genotypes O/O and A/B are effectively observed # Estimate the counts for the other genotypes nAA <- nA * p / (p + 2*r) nAO <- nA - nAA nBB <- nB * q / (q + 2*r) nBO <- nB - nBB # Print debugging information if (debug) { cat("Round #", iter, "lnLikelihood = ", llk, "\n\n") cat(" Allele frequencies: p = ", p, ", q = ", q, ", r = ", r, "\n\n") cat(" Genotype counts: nAA = ", nAA, ", nAO = ", nAO, ", nBB = ", nBB, ", nBO = ", nBO, "\n\n") } # Then the M-step p <- (2 * nAA + nAO + nAB) / (2 * (nA + nB + nO + nAB)) q <- (2 * nBB + nBO + nAB) / (2 * (nA + nB + nO + nAB)) # Then check for convergence ... llk1 <- lnL(p, q, nA, nB, nAB, nO) if (abs(llk1 - llk) < (abs(llk) + abs(llk1)) * 1e-6) break # Otherwise keep going llk <- llk1 iter <- iter + 1 } cbind(p = p, q = q) } EM(0.3,0.2,nA =28, nB = 24, nAB = 70, nO = 41, debug = TRUE)
From the results above, we get the eatimated value is : $p=0.32734$, $q= 0.3104234$,
The log likelihood estimate is : $lnLikelihood = -261.3305 , -251.1735, -251.1293 ,-251.1236 ,-251.122$ which means the maximum likelihood estimates is monotonous increasing.
Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:
lapply() is a wrapper for a common for loop pattern: create a container for output, apply f() to each component of a list, and fill the container with the results. All other for loop functionals are variations on this theme: they simply use different types of input or output.
unlist() can be used to convert the output from a list to a vector to make it more compact. But data frames are also lists, lapply() is also useful when you want to do something to each column of a data frame.
formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt ) # with lapply() print("***************************************** with lapply() ***********************************************************") lapply(formulas, lm, data = mtcars) # with a for loop print("***************************************** with a for loop *********************************************************") out_formulas <- vector('list', length(formulas)) for(i in seq_along(formulas)) { out_formulas[[i]] <- lm(formulas[[i]], data = mtcars) } out_formulas
From the results, we can easily see that the loop with a for is the same as with the lapply
We get the model form are :
[1] $mpg = 29.59985 - 0.04122disp$
[2] $mpg = 10.75 - 1557.67I(1 / disp)$
[3] $mpg = 34.96055 - 0.01772disp - 3.35083wt$
[4] $mpg = 19.024 - 1142.560I(1 / disp) - 1.798wt$
Fit the model $mpg$ ~ $disp$ to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?
bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] }) # with lapply() print("***************************************** with lapply() ***********************************************************") lapply(bootstraps, lm, formula = mpg~disp) # with a for loop print("***************************************** with a for loop *********************************************************") out_bootstraps <- vector('list', length(bootstraps)) for(i in seq_along(bootstraps)) { out_bootstraps[[i]] <- lm(mpg~disp, data = bootstraps[[i]]) } out_bootstraps
From the results, we can easily see that the loop with a for is the same as with the lapply.
We get the modle form are :
[1] $mpg = 28.96282 - 0.04149 disp$
[2] $mpg = 30.13856 - 0.04693 disp$
[3] $mpg = 27.09594 - 0.03264 disp$
[4] $mpg = 28.64848 - 0.03945 disp$
[5] $mpg = 29.94778 - 0.04477 disp$
[6] $mpg = 31.36958 - 0.04391 disp$
[7] $mpg = 28.92391 - 0.04218 disp$
[8] $mpg = 28.74858 - 0.03776 disp$
[9] $mpg = 28.70887 - 0.03956 disp$
[10] $mpg = 30.8249 - 0.0469 disp$
For each model in the previous two exercises, extract $R^2$ using the function below.
Use lapply function to complete the loop with different models
Then utilize the unlist function to extract the target values of $R^2$
rsq <- function(mod) summary(mod)$r.squared bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] }) models <- lapply(bootstraps, function(x) lm(mpg ~ disp, data = x)) unlist(lapply(models, rsq))
The extracted values in different models are :
$R_1^2 = 0.6606581, R_2^2 = 0.8160798, R_3^2 = 0.5806448, R_4^2 = 0.7407087, R_5^2 = 0.7589143,$
$R_6^2 = 0.6561107, R_7^2 = 0.6021043, R_8^2 = 0.6888394, R_9^2 = 0.5989015, R_{10}^2 = 0.7239530.$
The following code simulates the performance of a $t-test$ for non-normal data. Use sapply() and an anonymous function to extract the $p$-value from every trial.
trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE ) print("***************************************** with sapply() *******************************************************") sapply(trials, function(mod) mod$p.value) #Extra challenge: get rid of the anonymous function by using `[[` directly. print("*************************************** get rid of the anonymous function ********************************************") sapply(trials, `[[`, 'p.value')
From the results shown above, we can learn that the sapply and an anonymous function can extract the wanted p-values.
When getting rid of the anonymous function, we can also get the same results.
Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?
lapply() takes a function, applies it to each element in a list, and returns the results in the form of a list.
vapply() is very similar to lapply() except it simplify output to produce an atomic vector. vapply() is more verbose, but gives more informative error messages and never fails silently. It is better suited for use inside other functions. vapply() is an implementation of lapply() that assigns results to a vector (or matrix) of appropriate type instead of as a list.
With lapply(), only one argument to the function varies; the others are fixed. This makes it poorly suited for some problems.
There??s a natural equivalence between Map() and lapply() because you can always convert a Map() to an lapply() that iterates over indices.
library(parallel) mcvMap <- function(f, FUN.VALUE , ...) { out <- mcMap(f, ...) vapply(out, identity, FUN.VALUE) }
Make a faster version of $chisq.test()$ that only computes the $chi-square$ test statistic when the input is two numeric vectors with no missing values. You can try simplifying $chisq.test()$ or by coding from the mathematical definition ($http://en. wikipedia.org/wiki/Pearson%27s_chi-squared_test$).
getAnywhere(chisq.test)
chisq.test2 <- function(x, y){ # Input if (!is.numeric(x)) { stop("x must be numeric")} if (!is.numeric(y)) { stop("y must be numeric")} if (length(x) != length(y)) { stop("x and y must have the same length")} if (length(x) <= 1) { stop("length of x must be greater one")} if (any(c(x, y) < 0)) { stop("all entries of x and y must be greater or equal zero")} if (sum(complete.cases(x, y)) != length(x)) { stop("there must be no missing values in x and y")} if (any(is.null(c(x, y)))) { stop("entries of x and y must not be NULL")} # Help variables m <- rbind(x, y) margin1 <- rowSums(m) margin2 <- colSums(m) n <- sum(m) me <- tcrossprod(margin1, margin2) / n # Output x_stat = sum((m - me)^2 / me) dof <- (length(margin1) - 1) * (length(margin2) - 1) p <- pchisq(x_stat, df = dof, lower.tail = FALSE) return(list(x_stat = x_stat, df = dof, `p-value` = p)) }
a <- 21:25 b <- c(21, 23, 25, 27, 29) m <- cbind(a, b) m chisq.test(m) chisq.test2(a, b)
chisq.test2c <- compiler::cmpfun(chisq.test2) microbenchmark::microbenchmark( chisq.test(m), chisq.test2(a, b), chisq.test2c(a, b) )
Can you make a faster version of $table()$ for the case of an input of two integer vectors with no missing values? Can you use it to speed up your $chi-square$ test?
getAnywhere(table)
Then we code to speed up the $chi-square$ test.
table2 <- function(x, y) { x_val <- unique(x) y_val <- unique(y) mat <- matrix(0L, length(x_val), length(y_val)) for (i in seq_along(x)) { mat[which(x_val == x[[i]]), which(y_val == y[[i]])] <- mat[which(x_val == x[[i]]), which(y_val == y[[i]])] + 1L } dimnames <- list(x_val, y_val) names(dimnames) <- as.character(as.list(match.call())[-1]) # R has names for dimnames... :/ tab <- array(mat, dim = dim(mat), dimnames = dimnames) class(tab) <- "table" tab }
check:
a <- c(1, 2, 3) identical(table(a, a), table2(a, a))
we benchmark this implementation against a compiled version of itself and the original stats::table():
microbenchmark::microbenchmark(table(a, a), table2(a, a))
check
b <- c(2, 3, 4) identical(table(a, b), table2(a, b)) c <- c(1, 2, 3, 1, 2, 3) d <- c(2, 3, 4, 2, 3, 4) identical(table(c, d), table2(c, d)) e <- c(1, 2, 2) identical(table(a, e), table2(a, e)) identical(table(b, e), table2(b, e)) identical(table(e, e), table2(e, e)) f <- c(1, 1, 1) identical(table(f, f), table2(f, f)) identical(table(e, f), table2(e, f)) g <- c(1, 4, 9) identical(table(g, g), table2(g, g)) identical(table(g, f), table2(g, f)) h <- c(10, 20, 30) identical(table(h, h), table2(h, h)) identical(table(h, f), table2(h, f)) identical(table(h, g), table2(h, g)) i <- c(0, 0, 0) identical(table(i, i), table2(i, i)) identical(table(h, i), table2(h, i))
then we have:
j <- c(0, 10, 100, 1000, 10000) identical(table(j, j), table2(j, j)) microbenchmark::microbenchmark(table(j, j), table2(j, j))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.