A discrete random variable X has probability mass function x 0 1 2 3 4 ------------------------ p(x) 0.1 0.2 0.2 0.2 0.3
Use the inverse transform method to generate a random sample of size 1000 from the distribution of X.Construct a relative frequency table and compare the empirical with the theoretical probabilities.Repeat using the R sample function.
inv_F<-function(x){ if(x<=0.1) F<-0 else if(x<=0.3) F<-1 else if(x<=0.5) F<-2 else if(x<=0.7) F<-3 else if(x<=1) F<-4 return(F) } #Use the inverse transform method x<-runif(1000) #generate a uniform distribution sample sam<-sapply(x, inv_F) p0<-length(which(sam==0))/1000 p1<-length(which(sam==1))/1000 p2<-length(which(sam==2))/1000 p3<-length(which(sam==3))/1000 p4<-length(which(sam==4))/1000 #get the empirical probabilities ## relative frequency table table(sam) ## the empirical probabilities p0 p1 p2 p3 p4 ## the empirical probabilities are close to the corresponding theoretical probabilities ## Repeat using the R sample function x<-runif(1000) #generate a uniform distribution sample sam<-sapply(x, inv_F) p0<-length(which(sam==0))/1000 p1<-length(which(sam==1))/1000 p2<-length(which(sam==2))/1000 p3<-length(which(sam==3))/1000 p4<-length(which(sam==4))/1000 #get the empirical probabilities ## relative frequency table table(sam) ## the empirical probabilities p0 p1 p2 p3 p4 ## the empirical probabilities are close to the corresponding theoretical probabilities
Write a function to generate a random sample of size n from the Beta(a,b) distribution by the acceptance-rejection method.Generate a random sample of size 1000 from the Beta(3,2) distribution.Graph the histogram of the sample with the theoretical Beta(3,2) density superimposed.
fun1<-function(n,a,b){ j<-k<-0; y <- numeric(n) while (k < n) { u <- runif(1) j <- j + 1 x <- runif(1) #random variate from g if (x^(a-1)* (1-x)^(b-1) > u) { #we accept x k <- k + 1 y[k] <- x } } j }#a function to generate a random sample fun1(1000,3,2) #the Beta(3,2) n<-1000 j<-k<-0 y <- numeric(n) while (k < n) { u <- runif(1) j <- j + 1 x <- runif(1) #random variate from g if (x*x* (1-x) > u) { #we accept x k <- k + 1 y[k] <- x } } j hist(y,prob = TRUE)#histogram of the sample z <- seq(0, 1, 0.01) lines(z, 12*z*z*(1-z))#the theoretical Beta(3,2) density superimposed
Simulate a continuous Exponential-Gamma mixture.Suppose that the rate parameter ?? has Gamma(r,??) distribution and Y has Exp(??) distribution.That is,(Y|??=??)~fY(y|??)=??e^(-??y).Generate 1000 random observations from this mixture with r=4 and ??=2.
n <- 1000 r <- 4 beta <- 2 lambda <- rgamma(n, r, beta) #Gamma(r,??) distribution x <- rexp(n, lambda)#Exp(??) distribution x
Write a function to compute a Monto Carlo estimate of the Beta(3,3) cdf,and use the function to estimate F(x) for x=0.1,0.2,...,0.9.Compare the estimates with the values returned by the pbeta function in R.
First,we can write the Beta(3,3) cdf. $$\int_{0}^{x}30t^2(1-t)^2 dt$$ So,we get the Monto Carlo estimate is :
MCF<-function(a){ m<-1e4 t<-runif(m,0,a)# get the uniform sample theta.hat <- mean(30*a*t^2*(1-t)^2)# get the Monto Carlo estimate print(theta.hat) } pv<-c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9) for(x in pv){ print(x) MCF(x) print(pbeta(x,3,3))#use the pbeta function and compare with MCF }
The Rayleigh density [156, (18.76)] is $$f(x)=\frac{x}{\sigma^2}e^\frac{-x^2}{2\sigma^2}\quad x>=0, \sigma>0.$$
Implement a function to generate samples from a Rayleigh(??) distribution.using antithetic variables. What is the percent reduction in variance of $\frac{X+X'}{2}$ compared with $\frac{X1+X2}{2}$ for independent X1, X2?
We can get Rayleigh(??) distribution by the Rayleigh density.That is $$F(x)=1-e^\frac{-x^2}{2\sigma^2}\quad x>=0, \sigma>0.$$ So,we can get the inverse function is $$G(x)=\sqrt{-2\sigma^2ln(1-x)}\quad X=U(0,1).$$ by using antithetic variables,we can get samples from the Rayleigh(??) distribution. $$\frac{X+X'}{2}=\frac{\sqrt{-2\sigma^2ln(1-x)}+\sqrt{-2\sigma^2ln(x)}}{2}\quad X=U(0,1).$$
m<-1e4 x<-runif(m,0,1) a<-(sqrt(-2*log(1-x))+sqrt(-2*log(x)))/2 b<-sqrt(-2*log(1-x)) c(var(a),var(b),var(b)/var(a))#compare the results
Find two importance functions f1 and f2 that are supported on (1,??) and are ??close?? to $$g(x)=\frac{x^2}{\sqrt{2\pi}}e^\frac{-x^2}{2}\quad x>1.$$ Which of your two importance functions should produce the smaller variance in estimating $$\int_{1}^{??}\frac{x^2}{\sqrt{2\pi}}e^\frac{-x^2}{2}dx$$ by importance sampling? Explain.
We can get two importance functions f1 and f2 $$f1(x)=\frac{1}{x^2}\quad x>1.$$ $$f2(x)=xe^\frac{1-x^2}{2}\quad x>1.$$ So,we can get the cdf $$F1(x)=1-\frac{1}{x}\quad x>1.$$ $$F2(x)=1-e^\frac{1-x^2}{2}\quad x>1.$$ So,we can get the inverse function $$H1(x)=\frac{1}{1-x}\quad X=U(0,1).$$ $$H2(x)=\sqrt{1-2ln(1-x)}\quad X=U(0,1).$$
m<-1e4 x<-runif(m,0,1) a<-1/(1-x) b<-sqrt(1-2*log(1-x)) theta1<-sum(a^4*exp(-a^2/2))/sqrt(2*pi)/m#get the mean theta2<-sum(b/sqrt(2*pi*exp(1)))/m print(c(theta1,theta2)) var1<-var(a^4*exp(-a^2/2)/sqrt(2*pi))#get the variance var2<-var(b/sqrt(2*pi*exp(1))) print(c(var1,var2))
So we can find the same mean,but f2 has the smaller variance. That is,f2 is the better importance function.
Obtain a Monte Carlo estimate of $$\int_{1}^{??}\frac{x^2}{\sqrt{2\pi}}e^\frac{-x^2}{2}dx$$ by importance sampling.
We can get the importance functions f $$f(x)=xe^\frac{1-x^2}{2}\quad x>1.$$ So,we can get the cdf $$F(x)=1-e^\frac{1-x^2}{2}\quad x>1.$$ So,we can get the inverse function $$H(x)=\sqrt{1-2ln(1-x)}\quad X=U(0,1).$$ So,$$g(x)/f(x)=\frac{x}{\sqrt{2e\pi}}\quad x>1.$$
m<-1e4 x<-runif(m,0,1) b<-sqrt(1-2*log(1-x)) theta<-sum(b/sqrt(2*pi*exp(1)))/m#get the mean print(theta) var2<-var(b/sqrt(2*pi*exp(1)))#get the variance print(var2)
Let X be a non-negative random variable with $\mu=E\rvert{x}\rvert < ??$. For a random sample x1,...,xn from the distribution of X, the Gini ratio is defined by $$G=\frac{1}{2n^2\mu}\sum_{j=1}^{n}\sum_{i=1}^{n}\rvert{x_i-x_j}\rvert$$ The Gini ratio is applied in economics to measure inequality in income distribution (see e.g. [163]). Note that G can be written in terms of the order statistics $x_(i) as $$G=\frac{1}{n^2\mu}\sum_{i=1}^{n}(2i-n-1)x_{(i)}$$ If the mean is unknown,let $\widehat{G}$ be the statistic G with ?? replaced by $\overline{x}$. Estimate by simulation the mean, median and deciles of $\widehat{G}$ if X is standard lognormal. Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.
m<-1e3 n<-100 G.hat<-numeric(m) for(i in 1:m){ x<-rlnorm(n)#X is standard lognormal mu<-mean(x) sum0<-0 for(j in 1:n){ for(k in 1:n){ sum0<-sum0+abs(x[j]-x[k]) } } G.hat[i]<-sum0/(2*n^2*mu) } print(c(G.mean=mean(G.hat),G.median=median(G.hat),G.deciles=quantile(G.hat,c(0.1,0.9)))) hist(G.hat,freq = FALSE) lines(density(G.hat))
m<-1e3 n<-10 G.hat<-numeric(m) for(i in 1:m){ x<-runif(n)#uniform distribution mu<-mean(x) sum0<-0 for(j in 1:n){ for(k in 1:n){ sum0<-sum0+abs(x[j]-x[k]) } } G.hat[i]<-sum0/(2*n^2*mu) } print(c(G.mean=mean(G.hat),G.median=median(G.hat),G.deciles=quantile(G.hat,c(0.1,0.9)))) hist(G.hat,freq = FALSE) lines(density(G.hat))
m<-1e3 n<-10 G.hat<-numeric(m) for(i in 1:m){ x<-rbinom(n,m,0.1)#Bernoulli(0.1) mu<-mean(x) sum0<-0 for(j in 1:n){ for(k in 1:n){ sum0<-sum0+abs(x[j]-x[k]) } } G.hat[i]<-sum0/(2*n^2*mu) } print(c(G.mean=mean(G.hat),G.median=median(G.hat),G.deciles=quantile(G.hat,c(0.1,0.9)))) hist(G.hat,freq = FALSE) lines(density(G.hat))
Construct an approximate 95% confidence interval for the Gini ratio $\gamma=E[G]$ if X is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.
n<-50 m<-100 k<-200 G1<-numeric(m) Grlnorm<-function(n){ x<-rlnorm(n) mu<-mean(x) sum0<-0 for(j in 1:n){ for(k in 1:n){ sum0<-sum0+abs(x[j]-x[k]) } } return(sum0/(2*n^2*mu)) } for(i in 1:m){ G1[i]<-Grlnorm(n) } gamma.mean<-mean(G1) gamma.sd<-sd(G1) a<-gamma.mean-qt(0.95,df=m-1)*gamma.sd/sqrt(m) b<-gamma.mean+qt(0.95,df=m-1)*gamma.sd/sqrt(m)#confidence interval(a,b) gamma1<-numeric(k) for(i in 1:k){ G2<-numeric(m) for(j in 1:m){ G2[j]<-Grlnorm(n) } gamma1[i]<-mean(G2) } cat('The confidence interval of gamma is (',round(a,4),',',round(b,4),'),and the coverage rate is ',round(sum(gamma1>a&gamma1<b)/k,3)*100,'%',sep = '')
Tests for association based on Pearson product moment correlation ??, Spearman??s rank correlation coefficient $\rho_s$, or Kendall??s coefficient ??, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or ?? are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution (X,Y) such that X and Y are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.
library(MASS) n<-100 m<-1000 pearson.p<-numeric(m) kendall.p<-numeric(m) spearman.p<-numeric(m) sigma0<-matrix(c(1,0.2,0.2,1),ncol = 2)#bivariate normal mu<-c(0,1) for(i in 1:m){ mlnorm<-mvrnorm(n,mu,sigma0) x<-mlnorm[,1] y<-mlnorm[,2] pearson<-cor.test(x,y,method = 'pearson') kendall<-cor.test(x,y,method = 'kendall') spearman<-cor.test(x,y,method = 'spearman') pearson.p[i]<-pearson$p.value kendall.p[i]<-kendall$p.value spearman.p[i]<-spearman$p.value } cat('pearson:',sum(pearson.p<0.05)/m, 'kendall:',sum(kendall.p<0.05)/m, 'spearman:',sum(spearman.p<0.05)/m)#reject the null hypothesis
library(MASS) n<-100 m<-1000 pearson.p<-numeric(m) kendall.p<-numeric(m) spearman.p<-numeric(m) sigma0<-matrix(c(1,0,0,1),ncol = 2)#X and Y are dependent mu<-c(0,1) for(i in 1:m){ mlnorm<-mvrnorm(n,mu,sigma0) x<-mlnorm[,1] y<-mlnorm[,2] pearson<-cor.test(x,y,method = 'pearson') kendall<-cor.test(x,y,method = 'kendall') spearman<-cor.test(x,y,method = 'spearman') pearson.p[i]<-pearson$p.value kendall.p[i]<-kendall$p.value spearman.p[i]<-spearman$p.value } cat('pearson:',sum(pearson.p<0.05)/m, 'kendall:',sum(kendall.p<0.05)/m, 'spearman:',sum(spearman.p<0.05)/m)#accept the null hypothesis
Compute a jackknife estimate of the bias and the standard error of the correlation statistic in Example 7.2.
library(bootstrap) n<-nrow(law) #sample size theta.jack<-numeric(n) #storage for replicates b.cor<-function(x,i) cor(x[i,1],x[i,2]) theta.hat<-b.cor(law,1:n) for(i in 1:n){ theta.jack[i]<-b.cor(law,(1:n)[-i]) } bias.jack<-(n-1)*(mean(theta.jack)-theta.hat) se.jack<-sqrt((n-1)*mean((theta.jack-theta.hat)^2)) round(c(original=theta.hat,bias=bias.jack, se=se.jack),3)
Compute a jackknife estimate of the bias and the standard error of the correlation statistic in Example 7.2.
library(bootstrap) n<-nrow(law) #sample size theta.jack<-numeric(n) #storage for replicates b.cor<-function(x,i) cor(x[i,1],x[i,2]) theta.hat<-b.cor(law,1:n) for(i in 1:n){ theta.jack[i]<-b.cor(law,(1:n)[-i]) } bias.jack<-(n-1)*(mean(theta.jack)-theta.hat) se.jack<-sqrt((n-1)*mean((theta.jack-theta.hat)^2)) round(c(original=theta.hat,bias=bias.jack, se=se.jack),3)
Refer to Exercise 7.4. Compute 95% bootstrap confidence intervals for the mean time between failures 1/?? by the standard normal, basic, percentile, and BCa methods. Compare the intervals and explain why they may di???er.
library(boot) set.seed(1) n<-nrow(aircondit) #sample size mean(aircondit[,1]) boot.mean<-function(x,i) mean(x[i,1]) #give the mean function boot.out<-boot(data=aircondit,statistic=boot.mean,R=1000) boot.ci(boot.out,conf=0.95)
we can find 95% bootstrap confidence intervals are different,because we use different test types.
Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.
library(bootstrap) n<-nrow(scor) theta.jack<-numeric(n) n<-nrow(scor) theta.jack<-numeric(n) scor.pr<-function(x,i){ a<-eigen(cor(x[i,])) a$values[1]/sum(a$values) # the proportion of variance } theta.hat<-scor.pr(scor,1:n) for(j in 1:n){ theta.jack[j]<-scor.pr(scor,(1:n)[-j]) } bias.jack<-(n-1)*(mean(theta.jack)-theta.hat) se.jack<-sqrt((n-1)*mean((theta.jack-theta.hat)^2)) round(c(original=theta.hat,bias=bias.jack, se=se.jack),3)
In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.
library(DAAG) attach(ironslag) n <- length(magnetic) #in DAAG ironslag e1 <- e2 <- e3 <- e4 <- numeric((n-1)*n) #the size of samples k=1 for(i in 1:(n-1)){ for(j in (i+1):n){ y<- magnetic[-i][-(j-1)] x<- chemical[-i][-(j-1)] #leave-two-out samples J1 <- lm(y ~ x) yhat1 <- J1$coef[1] + J1$coef[2] * chemical[i] e1[k] <- magnetic[i] - yhat1 #check out the modle J2 <- lm(y ~ x + I(x^2)) yhat2 <- J2$coef[1] + J2$coef[2] * chemical[i] + J2$coef[3] * chemical[i]^2 e2[k] <- magnetic[i] - yhat2 J3 <- lm(log(y) ~ x) logyhat3 <- J3$coef[1] + J3$coef[2] * chemical[i] yhat3 <- exp(logyhat3) e3[k] <- magnetic[i] - yhat3 J4 <- lm(log(y) ~ log(x)) logyhat4 <- J4$coef[1] + J4$coef[2] * log(chemical[i]) yhat4 <- exp(logyhat4) e4[k] <- magnetic[i] - yhat4 k<-k+1 yhat1 <- J1$coef[1] + J1$coef[2] * chemical[j] e1[k] <- magnetic[j] - yhat1 yhat2 <- J2$coef[1] + J2$coef[2] * chemical[j] + J2$coef[3] * chemical[j]^2 e2[k] <- magnetic[j] - yhat2 logyhat3 <- J3$coef[1] + J3$coef[2] * chemical[j] yhat3 <- exp(logyhat3) e3[k] <- magnetic[j] - yhat3 logyhat4 <- J4$coef[1] + J4$coef[2] * log(chemical[j]) yhat4 <- exp(logyhat4) e4[k] <- magnetic[j] - yhat4 k<-k+1 } } c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2)) #get the square error lm(formula = magnetic ~ chemical + I(chemical^2))
According to the prediction error criterion, Model 2, the quadratic model, would be the best fit for the data. The fitted regression equation for Model 2 is $$\hat{Y}=24.49262-1.39334X+0.05452X^2$$
Compare with the leave-one-out (n-fold) cross validation,the leave-two-out cross validation has same conclusion,but more complicated,more accurate.
Implement the two-sample Cramer-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.
set.seed(1) attach(chickwts) x <- sort(as.vector(weight[feed == "soybean"])) y <- sort(as.vector(weight[feed == "linseed"])) detach(chickwts) n<-length(x) m<-length(y) R<-999 #repeat z<-c(x,y) h<-1:26 reps<-numeric(R) cvm<-function(a,b){ F<-ecdf(a) G<-ecdf(b) m*n*(sum((F(a)-G(a))^2)+sum((F(b)-G(b))^2))/((m+n)^2) } w0<-cvm(x,y) for (i in 1:R) { k <- sample(h, size = n, replace = FALSE) x1 <- z[k] y1 <- z[-k] reps[i] <-cvm(x1,y1) }#permutation test p <- mean(abs(c(w0, reps)) >= abs(w0)) round(p,3)
because p=0.421>0.05,so we can not reject.
Design experiments for evaluating the performance of the NN, energy, and ball methods in various situations.
Unequal variances and equal expectations
we can change the parameter to solve the problem.
we can change the mu
library(Ball) library(energy) library(boot) library(RANN) m <- 1e3 k<-3 p<-2 mu <- 0.5 set.seed(12345) n1 <- n2 <- 20 R<-999 n <- n1+n2 N = c(n1,n2) eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R, sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) Tn <- function(z, ix, sizes,k){ n1 <- sizes[1] n2 <- sizes[2] n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0); z <- z[ix, ] NN <- nn2(data=z, k=k+1) block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5) i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } for(i in 1:m){ x <- matrix(rnorm(n1*p),ncol=p); y <- cbind(rnorm(n2),rnorm(n2,mean=mu)); z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R=999,seed=i*12345)$p.value }#get p.values by the NN, energy, and ball methods. alpha <- 0.1 pow <- colMeans(p.values<alpha) round(pow,3)
Unequal variances and unequal expectations
we can change the sd
library(Ball) library(energy) library(boot) library(RANN) m <- 1e3 k<-3 p<-2 sd1<-2 set.seed(12345) n1 <- n2 <- 20 R<-999 n <- n1+n2 N = c(n1,n2) eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R, sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) Tn <- function(z, ix, sizes,k){ n1 <- sizes[1] n2 <- sizes[2] n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0) z <- z[ix, ] NN <- nn2(data=z, k=k+1) block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5) i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } for(i in 1:m){ x <- matrix(rnorm(n1*p),ncol=p) y <- cbind(rnorm(n2),rnorm(n2,sd=sd1)) z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R=999,seed=i*12345)$p.value } alpha <- 0.1 pow <- colMeans(p.values<alpha) round(pow,3)
Non-normal distributions: t distribution with 1 df (heavy-tailed distribution), bimodel distribution (mixture of two normal distributions)
we can change the df
library(Ball) library(energy) library(boot) library(RANN) m <- 1e3 df<-1 df2<-2 k<-3 p<-2 set.seed(12345) n1 <- n2 <- 20 R<-999 n <- n1+n2 N = c(n1,n2) eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R, sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) Tn <- function(z, ix, sizes,k){ n1 <- sizes[1] n2 <- sizes[2] n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0) z <- z[ix, ] NN <- nn2(data=z, k=k+1) block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5) i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } for(i in 1:m){ x <- matrix(rt(n1*p,df=df),ncol=p) y <- cbind(rt(n2,df=df),rt(n2,df=df2)) z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R=999,seed=i*12345)$p.value } alpha <- 0.1 pow <- colMeans(p.values<alpha) round(pow,3)
library(Ball) library(energy) library(boot) library(RANN) m <- 1e3 t<-0.8 k<-3 p<-2 set.seed(12345) n1 <- n2 <- 20 R<-999 n <- n1+n2 N = c(n1,n2) eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R, sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) Tn <- function(z, ix, sizes,k){ n1 <- sizes[1] n2 <- sizes[2] n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0) z <- z[ix, ] NN <- nn2(data=z, k=k+1) block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5) i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } for(i in 1:m){ x <- matrix(rbinom(n1*p,1,0.5),ncol=p) y <- cbind(rbinom(n2,1,0.5),rbinom(n2,1,t)) z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R=999,seed=i*12345)$p.value } alpha <- 0.1 pow <- colMeans(p.values<alpha) round(pow,3)
Unbalanced samples
we can change the n1,n2
library(Ball) library(energy) library(boot) library(RANN) m <- 1e3 k<-3 p<-2 set.seed(12345) n1<-20 n2<-30 R<-999 n <- n1+n2 N = c(n1,n2) eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R, sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) Tn <- function(z, ix, sizes,k){ n1 <- sizes[1] n2 <- sizes[2] n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0); z <- z[ix, ] NN <- nn2(data=z, k=k+1) block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5) i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } for(i in 1:m){ x <- matrix(rnorm(n1*p),ncol=p) y <- cbind(rnorm(n2),rnorm(n2)) z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R=999,seed=i*12345)$p.value } alpha <- 0.1 pow <- colMeans(p.values<alpha) round(pow,3)
Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the ???rst 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard Cauchy distribution(see qcauchy or qt with df=1). Recall that a Cauchy(??,??) distribution has density function $$f(x)=\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)}\quad ,\theta>0$$ The standard Cauchy has the Cauchy(?? =1,??= 0) density. (Note that the standard Cauchy density is equal to the Student t density with one degree of freedom.)
set.seed(1) m<-20000 #length of the chain x<-numeric(m) y<-numeric(m-1000) x[1]<-runif(1) standard_cauchy<-function(x){ return(1/(pi*(1+x^2))) } for(i in 1:(m-1)){ proposal<-x[i]+runif(1,min=-1,max=1) accept<-runif(1)<=standard_cauchy(proposal)/standard_cauchy(x[i]) x[i+1]<-ifelse(accept==T,proposal,x[i]) } y<-x[1001:m] quantile(y,probs = seq(0.1,0.9,0.1)) qcauchy(seq(0.1,0.9,0.1)) qqnorm(y)
Rao [220, Sec. 5g] presented an example on genetic linkage of 197 animals in four categories (also discussed in [67, 106, 171, 266]). The group sizes are(125,18,20,34). Assume that the probabilities of the corresponding multinomial distribution are $$(\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4})$$ Estimate the posterior distribution of ?? given the observed sample, using one of the methods in this chapter.
set.seed(1) m<-10000 #length of the chain w<-0.25 #width u<-runif(m) v<-runif(m,-w,w) group<-c(125,18,20,34) x<-numeric(m) prob<-function(theta){ if(theta<0||theta>=0.8) return(0) else return((1/2+theta/4)^group[1]*((1-theta)/4)^group[2]*((1-theta)/4)^group[3]*(theta/4)^group[4]) } x[1]<-0.3 #initial value for(i in 2:m){ theta<-x[i-1]+v[i] if(u[i]<=prob(theta)/prob(x[i-1])) x[i]<-theta else x[i]<-x[i-1] } theta.hat<-mean(x[1001:m]) print(theta.hat)
For exercise 9.6, use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R}<1.2$.
b=1000 m=10000 N=b+m Gelman.Rubin <- function(psi){ # psi[i,j] is the statistic psi(X[i,1:j]) # for chain in i-th row of X psi <- as.matrix(psi) n <- ncol(psi) k <- nrow(psi) psi.means <- rowMeans(psi) #row means B <- n * var(psi.means) #between variance est. psi.w <- apply(psi, 1, "var") #within variances W <- mean(psi.w) #within est. v.hat <- W*(n-1)/n + (B/(n*k)) #upper variance est. r.hat <- v.hat / W #G-R statistic return(r.hat) } iden<-function(theta){ 125*log(2+theta)+38*log(1-theta)+34*log(theta) } MCMC<-function(x,print_acc=F){ y=1:N acc=0 for(i in 1:N){ p=runif(1) y[i]=x=if(runif(1)<exp(iden(p)-iden(x))){acc=acc+1;p}else x } if(print_acc) print(acc/N) y[(b+1):N] } plot(MCMC(0.5,print_acc = T)) M1=cumsum((MC1=MCMC(0.2)))/1:m M2=cumsum((MC2=MCMC(0.4)))/1:m M3=cumsum((MC3=MCMC(0.6)))/1:m M4=cumsum((MC4=MCMC(0.8)))/1:m psi=rbind(M1,M2,M3,M4) plot((R=sapply(1:100,function(i)Gelman.Rubin(psi[,1:i]))),main="R value of Gelman.Rubin method",ylim=c(1,2)) c(mean(c(MC1[1:10000],MC2[1:10000],MC3[1:10000],MC4[1:10000])),var(c(MC1[1:10000],MC2[1:10000],MC3[1:10000],MC4[1:10000])))
Find the intersection points A(k) in $(0,\sqrt{k})$of the curves $$S_{k-1}(a)=P\big(t(k-1)>\sqrt{\frac{a^2(k-1)}{k-a^2}}\big)$$ and $$S_{k}(a)=P\big(t(k)>\sqrt{\frac{a^2k}{k+1-a^2}}\big)$$ for k =4:25,100,500,1000, where t(k) is a Student t random variable with k degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by Sz??ekely [260].)
alpha<-rep(1,25) b<-c(4:25,100,500,1000) for(i in 1:25){ k=b[i] inter<-function(a){ pt(sqrt(a*a*(k-1)/(k-a*a)),df=k-1,lower.tail=F,log.p=T)-pt(sqrt(a*a*k/(k+1-a*a)),df=k,lower.tail=F,log.p=T) } solution <- uniroot(inter,lower=0.0001,upper=sqrt(k)-0.0001) alpha[i]<- solution$root print(alpha[i]) }#intersection points
Write a function to compute the cdf of the Cauchy distribution, which has
density$$f(x)=\frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)}\quad -\infty
cdfCauchy = function(q, eta=0, theta=1) { pdf = function(x,eta,theta){ 1/(theta*pi*(1 + ((x-eta)/theta)^2)) } result = integrate( f=pdf, lower=-Inf, upper=q, rel.tol=.Machine$double.eps^.25, eta=eta, theta=theta ) # return( list( value=result$value, abs.error=result$abs.error ) ) return( result$value ) } #end function
consider the standard Cauchy
x = matrix( seq(-4,4), ncol=1 ) cbind( x, apply(x, MARGIN=1, FUN=cdfCauchy), pcauchy(x) )
for $\eta=2\quad and\quad\theta=1$:
cbind( x, apply( x, MARGIN=1, FUN=cdfCauchy, eta=2 ), pcauchy(x,location=2) )
for $\eta=0\quad and\quad\theta=2$:
cbind( x, apply( x, MARGIN=1, FUN=cdfCauchy, theta=2 ), pcauchy(x,scale=2) )
library(stats4) nA<-28;nB<-24;nOO<-41;nAB<-70; n<-nA+nB+nOO+nAB; i=0; p1<-0.3;q1<-0.3; p0<-0;q0<-0; options(warn=-1) while(!isTRUE(all.equal(p0,p1,tolerance=10^(-7)))||!isTRUE(all.equal(q0,q1,tolerance=10^(-7))))#EM?㷨 { p0<-p1; q0<-q1; mlogL<-function(p,q){ # minus log-likelihood return(-(2*n*p0^2*log(p)+2*n*q0^2*log(q)+2*nOO*log(1-p-q)+(nA-n*p0^2)*log(2*p*(1-p-q))+(nB-n*q0^2)*(log(2*q*(1-p-q)))+nAB*log(2*p*q))) } fit<-mle(mlogL,start=list(p=0.2,q=0.3)) p1<-fit@coef[1] q1<-fit@coef[2] i=i+1 } print(c(i,p1,q1))
Use both for loops and lapply() to ???t linear models to the mtcars using the formulas stored in this list: formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt )
set.seed(1) attach(mtcars) formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt ) n<-length(formulas) # for loop for(i in 1:n){ print(lm(formulas[[i]])) } # lapply lapply(formulas,lm) #we can see the values are equal detach(mtcars)
Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?
bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] })
set.seed(1) bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] }) # for loop for(i in 1:10){ print(lm(bootstraps[[i]]$mpg~bootstraps[[i]]$disp)) } # lapply for(i in 1:10){ bootstraps[[i]]<-subset(bootstraps[[i]],select=c(mpg,disp)) } lapply(bootstraps,lm) #we can see the values are equal
For each model in the previous two exercises,extract $R^2$ using the function below.
rsq <- function(mod) summary(mod)$r.squared
# Exercise 3 set.seed(1) attach(mtcars) formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt ) n<-length(formulas) # for loop for(i in 1:n){ print(summary(lm(formulas[[i]]))$r.squared) } # lapply lapply(formulas,function(x){ summary(lm(x))$r.squared }) #we can see the values are equal detach(mtcars) # Exercise 4 set.seed(1) bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] }) # for loop for(i in 1:10){ print(summary(lm(bootstraps[[i]]$mpg~bootstraps[[i]]$disp))$r.squared) } # lapply for(i in 1:10){ bootstraps[[i]]<-subset(bootstraps[[i]],select=c(mpg,disp)) } lapply(bootstraps,function(x){ summary(lm(x))$r.squared }) #we can see the values are equal
The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.
trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE )
Extra challenge: get rid of the anonymous function by using [[ directly
set.seed(1) trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE ) sapply(trials,function(x){ return(x$p.value) })
Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?
library(foreach) x<-foreach(a=1:3,b=rep(10,3),.combine = "rbind")%do%{ x1<-(a+b) x2<-(a*b) c(x1,x2) } x
Make a faster version of chisq.test() that only computes the chi-square test statistic when the input is two numeric vectors with no missing values. You can try simplifying chisq.test() or by coding from the mathematical de???nition (http://en. wikipedia.org/wiki/Pearson%27s_chi-squared_test).
library(microbenchmark) a=c(12,24) b=c(25,10) e=c(11,30) f=c(40,34) # chisq.test tableR<-cbind(a,b) chisq1=chisq.test(tableR)$statistic # a faster version of chisq.test chisq.test2<-function(x1,x2){ x<-cbind(x1,x2) sr <- rowSums(x) sc <- colSums(x) n<-sum(sr) E <- outer(sr, sc, "*")/n YATES <- min(0.5, abs(x - E)) STATISTIC <- sum((abs(x - E) - YATES)^2/E) return(STATISTIC) } chisq2=chisq.test2(a,b) round(c(chisq1,chisq2),5) chisq1=chisq.test(cbind(e,f))$statistic chisq2=chisq.test2(e,f) round(c(chisq1,chisq2),5) # chisq2 is faster ts <- microbenchmark(chisq1=chisq.test(tableR)$statistic,chisq2=chisq.test2(a,b)) summary(ts)[,c(1,3,5,6)]
Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?
library(microbenchmark) a=c(0,1,2,1,1,0,0,1) b=c(0,0,1,1,1,0,1,1) # a faster version of table() table2 <- function(x, y) { x_val <- unique(x) y_val <- unique(y) mat <- matrix(0L, length(x_val), length(y_val)) for (i in seq_along(x)) { mat[which(x_val == x[[i]]), which(y_val == y[[i]])]<-mat[which(x_val == x[[i]]),which(y_val == y[[i]])] + 1L } dimnames <- list(x_val, y_val) names(dimnames) <- as.character(as.list(match.call())[-1]) # R has names for dimnames... :/ tab <- array(mat, dim = dim(mat), dimnames = dimnames) class(tab) <- "table" tab } table(a,b) table2(a,b) identical(table(a,b),table2(a,b)) d=c(1,2,3,2,2,5,3,1) f=c(1,2,3,5,2,2,2,3) table(d,f) table2(d,f) identical(table(d,f),table2(d,f)) # table2 is faster ts <- microbenchmark(table=table(d,f),table2=table2(d,f)) summary(ts)[,c(1,3,5,6)]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.