title: "Homework1" author: "18081" date: "2018年9月18日" output: html_document
Create a matrix
cells <- c(16,25,32,45) rnames <- c("R1","R2") cnames <- c("C1","C2") mymatrix <- matrix(cells,nrow = 2,ncol = 2,byrow = TRUE, dimnames = list(rnames,cnames)) print(mymatrix)
Creat a table:
m1 <- matrix(1, nr = 2, nc = 2) m2 <- matrix(2, nr = 2, nc = 2) rbind(m1, m2) cbind(m1, m2)
I can also embed plots, for example:
library(vcd) attach(Arthritis) counts <- table(Treatment,Improved) spine(counts,main="Spinogram Example") detach(Arthritis)
title: "A-18081-2018-09-21" author: "18081" date: "2018-9-23" output: html_document
Use the inverse transform method to generate a random sample of size 1000 from the distribution of X. Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function
x <- c(0:4) p <- c(.1,.2,.2,.2,.3) cp <- cumsum(p) m <- 1e3 r <- numeric(m) r <- x[findInterval(runif(m),cp)+1] ct <- as.vector(table(r)) bili <- ct/sum(ct)/p cat(bili) s <- sample(0:4,size=1000,replace = TRUE,prob = c(.1,.2,.2,.2,.3)) print(s[1:100]) #To save space, only the first 100 data is output
Write a function to generate a random sample of size n from the Beta(a,b) distribution by the acceptance-rejection method. Generate a random sample of size 1000 from the Beta(3,2) distribution. Graph the histogram of the sample with the theoretical Beta(3,2) density superimposed.
n <- 1e3 j <- 0 k <- 0 y <- numeric(n) while (k < n) { u <- runif(1) j <- j + 1 x <- runif(1) #random variate from g if (x^2 * (1-x) > u) { #we accept x k <- k + 1 y[k] <- x } } j # #(experiments) for n random numbers hist(y, prob = TRUE) y <- seq(0, 1, .01) lines(y, 12*y^2*(1-y))
Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter Λ has Gamma(r,β) distribution and Y has Exp(Λ) distribution. That is, (Y |Λ = λ) ∼ f Y (y|λ) = λe −λy . Generate 1000 random observations from this mixture with r = 4 and β = 2.
n <- 1e3 r <- 4 beta <- 2 lambda <- rgamma(n, r, beta) y <- rexp(n, lambda) y[1:10]
title: "A-18048-2018-09-30" author: 'by 18081' date: "2018.09.30" output: html_document
Write a function to compute a Monte Carlo estimate of the $Beta(3, 3)$ cdf,and use the function to estimate $F(x)$ for $x = 0.1,0.2,...,0.9$. Compare the estimates with the values returned by the $pbeta$ function in R
Wo know the probability density function for Beat is $f(x,a,b)=\frac{1}{B(a,b)}x^{a-1}(1-x)^{b-1}$,in which $B(a,b)=\int_0^1x^{a-1}(1-x)^{b-1}dx$.
#Generate an estimated distribution function for the variable x F <- function(x){ a <- 3 b <- 3 m <- 1e4 y <- runif(m, min = 0,max = x) F.hat <- x*mean(1/beta(a,b)*y^(a-1)*(1-y)^(b-1)) return(F.hat) } #when x=c(.1,.2,...,.9) x_seq <- seq(.1,.9,by=.1) F_seq <- numeric(length(x_seq)) P_seq <- numeric(length(x_seq)) for (i in 1:length(x_seq)) { F_seq[i] <- F(x_seq[i]) # using the F function to estimate the probability P_seq[i] <- pbeta(x_seq[i],3,3) # using the pbeta function in R to estimate the probability } # generate a table to compare print(round(rbind(x_seq, F_seq, P_seq), 3))
The Rayleigh density $$f(x)=\frac{1}{\sigma^2}e^{-x^2/2\sigma^2}, x\geq 0 , \sigma>0$$
Implement a function to generate samples from a Rayleigh($\sigma$) distribution,using antithetic variables. What is the percent reduction in variance of $\frac{X+X^{'}}{2}$ compared with $\frac{X_1+X_2}{2}$ for independent $X_1$,$X_2$?
set.seed(123) R <- 1000 Ray <- function(R,antithetic = TRUE){ u <- runif(R/2) if (!antithetic) v <- runif(R/2) else v <- 1-u u <- c(u,v) return(mean(sqrt(-2*log(1-u)))) } m <- 1000 Ray1 <- Ray2 <- numeric(m) for (i in 1:m) { Ray1[i] <- Ray(R,FALSE) Ray2[i] <- Ray(R,TRUE) } print(sd(Ray1)) # X1 and X2 is independent print(sd(Ray2)) # X1 and X2 is antithetic variables print((var(Ray1) - var(Ray2))/var(Ray1)) # the percent reduction in variance
Find two importance functions $f_1$ and $f_2$ that are supported on $(1,∞)$ and are ‘close’ to $$g(x)=\frac{x^2}{\surd(2\pi)}e^{-x^2/2} ,x>1$$ Which of your two importance functions should produce the smaller variance in estimating $$ \int_1^\infty\frac{x^2}{\surd(2\pi)}e^{-x^2/2}dx$$
Simplified operation,I get the functions as follows:
$$f_1(x)=\frac{x^2}{\surd(2\pi)},1
Firstly $f_1$ and $f_2$ always have $f(x)>0$ with ${x:g(x)>0}$. Secondly, $g(x)/f_1(x) = e^{-x^2/2}$ and $g(x)/f_2(x)=\frac{x}{\surd(2\pi)}$
Obtain a Monte Carlo estimate of $$ \int_1^\infty\frac{x^2}{\surd(2\pi)}e^{-x^2/2}dx$$ by importance sampling.
From the question,I chose the $f_2$ as the importance function. $$f_2(x)=xe^{-x^2/2},1<x<\infty$$
set.seed(12345) g <- function(x) { (x^2*exp(-x^2/2))/sqrt(2*pi) * (x > 1) } f <- function(x){ x*exp(-x^2/2) } R <- 10000 Ray_cdf <- function(R,antithetic = TRUE){ u <- runif(R/2) if (!antithetic) v <- runif(R/2) else v <- 1-u u <- c(u,v) return(sqrt(-2*log(1-u))) } m <- 10000 x <- Ray_cdf(R,FALSE) fg <- g(x)/f(x) theta.hat <- mean(fg) se <- sd(fg) print(theta.hat) print(se)
In order to verify the correctness of the code, I used the function integrate" built in R to compare.
ff <- function(x) { (x^2*exp(-x^2/2))/sqrt(2*pi) } theta.right <- integrate(ff,1,Inf) print(theta.right)
title: "A-18081-2018-10-12" author: 'by 18081' date: "2018.10.13" output: html_document
Let X be a non-negative random variable with $\mu=E[X]<\infty$.For a random sample $x_1,x_2,...,x_n$ from the distribution of $X$, the Gini ratio is defined by $$G = \frac{1}{2n^2\mu}\sum_{j=1}^n\sum_{i=1}^n|x_i-x_j|$$ The Gini ratio is applied in economics to measure inequality in income dis- tribution (see e.g. [163]). Note that $G$ can be written in terms of the order statistics $X_{(i)}$ as $$ G = \frac{1}{n^2\mu}\sum_{i=1}^n(2i-n-1)x_{(i)}$$ If the mean is unknown, let $\hat{G}$ be the statistic $G$ with $\mu$ replaced by $ \bar{x}$ Estimate by simulation the mean, median and deciles of $\hat{G}$ if X is standard lognormal.Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.
set.seed(123) G.est <- function(type = "xtype") { n <- 1000 m <- 1000 Sum <- numeric(n) G.hat <- numeric(m) for (j in 1:m) { if (type == "rlnorm") { general <- rlnorm(n) } else if (type == "runif") { general <- runif(n) } else if (type == "rbinom") { general <- rbinom(n, 100, .1) } x <- general mu.hat <- mean(x) x_order <- sort(x) for (i in 1:n) { Sum[j] <- Sum[j] + (2 * i - n - 1) * x_order[i] } G.hat[j] <- Sum[j] / (n ^ 2 * mu.hat) } print(mean(G.hat)) print(median(G.hat)) print(quantile(G.hat, seq(.1, .9, length = 9))) } G.est("rlnorm") #x is generated by standard lognormal G.est("runif") #x is generated by uniform distribution G.est("rbinom") #x is generated by Bernoulli(0.1)
Construct an approximate 95% confidence interval for the Gini ratio $\gamma=E[G]$ If $G$ is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.
set.seed(123) n <- 1000 m <- 1000 Sum <- numeric(n) G.hat <- numeric(m) for (j in 1:m) { x <- rlnorm(n) mu.hat <- mean(x) x_order <- sort(x) for (i in 1:n) { Sum[j] <- Sum[j] + (2 * i - n - 1) * x_order[i] } G.hat[j] <- Sum[j] / (n ^ 2 * mu.hat) } #Generating confidence intervals G.mu <- mean(G.hat) G.sigma <- sd(G.hat) CT1 <- G.mu - G.sigma*1.96 CT2 <- G.mu + G.sigma*1.96 print(c(CT1,CT2)) #Solving the confidence interval coverage h <- 0 for (j in 1:m) { if(CT1 < G.hat[j] & G.hat[j] < CT2) h <- h +1 } print(h/m)
Tests for association based on Pearson product moment correlation $\rho$ Spearman’s rank correlation coefficient $\rho_s$ or Kendall’s coefficient $\tau$ are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ and $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution $(X,Y)$ such that $X$ and $Y$ are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.
set.seed(123) library(MASS) options(digits = 3) m <- 1000 nump <- numk <- nums <- numeric(m) p1 <- p2 <- p3 <- numeric(m) for (i in 1:m) { mean <- c(0, 1) sigma <- matrix(c(1, 0.15, 0.15, 1), ncol = 2, nrow = 2) data <- mvrnorm(n = 500, mean, sigma) x <- data[, 1] y <- data[, 2] a1 <- cor.test(x, y, method = "pearson") a2 <- cor.test(x, y, method = "kendall") a3 <- cor.test(x, y, method = "spearman") p1[i] <- a1$p.value p2[i] <- a2$p.value p3[i] <- a3$p.value } nump[p1 < 0.05] <- 1 numk[p2 < 0.05] <- 1 nums[p3 < 0.05] <- 1 ratio <- c(sum(nump), sum(numk), sum(nums)) / m print(ratio) barplot(ratio,col = c("red","green","green"),names.arg = c("pearson","kendall","spearman"),main = "Power of a Test")
Explanation: From the simulation results and the chart, it can be seen that the pwer of Pearson is 0.922 , the pwer of Kendall is 0.833 , the pwer of Spearman is 0.833 , which mean the nonparametric tests based on $\rho$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal.
set.seed(123) library(MASS) m <- 1000 n <- 1000 nump <- numk <- nums <- numeric(m) p1 <- p2 <- p3 <- numeric(m) for (i in 1:m) { mean <- c(0, 1) sigma <- matrix(c(1, 0, 0, 3), ncol = 2, nrow = 2) data <- mvrnorm(n , mean, sigma) x <- data[, 1] y <- data[, 2] + runif(n, min = 0, max = 0.2) a1 <- cor.test(x, y, method = "pearson") a2 <- cor.test(x, y, method = "kendall") a3 <- cor.test(x, y, method = "spearman") p1[i] <- a1$p.value p2[i] <- a2$p.value p3[i] <- a3$p.value } nump[p1 < 0.05] <- 1 numk[p2 < 0.05] <- 1 nums[p3 < 0.05] <- 1 ratio <- c(sum(nump), sum(numk), sum(nums)) / m print(ratio) barplot(ratio,col = c("red","green","green"),names.arg = c("pearson","kendall","spearman"),main = "Power of a Test")
Explanation: From the simulation results and the chart, it can be seen that the pwer of Pearson is 0.046 , the pwer of Kendall is 0.050 , the pwer of Spearman is 0.051 , which mean the nonparametric tests based on $\rho$ or $\tau$ are better powerful than the correlation test when the sampled distribution is bivariate normal
title: "A-18081-2018-11-02" author: 'by 18081' date: "2018.10.24" output: html_document
Compute a jackknife estimate of the bias and the standard error of the corre- lation statistic in Example 7.2
library(bootstrap) #for the law data r <- cor(law$LSAT, law$GPA) n <- length(law$LSAT) R <- numeric(n) #storage for replicates for (i in 1:n) { R[i] <- cor(law$LSAT[-i],law$GPA[-i]) } bias <- (n - 1) * (mean(R) - r) se.R <- sqrt((n - 1) *mean((R - mean(R))^2)) print(bias) print(se.R) hist(R,prob = TRUE, col = "pink")
Refer to the air-conditioning data set aircondit provided in the boot package. The 12 observations are the times in hours between failures of airconditioning equipment $$ 3,5,7,18,43,85,91,98,100,130,230,487 $$ Assume that the times between failures follow an exponential model Exp($λ$). Compute 95% bootstrap confidence intervals for the mean time between failures $1/λ$ by the standard normal, basic, percentile, and BCa methods. Compare the intervals and explain why they may differ.
set.seed(123) library(boot) Data <- c(3,5,7,18,43,85,91,98,100,130,230,487) mu <- function(Data,i){ return(mean(Data[i])) } bootobject <- boot(data = Data, statistic = mu, R=1000) print(bootobject) plot(bootobject) boot.ci(bootobject, type = c("basic", "norm", "perc","bca"))
"Norm":To apply the normal distribution, we assume that the distribution of $\hat{\theta}$ is normal and the sample size is large, but in this question,the sample size is small.
"Basic":The basic bootstrap confidence interval transforms the distribution of the replicates by subtracting the observed statistic.
"Quantiles":The quantiles of the empiricaldistribution are estimators of the quantiles of the sampling distribution of $\hat{\theta}$,so that these (random) quantiles may match the true distribution better when the distribution of $\hat{\theta}$ is not normal.
"BCa":Better bootstrap confidence intervals are a modified version of percentile intervals that have better theoretical properties and better performance in practice.
Efron and Tibshirani discuss the scor (bootstrap)test score data on 88 students who took examinations in five subjects The first two tests (mechanics, vectors) were closed book and the last three tests (algebra, analysis, statistics) were open book. Each row of the data frame is a set of scores $ (x_{i1},...,x_{i5})$ for the $i^{th}$ student. The five-dimensional scores data have a $5 × 5$ covariance matrix $ \Sigma$ with positive eigenvalues$ λ_1 > ··· > λ_5$.Compute the sample estimate $$\hat{\theta} = \frac{\hat{\lambda}1}{\sum^{5}{j=1}\hat{\lambda}_j}$$ Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$ .
library(bootstrap) n <- nrow(scor) theta.hat <- numeric(n) sigma <- cov(scor) results <- eigen(sigma) theta <- results$values[1]/sum(results$values) for (i in 1:n) { sigma <- cov(scor[-i, ]) result <- eigen(sigma) theta.hat[i] <- result$values[1] / sum(result$values) } bias <- (n - 1) * (mean(theta.hat) - theta) se <- sqrt((n - 1) *mean((theta.hat - mean(theta.hat))^2)) print(bias) print(se) hist(theta.hat,col = "lightgreen")
In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.
library(ggplot2) library(DAAG) attach(ironslag) n <- length(magnetic) #in DAAG ironslag e1 <- e2 <- e3 <- e4 <- numeric(n-1) # for n-fold cross validation # fit models on leave-two-out samples for (k in 1:n-1) { y <- magnetic[-k:-(k+1)] x <- chemical[-k:-(k+1)] J1 <- lm(y ~ x) yhat1_1 <- J1$coef[1] + J1$coef[2] * chemical[k] yhat1_2 <- J1$coef[1] + J1$coef[2] * chemical[k+1] e1[k] <- (magnetic[k] - yhat1_1)^2 + (magnetic[k+1] - yhat1_2)^2 J2 <- lm(y ~ x + I(x^2)) yhat2_1 <- J2$coef[1] + J2$coef[2] * chemical[k] +J2$coef[3] * chemical[k]^2 yhat2_2 <- J2$coef[1] + J2$coef[2] * chemical[k+1] +J2$coef[3] * chemical[k+1]^2 e2[k] <- (magnetic[k] - yhat2_1)^2 + (magnetic[k+1] - yhat2_2)^2 J3 <- lm(log(y) ~ x) logyhat3_1 <- J3$coef[1] + J3$coef[2] * chemical[k] logyhat3_2 <- J3$coef[1] + J3$coef[2] * chemical[k+1] yhat3_1 <- exp(logyhat3_1) yhat3_2 <- exp(logyhat3_2) e3[k] <- (magnetic[k] - yhat3_1)^2 + (magnetic[k+1] - yhat3_2)^2 J4 <- lm(log(y) ~ log(x)) logyhat4_1 <- J4$coef[1] + J4$coef[2] * log(chemical[k]) logyhat4_2 <- J4$coef[1] + J4$coef[2] * log(chemical[k+1]) yhat4_1 <- exp(logyhat4_1) yhat4_2 <- exp(logyhat4_2) e4[k] <- (magnetic[k] - yhat4_1)^2 + (magnetic[k+1] - yhat4_2)^2 } model <- c("1_Liner","2_Quadratic","3_Exponential","4_Log_Log") pred_error <- c(mean(e1), mean(e2), mean(e3), mean(e4)) data <- data.frame(model,pred_error) print(pred_error) ggplot(data, aes(x = model, y = pred_error)) + geom_bar(stat = "identity",fill = "lightblue", colour = "black")
title: "A-18081-2018-11-23" author: 'by 18081' date: "2018.11.14" output: html_document
Implement the two-sample Cram´ er-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.
The Cramer-von Mises statistic is defined by $$ W_2=\frac{mn}{(m+n)^2}\Bigg[\sum_{i=1}^{n}(F_n(x_i)-G_m(x_i))^2+\sum_{i=1}^{m}(F_n(y_i)-G_m(y_i))^2\Bigg]$$
set.seed(12) #define the Cramer-von Mises statistic: CVM <- function(x,y){ n <- length(x) m <- length(y) Fn <- ecdf(x) Gm <- ecdf(y) W = (m*n)/(m+n)^2*(sum((Fn(x)-Gm(x))^2) + sum((Fn(y)-Gm(y))^2)) return(W) } attach(chickwts) x <- sort(as.vector(weight[feed == "soybean"])) y <- sort(as.vector(weight[feed == "linseed"])) detach(chickwts) R <- 999 #number of replicates z <- c(x, y) #pooled sample K <- 1:26 reps <- numeric(R) #storage for replicates t0 <- CVM(x, y) #data in 8.1 and 8.2 for (i in 1:R) { #generate indices k for the first sample k <- sample(K, size = 14, replace = FALSE) x1 <- z[k] y1 <- z[-k] #complement of x1 reps[i] <- CVM(x1, y1) } p <- mean(c(t0, reps) >= t0) print(p) hist(reps, main = " the two-sample Cramer-von Mises test", freq = FALSE, xlab = "T (p = 0.396)",breaks = "scott") points(p, 0, cex = 1, pch = 16,col="red")
Design experiments for evaluating the performance of the NN,energy, and ball methods in various situations.
library(RANN) library(boot) library(energy) library(Ball) m <- 50; k<-3; p<-2; n1 <- n2 <- 50; R<-999; n <- n1+n2; N = c(n1,n2) Tn <- function(z, ix, sizes,k) { n1 <- sizes[1]; n2 <- sizes[2]; n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0); z <- z[ix, ]; NN <- nn2(data=z, k=k+1) # what's the first column? block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5); i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R,sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) for(i in 1:m){ x <- matrix(rnorm(n1*p,mean = 1,sd = 0.6),ncol=p); y <- matrix(rnorm(n2*p,mean = 1,sd = 0.85),ncol=p); z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R=999,seed = i*12)$p.value } alpha <- 0.1; pow <- colMeans(p.values<alpha) print(pow) barplot(pow,col = "lightgreen",main = "Unequal variances and unequal expectations", names.arg=c("NN","energy","ball"))
library(RANN) library(boot) library(energy) library(Ball) m <- 50; k<-3; p<-2; n1 <- n2 <- 50; R<-999; n <- n1+n2; N = c(n1,n2) Tn <- function(z, ix, sizes,k) { n1 <- sizes[1]; n2 <- sizes[2]; n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0); z <- z[ix, ]; NN <- nn2(data=z, k=k+1) # what's the first column? block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5); i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R,sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) for(i in 1:m){ x <- matrix(rnorm(n1*p,mean = 0.4,sd = 0.6),ncol=p); y <- matrix(rnorm(n2*p,mean = 0.5,sd = 0.85),ncol=p); z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R=999,seed = i*12)$p.value } alpha <- 0.1; pow <- colMeans(p.values<alpha) print(pow) barplot(pow,col = "lightblue",main = "Unequal variances and equal expectations", names.arg=c("NN","energy","ball"))
library(RANN) library(boot) library(energy) library(Ball) m <- 50; k<-3; p<-2; n1 <- n2 <- 20; R<-999; n <- n1+n2; N = c(n1,n2) Tn <- function(z, ix, sizes,k) { n1 <- sizes[1]; n2 <- sizes[2]; n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0); z <- z[ix, ]; NN <- nn2(data=z, k=k+1) # what's the first column? block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5); i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R,sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) for(i in 1:m){ # t distribution with 1 df (heavy-tailed distribution) x <- matrix(rt(n1*p,df = 1),ncol=p); #bimodel distribution (mixture of two normal distributions) y <- cbind(rnorm(n2,mean = 0.4),rnorm(n2,mean = 0.5)); z <- rbind(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R=999,seed = i*12)$p.value } alpha <- 0.1; pow <- colMeans(p.values<alpha) print(pow) barplot(pow,col = "pink",main = "Non-normal distributions", names.arg=c("NN","energy","ball"))
library(RANN) library(boot) library(energy) library(Ball) m <- 50; k<-3; p<-2; n1 <- 10;n2 <- 100;R<-999; n <- n1+n2; N = c(n1,n2) Tn <- function(z, ix, sizes,k) { n1 <- sizes[1]; n2 <- sizes[2]; n <- n1 + n2 if(is.vector(z)) z <- data.frame(z,0); z <- z[ix, ]; NN <- nn2(data=z, k=k+1) # what's the first column? block1 <- NN$nn.idx[1:n1,-1] block2 <- NN$nn.idx[(n1+1):n,-1] i1 <- sum(block1 < n1 + .5); i2 <- sum(block2 > n1+.5) (i1 + i2) / (k * n) } eqdist.nn <- function(z,sizes,k){ boot.obj <- boot(data=z,statistic=Tn,R=R,sim = "permutation", sizes = sizes,k=k) ts <- c(boot.obj$t0,boot.obj$t) p.value <- mean(ts>=ts[1]) list(statistic=ts[1],p.value=p.value) } p.values <- matrix(NA,m,3) for(i in 1:m){ x <- c(rnorm(n1,mean = 1,sd = 1)); # n1 = 10 y <- c(rnorm(n2,mean = 2,sd = 2)); # n2 = 100 z <- c(x,y) p.values[i,1] <- eqdist.nn(z,N,k)$p.value p.values[i,2] <- eqdist.etest(z,sizes=N,R=R)$p.value p.values[i,3] <- bd.test(x=x,y=y,R=999,seed = i*12)$p.value } alpha <- 0.1; pow <- colMeans(p.values<alpha) print(pow) barplot(pow,col = "lightgreen",main = "Unequal variances and unequal expectations",names.arg=c("NN","energy","ball"))
Use the Metropolis-Hastings sampler to generate random variables from a standard Cauchy distribution. Discard the first 1000 of the chain, and compare the deciles of the generated observations with the deciles of the standard Cauchy distribution (see qcauchyor qt with df=1).Recall that a Cauchy($\theta$,$\eta$) distribution has density function
$$ f(x) = \frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)} ,-\infty
set.seed(1) m=10000 # target density: the cauchy distribution cauchy<-function(x, theta=1,eta=0){ out<-1/(pi*theta*(1+((x-eta)/theta)^2)) return(out) } chain<-c(0) for(i in 1:m){ proposal<-chain[i]+runif(1, min=-20, max=20) accept<-runif(1) < cauchy(proposal)/cauchy(chain[i]) chain[i+1]<-ifelse(accept==T, proposal, chain[i]) } # plot over time plot(chain, type="l") # the quantiles of the generated chain in a quantile-quantile plot (QQ plot). b <- 1001 #discard the burnin sample y <- chain[b:m] a <- ppoints(100) QR <- qcauchy(a) #quantiles of Rayleigh Q <- quantile(chain, a) qqplot(QR, Q, main="",xlab="Rayleigh Quantiles", ylab="Sample Quantiles",xlim=c(-20,20),ylim=c(-20,20)) hist(y, breaks="scott", main="", xlab="", freq=FALSE) lines(QR, cauchy(QR, 4))
Rao presented an example on genetic linkage of 197 animals in four categories The group sizes are $(125,18,20,34)$. Assume that the probabilities of the corresponding multinomial distribution are $$\big(\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4} \big) $$ Estimate the posterior distribution of θ given the observed sample, using one of the methods in this chapter.
m <- 5000 w <- 0.25 u <- runif(m) v <- runif(m,-w,w) group <- c(125,18,20,34) x <- numeric(m) prob <- function(theta,group){ if(theta<0 || theta>=0.8) return(0) return((1/2+theta/4)^group[1]*((1-theta)/4)^group[2]*((1-theta)/4)^group[3]*(theta/4)^group[4]) } x[1] <- 0.4 for (i in 2:m) { theta <- x[i-1]+v[i] if(u[i]<= prob(theta,group)/prob(x[i-1],group)) x[i] <- theta else x[i] <- x[i-1] } index <- 1001:m theta_hat <- mean(x[index]) print(theta_hat)
title: "A-18081-2018-11-30" author: 'by 18081' date: "2018.11.29" output: html_document
Rao presented an example on genetic linkage of 197 animals in four categories The group sizes are $(125,18,20,34)$. Assume that the probabilities of the corresponding multinomial distribution are $$\big(\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4} \big) $$ For exercise 9.6, use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R}<1.2$
Find the intersection points $A(k)$ in $(0,\sqrt{k})$ of the curves $$ S_{k-1}(a)=P\Big(t(k-1)>\sqrt{\frac{a^2(k-1)}{k-a^2}}\Big)$$ and $$ S_{k}(a)=P\Big(t(k)>\sqrt{\frac{a^2k}{k+1-a^2}}\Big) $$ for $k=4:25,100,500,1000$,where $t(k)$ is a Student t random variable with $k$ degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by Szekely [260].)
k <- c(4:25, 100, 500, 1000) n <- length(k) A <- rep(0, n) eps <- .Machine$double.eps ^ 0.25 for (i in 1:n) { f <- function(a) { num1 <- sqrt(a ^ 2 * (k[i] - 1) / (k[i] - a ^ 2)) num2 <- sqrt(a ^ 2 * k[i] / (k[i] + 1 - a ^ 2)) pt(num2, k[i]) - pt(num1, k[i] - 1) } out <- uniroot(f,lower = eps, upper = sqrt(k[i] - eps)) A[i] <- out$root } print(A)
title: "A-18081-2018-12-07" author: 'by 18081' date: "2018.12.06" output: html_document
Write a function to compute the cdf of the Cauchy distribution, which has
density $$ \frac{1}{\theta\pi(1+[(x-\eta)/\theta]^2)},-\infty
options(digits=4) x <- c(0:10) n <- length(x) value_function <- rep(0,n) value_pcauchy<- rep(0,n) f <- function(x,theta,eta){ 1/(theta*pi*(1+((x-eta)/theta)^2)) } # I chose theta=1,eta=0 for (i in 1:n) { # by function to compute the cdf value_fun <- integrate(f, lower=-Inf, upper=x[i],rel.tol=.Machine$double.eps^0.25,theta=1,eta=0) value_function[i] <- value_fun$value # by R function "pcauchy" to compute the cdf value_pcauchy[i] <- pcauchy(x[i]) } value <- data.frame(x,value_function,value_pcauchy) print(t(value))
library(stats4) nA<-28;nB<-24;nOO<-41;nAB<-70; n<-nA+nB+nOO+nAB; i=0; p1<-0.3;q1<-0.3; p0<-0;q0<-0; options(warn=-1) while(!isTRUE(all.equal(p0,p1,tolerance=10^(-7)))||!isTRUE(all.equal(q0,q1,tolerance=10^(-7))))#EM算法 { p0<-p1; q0<-q1; mlogL<-function(p,q){ # minus log-likelihood return(-(2*n*p0^2*log(p)+2*n*q0^2*log(q)+2*nOO*log(1-p-q)+(nA-n*p0^2)*log(2*p*(1-p-q))+(nB-n*q0^2)*(log(2*q*(1-p-q)))+nAB*log(2*p*q))) } fit<-mle(mlogL,start=list(p=0.2,q=0.3)) p1<-fit@coef[1] q1<-fit@coef[2] i=i+1 } print(c(i,p1,q1))#i收敛次数,p1,q1真值
title: "A-18081-2018-12-14" author: 'by 18081' date: "2018.12.12" output: html_document
Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:
formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt ) attach(mtcars) # for loops out3_1 <- vector("list", length(formulas)) for (i in seq_along(formulas)) { out3_1[[i]] <- lm(formulas[[i]],data=mtcars) } # lapply out3_2 <- lapply(formulas,lm) detach(mtcars)
Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply() .Can you do it without an anonymous function?
attach(mtcars) bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] }) # for loops out4_1 <- vector("list", length(bootstraps)) for (i in seq_along(bootstraps)) { out4_1[[i]] <- lm(mpg ~ disp,data=bootstraps[[i]]) } # lapply #f <- function(i){lm(mpg ~ disp)} out4_2 <- lapply(seq_along(bootstraps), function(i) {lm(mpg ~ disp,data=bootstraps[[i]])}) detach(mtcars)
For each model in the previous two exercises, extract R 2 using the function below.
rsq <- function(mod) {summary(mod)$r.squared} # for exercise 3 Rreq3_1 <- lapply(out3_1, rsq) Rreq3_2 <- lapply(out3_2, rsq) print(data.frame(unlist(Rreq3_1),unlist(Rreq3_2))) # for exercise 4 Rreq4_1 <- lapply(out4_1, rsq) Rreq4_2 <- lapply(out4_2, rsq) print(data.frame(unlist(Rreq4_1),unlist(Rreq4_2)))
The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial.
trials <- replicate(100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE) sapply(trials,function(mod){mod[["p.value"]]},simplify=TRUE)
Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?
# the example is make the data normalized (the value divide by the colmean) # use Map and vapply to solve data <- matrix(rnorm(20, 0, 10), nrow = 4) x <- as.data.frame(data) ans1 <- Map("/",x,vapply(x,mean,c(1))) # use the lapply with an anonymous function to solve ans2 <- lapply(x,function(data){data/(mean(data))}) print(data.frame(unlist(ans1),unlist(ans2)))
title: "A-18081-2018-12-21" author: 'by 18081' date: "2018.12.21" output: html_document
Make a faster version of chisq.test() that only computes the chi-square test statistic when the input is two numeric vectors with no missing values. You can try simplifying chisq.test() or by coding from the mathematical definition
expected <- function(colsum, rowsum, total) { (colsum / total) * (rowsum / total) * total } chi_stat <- function(observed, expected) { ((observed - expected) ^ 2) / expected } # which is apparently different from chisq.test chisq_test2 <- function(x, y) { total <- sum(x) + sum(y) rowsum_x <- sum(x) rowsum_y <- sum(y) chistat <- 0 for (i in seq_along(x)) { colsum <- x[i] + y[i] expected_x <- expected(colsum, rowsum_x, total) expected_y <- expected(colsum, rowsum_y, total) chistat <- chistat + chi_stat(x[i], expected_x) chistat <- chistat + chi_stat(y[i], expected_y) } chistat } print(chisq_test2(seq(1, 9), seq(2, 10))) print(chisq.test(seq(1, 9), seq(2, 10))) print(microbenchmark::microbenchmark( chisq_test2(seq(1, 9), seq(2, 10)), chisq.test(seq(1, 9), seq(2, 10)) ))
Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?
table2 <- function(x, y) { x_val <- unique(x) y_val <- unique(y) mat <- matrix(0L, length(x_val), length(y_val)) for (i in seq_along(x)) { mat[which(x_val == x[[i]]), which(y_val == y[[i]])] <- mat[which(x_val == x[[i]]), which(y_val == y[[i]])] + 1L } dimnames <- list(x_val, y_val) names(dimnames) <- as.character(as.list(match.call())[-1]) # R has names for dimnames... :/ tab <- array(mat, dim = dim(mat), dimnames = dimnames) class(tab) <- "table" tab } # for example a <- c(4, 5, 6) identical(table(a, a), table2(a, a)) microbenchmark::microbenchmark(table(a, a), table2(a, a)) b <- c(7, 8, 9) identical(table(a, b), table2(a, b)) c <- c(1, 2, 3, 1, 2, 3) d <- c(2, 3, 4, 2, 3, 4) identical(table(c, d), table2(c, d))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.