Assignment 09-14
This example uses the inverse transform method to simulate a random sample from the distribution with density $f_X (x) = 3x^2 , 0 < x < 1$.
n <- 1000 u <- runif(n) x <- u^(1/3) hist(x, prob = TRUE) #density histogram of sample y <- seq(0, 1, .01) lines(y, 3*y^2) #density curve f(x)
Generate random samples from a Logarithmic(0.5) distribution.
rlogarithmic <- function(n, theta) { #returns a random logarithmic(theta) sample size n u <- runif(n) #set the initial length of cdf vector N <- ceiling(-16 / log10(theta)) k <- 1:N a <- -1/log(1-theta) fk <- exp(log(a) + k * log(theta) - log(k)) Fk <- cumsum(fk) x <- integer(n) for (i in 1:n) { x[i] <- as.integer(sum(u[i] > Fk)) #F^{-1}(u)-1 while (x[i] == N) { #if x==N we need to extend the cdf #very unlikely because N is large logf <- log(a) + (N+1)*log(theta) - log(N+1) fk <- c(fk, exp(logf)) Fk <- c(Fk, Fk[N] + fk[N+1]) N <- N + 1 x[i] <- as.integer(sum(u[i] > Fk)) } } x + 1 }
Let $X 1 ∼ Gamma(2, 2) $and $X 2 ∼ Gamma(2, 4) $be independent. Compare the histograms of the samples generated by the convolution $S = X_1 +X_2 $ and the mixture $F_X = 0.5F_{X_1} + 0.5F_{X_2}$ .
n <- 1000 x1 <- rgamma(n, 2, 2) x2 <- rgamma(n, 2, 4) s <- x1 + x2 #the convolution u <- runif(n) k <- as.integer(u > 0.5) #vector of 0’s and 1’s x <- k * x1 + (1-k) * x2 #the mixture par(mfcol=c(1,2)) #two graphs per page hist(s, prob=TRUE) hist(x, prob=TRUE) par(mfcol=c(1,1)) #restore display
Assignment 09-21
A discrete random variable X has probability mass function
x <- c(0,1,2,3,4); p <- c(0.1,0.2,0.2,0.2,0.3) names(p)=x p
Use the inverse transform method to generate a random sample of size 1000 from the distribution of X. Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function.
set.seed(1) x <- c(0,1,2,3,4); p <- c(0.1,0.2,0.2,0.2,0.3) cp <- cumsum(p); m <- 1e3; r <- numeric(m) r <- x[findInterval(runif(m),cp)+1] ct <- as.vector(table(r)) emperical_frequency=ct/sum(ct) #compute the emprical frequency
auto_generate = sample(c(0,1,2,3,4), size = 1000, replace = TRUE, prob = c(0.1,0.2,0.2,0.2,0.3)) ct <- as.vector(table(auto_generate)) emperical_frequency_sample=ct/sum(ct) #compute the theoretical probablities
theoretical_probablities = p names(emperical_frequency)=c(0,1,2,3,4) names(theoretical_probablities)=c(0,1,2,3,4) rbind(theoretical_probablities,emperical_frequency,emperical_frequency_sample)
We can find easily that emperical_frequency has little difference with theoretical_probablities, thus generating random number is sucessful. The answer to the question 1 is completed.
Write a function to generate a random sample of size n from the Beta(a,b) distribution by the acceptance-rejection method. Generate a random sample of size 1000 from the Beta(3,2) distribution. Graph the histogram of the sample with the theoretical Beta(3,2) density superimposed.
condition:Beta(a,b) with pdf $f(x) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1}(1-x)^{b-1}$, $0<x<1$ , $g(x) = 1$ , $0<x<1$ and $c = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}$.
generate_beta_random_variable <- function(a,b) { n <- 1e3;j<-k<-0;y <- numeric(n) while (k < n) { u <- runif(1) j <- j + 1 x <- runif(1) #random variate from g if (x^(a-1) * (1-x)^(b-1) > u) { #we accept x k <- k + 1 y[k] <- x } } hist( y, prob=T, ylab='', main='', breaks = "Scott") z=seq(0,1,.001) l=factorial(a+b-1)/(factorial(a-1)*factorial(b-1))*z^(a-1)*(1-z)^(b-1) lines(z,l) }
generate_beta_random_variable(3,2)
According to the graph, random sample generating by code we write is almostly same as theoretical Beta(3,2) distribution. The answer to the question 2 is completed.
Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter $\Lambda$ has $Gamma(r,\beta)$ distribution and Y has $Exp( \Lambda )$ distribution. That is, $(Y | \Lambda = \lambda) ~ f_Y (y|\lambda) = \lambda) \exp^{-\lambda y}$. Generate 1000 random observations from this mixture with $r = 4$ and $\beta = 2$.
(Remark:because of Browser display incorrectly, I rewrite the question here just the same as the content above.) Simulate a continuous Exponential-Gamma mixture. Suppose that the rate parameter Λ has Gamma(r,β) distribution and Y has Exp(Λ) distribution. That is, (Y |Λ = λ) ∼ f Y (y|λ) = λe^(-λy) . Generate 1000 random observations from this mixture with r = 4 and β = 2.
first of all we generate Lambda with Gamma(r,$\beta$) distribution.
Then we generate 1000 random observations from this mixture with r = 4 and $\beta$ = 2,and graph the histogram of the sample .
n = 1000 r = 4 beta = 2 Lambda = rgamma( n,r,beta ) y = rexp( n,Lambda ) # graph the histogram hist( y, prob=T, ylab='', main='', breaks = "Scott")
Assignment 09-30
Write a function to compute a Monte Carlo estimate of the Beta(3, 3) cdf, and use the function to estimate F(x) for $x = 0.1,0.2,\dots,0.9$ . Compare the estimates with the values returned by the pbeta function in R.
set.seed(1) beta_cdf <- function(x,a,b) { u=runif(1000,min = 0,max = x) g = factorial(a+b-1)/(factorial(a-1)*factorial(b-1))*u^(a-1)*(1-u)^(b-1) theta = mean(x*g) }
x=seq(0.1,0.9,0.1) for (i in x) { a = beta_cdf(i,3,3) b = pbeta(i,3,3) print(c(i,a,b)) }
The Rayleigh density is $f(x) = \frac{x}{\sigma^2}e^{-x^2/(2\sigma^2)} , x \geq 0,\sigma >0$. Implement a function to generate samples from a Rayleigh($\sigma$) distribution, using antithetic variables. What is the percent reduction in variance of $\frac{X+X^{'}}{2}$ compared with$\frac{X_1+X_2}{2}$ for independent $X_1,X_2$ ?
according to the Rayleigh distribution density function, we calculate the cdf of Rayleigh distribution $F(x)=1-e^{-\frac{x^2}{2\sigma^2}}$.
generate_Rayleigh_random_variable = function(sigma){ n = 1000 x = runif(n) g = (-2*sigma^2*log(1-x))^(1/2) return(g) } generate_Rayleigh_random_variable_antithetic_variable = function(sigma){ n = 1000 x = runif(n) g = (-2*sigma^2*log(1-x))^(1/2) f = (-2*sigma^2*log(x))^(1/2) return(c(g,f)) }
set.seed(100) a= generate_Rayleigh_random_variable_antithetic_variable(0.5) z=numeric(1000) for(i in seq(1,1000,1)){ z[i]=a[i]+a[1000+i] } z_1=z/2 print(sprintf('variance of random sample with antithetic variable: %f',var(z_1))) x_1 = generate_Rayleigh_random_variable(0.5) x_2 = generate_Rayleigh_random_variable(0.5) z_2 = (x_1 + x_2)/2 print(sprintf('variance of random sample with independent variable: %f',var(z_2))) percent_reduction = (var(z_2) - var(z_1))/var(z_2) print(sprintf('percent_reduction is: %s',paste(round(percent_reduction,6)*100,'%',sep = '')))
Assignment 10-12
Let X be a non-negative random variable with $\mu = E[X] < \infty$. For a random sample $x_1 ,\dots,x_n$ from the distribution of X, the Gini ratio is defined by $G=\frac{1}{2n^2\mu}\sum_{j=1}^n \sum_{i=1}^{n} |x_i - x_j|$. The Gini ratio is applied in economics to measure inequality in income dis- tribution (see e.g. [163]). Note that G can be written in terms of the order statistics x (i) as $G=\frac{1}{n^2\mu} \sum_{i=1}^{n} (2i-n-1)x_{(i)}$. If the mean is unknown, let $\hat{G}$ be the statistic G with $\mu$ replaced by $\bar{x}$. Estimate by simulation the mean, median and deciles of $\hat{G}$ if X is standard lognormal. Repeat the procedure for the uniform distribution and Bernoulli(0.1). Also construct density histograms of the replicates in each case.
set.seed(1) n=1000 gini_ratio_computing = function(x,mu){ x=sort(x) gini_ratio=0 a=0 for (i in 1:n) { if(mu==FALSE){ mu=mean(x) } a=a+(2*i-n-1)*x[i] } gini_ratio=a/(n^2*mu) return(gini_ratio) }
k=1 gini_ratio_statistic_for_lognormal=numeric(n) while (k<1001) { y=rnorm(n) x=exp(y) gini_ratio_statistic_for_lognormal[k]=gini_ratio_computing(x,1/2) k=k+1 } decile=seq(0.1,1,by=0.1) gini_ratio_hat_for_lognormal_mean=mean(gini_ratio_statistic_for_lognormal) gini_ratio_hat_for_lognormal_median=median(gini_ratio_statistic_for_lognormal) gini_ratio_hat_for_lognormal_deciles=quantile(gini_ratio_statistic_for_lognormal,decile) print(sprintf('gini ratio for lognormal mean: %f, median:%f ', gini_ratio_hat_for_lognormal_mean,gini_ratio_hat_for_lognormal_median )) print( gini_ratio_hat_for_lognormal_deciles) hist(gini_ratio_statistic_for_lognormal, prob=T, ylab='', main='', breaks = "Scott")
k=1 gini_ratio_statistic_for_uniform=numeric(n) while (k<1001) { x=runif(n) gini_ratio_statistic_for_uniform[k]=gini_ratio_computing(x,1/2) k=k+1 } decile=seq(0.1,1,by=0.1) gini_ratio_hat_for_uniform_mean=mean(gini_ratio_statistic_for_uniform) gini_ratio_hat_for_uniform_median=median(gini_ratio_statistic_for_uniform) gini_ratio_hat_for_uniform_deciles=quantile(gini_ratio_statistic_for_uniform,decile) print(sprintf('gini ratio for uniform mean: %f, median: %f ', gini_ratio_hat_for_uniform_mean,gini_ratio_hat_for_uniform_median )) print(gini_ratio_hat_for_uniform_deciles) hist(gini_ratio_statistic_for_uniform, prob=T, ylab='', main='', breaks = "Scott")
gini_ratio_statistic_for_bernoulli=numeric(n) l=1 while (l<1001) { x=rbinom(n,1,0.1) gini_ratio_statistic_for_bernoulli[l]=gini_ratio_computing(x,0.1) l=l+1 } decile=seq(0.1,1,by=0.1) gini_ratio_hat_for_bernoulli_mean=mean(gini_ratio_statistic_for_bernoulli) gini_ratio_hat_for_bernoulli_median=median(gini_ratio_statistic_for_bernoulli) gini_ratio_hat_for_bernoulli_deciles=quantile(gini_ratio_statistic_for_bernoulli,decile) print(sprintf('gini ratio for lognormal mean: %f, median:%f ', gini_ratio_hat_for_bernoulli_mean,gini_ratio_hat_for_bernoulli_median )) print(gini_ratio_hat_for_bernoulli_deciles) hist(gini_ratio_statistic_for_bernoulli, prob=T, ylab='', main='', breaks = "Scott")
Construct an approximate 95% confidence interval for the Gini ratio $\gamma = E[G]$ if X is lognormal with unknown parameters. Assess the coverage rate of the estimation procedure with a Monte Carlo experiment.
We try to use central limit theorey to compute gini ratio hat confidence interval with a lot of simulations to generate indenpendent identical distribution random varible gini ratio hat.
n=200 m=80 alpha=0.5 upper_confidence=numeric(n) lower_confidence=numeric(n) for (j in 1:n) { gini_ratio_hat=numeric(n) for (i in 1:m) { y = rnorm(n,mean = 0,sd=1) x = exp(y) gini_ratio_hat[i]=gini_ratio_computing(x,mu=FALSE) } gini_ratio_hat_mean=mean(gini_ratio_hat) upper_confidence[j]=gini_ratio_hat_variance = var(gini_ratio_hat) lower_confidence[j]=gini_ratio_hat_mean+(gini_ratio_hat_variance/m)^(1/2)*qnorm(1-alpha/2) } upper_confidence=mean(upper_confidence) lower_confidence=mean(lower_confidence) print(sprintf('confidence interval of gini ratio hat is :(%f , %f)',upper_confidence,lower_confidence ))
Tests for association based on Pearson product moment correlation $\rho$, Spearman’s rank correlation coefficient $\rho_s$ , or Kendall’s coefficient $\tau$, are implemented in cor.test. Show (empirically) that the nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal. Find an example of an alternative (a bivariate distribution (X,Y ) such that X and Y are dependent) such that at least one of the nonparametric tests have better empirical power than the correlation test against this alternative.
1.nonparametric tests based on $\rho_s$ or $\tau$ are less powerful than the correlation test when the sampled distribution is bivariate normal
set.seed(1) library(MASS) m <- 1000 pvalues <- replicate(m, expr = { Sigma = matrix(c(12,2,2,1),2,2) bivarite_normal=mvrnorm(n=50, rep(0, 2), Sigma) x=bivarite_normal[1:50,1] y=bivarite_normal[1:50,2] pearson_test <- cor.test(x,y,method = "pearson") spearman_test = cor.test(x,y,method = 'spearman') kendall_test = cor.test(x,y,method = 'kendall') c(pearson_test$p.value,spearman_test$p.value,kendall_test$p.value) } ) power_pearson = mean(pvalues[1,1:1000] < 0.05) power_spearman= mean(pvalues[2,1:1000] < 0.05) power_kendall = mean(pvalues[3,1:1000]<0.05) power_pearson power_spearman power_kendall
Obviously, the correlation test when the sampled distribution is bivariate normal is more powerful than tests based on $\rho_s$ or $\tau$.
set.seed(1) library(MASS) m <- 1000 pvalues <- replicate(m, expr = { Sigma = matrix(c(9,2,2,1),2,2) bivarite_normal=mvrnorm(n=50, rep(0, 2), Sigma) x=bivarite_normal[1:50,1] y=bivarite_normal[1:50,2] pearson_test <- cor.test(x,y,method = "pearson") spearman_test = cor.test(x,y,method = 'spearman') kendall_test = cor.test(x,y,method = 'kendall') c(pearson_test$p.value,spearman_test$p.value,kendall_test$p.value) } ) power_pearson = mean(pvalues[1,1:1000] < 0.05) power_spearman= mean(pvalues[2,1:1000] < 0.05) power_kendall = mean(pvalues[3,1:1000]<0.05) power_pearson power_spearman power_kendall
Assignment 11-02
Compute a jackknife estimate of the bias and the standard error of the correlation statistic in Example 7.2.
library(bootstrap) LSAT=law$LSAT GPA=law$GPA n = length( LSAT ) #numbeer of data cor_jack = numeric( n ) cor_hat = cor( LSAT,GPA ) #corelation of LSAT and GPA
for (i in 1:n) { cor_jack[i] = cor( LSAT[-i],GPA[-i] ) } bias_jack = (n-1)*( mean(cor_jack) - cor_hat ) cor_bar = mean(cor_jack) # mean of statistic generated from jackknife se_jack = sqrt( (n-1)*mean( (cor_jack-cor_bar)^2 ) ) print(sprintf('estimate of the bias of the correlation statistic is: %f',bias_jack )) print(sprintf('estimate of the standard error of the correlation statistic is: %f',se_jack ))
Refer to Exercise 7.4. Compute 95% bootstrap confidence intervals for the mean time between failures $\frac{1}{\lambda}$ by the standard normal, basic, percentile, and BCa methods. Compare the intervals and explain why they may differ.
library(boot) attach(aircondit) x = hours B = 5000 set.seed(1)
gaptime_hat = mean(x) #MLE of 1/lambda #bootstrap estimate of 1/lambda frac_1_lambda_boot = boot(data=aircondit,statistic=function(x,i){mean(x[i,])}, R=B ) frac_1_lambda_boot
We see the MLE is 1/lambda = 108.0833, with estimated std. error = 37.77646, bias = 0.2438.
Next we try to calculate the bootstrap intervals by the standard normal, basic, percentile, and BCa methods.
einf_jack = empinf(frac_1_lambda_boot, type='jack') boot.ci( frac_1_lambda_boot, type=c('norm','basic','perc','bca'), L=einf_jack )
We see the intervals are quite different. A primary reason is that the bootstrap distribution is still skewed, affecting the simpler methods and their appeal to the Central Limit Theorem. For example
hist(frac_1_lambda_boot$t, main='', xlab=expression(1/lambda), prob=T) points(frac_1_lambda_boot$t0, 0, pch = 19)
The BCa interval incorporates an acceleration adjustment for skew, and may be preferred here.
Refer to Exercise 7.7. Obtain the jackknife estimates of bias and standard error of $\hat{\theta}$.
library(bootstrap) attach( scor ) n = length( scor[,1] ) x = as.matrix(scor) theta_jack = numeric( n )
lambda_hat = eigen(cov(scor))$values theta_hat = lambda_hat[1]/sum(lambda_hat) #compute the mean of theta from sample #according to the formala, we write a function to calculate the theta. theta = function(x){ eigen(cov(x))$values[1]/sum(eigen(cov(x))$values) } for (i in 1:n) { theta_jack[i] = theta( x[-i,] ) } theta_bar = mean(theta_jack) bias_jack = (n-1)*( mean(theta_jack) - theta_hat ) se_jack = sqrt( (n-1)*mean( (theta_jack-theta_bar)^2 ) ) print(sprintf('the jackknife estimates of bias of hat of theta is : %f', bias_jack)) print(sprintf('the jackknife estimates of standard error of hat of theta is : %f', se_jack)) detach(scor)
In Example 7.18, leave-one-out (n-fold) cross validation was used to select the best fitting model. Use leave-two-out cross validation to compare the models.
The code to estimate the parameters of the four models follows. Plots of the predicted response with the data are also constructed for each model and shown follows.
library(DAAG) attach(ironslag) a <- seq(10, 40, .1) #sequence for plotting fits L1 <- lm(magnetic ~ chemical) plot(chemical, magnetic, main="Linear", pch=16) yhat1 <- L1$coef[1] + L1$coef[2] * a lines(a, yhat1, lwd=2) L2 <- lm(magnetic ~ chemical + I(chemical^2)) plot(chemical, magnetic, main="Quadratic", pch=16) yhat2 <- L2$coef[1] + L2$coef[2] * a + L2$coef[3] * a^2 lines(a, yhat2, lwd=2) L3 <- lm(log(magnetic) ~ chemical) plot(chemical, magnetic, main="Exponential", pch=16) logyhat3 <- L3$coef[1] + L3$coef[2] * a yhat3 <- exp(logyhat3) lines(a, yhat3, lwd=2) L4 <- lm(log(magnetic) ~ log(chemical)) plot(log(chemical), log(magnetic), main="Log-Log", pch=16) logyhat4 <- L4$coef[1] + L4$coef[2] * log(a) lines(log(a), logyhat4, lwd=2)
Once the model is estimated, we want to assess the fit. Cross validation can be used to estimate the prediction errors.
n <- length(magnetic) e1 <- e2 <- e3 <- e4 <- numeric(n) k=1 # fit models on leave-two-out samples while (k<n) { y <- magnetic[c(-k,-(k+1))] x <- chemical[c(-k,-(k+1))] J1 <- lm(y ~ x) yhat1 <- J1$coef[1] + J1$coef[2] * chemical[k] e1[k] <- magnetic[k] - yhat1 J2 <- lm(y ~ x + I(x^2)) yhat2 <- J2$coef[1] + J2$coef[2] * chemical[k] + J2$coef[3] * chemical[k]^2 e2[k] <- magnetic[k] - yhat2 J3 <- lm(log(y) ~ x) logyhat3 <- J3$coef[1] + J3$coef[2] * chemical[k] yhat3 <- exp(logyhat3) e3[k] <- magnetic[k] - yhat3 J4 <- lm(log(y) ~ log(x)) logyhat4 <- J4$coef[1] + J4$coef[2] * log(chemical[k]) yhat4 <- exp(logyhat4) e4[k] <- magnetic[k] - yhat4 k=k+2 }
The following estimates for prediction error are obtained from the leave two out cross validation.
c(mean(e1^2), mean(e2^2), mean(e3^2), mean(e4^2))
According to the prediction error criterion, Model 2, the quadratic model, would be the best fit for the data.
L2
The fitted regression equation for Model 2 is $$ \hat{Y} = 24.49262 - 1.39334X + 0.05452X^2$$
par(mfrow = c(1, 2)) #layout for graphs plot(L2$fit, L2$res) #residuals vs fitted values abline(0, 0) #reference line qqnorm(L2$res) #normal probability plot qqline(L2$res) #reference line par(mfrow = c(1, 1)) #restore display
Assignment 11-16
Implement the two-sample Cram´ er-von Mises test for equal distributions as a permutation test. Apply the test to the data in Examples 8.1 and 8.2.
1.Initial parameters
set.seed(1) library(cramer) attach(chickwts) x <- sort(as.vector(weight[feed == "soybean"])) y <- sort(as.vector(weight[feed == "linseed"])) R <- 99 #number of replicates z <- c(x, y) #pooled sample K <- 1:26 W_2_hat =numeric(R) #storage for replicates
2.Generate samples and calculat p value with permutation method.
W_2_0 = cramer.test(x,y)$statistic for (i in 1:R) { #generate indices k for the first sample k <- sample(K, size = 14, replace = FALSE) x1 <- z[k] y1 <- z[-k] #complement of x1 W_2_hat[i] = cramer.test(x1,y1)$statistic } p <- mean(abs(c(W_2_0, W_2_hat)) >= W_2_0) round(c(p,cramer.test(x,y)$p.value),3)
3.Draw the histgram
hist(W_2_hat, main = "", freq = FALSE, xlab = "W_2_hat ", breaks = "scott") points(W_2_0, 0, cex = 1, pch = 16)
Use the Metropolis-Hastings sampler to generate random variables from a
standard Cauchy distribution. Discard the first 1000 of the chain, and com-
pare the deciles of the generated observations with the deciles of the standard
Cauchy distribution (see qcauchyor qt with df=1). Recall that a Cauchy(θ,η)
distribution has density function
$$ f(x)=\frac{1}{\theta \pi (1+[(x-\eta)/\theta]^2)},~~-\infty
1.I want to simpliy the question with assumming the proposal distribution is symmetrical.
generate_cauthy_Metropolis_Hastings = function(n, sigma, x0, N) { # n: degree of freedom of t distribution # sigma: standard variance of proposal distribution N(xt,sigma) # x0: initial value # N: size of random numbers required. x <- numeric(N) x[1] <- x0 u <- runif(N) k <- 0 for (i in 2:N) { y <- rnorm(1, x[i-1], sigma) if (u[i] <= ( (dt(y, n)*dnorm(x[i-1],y,sigma))/(dt(x[i-1], n)*dnorm(y,x[i-1],n)))){ x[i] <- y }else { x[i] <- x[i-1] k <- k + 1 } } return(list(x=x, k=k)) } n <- 1 #degrees of freedom for target Student t dist. N <- 5000 x0 <- 10 #initial the value cauthy_distribution_random_varibl= generate_cauthy_Metropolis_Hastings(n,2,x0,N) x=cauthy_distribution_random_varibl$x k=cauthy_distribution_random_varibl$k #number of candidate points rejected print(sprintf('rate of candidate points rejected: %f',k/N))
2.graph the result
refline <- qt(c(.025, .975), df=n) plot(x, type="l",xlab=bquote(sigma == 1),ylab="X", ylim=range(x)) abline(h=refline) par(mfrow=c(1,2)) qqnorm(x)#绘制sigma=1的Q-Q图 hist(x,xlab="X",main="sigma=1")#绘制sigma=1的直方图
Assignment 11-23
For exercise 9.6, use the Gelman-Rubin method to monitor convergence of the chain, and run the chain until the chain has converged approximately to the target distribution according to $\hat{R} < 1.2$
1.Calculate joint probablity according to the sample.
set.seed(1) joint_probablity <- function( theta,x ) { probablity=(2+theta)^x[1] * (1-theta)^(x[2]+x[3]) * theta^x[4] return(probablity) }
2.Write a function to generate a Markov chain of theta.
theta_chain <- function(n) { xdata = c( 125, 18, 20, 34 ) #observed multinom. data theta = numeric(n) theta[1] = runif(1) #initialize: sample from prior on theta k = 0 u = runif(n) for (t in 2:n) { xt = theta[t-1] alpha = xt/(1-xt) y <- rbeta(1, shape1=alpha, shape2=1 ) numer = joint_probablity( y,xdata ) * dbeta( xt, y/(1-y), 1) denom = joint_probablity( xt,xdata ) * dbeta( y, alpha, 1) if ( u[t] <= numer/denom ) theta[t] = y else { theta[t] = theta[t-1] k = k + 1 } } return(theta) }
Gelman.Rubin <- function(psi) { # psi[i,j] is the statistic psi(X[i,1:j]) # for chain in i-th row of X psi <- as.matrix(psi) n <- ncol(psi) k <- nrow(psi) psi.means <- rowMeans(psi) #row means B <- n * var(psi.means) #between variance est. psi.w <- apply(psi, 1, "var") #within variances W <- mean(psi.w) #within est. v.hat <- W*(n-1)/n + (B/n) #upper variance est. r.hat <- v.hat / W #G-R statistic return(r.hat) }
k <- 4 #number of chains to generate n <- 8000 #length of chains b <- 1000 #burn-in length #generate the chains X <- matrix(0, nrow=k, ncol=n) for (i in 1:k) X[i, ] <- theta_chain(n) #compute diagnostic statistics psi <- t(apply(X, 1, cumsum)) for (i in 1:nrow(psi)) psi[i,] <- psi[i,] / (1:ncol(psi)) print(Gelman.Rubin(psi))
For the Gelman.Rubin statistic is less than 1.2, so the chain of theta has converged approximately to the target distribution according to $\hat{R}<1.2$.
Draw the graph of chains of theta and rhat.
#plot psi for the four chains par(mfrow=c(2,2)) for (i in 1:k) plot(psi[i, (b+1):n], type="l", xlab=i, ylab=bquote(psi)) par(mfrow=c(1,1)) #restore default #plot the sequence of R-hat statistics rhat <- rep(0, n) for (j in (b+1):n) rhat[j] <- Gelman.Rubin(psi[,1:j]) plot(rhat[(b+1):n], type="l", xlab="", ylab="R") abline(h=1.2, lty=2)
Find the intersection points $A(k)$ in $(0,\sqrt{k})$ of the curves $$S_{k-1}(a)=P(t(k-1)>\sqrt{\frac{a^2(k-1)}{k-a^2}})$$ and $$S_{k}(a)=P(t(k)>\sqrt{\frac{a^2k}{k+1-a^2}})$$ for $k = 4 : 25,100,500,1000$, where $t(k)$ is a Student t random variable with k degrees of freedom. (These intersection points determine the critical values for a t-test for scale-mixture errors proposed by Sz´ekely [260].)
k = c( 4:25, 100, 500, 1000 ) object = function( a, df ){ num_1 = sqrt( (a^2)*(df-1)/(df - a^2) ) S_kminus1 = pt( q=num_1, df=df-1, lower.tail = FALSE) num_2 = sqrt( (a^2)*df/(df + 1 - a^2) ) S_k = pt( q=num_2, df=df, lower.tail = FALSE) return( S_k - S_kminus1 ) }
par(mfrow=c(2,5)) for (i in k) { a=seq(1,2,0.01) plot(a,object(a,i)) }
for ( i in 1:length(k) ) { root=uniroot(object, lower=1, upper=2, df=k[i])$root print( c( k[i], root ) ) }
Assignment 11-30
Write a function to compute the cdf of the Cauchy distribution, which has density $$\frac{1}{\theta\pi(1+[(X-\eta)/\theta]^2)} ,-\infty < x < \infty, $$ where $\theta>0$. Compare your results to the results from the R function pcauchy. (Also see the source code in pcauchy.c.)
cauchy_distribution = function(q, eta=0, theta=1) { cauthy_density = function(x,eta,theta){ 1/(theta*pi*(1 + ((x-eta)/theta)^2)) } result = integrate( f=cauthy_density,lower=-Inf, upper=q, rel.tol=.Machine$double.eps^0.25, eta=eta, theta=theta ) return( result$value ) }
2.Compare with the standard cauchy distribution :
x = matrix( seq(-5,5), ncol=1 ) cbind( x, apply(x, MARGIN=1, FUN=cauchy_distribution), pcauchy(x) ) a=apply(x, MARGIN=1, FUN=cauchy_distribution)
compare with standard cauchy distribution with different eta parameter.
cbind( x, apply( x, MARGIN=1, FUN=cauchy_distribution, eta=2 ),pcauchy(x,location=2) )
compare with standard cauchy distribution with different theta parameter.
cbind( x, apply( x, MARGIN=1, FUN=cauchy_distribution, theta=2 ),pcauchy(x,scale=2) )
According to the result above, cauthy distribution we generate is same with standard cauthy distribution.
A-B-O blood type problem Let the three alleles be A, B, and O.
dat <- rbind(Genotype=c('AA','BB','OO','AO','BO','AB'), Frequency=c('p2','q2','r2','2pr','2qr','2pq',1), Count=c('nAA','nBB','nOO','nAO','nBO','nAB','n')) knitr::kable(dat)
Observed data: $n_{A\cdot}=n_{AA}+n_{AO}=28$ (A-type), $n_{B\cdot}=n_{BB}+n_{BO}=24$ (B-type), $n_{OO}=41$ (O-type), $n_{AB}=70$ (AB-type). Use EM algorithm to solve MLE of $p$ and $q$ (consider missing data $n_{AA}$ and $n_{BB}$). Record the maximum likelihood values in M-steps, are they increasing?
$$\log L(x,p)= n_{AA}\log(p_A^2)+n_{AO}\log(p_Ap_O) + n_{AB}\log(p_Ap_B) + n_{BB}\log{p_Bp_B} + n{BO}\log(p_Bp_O) + n_{OO}\log(p_Op_O)$$ 1. initial parameter
x = c(28, 24, 41,70) n = rep(0,6) p = rep(1/3,3) m = 20
expectation = function(x,p){ n.aa = (x[1]*(p[1]^2))/((p[1]^2)+2*p[1]*p[3]) n.ao = (2*x[1]*p[1]*p[3])/((p[1]^2)+2*p[1]*p[3]) n.bb = (x[2]*(p[2]^2))/((p[2]^2)+2*p[2]*p[3]) n.bo = (2*x[2]*p[2]*p[3])/((p[2]^2)+2*p[2]*p[3]) n = c(n.aa,n.ao,n.bb,n.bo,x[3],x[4]) return(n) }
maximization = function(x,n){ p.a = (2*n[1]+n[2]+n[3])/(2*sum(x)) p.b = (2*n[4]+n[5]+n[2])/(2*sum(x)) p.o = (2*n[6]+n[3]+n[5])/(2*sum(x)) p = c(p.a,p.b,p.o) return(p) }
record_mle= numeric(m) for(i in 1:m){ n = expectation(x,p) p = maximization(x,n) record_mle[i]=(p[1]^(2*n[1]))*((2*p[1]*p[3])^(n[2]))*(p[2]^(2*n[3]))*((2*p[2]*p[3])^(n[4]))*((2*p[1]*p[2])^n[5])*(p[3]^(2*n[6])) } p record_mle plot(record_mle) lines(record_mle)
Assignment 12-07
formulas <- list( mpg ~ disp, mpg ~ I(1 / disp), mpg ~ disp + wt, mpg ~ I(1 / disp) + wt )
1.Use for loops to fit linear models
lm_forloop <- vector("list", length(formulas)) for (i in seq_along(formulas)){ lm_forloop[[i]] <- lm(formulas[[i]], data = mtcars) } lm_forloop
2.Use lapply() to fit linear models
lm_lapply <- lapply(formulas, function(x) lm(formula = x, data = mtcars)) lm_lapply
bootstraps <- lapply(1:10, function(i) { rows <- sample(1:nrow(mtcars), rep = TRUE) mtcars[rows, ] })
1.Fit the model mpg ~ disp by using lapply()
lapply_fit <- lapply(bootstraps, function(x) lm(mpg~disp, data=x)) lapply_fit
2.Fit the model mpg ~ disp by using a for loop.
for_loop_fit <- vector("list", length(bootstraps)) for (i in seq_along(bootstraps)){ for_loop_fit[[i]] <- lm(mpg ~ disp, data = bootstraps[[i]]) } for_loop_fit
For each model in the previous two exercises, extract $R^2$ using the function below.
rsq <- function(mod) summary(mod)$r.squared
lapply(formulas, function(x) rsq(lm(formula=x, data=mtcars))) lapply(bootstraps, function(x) rsq(lm(mpg~disp, data=x)))
trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE )
Extra challenge: get rid of the anonymous function by using [[ directly.
sapply(trials, function(x) x[["p.value"]]) sapply(trials, "[[", "p.value")
According to the question, we want to combine Map() and vapply() to create a function to make both input and output are both vector(or a matrix). Apparently, Map() can take vector as its input, and vapply() can take vector as its output. So the code as below.
map_vapply <- function(X, FUN, FUN.VALUE, simplify = FALSE){ out <- Map(function(x) vapply(x, FUN, FUN.VALUE), X) return(out) }
Obviously, arguments should take both map() and vapply() function take originally.
Assignment 12-14
1.Compute expected value
expected <- function(colsum, rowsum, total) { (colsum/total)*(rowsum/total)*total }
2.Compute chisquare statistics.
chi_stat <- function(observed, expected) { ((observed-expected)^2)/expected }
chisq_test <- function(x, y) { total <- sum(x) + sum(y) rowsum_1 <- sum(x) rowsum_2 <- sum(y) colsum = x+y expected_1=mapply(expected, colsum,rowsum_1, total) expected_2=mapply(expected, colsum,rowsum_2, total) chi_1=mapply(chi_stat, x,expected_1) chi_2=mapply(chi_stat, x,expected_2) chistat=sum(chi_1+chi_2) return(chistat) } print(chisq_test(c(13,27,19,10), c(15,17,21,27))) print(chisq.test(c(13,27,19,10), c(15,17,21,27))) print(microbenchmark::microbenchmark( chisq_test(c(13,27,19,10), c(15,17,21,27)), chisq.test(c(13,27,19,10), c(15,17,21,27)) ))
Warning message is because of chisq.test result. Obviously, chisq_test generated by myself is faster then chisq.test.
Can you make a faster version of table() for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?
table_fast <- function(x, y) { x_num <- unique(x) y_num <- unique(y) mat <- matrix(0, length(x_num), length(y_num)) for (i in seq_along(x)) { mat[which(x_num == x[[i]]), which(y_num == y[[i]])] <- mat[which(x_num == x[[i]]), which(y_num == y[[i]])] + 1 } dimnames <- list(x_num, y_num) tab <- array(mat, dim = dim(mat), dimnames = dimnames) class(tab) <- "table" return(tab) } a=c(3,7,15) microbenchmark::microbenchmark(table(a,a), table_fast(a, a))
Apparently,we made a fast table2().
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.