knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
# library(dHM2013R) load_all()
This vignette documents my attempt to replicate the analysis of @dhaultfoeuille_inference_2013 on British data, specifically the longitudinal cohort studies Next Steps and BCS 1970.
Their analysis is based upon the extended Roy model, with two sectors, non-pecuniary factors and uncertainty in future earnings. $$ Y_0 = X'\beta_0 + \varepsilon_0 \ Y_1 = X'\beta_1 + \varepsilon_1 \ D = 1{ -\delta_0 + X'(\beta_1 - \beta_0 - \gamma_0) + \eta_{\Delta} > 0 } $$ where $\varepsilon_k = \eta_k + \nu_k$, $\eta_{\Delta} = \eta_1 - \eta_0$, and $X \perp (\eta_1, \eta_0)$. The young person knows $\eta_k$ at the time they make their decision but not $\nu_k$.
@dhaultfoeuille_inference_2013's method is a 3-step procedure that relies on $\eta_k$ linking the outcome ($y_k$) and choice ($D$) equations. The 3 steps are as follows: 1) estimate $\zeta_0 = \frac{\beta_1 - \beta_0 - \gamma_0}{\beta_{11} - \beta_{01} - \gamma_{01}} \Rightarrow \zeta_{01} = 1$, where $x_1$ is some continuous $x \in X$. Use @coppejans_estimation_2001 approach. 2) estimate $\beta_1$ and $\beta_0$ via @newey_two-step_2009's series estimator (which involves using the index from step-1 and its power series as a control function). 3) estimate ($\delta_0$ and $\gamma_0$) via @dhaultfoeuille_inference_2013's novel method.
Estimating $\zeta$.
benefitVars <- c( 'betterJob', 'betterPay', 'betterOpp', 'need4Career', 'showSkills', 'delayWorkGd', 'socialLifeGd', 'leaveHomeGd', 'keepStudy', 'moreQuals', 'persDevel', 'moreConfid', 'moreRespect', 'betterLife', 'prepareLife' ) costVars <- c( 'tooExpensive', 'getDebt', 'parentsMoney', 'notFinancIndep', 'notWorkEarn', 'costsGen', 'noGuaranteeGdJb', 'notNeed4Career', 'lessExp', 'tooHard', 'leaveHomeBad', 'tooLong', 'wasteTime', 'feesEtc', 'stress' ) xControls <- c( 'w4SOCMajorMP', 'w4ethgrpMP', 'w4hiqualgMP', 'W5AlevNoYP', 'W5EMA1YP', 'W4AlevNoYP', 'w4sexMP', 'W4SexYP', "cogScore" ) # regression predicted wages logitControlsFit <- c(benefitVars, costVars, xControls) step1Formula <- as.formula( paste0('degree ~ ', paste0(logitControlsFit, collapse = ' + ')) ) step1Probit <- glm( step1Formula, data = dtLsype4dHM, na.action = na.exclude ) summary(step1Probit) step1Logit <- glm( step1Formula, data = dtLsype4dHM, family = binomial(), na.action = na.exclude ) summary(step1Logit)
@coppejans_estimation_2001 proposes the mixture of distributions estimator, that allows estimation of binary response models when the error distribution is unknown. Rather than assume a logistic or Gaussian error distribution, the MOD estimator employs a mixture of smooth distributions. The method relies upon a single index as we have in the extended Roy model here.
The estimator maximises $$ Q_n(\omega(x;\Psi_k, \zeta)) = \frac{1}{n}\sum_{i=1}^n {y_i - [1 - \Psi_k(-x'\zeta; a, b, {\mu_j, \pi_j}, \sigma_k)] }^2 $$ with respect to $(\zeta, a, b, {\mu_j, \pi_j}, \sigma_k)$, and where $$ \Psi_k(u; a, b, {\mu_j, \pi_j}, \sigma_k) = a - (b-a)\sum_{j=1}^k \pi_j H\left( \frac{u - \mu_j}{\sigma_k} \right). $$ Here, $k$ is the number of mixing components and $H(\cdot)$ is some smooth distribution (e.g. $\Phi(\cdot)$), $\pi_j, 0 \leq \pi_j \leq 1, \sum_{j=1}^k \pi_j = 1$ are mixing weights, $\mu_j$ is a location parameter, $a$ is an intercept and $b-a$ a slope term (we can set $b=1$ and $a=0$ here).
The maximisation is subject to the following constraints:
However, for computation we only impose the bound on $\sigma$ and will check the bounds on the derivative after maximising.
@coppejans_estimation_2001 suggests picking $k = 7$ for 250 observations, though @dhaultfoeuille_inference_2013 use $k=3$.
sigma_p <- sd(step1Probit$residuals) sigma_lb <- min(.5 * sigma_p / (1593^.2), 0.5) c2Psi_ub <- min((exp(-.5)/sqrt(2*pi)) * max(1/4, (.5 * sigma_p / (1593^.2))^(-1.5)), 1000)
k <- 3 pi_k <- rep(1/k, k-1) mu <- rep(0, k) sigma <- sigma_p zeta <- step1Probit$coefficients[2:59] / step1Probit$coefficients[[59]] x <- model.matrix( object = step1Probit$formula, data = step1Probit$model )[, 2:59] x_x1 <- x / x[, "cogScore"] y <- step1Probit$y y_x1 <- y / x[, "cogScore"] Q <- function(zeta, pik, mu, sigma, x, y) { mean((y - (1 - ( pik[[1]] * pnorm(- x %*% zeta, mean = mu[[1]], sd = sigma) + pik[[2]] * pnorm(- x %*% zeta, mean = mu[[2]], sd = sigma) + (1 - pik[[1]] - pik[[2]]) * pnorm(- x %*% zeta, mean = mu[[3]], sd = sigma) )))^2) } Q2 <- function(theta) { Q(zeta = c(theta[1:57], 1), pik = theta[58:59], mu = theta[60:62], sigma = theta[63], x = x, y = y) } step1Coppejans <- optim( par = c(zeta[1:57], pi_k, mu, sigma), fn = Q2, method = "BFGS", control = list(maxit = 500) )
step1Estimates <- data.table( varName = names(step1Probit$coefficients), probitEstimate = step1Probit$coefficients / step1Probit$coefficients["cogScore"] ) %>% merge( data.table( varName = names(step1Logit$coefficients), logitEstimate = step1Logit$coefficients / step1Logit$coefficients["cogScore"] ), by = "varName" ) %>% merge( data.table( varName = names(step1Coppejans$par[1:58]), coppejansEstimate = step1Coppejans$par[1:58] / step1Coppejans$par[[58]] ), by = "varName" )
Comparing the parameter estimates across Probit, Logit and @coppejans_estimation_2001 approaches, the signs of the estimates are all the same---except the intercept which is positive for Probit and negative for the other two.
The idea behind step 2 is to use a control function approach a la @heckman_common_1976 but replacing the inverse Mills ratio with a function of the index from step 1.
A possibility suggested in @newey_two-step_2009 is to use a power series of the index itself, so if we borrow the notation from that paper and let $\hat{v} = x'\hat{\zeta}$ then the control function is $h(\hat{v}) = \sum_{k=1}^K \hat{v}^k$.
We can then estimate $(\beta_1, \beta_0)$ directly using least squares in regressions of the form: $$ y_1 = x\beta_1 + h(\hat{v}) \ y_0 = x\beta_0 + h(1-\hat{v}) $$ Not completely sure about the second equation here---is that the correct control function?
vhatVec <- x %*% c(step1Coppejans$par[1:57], 1) dtLsype4dHM <- dtLsype4dHM[, vhat1 := vhatVec[, 1]] dtLsype4dHM[, one_vhat1 := 1 - vhat1] dtLsype4dHM[, paste0("vhat", 2:6) := .(vhat1^2, vhat1^3, vhat1^4, vhat1^5, vhat1^6)] dtLsype4dHM[, paste0("one_vhat", 2:6) := .(one_vhat1^2, one_vhat1^3, one_vhat1^4, one_vhat1^5, one_vhat1^6)] step2FormulaD1 <- as.formula( paste0('logHourWage ~ ', paste0(c(xControls, "betterPay", paste0("vhat", 1:6)), collapse = ' + ')) ) step2FormulaD0 <- as.formula( paste0('logHourWage ~ ', paste0(c(xControls, "betterPay", paste0("one_vhat", 1:6)), collapse = ' + ')) ) step2Formula_noVhat <- as.formula( paste0('logHourWage ~ ', paste0(c(xControls, "betterPay"), collapse = ' + ')) ) step2d1 <- lm( formula = step2FormulaD1, data = dtLsype4dHM[degree == TRUE] ) step2d1_noVhat <- lm( formula = step2Formula_noVhat, data = dtLsype4dHM[degree == TRUE] ) step2d0 <- lm( formula = step2FormulaD0, data = dtLsype4dHM[degree == FALSE] ) step2d0_noVhat <- lm( formula = step2Formula_noVhat, data = dtLsype4dHM[degree == FALSE] ) step2Estimates <- data.table( varName = names(step2d1$coefficients), d1estimates = step2d1$coefficients ) %>% merge( data.table( varName = names(step2d1_noVhat$coefficients), d1estimates_noVhat = step2d1_noVhat$coefficients ), by = "varName" ) %>% merge( data.table( varName = names(step2d0$coefficients), d0estimates = step2d0$coefficients ), by = "varName" ) %>% merge( data.table( varName = names(step2d0_noVhat$coefficients), d0estimates_noVhat = step2d0_noVhat$coefficients ), by = "varName" )
The final step of the estimation is the novel third stage of @dhaultfoeuille_inference_2013, which builds on the logic in their identification proof.
First, we define $\alpha_0 \equiv \beta_{01} - \beta_{11} + \gamma_{01}$ and estimating $\alpha_0$ is sufficient to find $\gamma_0$ as $\gamma_0 = \beta_1 - \beta_0 + \alpha_0 \zeta_0$.
The assumed index model implies $\E[D|X]$ and $E[\varepsilon|X]$ depend only on the index, $U \equiv X'\zeta_0$. Therefore, define $q_0(u) = E[D|U=u]$ and $g_0(u) = E[\varepsilon|U=u]$, and $$ g'0(U) = q'_0(U)(\delta_0 + \alpha_0U) $$ Integrating between $u_0$ and $U$ gives $$ g_0(U) = \tilde{\lambda}_0 + q_0(U)\delta_0 + \alpha_0\int{u_0}^U uq'0(u)du \ = \lambda_0 + q_0(U)\delta_0 + \left[ q_0(U)U - \int{u_0}^U q_0(u)du \right]\alpha_0 $$ which is also $$ \varepsilon = \lambda_0 + D\delta_0 + \left[ DU - \int_{u_0}^U q_0(u)du \right]\alpha_0 + \xi, \, E[\xi|X] = E[\xi|U] = 0. $$ To estimate equation 5, let $\theta_0 = (\lambda_0, \delta_0, \alpha_0)'$, $V = DU - \int_{u_0}^U q_0(u)du$ and $W = (1, D, V)'$, meaning $\varepsilon = W'\theta_0 + \xi$. D and V are exogeneous as D depends on both U and $\tilde{\eta}{\Delta}$. Therefore, use functions of U as intruments for D and V: $Z = h(U) = (1, h_1(U), h_2(U))'$. Then $\theta_0 = E(ZW')^{-1}E(Z\varepsilon)$, and an estimator is $$ \hat{\theta} = \left(\frac{1}{n} \sum{i=1}^n \hat{Z}i\hat{W}_i\right)^{-1} \left(\frac{1}{n} \sum{i=1}^n \hat{Z}i\hat{\varepsilon}_i\right) $$ Following the method in @dhaultfoeuille_inference_2013 and the online appendix, we have: $$ \hat{\varepsilon}_i = Y_i - X_i'(D_i\hat{\beta}_1 + (1-D_i)\hat{\beta}_0), \hat{W}_i = (1, D_i, \hat{V}_i), \ \hat{V}_i = D_i\hat{U}_i - \int{\hat{u}0}^\hat{U_i} \hat{q}(u, \hat{\zeta}) du, \ \hat{Z}_i = h(\hat{U}_i), \hat{U}_i = X_i'\hat{\zeta}, \ \hat{q}(u,\zeta) = \frac{\sum{i=1}^n D_i K(\frac{u - X_i'\zeta}{h_n})}{\sum_{i=1}^n K(\frac{u - X_i'\zeta}{h_n})} $$ @dhaultfoeuille_inference_2013 choose a quartic kernel, $K(v) = (15/16)(1-v^2)^2\mathds{1}{[-1,1]}(v)$, with bandwidth $h_n = 0.5\sigma(\hat{U})n^{-1/7}$, with $\sigma(\hat{U})$ the estimated standard deviation of U. Finally, the functions to instrument D and V are: $$ h_1(x) = \Phi(\hat{a}_0 + \hat{a}_1x), \ h_2(x) = xh_1(x) - \int{\hat{u}_0}^x \hat{q}(u, \hat{\zeta})du, $$ and $(\hat{a}_0, \hat{a}_1)$ is the probit estimator of D on $(1, \hat{U})$.
n <- 1593 sigmaU <- sd(vhatVec[, 1]) Xzeta <- vhatVec[, 1] u <- Xzeta d <- as.numeric(y) q0 <- function(u) { M <- matrix(rep(u, each = length(Xzeta)), nrow = length(Xzeta), ncol = length(u)) Ku <- matrix( kde(Xzeta - mean(Xzeta), h = .5 * sigmaU * n^(-1/7), eval.points = M - Xzeta)$estimate, nrow = length(Xzeta), ncol = length(u) ) if (length(u) > 1) { colSums(d * Ku) / colSums(Ku) } else { sum(d * Ku) / sum(Ku) } } u0 <- min(Xzeta) intq0 <- function(u) { res <- list() for (i in length(u)) { res[[i]] <- integrate(q0, lower = u0, upper = u[[i]])$value } unlist(res) } V <- d * u - intq0(u)
aModel <- glm(d ~ u) a0 <- aModel$coefficients[[1]] a1 <- aModel$coefficients[[2]] h1 <- pnorm(a0 + a1 * u) h2 <- u * h1 - intq0(u) Z <- cbind(1, h1, h2)
b0 <- step2d0$coefficients[1:30] b1 <- step2d1$coefficients[1:30] xWage <- model.matrix(object = step2Formula_noVhat, data = dtLsype4dHM) b <- d * matrix(rep(b1, n), nrow = n) + (1-d) * matrix(rep(b0, n), nrow = n) eps <- y - rowSums(xWage * b) W <- cbind(1, d, V) theta <- colSums(Z * eps) / colSums(Z * W) step3Estimates <- step1Estimates[, .(varName, zeta0 = coppejansEstimate)] %>% merge(step2Estimates[, .(varName, beta1 = d1estimates, beta0 = d0estimates)], by = "varName", all = TRUE) step3Estimates[is.na(zeta0), zeta0 := 1] step3Estimates[is.na(step3Estimates)] <- 0 step3Estimates[, gamma0 := beta1 - beta0 + theta[[3]] * zeta0]
progDHM2013()
benefitVars <- c( 'betterJob', 'betterPay', 'betterOpp', 'need4Career', 'showSkills', 'delayWorkGd', 'socialLifeGd', 'leaveHomeGd', 'keepStudy', 'moreQuals', 'persDevel', 'moreConfid', 'moreRespect', 'betterLife', 'prepareLife' ) costVars <- c( 'tooExpensive', 'getDebt', 'parentsMoney', 'notFinancIndep', 'notWorkEarn', 'costsGen', 'noGuaranteeGdJb', 'notNeed4Career', 'lessExp', 'tooHard', 'leaveHomeBad', 'tooLong', 'wasteTime', 'feesEtc', 'stress' ) xControls <- c( "cogScore", #x1 'w4SOCMajorMP', 'w4ethgrpMP', 'w4hiqualgMP', 'W5AlevNoYP', 'W5EMA1YP', 'W4AlevNoYP', 'w4sexMP', 'W4SexYP' ) # testing the function manually dt <- dtLsype4dHM varList <- c(xControls, benefitVars, costVars) x1 <- "cogScore" d <- "degree" y <- "logHourWage" ivY1 <- c(benefitVars, costVars)[-2] ivY0 <- c(benefitVars, costVars)[-2]
resLsype <- progDHM2013( dt = dtLsype4dHM, varList = c(xControls, benefitVars, costVars), x1 = "cogScore", d = "degree", y = "logHourWage", ivY1 = c(benefitVars, costVars)[-2], ivY0 = c(benefitVars, costVars)[-2], bw = 2.5 ) # use_data(resLsype, overwrite = TRUE)
# with resBcs loaded gamma0 <- resLsype$dtEstimates[, gamma0] names(gamma0) <- resLsype$dtEstimates[, varName] gamma0 <- gamma0[colnames(resLsype$x[, 2:length(gamma0)])] delta0 <- resLsype$dtEstimates[, unique(delta0)] # calculate np "value" npValue <- colSums(t(resLsype$x) * c(delta0, gamma0)) # plot density plot(density(-npValue)) # plot density including d npDensity <- density(-npValue) plot(npDensity$x, npDensity$y*30, type = "l") points(-npValue, d)
The distribution of ex ante returns, $F_{\Delta}(u) = E[F_{\eta_{\Delta}}(u + X'(\beta_0 - \beta_1))]$. Note that $$ P(D = 0|X) = F_{\eta_{\Delta}}(\delta_0 + X'\alpha_0 \zeta_0) $$ Therefore we can estimate $F_{\eta_{\Delta}}$ by regressing nonparametrically 1-D on the index $\hat{\delta} + X'\hat{\alpha}\hat{\zeta}$.
library(np) alpha0 <- resLsype$dtEstimates[, unique(alpha0)] delta0 <- resLsype$dtEstimates[, unique(delta0)] zeta0 <- resLsype$dtEstimates[, zeta0] names(zeta0) <- resLsype$dtEstimates[, varName] zeta0 <- zeta0[colnames(resLsype$x[, 2:length(zeta0)])] d <- as.numeric(dtLsype4dHM[, degree]) DeltaIndex <- colSums(t(resLsype$x) * c(delta0, zeta0)) deltaModel <- np::npreg(1-d ~ DeltaIndex) plot(deltaModel) beta0 <- resLsype$dtEstimates[, beta0] names(beta0) <- resLsype$dtEstimates[, varName] beta0 <- beta0[colnames(resLsype$x[, 2:length(zeta0)])] beta1 <- resLsype$dtEstimates[, beta1] names(beta1) <- resLsype$dtEstimates[, varName] beta1 <- beta1[colnames(resLsype$x[, 2:length(zeta0)])] Tx <- colSums(t(resLsype$x[, 2:length(zeta0)]) * (beta0 - beta1)) Gx <- DeltaIndex - Tx Fdelta <- npreg(txdat = DeltaIndex, tydat = 1-d, exdat = rep(Gx, each = length(Tx))+Tx) Fd <- colSums(matrix(Fdelta$mean, nrow = length(Tx))) / length(Tx) plot(x = -Gx, y = Fd) plot(deltaModel)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.