In opmc2/dHM2013R:

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

# library(dHM2013R)
load_all()

This vignette documents my attempt to replicate the analysis of @dhaultfoeuille_inference_2013 on British data, specifically the longitudinal cohort studies Next Steps and BCS 1970.

Their analysis is based upon the extended Roy model, with two sectors, non-pecuniary factors and uncertainty in future earnings. $$ Y_0 = X'\beta_0 + \varepsilon_0 \ Y_1 = X'\beta_1 + \varepsilon_1 \ D = 1{ -\delta_0 + X'(\beta_1 - \beta_0 - \gamma_0) + \eta_{\Delta} > 0 } $$ where $\varepsilon_k = \eta_k + \nu_k$, $\eta_{\Delta} = \eta_1 - \eta_0$, and $X \perp (\eta_1, \eta_0)$. The young person knows $\eta_k$ at the time they make their decision but not $\nu_k$.

@dhaultfoeuille_inference_2013's method is a 3-step procedure that relies on $\eta_k$ linking the outcome ($y_k$) and choice ($D$) equations. The 3 steps are as follows: 1) estimate $\zeta_0 = \frac{\beta_1 - \beta_0 - \gamma_0}{\beta_{11} - \beta_{01} - \gamma_{01}} \Rightarrow \zeta_{01} = 1$, where $x_1$ is some continuous $x \in X$. Use @coppejans_estimation_2001 approach. 2) estimate $\beta_1$ and $\beta_0$ via @newey_two-step_2009's series estimator (which involves using the index from step-1 and its power series as a control function). 3) estimate ($\delta_0$ and $\gamma_0$) via @dhaultfoeuille_inference_2013's novel method.

Step 1 (we could also use a logit or probit here)

Estimating $\zeta$.

Probit and logit estimation for comparison

benefitVars <- c(
  'betterJob', 'betterPay', 'betterOpp', 'need4Career', 'showSkills',
  'delayWorkGd', 'socialLifeGd', 'leaveHomeGd', 'keepStudy', 'moreQuals',
  'persDevel', 'moreConfid', 'moreRespect', 'betterLife', 'prepareLife'
)

costVars <- c(
  'tooExpensive', 'getDebt', 'parentsMoney', 'notFinancIndep', 'notWorkEarn',
  'costsGen', 'noGuaranteeGdJb', 'notNeed4Career', 'lessExp', 'tooHard',
  'leaveHomeBad', 'tooLong', 'wasteTime', 'feesEtc', 'stress'
)

xControls <- c(
  'w4SOCMajorMP', 'w4ethgrpMP',
  'w4hiqualgMP', 'W5AlevNoYP', 'W5EMA1YP',
  'W4AlevNoYP', 'w4sexMP', 'W4SexYP',
  "cogScore"
)

# regression predicted wages
logitControlsFit <- c(benefitVars, costVars, xControls)

step1Formula <- as.formula(
  paste0('degree ~ ', paste0(logitControlsFit, collapse = ' + '))
)

step1Probit <- glm(
  step1Formula,
  data = dtLsype4dHM,
  na.action = na.exclude
)

summary(step1Probit)

step1Logit <- glm(
  step1Formula,
  data = dtLsype4dHM,
  family = binomial(),
  na.action = na.exclude
)

summary(step1Logit)

Mixture of distributions (MOD) estimator (@coppejans_estimation_2001)

@coppejans_estimation_2001 proposes the mixture of distributions estimator, that allows estimation of binary response models when the error distribution is unknown. Rather than assume a logistic or Gaussian error distribution, the MOD estimator employs a mixture of smooth distributions. The method relies upon a single index as we have in the extended Roy model here.

The estimator maximises $$ Q_n(\omega(x;\Psi_k, \zeta)) = \frac{1}{n}\sum_{i=1}^n {y_i - [1 - \Psi_k(-x'\zeta; a, b, {\mu_j, \pi_j}, \sigma_k)] }^2 $$ with respect to $(\zeta, a, b, {\mu_j, \pi_j}, \sigma_k)$, and where $$ \Psi_k(u; a, b, {\mu_j, \pi_j}, \sigma_k) = a - (b-a)\sum_{j=1}^k \pi_j H\left( \frac{u - \mu_j}{\sigma_k} \right). $$ Here, $k$ is the number of mixing components and $H(\cdot)$ is some smooth distribution (e.g. $\Phi(\cdot)$), $\pi_j, 0 \leq \pi_j \leq 1, \sum_{j=1}^k \pi_j = 1$ are mixing weights, $\mu_j$ is a location parameter, $a$ is an intercept and $b-a$ a slope term (we can set $b=1$ and $a=0$ here).

The maximisation is subject to the following constraints:

$\sigma_k > \min{0.5, \frac{0.5\hat{\sigma}{p}}{n^{0.2}}}$, where $\hat{\sigma}{p}$ is the Probit estimate of $\sigma$
an upper bound on the second derivative: $$ \left| \frac{b-a}{\sigma^2_k}\sum_{j=1}^k \pi_j H^{(2)}\left( \frac{z_{j,l} - \mu_j}{\sigma_k} \right) \right| \leq C^{(2)}{\Psi}, \,l = 1,2, $$ where $z{j,1} = \mu_j - \sigma_k$ and $z_{j,1} = \mu_j + \sigma_k$, and $$ C^{(2)}_{\Psi} \equiv \min\left{ \frac{\exp{(-0.5)}}{\sqrt{2\pi}}\max\left[ 0.5^{-2}, \left( \frac{0.5\hat{\sigma_p}}{n^{0.2}} \right)^{-1.5} \right], 1000 \right}. $$ This constraint is motivated by noting that $\left| H^{(2)}\left( \frac{z - \mu_j}{\sigma_k} \right) \right|$ obtains a maximum at $\mu_j \pm \sigma_k$.

However, for computation we only impose the bound on $\sigma$ and will check the bounds on the derivative after maximising.

@coppejans_estimation_2001 suggests picking $k = 7$ for 250 observations, though @dhaultfoeuille_inference_2013 use $k=3$.

sigma_p <- sd(step1Probit$residuals)

sigma_lb <- min(.5 * sigma_p / (1593^.2), 0.5)

c2Psi_ub <- min((exp(-.5)/sqrt(2*pi)) * max(1/4, (.5 * sigma_p / (1593^.2))^(-1.5)), 1000)

k <- 3

pi_k <- rep(1/k, k-1)

mu <- rep(0, k)
sigma <- sigma_p

zeta <- step1Probit$coefficients[2:59] / step1Probit$coefficients[[59]]

x <- model.matrix(
  object = step1Probit$formula,
  data = step1Probit$model
)[, 2:59]

x_x1 <- x / x[, "cogScore"]

y <- step1Probit$y

y_x1 <- y / x[, "cogScore"]

Q <- function(zeta, pik, mu, sigma, x, y) {
  mean((y - (1 - (
    pik[[1]] * pnorm(- x %*% zeta, mean = mu[[1]], sd = sigma) +
      pik[[2]] * pnorm(- x %*% zeta, mean = mu[[2]], sd = sigma) +
      (1 - pik[[1]] - pik[[2]]) * pnorm(- x %*% zeta, mean = mu[[3]], sd = sigma)
  )))^2)
}

Q2 <- function(theta) {
  Q(zeta = c(theta[1:57], 1), pik = theta[58:59], mu = theta[60:62], 
    sigma = theta[63], x = x, y = y)
}

step1Coppejans <- optim(
  par = c(zeta[1:57], pi_k, mu, sigma),
  fn = Q2, method = "BFGS", control = list(maxit = 500)
)

step1Estimates <- data.table(
  varName = names(step1Probit$coefficients), 
  probitEstimate = step1Probit$coefficients / step1Probit$coefficients["cogScore"]
) %>% 
  merge(
    data.table(
      varName = names(step1Logit$coefficients), 
      logitEstimate = step1Logit$coefficients / step1Logit$coefficients["cogScore"]
    ), by = "varName"
) %>%
  merge(
    data.table(
      varName = names(step1Coppejans$par[1:58]), 
      coppejansEstimate = step1Coppejans$par[1:58] / step1Coppejans$par[[58]]
    ), by = "varName"
)

Comparing the parameter estimates across Probit, Logit and @coppejans_estimation_2001 approaches, the signs of the estimates are all the same---except the intercept which is positive for Probit and negative for the other two.

Step 2 (@newey_two-step_2009)

The idea behind step 2 is to use a control function approach a la @heckman_common_1976 but replacing the inverse Mills ratio with a function of the index from step 1.

A possibility suggested in @newey_two-step_2009 is to use a power series of the index itself, so if we borrow the notation from that paper and let $\hat{v} = x'\hat{\zeta}$ then the control function is $h(\hat{v}) = \sum_{k=1}^K \hat{v}^k$.

We can then estimate $(\beta_1, \beta_0)$ directly using least squares in regressions of the form: $$ y_1 = x\beta_1 + h(\hat{v}) \ y_0 = x\beta_0 + h(1-\hat{v}) $$ Not completely sure about the second equation here---is that the correct control function?

vhatVec <- x %*% c(step1Coppejans$par[1:57], 1)

dtLsype4dHM <- dtLsype4dHM[, vhat1 := vhatVec[, 1]]
dtLsype4dHM[, one_vhat1 := 1 - vhat1]
dtLsype4dHM[, paste0("vhat", 2:6) := .(vhat1^2, vhat1^3, vhat1^4, vhat1^5, vhat1^6)]
dtLsype4dHM[, paste0("one_vhat", 2:6) := .(one_vhat1^2, one_vhat1^3, one_vhat1^4, one_vhat1^5, one_vhat1^6)]

step2FormulaD1 <- as.formula(
  paste0('logHourWage ~ ', paste0(c(xControls, "betterPay", paste0("vhat", 1:6)), collapse = ' + '))
)

step2FormulaD0 <- as.formula(
  paste0('logHourWage ~ ', paste0(c(xControls, "betterPay", paste0("one_vhat", 1:6)), collapse = ' + '))
)

step2Formula_noVhat <- as.formula(
  paste0('logHourWage ~ ', paste0(c(xControls, "betterPay"), collapse = ' + '))
)

step2d1 <- lm(
  formula = step2FormulaD1,
  data = dtLsype4dHM[degree == TRUE]
)

step2d1_noVhat <- lm(
  formula = step2Formula_noVhat,
  data = dtLsype4dHM[degree == TRUE]
)

step2d0 <- lm(
  formula = step2FormulaD0,
  data = dtLsype4dHM[degree == FALSE]
)

step2d0_noVhat <- lm(
  formula = step2Formula_noVhat,
  data = dtLsype4dHM[degree == FALSE]
)

step2Estimates <- data.table(
  varName = names(step2d1$coefficients),
  d1estimates = step2d1$coefficients
) %>%
  merge(
    data.table(
      varName = names(step2d1_noVhat$coefficients),
      d1estimates_noVhat = step2d1_noVhat$coefficients
    ), by = "varName"
  ) %>%
  merge(
    data.table(
      varName = names(step2d0$coefficients),
      d0estimates = step2d0$coefficients
    ), by = "varName"
  ) %>%
  merge(
    data.table(
      varName = names(step2d0_noVhat$coefficients),
      d0estimates_noVhat = step2d0_noVhat$coefficients
    ), by = "varName"
  )

Step 3: estimating $\delta_0$ and $\gamma_0$

The final step of the estimation is the novel third stage of @dhaultfoeuille_inference_2013, which builds on the logic in their identification proof.

First, we define $\alpha_0 \equiv \beta_{01} - \beta_{11} + \gamma_{01}$ and estimating $\alpha_0$ is sufficient to find $\gamma_0$ as $\gamma_0 = \beta_1 - \beta_0 + \alpha_0 \zeta_0$.

The assumed index model implies $\E[D|X]$ and $E[\varepsilon|X]$ depend only on the index, $U \equiv X'\zeta_0$. Therefore, define $q_0(u) = E[D|U=u]$ and $g_0(u) = E[\varepsilon|U=u]$, and $$ g'0(U) = q'_0(U)(\delta_0 + \alpha_0U) $$ Integrating between $u_0$ and $U$ gives $$ g_0(U) = \tilde{\lambda}_0 + q_0(U)\delta_0 + \alpha_0\int{u_0}^U uq'0(u)du \ = \lambda_0 + q_0(U)\delta_0 + \left[ q_0(U)U - \int{u_0}^U q_0(u)du \right]\alpha_0 $$ which is also $$ \varepsilon = \lambda_0 + D\delta_0 + \left[ DU - \int_{u_0}^U q_0(u)du \right]\alpha_0 + \xi, \, E[\xi|X] = E[\xi|U] = 0. $$ To estimate equation 5, let $\theta_0 = (\lambda_0, \delta_0, \alpha_0)'$, $V = DU - \int_{u_0}^U q_0(u)du$ and $W = (1, D, V)'$, meaning $\varepsilon = W'\theta_0 + \xi$. D and V are exogeneous as D depends on both U and $\tilde{\eta}{\Delta}$. Therefore, use functions of U as intruments for D and V: $Z = h(U) = (1, h_1(U), h_2(U))'$. Then $\theta_0 = E(ZW')^{-1}E(Z\varepsilon)$, and an estimator is $$ \hat{\theta} = \left(\frac{1}{n} \sum{i=1}^n \hat{Z}i\hat{W}_i\right)^{-1} \left(\frac{1}{n} \sum{i=1}^n \hat{Z}i\hat{\varepsilon}_i\right) $$ Following the method in @dhaultfoeuille_inference_2013 and the online appendix, we have: $$ \hat{\varepsilon}_i = Y_i - X_i'(D_i\hat{\beta}_1 + (1-D_i)\hat{\beta}_0), \hat{W}_i = (1, D_i, \hat{V}_i), \ \hat{V}_i = D_i\hat{U}_i - \int{\hat{u}0}^\hat{U_i} \hat{q}(u, \hat{\zeta}) du, \ \hat{Z}_i = h(\hat{U}_i), \hat{U}_i = X_i'\hat{\zeta}, \ \hat{q}(u,\zeta) = \frac{\sum{i=1}^n D_i K(\frac{u - X_i'\zeta}{h_n})}{\sum_{i=1}^n K(\frac{u - X_i'\zeta}{h_n})} $$ @dhaultfoeuille_inference_2013 choose a quartic kernel, $K(v) = (15/16)(1-v^2)^2\mathds{1}{[-1,1]}(v)$, with bandwidth $h_n = 0.5\sigma(\hat{U})n^{-1/7}$, with $\sigma(\hat{U})$ the estimated standard deviation of U. Finally, the functions to instrument D and V are: $$ h_1(x) = \Phi(\hat{a}_0 + \hat{a}_1x), \ h_2(x) = xh_1(x) - \int{\hat{u}_0}^x \hat{q}(u, \hat{\zeta})du, $$ and $(\hat{a}_0, \hat{a}_1)$ is the probit estimator of D on $(1, \hat{U})$.

n <- 1593
sigmaU <- sd(vhatVec[, 1])
Xzeta <- vhatVec[, 1]
u <- Xzeta
d <- as.numeric(y)

q0 <- function(u) {
  M <- matrix(rep(u, each = length(Xzeta)), nrow = length(Xzeta), ncol = length(u))
  Ku <- matrix(
    kde(Xzeta - mean(Xzeta), h = .5 * sigmaU * n^(-1/7), 
        eval.points = M - Xzeta)$estimate, 
    nrow = length(Xzeta), ncol = length(u)
  )
  if (length(u) > 1) {
    colSums(d * Ku) / colSums(Ku)
  } else {
    sum(d * Ku) / sum(Ku)
  }
}

u0 <- min(Xzeta)

intq0 <- function(u) {
  res <- list()
  for (i in length(u)) {
    res[[i]] <- integrate(q0, lower = u0, upper = u[[i]])$value
  }
  unlist(res)
}

V <- d * u - intq0(u)

aModel <- glm(d ~ u)
a0 <- aModel$coefficients[[1]]
a1 <- aModel$coefficients[[2]]

h1 <- pnorm(a0 + a1 * u)
h2 <- u * h1 - intq0(u)

Z <- cbind(1, h1, h2)

b0 <- step2d0$coefficients[1:30]
b1 <- step2d1$coefficients[1:30]
xWage <- model.matrix(object = step2Formula_noVhat, 
                      data = dtLsype4dHM)

b <- d * matrix(rep(b1, n), nrow = n) + (1-d) * matrix(rep(b0, n), nrow = n)

eps <- y - rowSums(xWage * b)
W <- cbind(1, d, V)

theta <- colSums(Z * eps) / colSums(Z * W)

step3Estimates <- step1Estimates[, .(varName, zeta0 = coppejansEstimate)] %>%
  merge(step2Estimates[, .(varName, beta1 = d1estimates, beta0 = d0estimates)], by = "varName", all = TRUE)

step3Estimates[is.na(zeta0), zeta0 := 1]

step3Estimates[is.na(step3Estimates)] <- 0

step3Estimates[, gamma0 := beta1 - beta0 + theta[[3]] * zeta0]

Estimation using `progDHM2013()`

benefitVars <- c(
  'betterJob', 'betterPay', 'betterOpp', 'need4Career', 'showSkills',
  'delayWorkGd', 'socialLifeGd', 'leaveHomeGd', 'keepStudy', 'moreQuals',
  'persDevel', 'moreConfid', 'moreRespect', 'betterLife', 'prepareLife'
)

costVars <- c(
  'tooExpensive', 'getDebt', 'parentsMoney', 'notFinancIndep', 'notWorkEarn',
  'costsGen', 'noGuaranteeGdJb', 'notNeed4Career', 'lessExp', 'tooHard',
  'leaveHomeBad', 'tooLong', 'wasteTime', 'feesEtc', 'stress'
)

xControls <- c(
  "cogScore", #x1
  'w4SOCMajorMP', 'w4ethgrpMP',
  'w4hiqualgMP', 'W5AlevNoYP', 'W5EMA1YP',
  'W4AlevNoYP', 'w4sexMP', 'W4SexYP'
)

# testing the function manually
dt <- dtLsype4dHM
varList <- c(xControls, benefitVars, costVars)
x1 <- "cogScore"
d <- "degree"
y <- "logHourWage"
ivY1 <- c(benefitVars, costVars)[-2]
ivY0 <- c(benefitVars, costVars)[-2]

resLsype <- progDHM2013(
  dt = dtLsype4dHM,
  varList = c(xControls, benefitVars, costVars),
  x1 = "cogScore", d = "degree", y = "logHourWage",
  ivY1 = c(benefitVars, costVars)[-2], 
  ivY0 = c(benefitVars, costVars)[-2], bw = 2.5
)


# use_data(resLsype, overwrite = TRUE)

Distribution of non-pecuniary "costs"

Distribution of non-pecuniary factors

# with resBcs loaded
gamma0 <- resLsype$dtEstimates[, gamma0]
names(gamma0) <- resLsype$dtEstimates[, varName]
gamma0 <- gamma0[colnames(resLsype$x[, 2:length(gamma0)])]
delta0 <- resLsype$dtEstimates[, unique(delta0)]

# calculate np "value"
npValue <- colSums(t(resLsype$x) * c(delta0, gamma0))

# plot density
plot(density(-npValue))

# plot density including d
npDensity <- density(-npValue)
plot(npDensity$x, npDensity$y*30, type = "l")
points(-npValue, d)

Estimating the distribution of ex ante returns

The distribution of ex ante returns, $F_{\Delta}(u) = E[F_{\eta_{\Delta}}(u + X'(\beta_0 - \beta_1))]$. Note that $$ P(D = 0|X) = F_{\eta_{\Delta}}(\delta_0 + X'\alpha_0 \zeta_0) $$ Therefore we can estimate $F_{\eta_{\Delta}}$ by regressing nonparametrically 1-D on the index $\hat{\delta} + X'\hat{\alpha}\hat{\zeta}$.

library(np)

alpha0 <- resLsype$dtEstimates[, unique(alpha0)]
delta0 <- resLsype$dtEstimates[, unique(delta0)]

zeta0 <- resLsype$dtEstimates[, zeta0]
names(zeta0) <- resLsype$dtEstimates[, varName]
zeta0 <- zeta0[colnames(resLsype$x[, 2:length(zeta0)])]

d <- as.numeric(dtLsype4dHM[, degree])


DeltaIndex <- colSums(t(resLsype$x) * c(delta0, zeta0))

deltaModel <- np::npreg(1-d ~ DeltaIndex)

plot(deltaModel)

beta0 <- resLsype$dtEstimates[, beta0]
names(beta0) <- resLsype$dtEstimates[, varName]
beta0 <- beta0[colnames(resLsype$x[, 2:length(zeta0)])]
beta1 <- resLsype$dtEstimates[, beta1]
names(beta1) <- resLsype$dtEstimates[, varName]
beta1 <- beta1[colnames(resLsype$x[, 2:length(zeta0)])]

Tx <- colSums(t(resLsype$x[, 2:length(zeta0)]) * (beta0 - beta1))
Gx <- DeltaIndex - Tx

Fdelta <- npreg(txdat = DeltaIndex, tydat = 1-d, exdat = rep(Gx, each = length(Tx))+Tx)

Fd <- colSums(matrix(Fdelta$mean, nrow = length(Tx))) / length(Tx)

plot(x = -Gx, y = Fd)
plot(deltaModel)