sfaselectioncross: Sample selection in stochastic frontier estimation using...

View source: R/sfaselectioncross.R

sfaselectioncrossR Documentation

Sample selection in stochastic frontier estimation using cross-section data

Description

sfaselectioncross is a symbolic formula based function for the estimation of the stochastic frontier model in the presence of sample selection. The model accommodates cross-sectional or pooled cross-sectional data. The model can be estimated using different quadrature approaches or maximum simulated likelihood (MSL). See Greene (2010).

Only the half-normal distribution is possible for the one-sided error term. Eleven optimization algorithms are available.

The function also accounts for heteroscedasticity in both one-sided and two-sided error terms, as in Reifschneider and Stevenson (1991), Caudill and Ford (1993), Caudill et al. (1995) and Hadri (1999).

Usage

sfaselectioncross(
  selectionF,
  frontierF,
  uhet,
  vhet,
  modelType = "greene10",
  logDepVar = TRUE,
  data,
  subset,
  weights,
  wscale = TRUE,
  S = 1L,
  udist = "hnormal",
  start = NULL,
  method = "bfgs",
  hessianType = 2L,
  lType = "ghermite",
  Nsub = 100,
  uBound = Inf,
  simType = "halton",
  Nsim = 100,
  prime = 2L,
  burn = 10,
  antithetics = FALSE,
  seed = 12345,
  itermax = 2000,
  printInfo = FALSE,
  intol = 1e-06,
  tol = 1e-12,
  gradtol = 1e-06,
  stepmax = 0.1,
  qac = "marquardt"
)

## S3 method for class 'sfaselectioncross'
print(x, ...)

## S3 method for class 'sfaselectioncross'
bread(x, ...)

## S3 method for class 'sfaselectioncross'
estfun(x, ...)

Arguments

selectionF

A symbolic (formula) description of the selection equation.

frontierF

A symbolic (formula) description of the outcome (frontier) equation.

uhet

A one-part formula to consider heteroscedasticity in the one-sided error variance (see section ‘Details’).

vhet

A one-part formula to consider heteroscedasticity in the two-sided error variance (see section ‘Details’).

modelType

Character string. Model used to solve the selection bias. Only the model discussed in Greene (2010) is currently available.

logDepVar

Logical. Informs whether the dependent variable is logged (TRUE) or not (FALSE). Default = TRUE.

data

The data frame containing the data.

subset

An optional vector specifying a subset of observations to be used in the optimization process.

weights

An optional vector of weights to be used for weighted log-likelihood. Should be NULL or numeric vector with positive values. When NULL, a numeric vector of 1 is used.

wscale

Logical. When weights is not NULL, a scaling transformation is used such that the weights sum to the sample size. Default TRUE. When FALSE no scaling is used.

S

If S = 1 (default), a production (profit) frontier is estimated: \epsilon_i = v_i-u_i. If S = -1, a cost frontier is estimated: \epsilon_i = v_i+u_i.

udist

Character string. Distribution specification for the one-sided error term. Only the half normal distribution 'hnormal' is currently implemented.

start

Numeric vector. Optional starting values for the maximum likelihood (ML) estimation.

method

Optimization algorithm used for the estimation. Default = 'bfgs'. 11 algorithms are available:

  • 'bfgs', for Broyden-Fletcher-Goldfarb-Shanno (see maxBFGS)

  • 'bhhh', for Berndt-Hall-Hall-Hausman (see maxBHHH)

  • 'nr', for Newton-Raphson (see maxNR)

  • 'nm', for Nelder-Mead (see maxNM)

  • 'cg', for Conjugate Gradient (see maxCG)

  • 'sann', for Simulated Annealing (see maxSANN)

  • 'ucminf', for a quasi-Newton type optimization with BFGS updating of the inverse Hessian and soft line search with a trust region type monitoring of the input to the line search algorithm (see ucminf)

  • 'mla', for general-purpose optimization based on Marquardt-Levenberg algorithm (see mla)

  • 'sr1', for Symmetric Rank 1 (see trust.optim)

  • 'sparse', for trust regions and sparse Hessian (see trust.optim)

  • 'nlminb', for optimization using PORT routines (see nlminb)

hessianType

Integer. If 1, analytic Hessian is returned. If 2, bhhh Hessian is estimated (g'g). bhhh hessian is estimated by default as the estimation is conducted in two steps.

lType

Specifies the way the likelihood is estimated. Five possibilities are available: kronrod for Gauss-Kronrod quadrature (see integrate), hcubature and pcubature for adaptive integration over hypercubes (see hcubature and pcubature), ghermite for Gauss-Hermite quadrature (see gaussHermiteData), and msl for maximum simulated likelihood. Default ghermite.

Nsub

Integer. Number of subdivisions/nodes used for quadrature approaches. Default Nsub = 100.

uBound

Numeric. Upper bound for the inefficiency component when solving integrals using quadrature approaches except Gauss-Hermite for which the upper bound is automatically infinite (Inf). Default uBound = Inf.

simType

Character string. If simType = 'halton' (Default), Halton draws are used for maximum simulated likelihood (MSL). If simType = 'ghalton', Generalized-Halton draws are used for MSL. If simType = 'sobol', Sobol draws are used for MSL. If simType = 'uniform', uniform draws are used for MSL. (see section ‘Details’).

Nsim

Number of draws for MSL (default 100).

prime

Prime number considered for Halton and Generalized-Halton draws. Default = 2.

burn

Number of the first observations discarded in the case of Halton draws. Default = 10.

antithetics

Logical. Default = FALSE. If TRUE, antithetics counterpart of the uniform draws is computed. (see section ‘Details’).

seed

Numeric. Seed for the random draws.

itermax

Maximum number of iterations allowed for optimization. Default = 2000.

printInfo

Logical. Print information during optimization. Default = FALSE.

intol

Numeric. Integration tolerance for quadrature approaches (kronrod, hcubature, pcubature).

tol

Numeric. Convergence tolerance. Default = 1e-12.

gradtol

Numeric. Convergence tolerance for gradient. Default = 1e-06.

stepmax

Numeric. Step max for ucminf algorithm. Default = 0.1.

qac

Character. Quadratic Approximation Correction for 'bhhh' and 'nr' algorithms. If 'stephalving', the step length is decreased but the direction is kept. If 'marquardt' (default), the step length is decreased while also moving closer to the pure gradient direction. See maxBHHH and maxNR.

x

an object of class sfaselectioncross (returned by the function sfaselectioncross).

...

additional arguments of frontier are passed to sfaselectioncross; additional arguments of the print, bread, estfun, nobs methods are currently ignored.

Details

The current model is an extension of Heckman (1976, 1979) sample selection model to nonlinear models particularly stochastic frontier model. The model has first been discussed in Greene (2010), and an application can be found in Dakpo et al. (2021). Practically, we have:

y_{1i} = \left\{ \begin{array}{ll} 1 & \mbox{if} \quad y_{1i}^* > 0 \\ 0 & \mbox{if} \quad y_{1i}^* \leq 0 \\ \end{array} \right.

where

y_{1i}^*=\mathbf{Z}_{si}^{\prime} \mathbf{\gamma} + w_i, \quad w_i \sim \mathcal{N}(0, 1)

and

y_{2i} = \left\{ \begin{array}{ll} y_{2i}^* & \mbox{if} \quad y_{1i}^* > 0 \\ NA & \mbox{if} \quad y_{1i}^* \leq 0 \\ \end{array} \right.

where

y_{2i}^*=\mathbf{x_{i}^{\prime}} \mathbf{\beta} + v_i - Su_i, \quad v_i = \sigma_vV_i \quad \wedge \quad V_i \sim \mathcal{N}(0, 1), \quad u_i = \sigma_u|U_i| \quad \wedge \quad U_i \sim \mathcal{N}(0, 1)

y_{1i} describes the selection equation while y_{2i} represents the frontier equation. The selection bias arises from the correlation between the two symmetric random components v_i and w_i:

(v_i, w_i) \sim \mathcal{N}_2\left\lbrack(0,0), (1, \rho \sigma_v, \sigma_v^2) \right\rbrack

Conditionaly on |U_i|, the probability associated to each observation is:

Pr \left\lbrack y_{1i}^* \leq 0 \right\rbrack^{1-y_{1i}} \cdot \left\lbrace f(y_{2i}|y_{1i}^* > 0) \times Pr\left\lbrack y_{1i}^* > 0 \right\rbrack \right\rbrace^{y_{1i}}

Using the conditional probability formula:

P\left(A\cap B\right) = P(A) \cdot P(B|A) = P(B) \cdot P(A|B)

Therefore:

f(y_{2i}|y_{1i}^* \geq 0) \cdot Pr\left\lbrack y_{1i}^* \geq 0\right\rbrack = f(y_{2i}) \cdot Pr(y_{1i}^* \geq 0|y_{2i})

Using the properties of a bivariate normal distribution, we have:

y_{i1}^* | y_{i2} \sim N\left(\mathbf{Z_{si}^{\prime}} \bm{\gamma}+\frac{\rho}{ \sigma_v}v_i, 1-\rho^2\right)

Hence conditionally on |U_i|, we have:

f(y_{2i}|y_{1i}^* \geq 0) \cdot Pr\left\lbrack y_{1i}^* \geq 0\right\rbrack = \frac{1}{\sigma_v}\phi\left(\frac{v_i}{\sigma_v}\right)\Phi\left(\frac{ \mathbf{Z_{si}^{\prime}} \bm{\gamma}+\frac{\rho}{\sigma_v}v_i}{ \sqrt{1-\rho^2}}\right)

The conditional likelihood is equal to:

L_i\big||U_i| = \Phi(-\mathbf{Z_{si}^{\prime}} \bm{\gamma})^{1-y_{1i}} \times \left\lbrace \frac{1}{\sigma_v}\phi\left(\frac{y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|}{\sigma_v}\right)\Phi\left(\frac{ \mathbf{Z_{si}^{\prime}} \bm{\gamma}+\frac{\rho}{\sigma_v}\left(y_{2i}- \mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|\right)}{\sqrt{1-\rho^2}} \right) \right\rbrace ^{y_{1i}}

Since the non-selected observations bring no additional information, the conditional likelihood to be considered is:

L_i\big||U_i| = \frac{1}{\sigma_v}\phi\left(\frac{y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|}{\sigma_v}\right) \Phi\left(\frac{\mathbf{Z_{si}^{\prime}} \bm{\gamma}+\frac{\rho}{\sigma_v}\left(y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|\right)}{\sqrt{1-\rho^2}}\right)

The unconditional likelihood is obtained by integrating |U_i| out of the conditional likelihood. Thus

L_i\\ = \int_{|U_i|} \frac{1}{\sigma_v}\phi\left(\frac{y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|}{\sigma_v}\right) \Phi\left(\frac{\mathbf{Z_{si}^{\prime}} \bm{\gamma}+ \frac{\rho}{\sigma_v}\left(y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|\right)}{\sqrt{1-\rho^2}}\right)p\left(|U_i|\right)d|U_i|

To simplifiy the estimation, the likelihood can be estimated using a two-step approach. In the first step, the probit model can be run and estimate of \gamma can be obtained. Then, in the second step, the following model is estimated:

L_i\\ = \int_{|U_i|} \frac{1}{\sigma_v}\phi\left(\frac{y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|}{\sigma_v}\right) \Phi\left(\frac{a_i + \frac{\rho}{\sigma_v}\left(y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|\right)}{\sqrt{1-\rho^2}}\right)p\left(|U_i|\right)d|U_i|

where a_i = \mathbf{Z_{si}^{\prime}} \hat{\bm{\gamma}}. This likelihood can be estimated using five different approaches: Gauss-Kronrod quadrature, adaptive integration over hypercubes (hcubature and pcubature), Gauss-Hermite quadrature, and maximum simulated likelihood. We also use the BHHH estimator to obtain the asymptotic standard errors for the parameter estimators.

sfaselectioncross allows for the maximization of weighted log-likelihood. When option weights is specified and wscale = TRUE, the weights are scaled as:

new_{weights} = sample_{size} \times \frac{old_{weights}}{\sum(old_{weights})}

For complex problems, non-gradient methods (e.g. nm or sann) can be used to warm start the optimization and zoom in the neighborhood of the solution. Then a gradient-based methods is recommended in the second step. In the case of sann, we recommend to significantly increase the iteration limit (e.g. itermax = 20000). The Conjugate Gradient (cg) can also be used in the first stage.

A set of extractor functions for fitted model objects is available for objects of class 'sfaselectioncross' including methods to the generic functions print, summary, coef, fitted, logLik, residuals, vcov, efficiencies, ic, marginal, estfun and bread (from the sandwich package), lmtest::coeftest() (from the lmtest package).

Value

sfaselectioncross returns a list of class 'sfaselectioncross' containing the following elements:

call

The matched call.

selectionF

The selection equation formula.

frontierF

The frontier equation formula.

S

The argument 'S'. See the section ‘Arguments’.

typeSfa

Character string. 'Stochastic Production/Profit Frontier, e = v - u' when S = 1 and 'Stochastic Cost Frontier, e = v + u' when S = -1.

Ninit

Number of initial observations in all samples.

Nobs

Number of observations used for optimization.

nXvar

Number of explanatory variables in the production or cost frontier.

logDepVar

The argument 'logDepVar'. See the section ‘Arguments’.

nuZUvar

Number of variables explaining heteroscedasticity in the one-sided error term.

nvZVvar

Number of variables explaining heteroscedasticity in the two-sided error term.

nParm

Total number of parameters estimated.

udist

The argument 'udist'. See the section ‘Arguments’.

startVal

Numeric vector. Starting value for M(S)L estimation.

dataTable

A data frame (tibble format) containing information on data used for optimization along with residuals and fitted values of the OLS and M(S)L estimations, and the individual observation log-likelihood. When argument weights is specified, an additional variable is provided in dataTable.

lpmObj

Linear probability model used for initializing the first step probit model.

probitObj

Probit model. Object of class 'maxLik' and 'maxim'.

ols2stepParam

Numeric vector. OLS second step estimates for selection correction. Inverse Mills Ratio is introduced as an additional explanatory variable.

ols2stepStder

Numeric vector. Standard errors of OLS second step estimates.

ols2stepSigmasq

Numeric. Estimated variance of OLS second step random error.

ols2stepLoglik

Numeric. Log-likelihood value of OLS second step estimation.

ols2stepSkew

Numeric. Skewness of the residuals of the OLS second step estimation.

ols2stepM3Okay

Logical. Indicating whether the residuals of the OLS second step estimation have the expected skewness.

CoelliM3Test

Coelli's test for OLS residuals skewness. (See Coelli, 1995).

AgostinoTest

D'Agostino's test for OLS residuals skewness. (See D'Agostino and Pearson, 1973).

isWeights

Logical. If TRUE weighted log-likelihood is maximized.

lType

Type of likelihood estimated. See the section ‘Arguments’.

optType

Optimization algorithm used.

nIter

Number of iterations of the ML estimation.

optStatus

Optimization algorithm termination message.

startLoglik

Log-likelihood at the starting values.

mlLoglik

Log-likelihood value of the M(S)L estimation.

mlParam

Parameters obtained from M(S)L estimation.

gradient

Each variable gradient of the M(S)L estimation.

gradL_OBS

Matrix. Each variable individual observation gradient of the M(S)L estimation.

gradientNorm

Gradient norm of the M(S)L estimation.

invHessian

Covariance matrix of the parameters obtained from the M(S)L estimation.

hessianType

The argument 'hessianType'. See the section ‘Arguments’.

mlDate

Date and time of the estimated model.

simDist

The argument 'simDist', only if lType = 'msl'. See the section ‘Arguments’.

Nsim

The argument 'Nsim', only if lType = 'msl'. See the section ‘Arguments’.

FiMat

Matrix of random draws used for MSL, only if lType = 'msl'.

gHermiteData

List. Gauss-Hermite quadrature rule as provided by gaussHermiteData. Only if lType = 'ghermite'.

Nsub

Number of subdivisions used for quadrature approaches.

uBound

Upper bound for the inefficiency component when solving integrals using quadrature approaches except Gauss-Hermite for which the upper bound is automatically infinite (Inf).

intol

Integration tolerance for quadrature approaches except Gauss-Hermite.

Note

For the Halton draws, the code is adapted from the mlogit package.

References

Caudill, S. B., and Ford, J. M. 1993. Biases in frontier estimation due to heteroscedasticity. Economics Letters, 41(1), 17–20.

Caudill, S. B., Ford, J. M., and Gropper, D. M. 1995. Frontier estimation and firm-specific inefficiency measures in the presence of heteroscedasticity. Journal of Business & Economic Statistics, 13(1), 105–111.

Coelli, T. 1995. Estimators and hypothesis tests for a stochastic frontier function - a Monte-Carlo analysis. Journal of Productivity Analysis, 6:247–268.

D'Agostino, R., and E.S. Pearson. 1973. Tests for departure from normality. Empirical results for the distributions of b_2 and \sqrt{b_1}. Biometrika, 60:613–622.

Dakpo, K. H., Latruffe, L., Desjeux, Y., Jeanneaux, P., 2022. Modeling heterogeneous technologies in the presence of sample selection: The case of dairy farms and the adoption of agri-environmental schemes in France. Agricultural Economics, 53(3), 422-438.

Greene, W., 2010. A stochastic frontier model with correction for sample selection. Journal of Productivity Analysis. 34, 15–24.

Hadri, K. 1999. Estimation of a doubly heteroscedastic stochastic frontier cost function. Journal of Business & Economic Statistics, 17(3), 359–363.

Heckman, J., 1976. Discrete, qualitative and limited dependent variables. Ann Econ Soc Meas. 4, 475–492.

Heckman, J., 1979. Sample Selection Bias as a Specification Error. Econometrica. 47, 153–161.

Reifschneider, D., and Stevenson, R. 1991. Systematic departures from the frontier: A framework for the analysis of firm inefficiency. International Economic Review, 32(3), 715–723.

See Also

print for printing sfaselectioncross object.

summary for creating and printing summary results.

coef for extracting coefficients of the estimation.

efficiencies for computing (in-)efficiency estimates.

fitted for extracting the fitted frontier values.

ic for extracting information criteria.

logLik for extracting log-likelihood value(s) of the estimation.

marginal for computing marginal effects of inefficiency drivers.

residuals for extracting residuals of the estimation.

vcov for computing the variance-covariance matrix of the coefficients.

bread for bread for sandwich estimator.

estfun for gradient extraction for each observation.

Examples


## Not run: 

## Simulated example

N <- 2000  # sample size
set.seed(12345)
z1 <- rnorm(N)
z2 <- rnorm(N)
v1 <- rnorm(N)
v2 <- rnorm(N)
e1 <- v1
e2 <- 0.7071 * (v1 + v2)
ds <- z1 + z2 + e1
d <- ifelse(ds > 0, 1, 0)
u <- abs(rnorm(N))
x1 <- rnorm(N)
x2 <- rnorm(N)
y <- x1 + x2 + e2 - u
data <- cbind(y = y, x1 = x1, x2 = x2, z1 = z1, z2 = z2, d = d)

## Estimation using quadrature (Gauss-Kronrod)

selecRes1 <- sfaselectioncross(selectionF = d ~ z1 + z2, frontierF = y ~ x1 + x2, 
modelType = 'greene10', method = 'bfgs',
logDepVar = TRUE, data = as.data.frame(data),
S = 1L, udist = 'hnormal', lType = 'kronrod', Nsub = 100, uBound = Inf,
simType = 'halton', Nsim = 300, prime = 2L, burn = 10, antithetics = FALSE,
seed = 12345, itermax = 2000, printInfo = FALSE)

summary(selecRes1)

## Estimation using maximum simulated likelihood

selecRes2 <- sfaselectioncross(selectionF = d ~ z1 + z2, frontierF = y ~ x1 + x2, 
modelType = 'greene10', method = 'bfgs',
logDepVar = TRUE, data = as.data.frame(data),
S = 1L, udist = 'hnormal', lType = 'msl', Nsub = 100, uBound = Inf,
simType = 'halton', Nsim = 300, prime = 2L, burn = 10, antithetics = FALSE,
seed = 12345, itermax = 2000, printInfo = FALSE)

summary(selecRes2)


## End(Not run)


sfaR documentation built on Oct. 29, 2024, 9:07 a.m.