View source: R/f_copulacorrection_interface.R
copulaCorrection  R Documentation 
Fits linear models with continuous or discrete endogenous regressors (or a mixture of both) using Gaussian copulas, as presented in Park and Gupta (2012). This is a statistical technique to address the endogeneity problem where no external instrumental variables are needed. The important assumption of the model is that the endogenous variables should NOT be normally distributed, if continuous, preferably with a skewed distribution. The corrections proposed by Qian, Koschmann, and Xie (2024, p.1922) are implemented. These mitigate the bias of the original paper for small and moderate sample sizes.
copulaCorrection(formula, data, num.boots = 1000, verbose = TRUE, ...)
formula 
A symbolic description of the model to be fitted. See the "Details" section for the exact notation. 
data 
A data.frame containing the data of all parts specified in the formula parameter. 
num.boots 
Number of bootstrapping iterations. Defaults to 1000. 
verbose 
Show details about the running of the function. 
... 
Arguments for the loglikelihood optimization function in the case of a single continuous endogenous regressor. Ignored with a warning otherwise.

The underlying idea of the joint estimation method is that using information contained in the observed data, one selects marginal distributions for the endogenous regressor and the structural error term, respectively. Then, the copula model enables the construction of a flexible multivariate joint distribution allowing a wide range of correlations between the two marginals.
Consider the model:
where t=1,..,T
indexes either time or crosssectional units, Y_{t} is a 1x1
response variable,
X_{t} is a kxn
exogenous regressor,
P_{t} is a kx1
continuous endogenous regressor,
ε_{t} is a normally distributed structural error term with mean zero and
E(ε^{2})=σ_{ε}^{2},
\alpha
and \beta
are model parameters.
The marginal distribution of the endogenous regressor P_{t} is obtained using the Epanechnikov kernel density estimator (Epanechnikov, 1969), as below:
where P_{t} is the endogenous regressor,
K(x)=0.75·(1x^{2})·I(x<=1)
and the bandwidth b
is the one proposed by Silverman (1986),
and is equal to b=0.9·T^{1.5}·min(s, IQR/1.34).
IQR
is the interquartile range while s
is the data sample standard deviation
and T
is the number of time periods observed in the data.
After obtaining the joint distribution of the error term and the continuous endogenous regressor, the model parameters are estimated using
maximum likelihood estimation.
The additional parameters used during model fitting and printed in summary
hence are:
rho
The correlation between the endogenous regressor and the error.
sigma
The variance of the model's error.
With more than one continuous endogenous regressor or an endogenous discrete regressor, an alternative approach to the
estimation using Gaussian copula should be applied. This approach is similar to the control function approach (Petrin and Train, 2010).
The core idea is to apply OLS estimation on the original set of explanatory variables in the model equation above, plus an additional regressor
P_{t}*=Φ^{1}(H(P_{t})).
Here, H(P_{t}) is the marginal distribution of the endogenous regressor P
.
Including this regressor solves the correlation between the endogenous regressor and the structural error, \epsilon
,
OLS providing consistent parameter estimates. Due to identification problems, the discrete endogenous regressor cannot have a binomial
distribution.
Hence, only in the case of a single continuous endogenous regressor maximum likelihood estimation is used. In all other cases, augmented OLS based on Gaussian copula is applied. This includes cases of multiple endogenous regressors of both discrete and continuous distributions.
In the case of discrete endogenous regressors, a random seed needs to be assigned because the marginal distribution function of the endogenous regressor is a step function in this case. This means that the value of P* lies between 2 values, Φ^{1}(H(P_{t}1)) and Φ^{1}(H(P_{t})). However, the reported upper and lower bounds of the 95% bootstrapped confidence interval gives indication of the variance of the estimates.
Since the inference procedure in both cases, augmented OLS and maximum likelihood, occurs in two stages (first the empirical distribution of the endogenous regressor is computed and then used in constructing the likelihood function), the standard errors are not correct. Therefore, in both cases, the standard errors and the confidence intervals are obtained based on the sampling distributions resulted from bootstrapping. Since the distribution of the bootstrapped parameters is highly skewed, we report the percentile confidence intervals. Moreover, the variancecovariance matrix is also computed based on the bootstrapped parameters, and not based on the Hessian.
The formula
argument follows a two part notation:
A twosided formula describing the model (e.g. y ~ X1 + X2 + P
) to be estimated and a
second righthand side part in which the endogenous regressors and their distributional
assumptions are indicated (e.g. continuous(P)
). These two parts are separated by a single vertical bar (
).
In the second part, the special functions continuous
, discrete
, or a combination
of both, are used to indicate the endogenous regressors and their respective distribution.
Both functions use the ...
parameter in which the respective endogenous regressors is specified.
Note that no argument to continuous
or discrete
is to be supplied as character
but as symbols without quotation marks.
See the example section for illustrations on how to specify the formula
parameter.
For all cases, an object of classes rendo.copula.correction
, rendo.boots
, and rendo.base
is returned
which is a list and contains the following components:
formula 
The formula given to specify the fitted model. 
terms 
The terms object used for model fitting. 
model 
The model.frame used for model fitting. 
coefficients 
A named vector of all coefficients resulting from model fitting. 
names.main.coefs 
a vector specifying which coefficients are from the model. For internal usage. 
names.vars.continuous 
The names of the continuous endogenous regressors. 
names.vars.discrete 
The names of the discrete endogenous regressors. 
fitted.values 
Fitted values at the found solution. 
residuals 
The residuals at the found solution. 
boots.params 
The bootstrapped coefficients. 
For the case of a single continuous endogenous regressor, the returned object further contains the following components:
start.params 
A named vector with the initial set of parameters used to optimize the loglikelihood function. 
res.optimx 
The result object returned by the function 
For all other cases, the returned object further contains the following component:
res.lm.real.data 
The linear model fitted on the original data together with generated p.star data. 
The function summary
can be used to obtain and print a summary of the results.
Depending on the returned object, the generic accessor functions coefficients
, fitted.values
,
residuals
, vcov
, logLik
, AIC
, BIC
, and nobs
are available.
Park, S. and Gupta, S., (2012), "Handling Endogenous Regressors by Joint Estimation Using Copulas", Marketing Science, 31(4), 56786.
Qian, Y., Koschmann, A., and Xie, H. (2024). "A Practical Guide to Endogeneity Correction Using Copulas". National Bureau of Economic Research, w32231.
Epanechnikov V (1969). "Nonparametric Estimation of a Multidimensional Probability Density." Teoriya veroyatnostei i ee primeneniya, 14(1), 156–161.
Silverman B (1986). "Density Estimation for Statistics and Data Analysis". CRC Monographs on Statistics and Applied Probability. London: Chapman & Hall.
Petrin A, Train K (2010). "A Control Function Approach to Endogeneity in Consumer Choice Models." Journal of Marketing Research, 47(1), 3–13.
summary
for how fitted models are summarized
vcov
for how the variancecovariance matrix is derived
confint
for how confidence intervals are derived
optimx
for possible elements of parameter optimx.arg
data("dataCopCont")
data("dataCopCont2")
data("dataCopDis")
data("dataCopDis2")
data("dataCopDisCont")
## Not run:
# Single continuous: loglikelihood optimization
c1 < copulaCorrection(y~X1+X2+Pcontinuous(P), num.boots=10, data=dataCopCont)
# same as above, with start.parameters and number of bootstrappings
c1 < copulaCorrection(y~X1+X2+Pcontinuous(P), num.boots=10, data=dataCopCont,
start.params = c("(Intercept)"=1, X1=1, X2=2, P=1))
# All following examples fit linear model with Gaussian copulas
# 2 continuous endogenous regressors
c2 < copulaCorrection(y~X1+X2+P1+P2continuous(P1, P2),
num.boots=10, data=dataCopCont2)
# same as above
c2 < copulaCorrection(y~X1+X2+P1+P2continuous(P1)+continuous(P2),
num.boots=10, data=dataCopCont2)
# single discrete endogenous regressor
d1 < copulaCorrection(y~X1+X2+Pdiscrete(P), num.boots=10, data=dataCopDis)
# two discrete endogenous regressor
d2 < copulaCorrection(y~X1+X2+P1+P2discrete(P1)+discrete(P2),
num.boots=10, data=dataCopDis2)
# same as above but less bootstrap runs
d2 < copulaCorrection(y~X1+X2+P1+P2discrete(P1, P2), num.boots = 10,
data=dataCopDis2)
# single discrete, single continuous
cd < copulaCorrection(y~X1+X2+P1+P2discrete(P1)+continuous(P2),
num.boots=10, data=dataCopDisCont)
# For single continuous only: use own optimization settings (see optimx())
# set maximum number of iterations to 50'000
res.c1 < copulaCorrection(y~X1+X2+Pcontinuous(P),
optimx.args = list(itnmax = 50000),
num.boots=10, data=dataCopCont)
# print detailed tracing information on progress
res.c1 < copulaCorrection(y~X1+X2+Pcontinuous(P),
optimx.args = list(control = list(trace = 6)),
num.boots=10, data=dataCopCont)
# use method LBFGSB instead of NelderMead and print report every 50 iterations
res.c1 < copulaCorrection(y~X1+X2+Pcontinuous(P),
optimx.args = list(method = "LBFGSB",
control=list(trace = 2, REPORT=50)),
num.boots=10, data=dataCopCont)
# For coef(), the parameter "complete" determines if only the
# main model parameters or also the auxiliary coefficients are returned
c1.all.coefs < coef(res.c1) # also returns rho and sigma
# same as above
c1.all.coefs < coef(res.c1, complete = TRUE)
# only main model coefs
c1.main.coefs < coef(res.c1, complete = FALSE)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.