# copulaCorrection: Fitting Linear Models Endogenous Regressors using Gaussian... In REndo: Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables

## Description

Fits linear models with continuous or discrete endogenous regressors (or a mixture of both) using Gaussian copulas, as presented in Park and Gupta (2012). This is a statistical technique to address the endogeneity problem where no external instrumental variables are needed. The important assumption of the model is that the endogenous variables should NOT be normally distributed, if continuous, preferably with a skewed distribution.

## Usage

 `1` ```copulaCorrection(formula, data, num.boots = 1000, verbose = TRUE, ...) ```

## Arguments

 `formula` A symbolic description of the model to be fitted. See the "Details" section for the exact notation. `data` A data.frame containing the data of all parts specified in the formula parameter. `num.boots` Number of bootstrapping iterations. Defaults to 1000. `verbose` Show details about the running of the function. `...` Arguments for the log-likelihood optimization function in the case of a single continuous endogenous regressor. Ignored with a warning otherwise. start.paramsA named vector containing a set of parameters to use in the first optimization iteration. The names have to correspond exactly to the names of the components specified in the `formula` parameter. If not provided, a linear model is fitted to derive them. optimx.argsA named list of arguments which are passed to `optimx`. This allows users to tweak optimization settings to their liking.

## Details

#### Method

The underlying idea of the joint estimation method is that using information contained in the observed data, one selects marginal distributions for the endogenous regressor and the structural error term, respectively. Then, the copula model enables the construction of a flexible multivariate joint distribution allowing a wide range of correlations between the two marginals.

Consider the model:

Yt01Pt2Xtt

where t=1,..,T indexes either time or cross-sectional units, Yt is a 1x1 response variable, Xt is a kxn exogenous regressor, Pt is a kx1 continuous endogenous regressor, εt is a normally distributed structural error term with mean zero and E(ε2)=σε2, α and β are model parameters.

The marginal distribution of the endogenous regressor Pt is obtained using the Epanechnikov kernel density estimator (Epanechnikov, 1969), as below:

ĥ(p)=1/(T·b) ∑(K·((p-Pt)/b))

where Pt is the endogenous regressor, K(x)=0.75·(1-x2)·I(|x|<=1) and the bandwidth b is the one proposed by Silverman (1986), and is equal to b=0.9·T-1.5·min(s, IQR/1.34). IQR is the interquartile range while s is the data sample standard deviation and T is the number of time periods observed in the data. After obtaining the joint distribution of the error term and the continuous endogenous regressor, the model parameters are estimated using maximum likelihood estimation.

The additional parameters used during model fitting and printed in `summary` hence are:

`rho`

The correlation between the endogenous regressor and the error.

`sigma`

The variance of the model's error.

With more than one continuous endogenous regressor or an endogenous discrete regressor, an alternative approach to the estimation using Gaussian copula should be applied. This approach is similar to the control function approach (Petrin and Train, 2010). The core idea is to apply OLS estimation on the original set of explanatory variables in the model equation above, plus an additional regressor Pt*=Φ-1(H(Pt)). Here, H(Pt) is the marginal distribution of the endogenous regressor P. Including this regressor solves the correlation between the endogenous regressor and the structural error, ε, OLS providing consistent parameter estimates. Due to identification problems, the discrete endogenous regressor cannot have a binomial distribution.

Hence, only in the case of a single continuous endogenous regressor maximum likelihood estimation is used. In all other cases, augmented OLS based on Gaussian copula is applied. This includes cases of multiple endogenous regressors of both discrete and continuous distributions.

In the case of discrete endogenous regressors, a random seed needs to be assigned because the marginal distribution function of the endogenous regressor is a step function in this case. This means that the value of P* lies between 2 values, Φ-1(H(Pt-1)) and Φ-1(H(Pt)). However, the reported upper and lower bounds of the 95% bootstrapped confidence interval gives indication of the variance of the estimates.

Since the inference procedure in both cases, augmented OLS and maximum likelihood, occurs in two stages (first the empirical distribution of the endogenous regressor is computed and then used in constructing the likelihood function), the standard errors are not correct. Therefore, in both cases, the standard errors and the confidence intervals are obtained based on the sampling distributions resulted from bootstrapping. Since the distribution of the bootstrapped parameters is highly skewed, we report the percentile confidence intervals. Moreover, the variance-covariance matrix is also computed based on the bootstrapped parameters, and not based on the Hessian.

#### Formula parameter

The `formula` argument follows a two part notation:

A two-sided formula describing the model (e.g. `y ~ X1 + X2 + P`) to be estimated and a second right-hand side part in which the endogenous regressors and their distributional assumptions are indicated (e.g. `continuous(P)`). These two parts are separated by a single vertical bar (`|`). In the second part, the special functions `continuous`, `discrete`, or a combination of both, are used to indicate the endogenous regressors and their respective distribution. Both functions use the `...` parameter in which the respective endogenous regressors is specified.

Note that no argument to `continuous` or `discrete` is to be supplied as character but as symbols without quotation marks.

See the example section for illustrations on how to specify the `formula` parameter.

## Value

For all cases, an object of classes `rendo.copula.correction`, `rendo.boots`, and `rendo.base` is returned which is a list and contains the following components:

 `formula` The formula given to specify the fitted model. `terms` The terms object used for model fitting. `model` The model.frame used for model fitting. `coefficients` A named vector of all coefficients resulting from model fitting. `names.main.coefs` a vector specifying which coefficients are from the model. For internal usage. `names.vars.continuous` The names of the continuous endogenous regressors. `names.vars.discrete` The names of the discrete endogenous regressors. `fitted.values` Fitted values at the found solution. `residuals` The residuals at the found solution. `boots.params` The bootstrapped coefficients.

For the case of a single continuous endogenous regressor, the returned object further contains the following components:

 `start.params` A named vector with the initial set of parameters used to optimize the log-likelihood function. `res.optimx` The result object returned by the function `optimx` after optimizing the log-likelihood function.

For all other cases, the returned object further contains the following component:

 `res.lm.real.data` The linear model fitted on the original data together with generated p.star data.

The function `summary` can be used to obtain and print a summary of the results. Depending on the returned object, the generic accessor functions `coefficients`, `fitted.values`, `residuals`, `vcov`, `logLik`, `AIC`, `BIC`, and `nobs` are available.

## References

Park, S. and Gupta, S., (2012), "Handling Endogenous Regressors by Joint Estimation Using Copulas", Marketing Science, 31(4), 567-86.

Epanechnikov V (1969). "Nonparametric Estimation of a Multidimensional Probability Density." Teoriya veroyatnostei i ee primeneniya, 14(1), 156–161. Silverman B (1986). "Density Estimation for Statistics and Data Analysis". CRC Monographs on Statistics and Applied Probability. London: Chapman & Hall.

Petrin A, Train K (2010). "A Control Function Approach to Endogeneity in Consumer Choice Models." Journal of Marketing Research, 47(1), 3–13.

`summary` for how fitted models are summarized

`vcov` for how the variance-covariance matrix is derived

`confint` for how confidence intervals are derived

`optimx` for possible elements of parameter `optimx.arg`

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66``` ```data("dataCopCont") data("dataCopCont2") data("dataCopDis") data("dataCopDis2") data("dataCopDisCont") ## Not run: # Single continuous: log-likelihood optimization c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont) # same as above, with start.parameters and number of bootstrappings c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont, start.params = c("(Intercept)"=1, X1=1, X2=-2, P=-1)) # All following examples fit linear model with Gaussian copulas # 2 continuous endogenous regressors c2 <- copulaCorrection(y~X1+X2+P1+P2|continuous(P1, P2), num.boots=10, data=dataCopCont2) # same as above c2 <- copulaCorrection(y~X1+X2+P1+P2|continuous(P1)+continuous(P2), num.boots=10, data=dataCopCont2) # single discrete endogenous regressor d1 <- copulaCorrection(y~X1+X2+P|discrete(P), num.boots=10, data=dataCopDis) # two discrete endogenous regressor d2 <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1)+discrete(P2), num.boots=10, data=dataCopDis2) # same as above but less bootstrap runs d2 <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1, P2), num.boots = 10, data=dataCopDis2) # single discrete, single continuous cd <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1)+continuous(P2), num.boots=10, data=dataCopDisCont) # For single continuous only: use own optimization settings (see optimx()) # set maximum number of iterations to 50'000 res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P), optimx.args = list(itnmax = 50000), num.boots=10, data=dataCopCont) # print detailed tracing information on progress res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P), optimx.args = list(control = list(trace = 6)), num.boots=10, data=dataCopCont) # use method L-BFGS-B instead of Nelder-Mead and print report every 50 iterations res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P), optimx.args = list(method = "L-BFGS-B", control=list(trace = 2, REPORT=50)), num.boots=10, data=dataCopCont) # For coef(), the parameter "complete" determines if only the # main model parameters or also the auxiliary coefficients are returned c1.all.coefs <- coef(res.c1) # also returns rho and sigma # same as above c1.all.coefs <- coef(res.c1, complete = TRUE) # only main model coefs c1.main.coefs <- coef(res.c1, complete = FALSE) ## End(Not run) ```

REndo documentation built on Sept. 5, 2021, 5:37 p.m.