semiBRM: Semiparametric binary response model: Parameter estimation
In henrykye/semiBRM: Semiparametric Binary Response Models in R

Description Usage Arguments Details Value References See Also Examples

This implements quasi maximum likelihood estimation for parameters in semiparametric binary response models.

## Default S3 method:
semiBRM(x, y, r = 1/6.01, tau = 0.025, ...)

## S3 method for class 'formula'
semiBRM(formula, data, r = 1/6.01, tau = 0.025, ...)

`x`	a numeric matrix of explanatory variables.
`y`	a vector of integer, numeric, or factor of binary response outcomes, taking either 1 or 0 only.
`r`	a numeric number that controls the size of Silverman's rule-of-thumb bandwidth, `h = sd(x)*N^(-r)`.
`tau`	a numeric indicating cut-off levels for trimming in `TrimmingIndicator(X,f)`, which assigns 1L to the values in `X` lying between `tau`100 and (1-`tau`)100 -th percentiles and 0L to those outside this range.
`...`	further arguments in "`maxLik::maxLik`" such as `control`.
`formula`	a formula describing the model to be fitted..
`data`	a data.frame containing variables in `formula`.

This is the main function in the pacakge that performs parameter estimation of semiparametric binary response models. It can take as arguments either matrix x and vector y or formula and data to run estimation (see Examples below). The default setup is reasonably good so that simply offering x and y or formula and data would be enough in many cases.

Currently, only a single index model for binary outcome y is allowed. Importantly, the first explanatory variable should be the one whose coefficient is strongly believed to be different from zero, as the coefficients of other variables will be rescaled by that of the first explanatory variable in estimation. This rescaling is unavoidable in semiparametric approaches while it has "no" impact on estimation of conditional probability, Pr{y=1|x}. The parameter estimates are found in quasi maximum likelihood estimation using maxLik::maxLik with BFGS method in place.

The theory in \insertCiteklein1993efficient;textualsemiBRM needs a well-defined trimming indicator of 'index' that trims out boundary points to ensure a compact support. For it, this runs estimation twice, as recommended in the paper, with the first estimation as 'pilot' version and with the second one as the primary one. In the pilot version, the initial trimming indicator is generated based on the original set of explanatory variables x, dropping out observations near boundaries in any of the explanatory variables x, where observations lying outside [trimming.level*100, (1-trimming.level)*100] percentiles are considered as being near boundaries, with the default value trimming.level = 0.025. Then, the coefficient estimates of the pilot version are used to form the index, the linear combination of explanatory variables with the estimated coefficients from the pilot version. Finally, the trimming indicator for the primary version is generated from it at trimming.level and taken to the log-likelihood function for parameter estimation.

The Silverman's rule of thumb bandwidth is put in place for the Nadaraya-Watson estimator, which computes conditional probabilities. The bandwidth size is controlled by r with default value r = 6.01, which satisfies conditions for consistency and asymptotic normality.

The package deploys OpenMP here, parallelizing computation of the Nadaraya-Watson estimator over data points. The default value of the number of threads is parallel::detectCores()-1L. To change it manually, please use set_num_threads(x). Note that this affects all functions that employ the Nadaraya-Watson estimator in the package. If set to be 1L, multithreading will not be used.

object of class 'semiBRM' similar to that of 'maxLik' with elements:

estimate: estimated parameter values.
log.likelihood: log likelihood at the estimates.
gradient: a gradient vector at the estimates.
hessian: a hessian matrix at the estimates.
code: return code as detailed in maxLik::maxLik.
message: a short message describing return code.
iter: the number of iterations performed for numerical optimization.
control: the optimization control parameters as detailed in maxLik::maxLik.
model: the model frame.
r: the bandwidth parameter for Silverman's rule-of-thumb bandwidth.
trimming.level: the trimming cutoff level, which is tau in function argument.
call: the matched call.
formula: the formula entered for estimation.

\insertAllCited

GaussianNadarayaWatsonEstimator, TrimmingIndicator

# data generating process
N <- 500L
X1 <- rnorm(N)
X2 <- (X1 + 2*rnorm(N))/sqrt(5) + 1
X3 <- rnorm(N)^2/sqrt(2)
X <- cbind(X1, X2, X3)
beta <- c(2, 2, -1, -1)
V <- as.vector(cbind(X, 1)%*%beta)
Y <- ifelse(V >= rnorm(N), 1L, 0L)

# identifiable set of parameters
ests_true <- c(1, -.5)

# using matrix/vector
qmle0 <- semiBRM(x = X, y = Y, control = list(iterlim = 50))

# using formula and data
data <- data.frame(Y, X1, X2, X3)
qmle1 <- semiBRM(Y ~ X1 + X2 + X3, data = data, control = list(iterlim = 50))