LinCDE.boost: LinCDE.boost

View source: R/LinCDE.boost.R

LinCDE.boostR Documentation

LinCDE.boost

Description

This function implements LinCDE boosting: a boosting algorithm of conditional density estimation with shallow LinCDE trees as base-learners.

Usage

LinCDE.boost(
  y,
  X = NULL,
  splitPoint = 20,
  basis = "nsTransform",
  splineDf = 10,
  minY = NULL,
  maxY = NULL,
  numberBin = 40,
  df = 4,
  penalty = NULL,
  prior = "Gaussian",
  depth = 1,
  n.trees = 100,
  shrinkage = 0.1,
  terminalSize = 20,
  alpha = 0.2,
  subsample = 1,
  centering = FALSE,
  centeringMethod = "randomForest",
  verbose = TRUE,
  ...
)

Arguments

y

response vector, of length nobs.

X

input matrix, of dimension nobs x nvars; each row represents an observation vector.

splitPoint

a list of candidate splits of length nvars or a scalar/vector of candidate split numbers. If splitPoint is a list, each object is a vector corresponding to a variable's candidate splits (including the left and right endpoints). The list's objects should be ordered the same as X's columns. An alternative input is candidate split numbers, a scalar if all variables share the same number of candidate splits, a vector of length nvars if variables have different numbers of candidate splits. If candidate split numbers are given, each variable's range is divided into splitPoint-1 intervals containing approximately the same number of observations. Default is 20. Note that if a variable has fewer unique values than the desired number of intervals, split intervals corresponding to unique values are created. The minimal accepted splitPoint is 3.

basis

a character or a function specifying sufficient statistics, i.e., spline basis. For basis = "Gaussian", y, y^2 are used. For basis = "nsTransform", transformed natural cubic splines are used. If basis is a function, it should take a vector of response values and output a basis matrix: each row stands for a response value and each column stands for a basis function. Default is "nsTransform".

splineDf

the number of sufficient statistics/spline basis. If z = "Gaussian", splineDf is set to 2. Default is 10.

minY

the user-provided left end of the response range. If centering is TRUE, minY is ignored. Default is NULL.

maxY

the user-provided right end of the response range. If centering is TRUE, maxY is ignored. Default is NULL.

numberBin

the number of bins for response discretization. Default is 40. The response range is divided into numberBin equal-width bins.

df

approximate degrees of freedom. df is used for determining the ridge regularization parameter. If basis = "Gaussian", no penalization is implemented. If df = splineDf, there is no ridge penalization. Default is 6.

penalty

vector of penalties applied to each sufficient statistics' coefficient.

prior

a character or a function specifying initial carrier density. For prior = "uniform", the uniform distribution over the response range is used. For prior = "Gaussian", the Gaussian distribution with the marginal response mean and standard deviation is used. For prior = "LindseyMarginal", the marginal response density estimated by Lindsey's method based on all responses is used. The argument prior can also be a homogeneous or heterogeneous conditional density function. The conditional density function should take a covariate matrix X, a response vector y, and output a vector of conditional densities f(yi | Xi). See the LinCDE vignette for examples. Default is "Gaussian".

depth

the number of splits of each LinCDE tree. The number of terminal nodes is depth + 1. If depth = 1, an additive model is fitted. Default is 1.

n.trees

the number of trees to fit. Default is 100.

shrinkage

the shrinkage parameter applied to each tree in the expansion, value in (0,1]. Default is 0.1.

terminalSize

the minimum number of observations in a terminal node. Default is 20.

alpha

a hyperparameter in (0,1] to early stop the boosting. A smaller alpha is more likely to induce early stopping. If alpha = 1, no early stopping will be conducted. Default is 0.2.

subsample

subsample ratio of the training samples in (0,1]. Default is 1.

centering

a logical value. If TRUE, a conditional mean model is fitted first, and LinCDE boosting is applied to the residuals. The centering is recommended for responses whose conditional support varies wildly. See the LinCDE vignette for examples. Default is FALSE.

centeringMethod

a character or a function specifying the conditional mean estimator. If centeringMethod = "linearRegression", a regression model is fitted to the response. If centeringMethod = "randomForest", a random forest model is fitted. Hyperparameters used by the centering method can be directly fed to LinCDE.boost, such as nodesize = 10 for centeringMethod = "randomForest". If centeringMethod is a function, the call centeringMethod(y, X) should return a conditional mean model with a predict function. Default is "randomForest". Applies only to centering = TRUE.

verbose

a logical value. If TRUE, progress and performance are printed. Default is TRUE.

...

other parameters, such as hyperparameters to be passed to the conditional mean estimator.

Value

This function returns a LinCDE object consisting of a list of values.

  • trees: a list of LinCDE trees.

  • importanceScore: a named vector measuring the contribution of each covariate to the objective.

  • splitMidPointY: the vector of discretized bins' mid-points.

  • z: the spline basis matrix.

  • zTransformMatrix: the transformation matrix (of dimension splineDf x splineDf) multiplied by the standard natural cubic spline basis if basis = "nsTransform".

  • prior: the prior function. The call prior(X, Y) should return a vector of prior conditional densities f(yi | Xi).

  • basis/depth/shrinkage/centering/centeringMethod: values inherited from the input arguments. If centering is FALSE, no centeringMethod is returned.

  • centeringModel: a centering model with a predict function. If centering is FALSE, no centeringModel is returned.


ZijunGao/LinCDE documentation built on Jan. 2, 2023, 11:14 p.m.