In ZijunGao/LinCDE: Conditional Density Estimation via Lindsey's Method and Boosting

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  tidy = FALSE
)

In this vignette, we will introduce how to use the R package LinCDE for conditional density estimation.

Conditional density estimation and LinCDE

The density of a continuous random variable characterizes its relative likelihood of taking a specific value. In statistics, many questions are essentially questions of density characteristics, e.g., location, variance, modality, and skewness. A useful collection of methods have been proposed for density estimation, such as kernel density estimation, Gaussian mixture models, local likelihood, log-spline methods. Here we use Lindsey's method.

In practice, the density of interest can depend on other variables. For instance, the mean height increases with the age of children, the density of food expenditure has higher variance in higher-income community relative to a lower-income community, and the density of salary is bi-modal among lawyers while unimodal among firefighters. The heterogeneity in conditional densities often carries meaningful messages.

We propose LinCDE.boost to estimate conditional densities: a boosting algorithm based on LinCDE trees. A LinCDE tree partitions the covariate space into subregions with approximately locally homogeneous densities, employs Lindsey's method for density estimation in the subregions, and aggregates the densities from different sub-areas as the estimated conditional density. LinCDE boost grows a number of LinCDE trees in a forward-stagewise manner, and each tree fits a type of residual from the current estimate. For more details, please refer to our LinCDE paper.

A toy example

We illustrate the workflow of the LinCDE package via a toy example. Before we begin, let us install and attach the LinCDE package.

# devtools::install_github("ZijunGao/LinCDE")
library("LinCDE")

The toy conditional density is locally Gaussian. We consider $20$ covariates $X = [X_1, \ldots, X_{20}]^\top$. Given a covariate value, the response $y$ is Gaussian with mean $0.5 X_1 + X_1 X_2$ and standard deviation $0.5 + 0.25 X_2$.

# locally Gaussian design (LGD)
  # If y = NULL, density.LGD generates responses y given covariates X.
  # If y != NULL, density.LGD outputs conditional densities f(y | X).
density.LGD = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = (0.5 * X[, 1] + X[, 1] * X[, 2]), sd = (0.5 + 0.25 * X[, 2]))
  }else{
    y = 0.5 * X[, 1] + X[, 1] * X[, 2] + rnorm(dim(X)[1], 0, 1) * (0.5 + 0.25 * X[ ,2])
  }
}

We conduct conditional density estimation based on $1000$ training data points. Covariates $X_1$, $\ldots$, $X_{20}$ are generated independently uniformly from $[-1, 1]$. We also generate independent validation and test samples of size $1000$ for hyper-parameter tuning and performance evaluation, respectively.

set.seed(100)
# training data
n = 1000; d = 20; X = matrix(runif(n * d, -1, 1), ncol = d); colnames(X) = paste("X", seq(1,d), sep = "")
y.LGD = density.LGD(X)
# validation data
nVal = 1000; XVal = matrix(runif(nVal * d, -1, 1), ncol = d); colnames(XVal) = paste("X", seq(1,d), sep = "")
yVal.LGD = density.LGD(XVal)
# test data
nTest = 1000; XTest = matrix(runif(nTest * d, -1, 1), ncol = d); colnames(XTest) = paste("X", seq(1,d), sep = "")
yTest.LGD = density.LGD(XTest)

The lattice plot below visualizes the true conditional densities at $9$ particular landmark covariate values. According to the data generation mechanism, $X_1$ and $X_2$ influence the response's mean, $X_2$ also influences the response's variance, and other covariates are nuisance variables and have no influence on the response distribution. The landmark covariates differ only in $X_1$ and $X_2$.

# landmarks for visualization
XProfile = matrix(0, nrow = 3^2, ncol = d); colnames(XProfile) = paste("X", seq(1,d), sep = "")
X1 = X2 = c(-0.6,0,0.6); XProfile[, c(1,2)] = as.matrix(expand.grid(X1, X2))

# true conditional density plot
densityPlot(X = XProfile, trueDensity = density.LGD, minY = -3, maxY = 3, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))

Tuning

There are a few hyper-parameters critical to the performance of LinCDE boost. We tune the hyper-parameters on a separate validation dataset.

Primary parameters

The two primary parameters of interest are the number of iterations (n.trees) and the depth of the base learner LinCDE tree (depth).

n.trees: the number of LinCDE trees to fit. We train a LinCDE boost model on the training data with a large number of trees, e.g., 100 trees. We then compute the validation log-likelihoods after each iteration on a separate validation dataset. We choose the number of iterations corresponding to the maximal validation log-likelihood. Note that if after a number of iterations, additional trees stop contributing to the log-likelihood effectively, then we stop the boosting procedure before n.trees base learners are added.

In the following example with depth = 2 (parameter depth discussed below), we specify the maximal number of trees as $100$. We evaluate the validation log-likelihood and plot the curve. Here we use the function predict for computing the validation errors and we will discuss more in Section 2.3. The log-likelihood keeps increasing and we choose n.trees = 100 for depth = 2.

# depth = 1
model.LGD.D1 = LinCDE.boost(X = X, y = y.LGD, depth = 1, n.trees = 100, verbose = FALSE)
predict.LGD.D1 = predict(model.LGD.D1, X = XVal, y = yVal.LGD, densityOnly = FALSE)
# depth = 2
model.LGD.D2 = LinCDE.boost(y = y.LGD, X = X, depth = 2, n.trees = 100, verbose = FALSE) 
predict.LGD.D2 = predict(model.LGD.D2, X = XVal, y = yVal.LGD, densityOnly = FALSE)
matplot(1:100,
        cbind(predict.LGD.D1$testLogLikelihoodHistory,
              predict.LGD.D2$testLogLikelihoodHistory),
     ylab="log-likelihood", xlab="n.trees",main="LGD",
     type="l", lty=1,col=c("black","blue"),lwd=2)
legend("bottomright",legend=c("depth=1","depth=2"),lty=1,lwd=3,col=c("black","blue"))

depth: the number of splits of each LinCDE tree. We apply LinCDE boost with a grid of depths. For each depth, we choose the number of iterations as above and record the associated validation log-likelihood. We choose the depth with the maximal validation log-likelihood.

In the above example, we experiment with depth = 1 and depth = 2. For depth = 1, we choose 100 iterations, and the associated log-likelihood is $-0.80$. For depth = 2, we choose 100 iterations, and the associated log-likelihood is $-0.76$. Since $-0.76 > -0.80$, we end up with depth = 2. Notice that the example's conditional mean involves an interaction term of $X_1$ and $X_2$, and thus depth = 1 --- an additive model in the density's exponent --- is too restrictive compared to depth = 2 --- a richer model including first-order interactions.

We remark that in standard boosting, deep trees are problematic due to overfitting. In LinCDE boost, the overfitting issue is more severe than standard boosting because the density estimation problem at terminal nodes is more complicated than regression and classification. As a result, we do not recommend depth > 5.

Secondary parameters

There are a number of secondary hyper-parameters. We recommend starting with the default values and making changes only if certain issues are observed.

basis: the type of spline basis. Default is the cubic natural spline basis. Cubic natural splines are desirable for their flexibility, smoothness, and linear extensions beyond the boundaries. However, if the conditional densities are believed to belong to a certain distribution family, then specific basis should be adopted. Here are two examples.
- In the above example where the response is locally Gaussian, basis = "Gaussian" is the most appropriate choice.
- For the modality and skewness examples below, the response can be locally bi-modal. basis = "Gaussian" can not produce bi-modality structures and a more flexible spline basis should be adopted.
splineDf: the number of spline basis. Default is $10$. If you go with the natural cubic spline basis, then splineDf specifies the splines' degrees of freedom, i.e., the number of spline bases. A larger splineDf is able to characterize more local structures but may produce unnecessary curvatures.
df: the ridge Poisson regression's degrees of freedom. Default is $2$. df is used for determining the ridge regularization hyper-parameter. A smaller df corresponds to a larger regularization parameter, and assists to avoid computational instabilities at subregions with a limited number of observations.
prior: type of the initial carrying density. Default is "Gaussian", i.e., the Gaussian distribution with the marginal response mean and the standard deviation is used as the universal density initialization. If you set prior = "uniform", the uniform distribution over the response range is used. If you set prior = "LindseyMarginal", the marginal response density estimated by Lindsey's method based on all responses is used. You can also input a homogeneous or heterogeneous conditional density function. The conditional density function should take a covariate matrix $X$, a response vector $y$, and output the densities at pairs $(X, y)$. If the prior conditional density is close to the underlying truth, e.g., a pretrained conditional density estimator, LinCDE boost will require less iterations.

Standard boosting parameters

There are several hyper-parameters applicable to all boosting-based methods. We recommend starting with the default values and making changes only if certain issues are observed.

splitPoint: a list of candidate splits or numbers of candidate split. Each element is a vector corresponding to a variable's candidate splits (including the left and right endpoints). The list's elements are ordered the same as $X$'s columns. An alternative input is candidate split numbers, a scalar if all variables share the same number of candidate splits, a vector of length nvars if variables have different numbers of candidate splits. If candidate split numbers are given, each variable's range is divided into splitPoint-1 intervals, i.e., splitPoint knots, containing approximately the same number of observations. Default is $20$. Note that if a variable has fewer unique values than the desired number of intervals, split intervals corresponding to each unique value are created.
shrinkage: the shrinkage parameter applied to each tree in the expansion, value in $(0,1]$. Default is $0.1$. A smaller shrinkage leads to less overfitting but slower convergence.
terminalSize: the minimum number of observations in a terminal node. Default is $20$. We do not recommend a very small terminalSize, since fitting a density model at each terminal node may involve quite a few parameters and a reasonable number of samples are needed.

Centering parameters

We will discuss the centering for LinCDE boost below, which aims to solve the "disjoint support" issue. Here we list the hyper-parameters used by the centering.

centering: If true, a conditional mean model is fitted first, and LinCDE boost is applied to the residuals. The centering is recommended for responses whose conditional support varies wildly. See below for an example. Default is false.
centeringMethod: a conditional mean estimator. Applies only when centering is true. If centeringMethod = "linearRegression", a regression model is fitted to the response. If centeringMethod = "randomForest", a random forest model is fitted. Default is fitting a random forest. Applies only to centering = TRUE. If centeringMethod is a function, the call centeringMethod(y, X) should return a conditional mean model with a predict function. Default is "randomForest".

Estimation

In the above example, we tune all the primary and part of the secondary parameters on the separate validation dataset. We choose hyper-parameters splineDf = "Gaussian", shrinkage = 0.02, depth = 3, n.trees = 300, and leave other parameters at default values.

We print the trained LinCDE model which gives us the model formula, crucial hyper-parameters, and relative influences/importance scores (discussed in Section 2.5). Note that the boosting procedure is stopped early after $216 < 300$ iterations.

# tuned LinCDE boost model
model.LGD.final = LinCDE.boost(y = y.LGD, X = X, basis = "Gaussian", depth = 3, n.trees = 300, shrinkage = 0.02, verbose = F)
print(model.LGD.final)

Prediction

With a LinCDE boost model, we can predict conditional densities of an independent test dataset via the function predict.

predict.LGD.final.simple = predict(model.LGD.final, X = XTest, y = yTest.LGD)

Note we have supplied both X and y --- the latter are the locations where we wish to evaluate the density, one per value of X.

We can also use predict to produce conditional density curves. In this case we supply test covariates only (and no test responses). By default, predict computes conditional densities at the response grid points used in training. We also allow users to supply handcrafted grid points through splitPointYTest.

predict.LGD.final.density = predict(model.LGD.final, X = XTest)

Finally, it is possible to use predict to make both types of predictions: conditional density predictions at test covariate-response pairs and conditional density curves at test covariates. Simply set densityOnly = FALSE in predict. Check out the documentation of predict for the full list of returned values.

predict.LGD.final = predict(model.LGD.final, X = XTest, y = yTest.LGD, densityOnly = FALSE)

We evaluate the performance of LinCDE boost. On simulated data, we report the relative improvement in the test log-likelihood \begin{align} \frac{\ell_{\text{LinCDE boost}} - \ell_{\text{null}}}{\ell_{\text{oracle}} - \ell_{\text{null}}}. \end{align} Here the null model is the universal Gaussian distribution with the response's marginal mean and standard deviation, and the oracle denotes the true underlying conditional density. The criterion is analogous to the goodness-of-fit measure $R^2$ of linear regression. In this example, LinCDE boost achieves $80.9\%$ of the oracle's improvement over the null model.

# null model
model.LGD.null = LinCDE.boost(X = X, y = y.LGD, basis = "Gaussian", n.trees = 0)
predict.LGD.null = predict(model.LGD.null, X = XTest, y = yTest.LGD,  densityOnly = FALSE)
# oracle
oracle = mean(log(density.LGD(X = XTest, y = yTest.LGD)))
# relative improvement 
(relativeImprovement =   
  (predict.LGD.final$testLogLikelihood - predict.LGD.null$testLogLikelihood)/  
  (oracle - predict.LGD.null$testLogLikelihood))

When the true density is unknown, such as on a real dataset, the relative improvement can't be computed. However, we can still obtain and use the test log-likelihood as an assessment of LinCDE boost's performance.

Visualization

We plot the estimated conditional densities against the truth at the above $9$ selected feature points. The estimated conditional densities are close to the truth.

# conditional density plot
densityPlot(X = XProfile, trueDensity = density.LGD, model = model.LGD.final, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))

Importance evaluation

To identify the influential covariates, we plot the importance scores for each covariate. The importance score is defined as in random forests and graient boosting. The importance scores are computed from the training data and record how much each covariate contributes to the improvement of the objective.

In the above example, the importance score barplot indicates that the first two candidates contribute the most to the improvement in the objective, which agrees with the underlying density model.

# relative influence plot
importance.LGD = summary(model.LGD.final, cBars = 8)

More examples

The above example shows LinCDE boost's ability to capture location and scale changes. We add two more examples focusing on modality and skewness (asymmetry), respectively. We also include an example with the "disjoint support" issue.

Modality example

For modality, we generate locally Gaussian mixture responses if $X_2 \le 0.2$ and locally Gaussian responses if $X_2 > 0.2$. Meanwhile, we let the responses' locations depend on $X_1$. We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.05, depth = 3, n.trees = 200. We use a larger df to learn the bi-modality. The rest of the parameters are at default values.

# data generation mechanism
density.modality = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = 0.25 * X[,1], sd = 0.8)
    index = which(X[,2] < 0.2)
    dens[index] = 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] - 0.8, sd = 0.3)
    return(dens)
  }else{
    n = dim(X)[1]
    groupIndex = rbinom(n, 1, 0.5)
    y = groupIndex * rnorm(n, -0.8, 0.3) + (1-groupIndex) * rnorm(n, 0.8, 0.3)
    y = 0.25 * X[,1] + (X[,2] <= 0.2) * y + (X[,2] > 0.2) * rnorm(n, 0, 0.8)
  }
}
set.seed(1)
y.modality = density.modality(X)
model.modality = LinCDE.boost(y = y.modality, X = X, df = 8,
                                 depth = 3, n.trees = 200, shrinkage = 0.05, verbose = FALSE)

We compare the estimated conditional densities against the truth at the $9$ selected feature points (left figure below). The estimated conditional densities are clearly bi-modal for $X_2 = -0.6$ and $X_2 = 0$. For $X_2 = 0.6$, the estimated conditional densities are largely Gaussian with mild curvatures in the middle.

We also plot the importance scores (right figure below). In this setting, $X_1$ and $X_2$ are the most influential covariates, consistent with the data generation mechanism.

densityPlot(X = XProfile, trueDensity = density.modality, model = model.modality, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))
importance.modality = summary(model.modality, cBars = 8)

Symmetry example

For symmetry, we generate asymmetric Gaussian mixture responses. If $X_2 > 0$, the right modal is sharper; if $X_2 < 0$, the left modal is sharper. The larger the $|X_2|$ is, the more asymmetric the conditional distribution is. Meanwhile, we let the responses' locations depend on $X_1$. We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.1, depth = 3, n.trees = 200, and df = 8. The rest parameters are at default values.

# data generation mechanism
density.skewness = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = 0.5 * dnorm(y, mean = 0.5 * X[, 1] + 0.8, sd = 0.5 + 0.45 * X[, 2]) +
    0.5 * dnorm(y, mean = 0.5 * X[, 1] - 0.8, sd = 0.5 - 0.45 * X[, 2])
  }else{
    n = dim(X)[1]
    groupIndex = rbinom(n, 1, 0.5)
    y = groupIndex * rnorm(n, 0.8, 0.5 + 0.45 * X[, 2]) + (1 - groupIndex) * rnorm(n, -0.8, 0.5 - 0.45 * X[, 2]) + 0.5 * X[, 1]
  }
}

set.seed(1)
y.skewness = density.skewness(X)
model.skewness = LinCDE.boost(X = X, y = y.skewness, df = 8,
                                 depth = 3, n.trees = 200, shrinkage = 0.1, verbose = FALSE)

We compare the estimated conditional densities against the truth at $9$ landmarks (left figure below). The estimated conditional densities are right-skewed for $X_2 = 0.6$, left-skewed for $X_2 = -0.6$, and symmetric for $X_2 = 0$.

As for the importance scores (right figure below), $X_1$ and $X_2$ are correctly identified as important covariates.

densityPlot(X = XProfile, trueDensity = density.skewness, model = model.skewness, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))
importance.skewness = summary(model.skewness, cBars = 8)

Centering example

Centering aims to align the centers of the conditional densities in advance by estimating the conditional means first and subtracting the estimates from the responses. Then we apply LinCDE boost to the residuals to capture additional distributional structures. The centering step is helpful if a distribution's conditional components differ violently in location, i.e., the "disjoint support" issue. For more details regarding LinCDE boost and centering, please refer to our LinCDE paper. In practice, we recommend creating scatter plots of responses versus (a selected subset of) covariates to detect the "disjoint support" issue.

In the following example, we generate locally Gaussian mixture responses if $X_2 \le 0.2$, and locally Gaussian responses if $X_2 > 0.2$. Meanwhile, we let the responses' supports differ dramatically as $X_1$ changes, and thus the "disjoint support" problem is present.

# data generation mechanism
density.centering = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = 4 * X[, 1], sd = 0.8)
    index = which(X[, 2] < 0.2)
    dens[index] = (0.5 * dnorm(y[index], mean = 4 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 4 * X[index, 1] - 0.8, sd = 0.3))
    return(dens)
  }else{
    n = dim(X)[1]
    groupIndex = rbinom(n, 1, 0.5)
    y = groupIndex * rnorm(n, -0.8, 0.3) + (1 - groupIndex) * rnorm(n, 0.8, 0.3)
    y = 4 * X[, 1] + (X[, 2] <= 0.2) * y + (X[, 2] > 0.2) * rnorm(n, 0, 0.8)
  }
}

# generate data
set.seed(1)
y.centering = density.centering(X)

We follow the aforementioned workflow. For LinCDE boost without centering, we set the hyper-parameters at shrinkage = 0.02, depth = 3, n.trees = 200 (the rest parameters are at default values). For LinCDE boost with centering, we use linear regression as the centering method by setting centeringMethod = "linearRegression". Check out the documentary of LinCDE.boost for possible centering methods.

# without centering
model.centering.no = LinCDE.boost(X = X,  y = y.centering, 
                                  depth = 2, n.trees = 200, shrinkage = 0.02, splineDf = 10, verbose = FALSE)
# with centering
model.centering.OLS = LinCDE.boost(X = X, y = y.centering, 
                                   df = 8, depth = 2, n.trees = 200, shrinkage = 0.02, 
                                   centering = TRUE, centeringMethod = "linearRegression", verbose = FALSE)

We compare LinCDE boost with and without centering. We plot the estimated conditional densities at $9$ landmark covariate values. LinCDE boost without centering does not capture the bi-modality structure. LinCDE boost with centering reflects the bi-modality structure for $X_2 = 0$ or $X_2 = -0.6$.

densityPlot(X = XProfile, trueDensity = density.centering, model = model.centering.no, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""), main = "Estimated and true conditional densities (without centering)")
densityPlot(X = XProfile, trueDensity = density.centering, model = model.centering.OLS, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""), main = "Estimated and true conditional densities (with centering)")

We also compare the relative influence plots. For LinCDE boost without centering, most efforts are spent on learning the influence of $X_1$ on the response's location. For LinCDE boost with centering, we stack importances from centering (in red) and beyond centering (in blue). $X_1$ accounts for most of the importances in the centering, and $X_2$ is the most important covariate beyond centering. Currently the decomposition of importances into centering and beyond centering is only available for centeringMethod = "linearRegression" or centeringMethod = "randomForest".

importance.skewness = summary(model.centering.no, cBars = 8, main = "Without centering")
importance.skewness = summary(model.centering.OLS, cBars = 8, main = "With centering")

California housing example

We read in the benchmark dataset California Housing data. The dataset collects California housing prices and $8$ covariates from the $1990$ Census. More details are available at the website. We remove windsorized responses, and divide the responses by $1000$. For time efficiency, we randomly select $1000$ observations as the training data.

cahousing=read.csv("./data/cahousing.csv", header = TRUE)
y.cahousing = cahousing[,1]; cahousing=cahousing[y.cahousing < max(y.cahousing),]
y.cahousing = cahousing[,1]/1000
X = data.matrix(cahousing[,-1])
set.seed(318); indexTrain = sample(length(y.cahousing), 1000, replace = FALSE)

X.transform = X
X.transform[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")] = log(X[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")])

We fit a simple LinCDE boost model with sufficient statistics $y$,$y^2$ as a start. The LinCDE boost model incorporates the information of location and variance. The resulting conditional densities are locally Gaussian.

model.cahousing.Gaussian = LinCDE.boost(y = y.cahousing[indexTrain], X=X[indexTrain,], 
                                        basis = "Gaussian", depth=3, n.trees=150, 
                                        terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510, 
                                        verbose = FALSE)

Next we use $10$ transformed natural cubic splines with $6$ degrees of freedom to learn a more complicated LinCDE boost model.

model.cahousing.ns = LinCDE.boost(y=y.cahousing[indexTrain], X=X[indexTrain,], 
                                  basis = "nsTransform", depth=3, splineDf = 10, df = 6, n.trees=300, 
                                  terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510, 
                                  verbose = FALSE)

We provide the relative influence plots. The importances computed by the two LinCDE boost models are similar: MedInc (median income) is the dominant covariate followed by AveOccup (average occupancy), Longitude, and Latitude.

par(mar = c(4,6,4,4))
importance.cahousing.Gaussian = summary(model.cahousing.Gaussian, cBars = 8, main = "Gaussian basis")
importance.cahousing.ns = summary(model.cahousing.ns, cBars = 8, main = "Natural spline basis")

We plot the conditional density estimates for different (MedInc, AveOccup) pairs. Since we don't know the underlying truth, we plot the histograms of the samples with similar covariate values as our reference. (The similarity of samples to a target point is measured by the covariate Mahalanobis distance weighted by the covariate importances from the LinCDE boost model.)

For both LinCDE boost models, the conditional densities shift to the right as the MedInc increases, and the spreads decrease as the AveOccup grows in the middle and right columns. For the LinCDE boost model with natural spline basis, the estimated conditional densities have diverse shapes beyond Gaussian. In local regions (indicated in the subtitles of panels) of $2.32$ ($20\%$ quantile) or $3.45$ ($50\%$ quantile) median incomes, the conditional distributions appear right-skewed. In regions of $3.45$ ($50\%$ quantile) median income and $2.36$ ($20\%$ quantile), $2.84$ ($50\%$ quantile) average occupancies, the conditional distributions have flatter peaks compared to the Gaussian distribution.

# landmarks for visualization
X.MedInc = quantile(X[,"MedInc"],c(.2,.5,.8)) # 20%, 50%, 80% quantiles of MedInc
X.AveOccup = quantile(X[,"AveOccup"],c(.2,.5,.8))  # 20%, 50%, 80% quantiles of AveOccup
XGrid=matrix(colMeans(X), byrow=TRUE, nrow=9, ncol=ncol(X)); colnames(XGrid) = colnames(X) # other covariates fixed at the sample means
XGrid[,c("MedInc", "AveOccup")] = data.matrix(expand.grid(X.MedInc, X.AveOccup))
yGrid = seq(0,510,length.out = 100) # response grid

# conditional density plot
# Gaussian basis
plot.cahousing.Gaussian = densityPlot(X = XGrid, yGrid = yGrid, model=model.cahousing.Gaussian, plot = F)
# natural spline basis
plot.cahousing.ns = densityPlot(X=XGrid, yGrid = yGrid, model = model.cahousing.ns, plot = F)

# extract the observations in the neighborhoods of the selected landmarks
XGrid.transform = XGrid
XGrid.transform[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")] = log(XGrid[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")])
importance.matrix = diag((mean((model.cahousing.ns$importanceScore))/model.cahousing.ns$importanceScore)^3)
Sigma = diag((apply(X.transform, 2, quantile, 0.75) - apply(X.transform, 2, quantile, 0.25))^2)
Sigma = importance.matrix %*% Sigma %*% importance.matrix

# plot
n.sub = c(rep(c(300,1000,1000),3))
par(mfrow = c(3,3)); par(mar = c(4,4,6,4))
for(i in c(7,8,9,4,5,6,1,2,3)){
  dist.matrix = mahalanobis(X.transform, center = XGrid.transform[i,], cov = Sigma)
  selectIndex = order(dist.matrix, decreasing = F)[1:n.sub[i]]
  hist(y.cahousing[selectIndex], main = paste("MedInc ", round(XGrid[i,"MedInc"],2), ", ",  "AveOccup ", round(XGrid[i,"AveOccup"],2), sep = ""), xlab = "Y", ylab = "density", 
       cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5,
       freq = FALSE, breaks = seq(0,510,length.out=20), 
       ylim = c(0,0.015), xlim = c(0,max(y.cahousing)))
  lines(x = plot.cahousing.Gaussian$y[plot.cahousing.Gaussian$group == i],
        y = plot.cahousing.Gaussian$density[plot.cahousing.Gaussian$group == i], 
        lwd = 3, col = "red")
}
mtext('Gaussian basis', side=3, line=-2, outer=TRUE, cex = 1.5, font = 2)

for(i in c(7,8,9,4,5,6,1,2,3)){
  dist.matrix = mahalanobis(X.transform, center = XGrid.transform[i,], cov = Sigma)
  selectIndex = order(dist.matrix, decreasing = F)[1:n.sub[i]]
  hist(y.cahousing[selectIndex], main = paste("MedInc ", round(XGrid[i,"MedInc"],2), ", ",  "AveOccup ", round(XGrid[i,"AveOccup"],2), sep = ""), xlab = "Y", ylab = "density", 
       cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5,
       freq = FALSE, breaks = seq(0, 510, length.out=20), 
       ylim = c(0,0.015), xlim = c(0,510))
  lines(x = plot.cahousing.ns$y[plot.cahousing.ns$group == i],
        y = plot.cahousing.ns$density[plot.cahousing.ns$group == i], 
        lwd = 3, col = "red")
}
mtext('Natural spline basis', side=3, line=-2, outer=TRUE, cex = 1.5, font = 2)

Plug in your own data and have fun with LinCDE!

ZijunGao/LinCDE documentation built on Jan. 2, 2023, 11:14 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ZijunGao/LinCDE
Conditional Density Estimation via Lindsey's Method and Boosting

In ZijunGao/LinCDE: Conditional Density Estimation via Lindsey's Method and Boosting

Conditional density estimation and LinCDE

A toy example

Tuning

Primary parameters

Secondary parameters

Standard boosting parameters

Centering parameters

Estimation

Prediction

Visualization

Importance evaluation

More examples

Modality example

Symmetry example

Centering example

California housing example

Plug in your own data and have fun with LinCDE!

R Package Documentation

Browse R Packages

We want your feedback!

ZijunGao/LinCDE Conditional Density Estimation via Lindsey's Method and Boosting

In ZijunGao/LinCDE: Conditional Density Estimation via Lindsey's Method and Boosting

Conditional density estimation and LinCDE

A toy example

Tuning

Primary parameters

Secondary parameters

Standard boosting parameters

Centering parameters

Estimation

Prediction

Visualization

Importance evaluation

More examples

Modality example

Symmetry example

Centering example

California housing example

Plug in your own data and have fun with LinCDE!

R Package Documentation

Browse R Packages

We want your feedback!

ZijunGao/LinCDE
Conditional Density Estimation via Lindsey's Method and Boosting