knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  tidy = FALSE
)

In this vignette, we will introduce how to use the R package LinCDE for conditional density estimation.

Conditional density estimation and LinCDE

The density of a continuous random variable characterizes its relative likelihood of taking a specific value. In statistics, many questions are essentially questions of density characteristics, e.g., location, variance, modality, and skewness. A useful collection of methods have been proposed for density estimation, such as kernel density estimation, Gaussian mixture models, local likelihood, log-spline methods. Here we use Lindsey's method.

In practice, the density of interest can depend on other variables. For instance, the mean height increases with the age of children, the density of food expenditure has higher variance in higher-income community relative to a lower-income community, and the density of salary is bi-modal among lawyers while unimodal among firefighters. The heterogeneity in conditional densities often carries meaningful messages.

We propose LinCDE.boost to estimate conditional densities: a boosting algorithm based on LinCDE trees. A LinCDE tree partitions the covariate space into subregions with approximately locally homogeneous densities, employs Lindsey's method for density estimation in the subregions, and aggregates the densities from different sub-areas as the estimated conditional density. LinCDE boost grows a number of LinCDE trees in a forward-stagewise manner, and each tree fits a type of residual from the current estimate. For more details, please refer to our LinCDE paper.

A toy example

We illustrate the workflow of the LinCDE package via a toy example. Before we begin, let us install and attach the LinCDE package.

# devtools::install_github("ZijunGao/LinCDE")
library("LinCDE")

The toy conditional density is locally Gaussian. We consider $20$ covariates $X = [X_1, \ldots, X_{20}]^\top$. Given a covariate value, the response $y$ is Gaussian with mean $0.5 X_1 + X_1 X_2$ and standard deviation $0.5 + 0.25 X_2$.

# locally Gaussian design (LGD)
  # If y = NULL, density.LGD generates responses y given covariates X.
  # If y != NULL, density.LGD outputs conditional densities f(y | X).
density.LGD = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = (0.5 * X[, 1] + X[, 1] * X[, 2]), sd = (0.5 + 0.25 * X[, 2]))
  }else{
    y = 0.5 * X[, 1] + X[, 1] * X[, 2] + rnorm(dim(X)[1], 0, 1) * (0.5 + 0.25 * X[ ,2])
  }
}

We conduct conditional density estimation based on $1000$ training data points. Covariates $X_1$, $\ldots$, $X_{20}$ are generated independently uniformly from $[-1, 1]$. We also generate independent validation and test samples of size $1000$ for hyper-parameter tuning and performance evaluation, respectively.

set.seed(100)
# training data
n = 1000; d = 20; X = matrix(runif(n * d, -1, 1), ncol = d); colnames(X) = paste("X", seq(1,d), sep = "")
y.LGD = density.LGD(X)
# validation data
nVal = 1000; XVal = matrix(runif(nVal * d, -1, 1), ncol = d); colnames(XVal) = paste("X", seq(1,d), sep = "")
yVal.LGD = density.LGD(XVal)
# test data
nTest = 1000; XTest = matrix(runif(nTest * d, -1, 1), ncol = d); colnames(XTest) = paste("X", seq(1,d), sep = "")
yTest.LGD = density.LGD(XTest)

The lattice plot below visualizes the true conditional densities at $9$ particular landmark covariate values. According to the data generation mechanism, $X_1$ and $X_2$ influence the response's mean, $X_2$ also influences the response's variance, and other covariates are nuisance variables and have no influence on the response distribution. The landmark covariates differ only in $X_1$ and $X_2$.

# landmarks for visualization
XProfile = matrix(0, nrow = 3^2, ncol = d); colnames(XProfile) = paste("X", seq(1,d), sep = "")
X1 = X2 = c(-0.6,0,0.6); XProfile[, c(1,2)] = as.matrix(expand.grid(X1, X2))

# true conditional density plot
densityPlot(X = XProfile, trueDensity = density.LGD, minY = -3, maxY = 3, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))

Tuning

There are a few hyper-parameters critical to the performance of LinCDE boost. We tune the hyper-parameters on a separate validation dataset.

Primary parameters

The two primary parameters of interest are the number of iterations (n.trees) and the depth of the base learner LinCDE tree (depth).

# depth = 1
model.LGD.D1 = LinCDE.boost(X = X, y = y.LGD, depth = 1, n.trees = 100, verbose = FALSE)
predict.LGD.D1 = predict(model.LGD.D1, X = XVal, y = yVal.LGD, densityOnly = FALSE)
# depth = 2
model.LGD.D2 = LinCDE.boost(y = y.LGD, X = X, depth = 2, n.trees = 100, verbose = FALSE) 
predict.LGD.D2 = predict(model.LGD.D2, X = XVal, y = yVal.LGD, densityOnly = FALSE)
matplot(1:100,
        cbind(predict.LGD.D1$testLogLikelihoodHistory,
              predict.LGD.D2$testLogLikelihoodHistory),
     ylab="log-likelihood", xlab="n.trees",main="LGD",
     type="l", lty=1,col=c("black","blue"),lwd=2)
legend("bottomright",legend=c("depth=1","depth=2"),lty=1,lwd=3,col=c("black","blue"))

Secondary parameters

There are a number of secondary hyper-parameters. We recommend starting with the default values and making changes only if certain issues are observed.

Standard boosting parameters

There are several hyper-parameters applicable to all boosting-based methods. We recommend starting with the default values and making changes only if certain issues are observed.

Centering parameters

We will discuss the centering for LinCDE boost below, which aims to solve the "disjoint support" issue. Here we list the hyper-parameters used by the centering.

Estimation

In the above example, we tune all the primary and part of the secondary parameters on the separate validation dataset. We choose hyper-parameters splineDf = "Gaussian", shrinkage = 0.02, depth = 3, n.trees = 300, and leave other parameters at default values.

We print the trained LinCDE model which gives us the model formula, crucial hyper-parameters, and relative influences/importance scores (discussed in Section 2.5). Note that the boosting procedure is stopped early after $216 < 300$ iterations.

# tuned LinCDE boost model
model.LGD.final = LinCDE.boost(y = y.LGD, X = X, basis = "Gaussian", depth = 3, n.trees = 300, shrinkage = 0.02, verbose = F)
print(model.LGD.final)

Prediction

With a LinCDE boost model, we can predict conditional densities of an independent test dataset via the function predict.

predict.LGD.final.simple = predict(model.LGD.final, X = XTest, y = yTest.LGD)

Note we have supplied both X and y --- the latter are the locations where we wish to evaluate the density, one per value of X.

We can also use predict to produce conditional density curves. In this case we supply test covariates only (and no test responses). By default, predict computes conditional densities at the response grid points used in training. We also allow users to supply handcrafted grid points through splitPointYTest.

predict.LGD.final.density = predict(model.LGD.final, X = XTest)

Finally, it is possible to use predict to make both types of predictions: conditional density predictions at test covariate-response pairs and conditional density curves at test covariates. Simply set densityOnly = FALSE in predict. Check out the documentation of predict for the full list of returned values.

predict.LGD.final = predict(model.LGD.final, X = XTest, y = yTest.LGD, densityOnly = FALSE)

We evaluate the performance of LinCDE boost. On simulated data, we report the relative improvement in the test log-likelihood \begin{align} \frac{\ell_{\text{LinCDE boost}} - \ell_{\text{null}}}{\ell_{\text{oracle}} - \ell_{\text{null}}}. \end{align} Here the null model is the universal Gaussian distribution with the response's marginal mean and standard deviation, and the oracle denotes the true underlying conditional density. The criterion is analogous to the goodness-of-fit measure $R^2$ of linear regression. In this example, LinCDE boost achieves $80.9\%$ of the oracle's improvement over the null model.

# null model
model.LGD.null = LinCDE.boost(X = X, y = y.LGD, basis = "Gaussian", n.trees = 0)
predict.LGD.null = predict(model.LGD.null, X = XTest, y = yTest.LGD,  densityOnly = FALSE)
# oracle
oracle = mean(log(density.LGD(X = XTest, y = yTest.LGD)))
# relative improvement 
(relativeImprovement =   
  (predict.LGD.final$testLogLikelihood - predict.LGD.null$testLogLikelihood)/  
  (oracle - predict.LGD.null$testLogLikelihood))

When the true density is unknown, such as on a real dataset, the relative improvement can't be computed. However, we can still obtain and use the test log-likelihood as an assessment of LinCDE boost's performance.

Visualization

We plot the estimated conditional densities against the truth at the above $9$ selected feature points. The estimated conditional densities are close to the truth.

# conditional density plot
densityPlot(X = XProfile, trueDensity = density.LGD, model = model.LGD.final, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))

Importance evaluation

To identify the influential covariates, we plot the importance scores for each covariate. The importance score is defined as in random forests and graient boosting. The importance scores are computed from the training data and record how much each covariate contributes to the improvement of the objective.

In the above example, the importance score barplot indicates that the first two candidates contribute the most to the improvement in the objective, which agrees with the underlying density model.

# relative influence plot
importance.LGD = summary(model.LGD.final, cBars = 8)

More examples

The above example shows LinCDE boost's ability to capture location and scale changes. We add two more examples focusing on modality and skewness (asymmetry), respectively. We also include an example with the "disjoint support" issue.

Modality example

For modality, we generate locally Gaussian mixture responses if $X_2 \le 0.2$ and locally Gaussian responses if $X_2 > 0.2$. Meanwhile, we let the responses' locations depend on $X_1$. We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.05, depth = 3, n.trees = 200. We use a larger df to learn the bi-modality. The rest of the parameters are at default values.

# data generation mechanism
density.modality = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = 0.25 * X[,1], sd = 0.8)
    index = which(X[,2] < 0.2)
    dens[index] = 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] - 0.8, sd = 0.3)
    return(dens)
  }else{
    n = dim(X)[1]
    groupIndex = rbinom(n, 1, 0.5)
    y = groupIndex * rnorm(n, -0.8, 0.3) + (1-groupIndex) * rnorm(n, 0.8, 0.3)
    y = 0.25 * X[,1] + (X[,2] <= 0.2) * y + (X[,2] > 0.2) * rnorm(n, 0, 0.8)
  }
}
set.seed(1)
y.modality = density.modality(X)
model.modality = LinCDE.boost(y = y.modality, X = X, df = 8,
                                 depth = 3, n.trees = 200, shrinkage = 0.05, verbose = FALSE)

We compare the estimated conditional densities against the truth at the $9$ selected feature points (left figure below). The estimated conditional densities are clearly bi-modal for $X_2 = -0.6$ and $X_2 = 0$. For $X_2 = 0.6$, the estimated conditional densities are largely Gaussian with mild curvatures in the middle.

We also plot the importance scores (right figure below). In this setting, $X_1$ and $X_2$ are the most influential covariates, consistent with the data generation mechanism.

densityPlot(X = XProfile, trueDensity = density.modality, model = model.modality, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))
importance.modality = summary(model.modality, cBars = 8)

Symmetry example

For symmetry, we generate asymmetric Gaussian mixture responses. If $X_2 > 0$, the right modal is sharper; if $X_2 < 0$, the left modal is sharper. The larger the $|X_2|$ is, the more asymmetric the conditional distribution is. Meanwhile, we let the responses' locations depend on $X_1$. We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.1, depth = 3, n.trees = 200, and df = 8. The rest parameters are at default values.

# data generation mechanism
density.skewness = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = 0.5 * dnorm(y, mean = 0.5 * X[, 1] + 0.8, sd = 0.5 + 0.45 * X[, 2]) +
    0.5 * dnorm(y, mean = 0.5 * X[, 1] - 0.8, sd = 0.5 - 0.45 * X[, 2])
  }else{
    n = dim(X)[1]
    groupIndex = rbinom(n, 1, 0.5)
    y = groupIndex * rnorm(n, 0.8, 0.5 + 0.45 * X[, 2]) + (1 - groupIndex) * rnorm(n, -0.8, 0.5 - 0.45 * X[, 2]) + 0.5 * X[, 1]
  }
}

set.seed(1)
y.skewness = density.skewness(X)
model.skewness = LinCDE.boost(X = X, y = y.skewness, df = 8,
                                 depth = 3, n.trees = 200, shrinkage = 0.1, verbose = FALSE)

We compare the estimated conditional densities against the truth at $9$ landmarks (left figure below). The estimated conditional densities are right-skewed for $X_2 = 0.6$, left-skewed for $X_2 = -0.6$, and symmetric for $X_2 = 0$.

As for the importance scores (right figure below), $X_1$ and $X_2$ are correctly identified as important covariates.

densityPlot(X = XProfile, trueDensity = density.skewness, model = model.skewness, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))
importance.skewness = summary(model.skewness, cBars = 8)

Centering example

Centering aims to align the centers of the conditional densities in advance by estimating the conditional means first and subtracting the estimates from the responses. Then we apply LinCDE boost to the residuals to capture additional distributional structures. The centering step is helpful if a distribution's conditional components differ violently in location, i.e., the "disjoint support" issue. For more details regarding LinCDE boost and centering, please refer to our LinCDE paper. In practice, we recommend creating scatter plots of responses versus (a selected subset of) covariates to detect the "disjoint support" issue.

In the following example, we generate locally Gaussian mixture responses if $X_2 \le 0.2$, and locally Gaussian responses if $X_2 > 0.2$. Meanwhile, we let the responses' supports differ dramatically as $X_1$ changes, and thus the "disjoint support" problem is present.

# data generation mechanism
density.centering = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = 4 * X[, 1], sd = 0.8)
    index = which(X[, 2] < 0.2)
    dens[index] = (0.5 * dnorm(y[index], mean = 4 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 4 * X[index, 1] - 0.8, sd = 0.3))
    return(dens)
  }else{
    n = dim(X)[1]
    groupIndex = rbinom(n, 1, 0.5)
    y = groupIndex * rnorm(n, -0.8, 0.3) + (1 - groupIndex) * rnorm(n, 0.8, 0.3)
    y = 4 * X[, 1] + (X[, 2] <= 0.2) * y + (X[, 2] > 0.2) * rnorm(n, 0, 0.8)
  }
}

# generate data
set.seed(1)
y.centering = density.centering(X)

We follow the aforementioned workflow. For LinCDE boost without centering, we set the hyper-parameters at shrinkage = 0.02, depth = 3, n.trees = 200 (the rest parameters are at default values). For LinCDE boost with centering, we use linear regression as the centering method by setting centeringMethod = "linearRegression". Check out the documentary of LinCDE.boost for possible centering methods.

# without centering
model.centering.no = LinCDE.boost(X = X,  y = y.centering, 
                                  depth = 2, n.trees = 200, shrinkage = 0.02, splineDf = 10, verbose = FALSE)
# with centering
model.centering.OLS = LinCDE.boost(X = X, y = y.centering, 
                                   df = 8, depth = 2, n.trees = 200, shrinkage = 0.02, 
                                   centering = TRUE, centeringMethod = "linearRegression", verbose = FALSE)

We compare LinCDE boost with and without centering. We plot the estimated conditional densities at $9$ landmark covariate values. LinCDE boost without centering does not capture the bi-modality structure. LinCDE boost with centering reflects the bi-modality structure for $X_2 = 0$ or $X_2 = -0.6$.

densityPlot(X = XProfile, trueDensity = density.centering, model = model.centering.no, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""), main = "Estimated and true conditional densities (without centering)")
densityPlot(X = XProfile, trueDensity = density.centering, model = model.centering.OLS, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""), main = "Estimated and true conditional densities (with centering)")

We also compare the relative influence plots. For LinCDE boost without centering, most efforts are spent on learning the influence of $X_1$ on the response's location. For LinCDE boost with centering, we stack importances from centering (in red) and beyond centering (in blue). $X_1$ accounts for most of the importances in the centering, and $X_2$ is the most important covariate beyond centering. Currently the decomposition of importances into centering and beyond centering is only available for centeringMethod = "linearRegression" or centeringMethod = "randomForest".

importance.skewness = summary(model.centering.no, cBars = 8, main = "Without centering")
importance.skewness = summary(model.centering.OLS, cBars = 8, main = "With centering")

California housing example

We read in the benchmark dataset California Housing data. The dataset collects California housing prices and $8$ covariates from the $1990$ Census. More details are available at the website. We remove windsorized responses, and divide the responses by $1000$. For time efficiency, we randomly select $1000$ observations as the training data.

cahousing=read.csv("./data/cahousing.csv", header = TRUE)
y.cahousing = cahousing[,1]; cahousing=cahousing[y.cahousing < max(y.cahousing),]
y.cahousing = cahousing[,1]/1000
X = data.matrix(cahousing[,-1])
set.seed(318); indexTrain = sample(length(y.cahousing), 1000, replace = FALSE)
X.transform = X
X.transform[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")] = log(X[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")])

We fit a simple LinCDE boost model with sufficient statistics $y$,$y^2$ as a start. The LinCDE boost model incorporates the information of location and variance. The resulting conditional densities are locally Gaussian.

model.cahousing.Gaussian = LinCDE.boost(y = y.cahousing[indexTrain], X=X[indexTrain,], 
                                        basis = "Gaussian", depth=3, n.trees=150, 
                                        terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510, 
                                        verbose = FALSE)

Next we use $10$ transformed natural cubic splines with $6$ degrees of freedom to learn a more complicated LinCDE boost model.

model.cahousing.ns = LinCDE.boost(y=y.cahousing[indexTrain], X=X[indexTrain,], 
                                  basis = "nsTransform", depth=3, splineDf = 10, df = 6, n.trees=300, 
                                  terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510, 
                                  verbose = FALSE)

We provide the relative influence plots. The importances computed by the two LinCDE boost models are similar: MedInc (median income) is the dominant covariate followed by AveOccup (average occupancy), Longitude, and Latitude.

par(mar = c(4,6,4,4))
importance.cahousing.Gaussian = summary(model.cahousing.Gaussian, cBars = 8, main = "Gaussian basis")
importance.cahousing.ns = summary(model.cahousing.ns, cBars = 8, main = "Natural spline basis")

We plot the conditional density estimates for different (MedInc, AveOccup) pairs. Since we don't know the underlying truth, we plot the histograms of the samples with similar covariate values as our reference. (The similarity of samples to a target point is measured by the covariate Mahalanobis distance weighted by the covariate importances from the LinCDE boost model.)

For both LinCDE boost models, the conditional densities shift to the right as the MedInc increases, and the spreads decrease as the AveOccup grows in the middle and right columns. For the LinCDE boost model with natural spline basis, the estimated conditional densities have diverse shapes beyond Gaussian. In local regions (indicated in the subtitles of panels) of $2.32$ ($20\%$ quantile) or $3.45$ ($50\%$ quantile) median incomes, the conditional distributions appear right-skewed. In regions of $3.45$ ($50\%$ quantile) median income and $2.36$ ($20\%$ quantile), $2.84$ ($50\%$ quantile) average occupancies, the conditional distributions have flatter peaks compared to the Gaussian distribution.

# landmarks for visualization
X.MedInc = quantile(X[,"MedInc"],c(.2,.5,.8)) # 20%, 50%, 80% quantiles of MedInc
X.AveOccup = quantile(X[,"AveOccup"],c(.2,.5,.8))  # 20%, 50%, 80% quantiles of AveOccup
XGrid=matrix(colMeans(X), byrow=TRUE, nrow=9, ncol=ncol(X)); colnames(XGrid) = colnames(X) # other covariates fixed at the sample means
XGrid[,c("MedInc", "AveOccup")] = data.matrix(expand.grid(X.MedInc, X.AveOccup))
yGrid = seq(0,510,length.out = 100) # response grid

# conditional density plot
# Gaussian basis
plot.cahousing.Gaussian = densityPlot(X = XGrid, yGrid = yGrid, model=model.cahousing.Gaussian, plot = F)
# natural spline basis
plot.cahousing.ns = densityPlot(X=XGrid, yGrid = yGrid, model = model.cahousing.ns, plot = F)
# extract the observations in the neighborhoods of the selected landmarks
XGrid.transform = XGrid
XGrid.transform[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")] = log(XGrid[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")])
importance.matrix = diag((mean((model.cahousing.ns$importanceScore))/model.cahousing.ns$importanceScore)^3)
Sigma = diag((apply(X.transform, 2, quantile, 0.75) - apply(X.transform, 2, quantile, 0.25))^2)
Sigma = importance.matrix %*% Sigma %*% importance.matrix

# plot
n.sub = c(rep(c(300,1000,1000),3))
par(mfrow = c(3,3)); par(mar = c(4,4,6,4))
for(i in c(7,8,9,4,5,6,1,2,3)){
  dist.matrix = mahalanobis(X.transform, center = XGrid.transform[i,], cov = Sigma)
  selectIndex = order(dist.matrix, decreasing = F)[1:n.sub[i]]
  hist(y.cahousing[selectIndex], main = paste("MedInc ", round(XGrid[i,"MedInc"],2), ", ",  "AveOccup ", round(XGrid[i,"AveOccup"],2), sep = ""), xlab = "Y", ylab = "density", 
       cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5,
       freq = FALSE, breaks = seq(0,510,length.out=20), 
       ylim = c(0,0.015), xlim = c(0,max(y.cahousing)))
  lines(x = plot.cahousing.Gaussian$y[plot.cahousing.Gaussian$group == i],
        y = plot.cahousing.Gaussian$density[plot.cahousing.Gaussian$group == i], 
        lwd = 3, col = "red")
}
mtext('Gaussian basis', side=3, line=-2, outer=TRUE, cex = 1.5, font = 2)

for(i in c(7,8,9,4,5,6,1,2,3)){
  dist.matrix = mahalanobis(X.transform, center = XGrid.transform[i,], cov = Sigma)
  selectIndex = order(dist.matrix, decreasing = F)[1:n.sub[i]]
  hist(y.cahousing[selectIndex], main = paste("MedInc ", round(XGrid[i,"MedInc"],2), ", ",  "AveOccup ", round(XGrid[i,"AveOccup"],2), sep = ""), xlab = "Y", ylab = "density", 
       cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5,
       freq = FALSE, breaks = seq(0, 510, length.out=20), 
       ylim = c(0,0.015), xlim = c(0,510))
  lines(x = plot.cahousing.ns$y[plot.cahousing.ns$group == i],
        y = plot.cahousing.ns$density[plot.cahousing.ns$group == i], 
        lwd = 3, col = "red")
}
mtext('Natural spline basis', side=3, line=-2, outer=TRUE, cex = 1.5, font = 2)

Plug in your own data and have fun with LinCDE!



ZijunGao/LinCDE documentation built on Jan. 2, 2023, 11:14 p.m.