knitr::opts_chunk$set( collapse = TRUE, comment = "#>", tidy = FALSE )
In this vignette, we will introduce how to use the R package LinCDE for conditional density estimation.
The density of a continuous random variable characterizes its relative likelihood of taking a specific value. In statistics, many questions are essentially questions of density characteristics, e.g., location, variance, modality, and skewness. A useful collection of methods have been proposed for density estimation, such as kernel density estimation, Gaussian mixture models, local likelihood, log-spline methods. Here we use Lindsey's method.
In practice, the density of interest can depend on other variables. For instance, the mean height increases with the age of children, the density of food expenditure has higher variance in higher-income community relative to a lower-income community, and the density of salary is bi-modal among lawyers while unimodal among firefighters. The heterogeneity in conditional densities often carries meaningful messages.
We propose LinCDE.boost
to estimate conditional densities: a
boosting algorithm based on LinCDE trees. A LinCDE tree partitions the
covariate space into subregions with approximately locally homogeneous
densities, employs Lindsey's method for density estimation in the
subregions, and aggregates the densities from different sub-areas as
the estimated conditional density. LinCDE boost grows a number of
LinCDE trees in a forward-stagewise manner, and each tree fits a type
of residual from the current estimate. For more details, please refer to our LinCDE paper.
We illustrate the workflow of the LinCDE package via a toy example. Before we begin, let us install and attach the LinCDE package.
# devtools::install_github("ZijunGao/LinCDE") library("LinCDE")
The toy conditional density is locally Gaussian. We consider $20$ covariates $X = [X_1, \ldots, X_{20}]^\top$. Given a covariate value, the response $y$ is Gaussian with mean $0.5 X_1 + X_1 X_2$ and standard deviation $0.5 + 0.25 X_2$.
# locally Gaussian design (LGD) # If y = NULL, density.LGD generates responses y given covariates X. # If y != NULL, density.LGD outputs conditional densities f(y | X). density.LGD = function(X, y = NULL){ if(is.null(dim(X))){X = matrix(X, nrow = 1)} if(!is.null(y)){ dens = dnorm(y, mean = (0.5 * X[, 1] + X[, 1] * X[, 2]), sd = (0.5 + 0.25 * X[, 2])) }else{ y = 0.5 * X[, 1] + X[, 1] * X[, 2] + rnorm(dim(X)[1], 0, 1) * (0.5 + 0.25 * X[ ,2]) } }
We conduct conditional density estimation based on $1000$ training data points. Covariates $X_1$, $\ldots$, $X_{20}$ are generated independently uniformly from $[-1, 1]$. We also generate independent validation and test samples of size $1000$ for hyper-parameter tuning and performance evaluation, respectively.
set.seed(100) # training data n = 1000; d = 20; X = matrix(runif(n * d, -1, 1), ncol = d); colnames(X) = paste("X", seq(1,d), sep = "") y.LGD = density.LGD(X) # validation data nVal = 1000; XVal = matrix(runif(nVal * d, -1, 1), ncol = d); colnames(XVal) = paste("X", seq(1,d), sep = "") yVal.LGD = density.LGD(XVal) # test data nTest = 1000; XTest = matrix(runif(nTest * d, -1, 1), ncol = d); colnames(XTest) = paste("X", seq(1,d), sep = "") yTest.LGD = density.LGD(XTest)
The lattice plot below visualizes the true conditional densities at $9$ particular landmark covariate values. According to the data generation mechanism, $X_1$ and $X_2$ influence the response's mean, $X_2$ also influences the response's variance, and other covariates are nuisance variables and have no influence on the response distribution. The landmark covariates differ only in $X_1$ and $X_2$.
# landmarks for visualization XProfile = matrix(0, nrow = 3^2, ncol = d); colnames(XProfile) = paste("X", seq(1,d), sep = "") X1 = X2 = c(-0.6,0,0.6); XProfile[, c(1,2)] = as.matrix(expand.grid(X1, X2)) # true conditional density plot densityPlot(X = XProfile, trueDensity = density.LGD, minY = -3, maxY = 3, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = ""))
There are a few hyper-parameters critical to the performance of LinCDE boost. We tune the hyper-parameters on a separate validation dataset.
The two primary parameters of interest are the number of iterations (n.trees
) and the depth of the base learner LinCDE tree (depth
).
n.trees
: the number of LinCDE trees to fit. We train a LinCDE boost model on the training data with a large number of trees, e.g., 100 trees. We then compute the validation log-likelihoods after each iteration on a separate validation dataset. We choose the number of iterations corresponding to the maximal validation log-likelihood. Note that if after a number of iterations, additional trees stop contributing to the log-likelihood effectively, then we stop the boosting procedure before n.trees
base learners are added.
In the following example with depth = 2
(parameter depth
discussed below), we specify the maximal number of trees as $100$. We evaluate the validation log-likelihood and plot the curve. Here we use the function predict
for computing the validation errors and we will discuss more in Section 2.3. The log-likelihood keeps increasing and we choose n.trees = 100
for depth = 2
.
# depth = 1 model.LGD.D1 = LinCDE.boost(X = X, y = y.LGD, depth = 1, n.trees = 100, verbose = FALSE) predict.LGD.D1 = predict(model.LGD.D1, X = XVal, y = yVal.LGD, densityOnly = FALSE) # depth = 2 model.LGD.D2 = LinCDE.boost(y = y.LGD, X = X, depth = 2, n.trees = 100, verbose = FALSE) predict.LGD.D2 = predict(model.LGD.D2, X = XVal, y = yVal.LGD, densityOnly = FALSE) matplot(1:100, cbind(predict.LGD.D1$testLogLikelihoodHistory, predict.LGD.D2$testLogLikelihoodHistory), ylab="log-likelihood", xlab="n.trees",main="LGD", type="l", lty=1,col=c("black","blue"),lwd=2) legend("bottomright",legend=c("depth=1","depth=2"),lty=1,lwd=3,col=c("black","blue"))
depth
: the number of splits of each LinCDE tree. We apply LinCDE boost with a grid of depths. For each depth, we choose the number of iterations as above and record the associated validation log-likelihood. We choose the depth with the maximal validation log-likelihood.
In the above example, we experiment with depth = 1
and depth = 2
. For depth = 1
, we choose 100
iterations, and the associated log-likelihood is $-0.80$. For depth = 2
, we choose 100
iterations, and the associated log-likelihood is $-0.76$. Since $-0.76 > -0.80$, we end up with depth = 2
. Notice that the example's conditional mean involves an interaction term of $X_1$ and $X_2$, and thus depth = 1
--- an additive model in the density's exponent --- is too restrictive compared to depth = 2
--- a richer model including first-order interactions.
We remark that in standard boosting, deep trees are problematic due to overfitting. In LinCDE boost, the overfitting issue is more severe than standard boosting because the density estimation problem at terminal nodes is more complicated than regression and classification. As a result, we do not recommend depth > 5
.
There are a number of secondary hyper-parameters. We recommend starting with the default values and making changes only if certain issues are observed.
basis
: the type of spline basis. Default is the cubic natural spline basis. Cubic natural splines are desirable for their flexibility, smoothness, and linear extensions beyond the boundaries. However, if the conditional densities are believed to belong to a certain distribution family, then specific basis
should be adopted. Here are two examples.
basis = "Gaussian"
is the most appropriate choice.basis = "Gaussian"
can not produce bi-modality structures and a more flexible spline basis should be adopted.splineDf
: the number of spline basis. Default is $10$. If you go with the natural cubic spline basis, then splineDf
specifies the splines' degrees of freedom, i.e., the number of spline bases. A larger splineDf
is able to characterize more local structures but may produce unnecessary curvatures.
df
: the ridge Poisson regression's degrees of freedom. Default is $2$. df
is used for determining the ridge regularization hyper-parameter. A smaller df
corresponds to a larger regularization parameter, and assists to avoid computational instabilities at subregions with a limited number of observations.
prior
: type of the initial carrying density. Default is "Gaussian", i.e., the Gaussian distribution with the marginal response mean and the standard deviation is used as the universal density initialization. If you set prior = "uniform"
, the uniform distribution over the response range is used. If you set prior = "LindseyMarginal"
, the marginal response density estimated by Lindsey's method based on all responses is used. You can also input a homogeneous or heterogeneous conditional density function. The conditional density function should take a covariate matrix $X$, a response vector $y$, and output the densities at pairs $(X, y)$. If the prior conditional density is close to the underlying truth, e.g., a pretrained conditional density estimator, LinCDE boost will require less iterations.
There are several hyper-parameters applicable to all boosting-based methods. We recommend starting with the default values and making changes only if certain issues are observed.
splitPoint
: a list of candidate splits or numbers of candidate split. Each element is a vector corresponding to a variable's candidate splits (including the left and right endpoints). The list's elements are ordered the same as $X$'s columns. An alternative input is candidate split numbers, a scalar if all variables share the same number of candidate splits, a vector of length nvars if variables have different numbers of candidate splits. If candidate split numbers are given, each variable's range is divided into splitPoint-1
intervals, i.e., splitPoint
knots, containing approximately the same number of observations. Default is $20$. Note that if a variable has fewer unique values than the desired number of intervals, split intervals corresponding to each unique value are created. shrinkage
: the shrinkage parameter applied to each tree in the expansion, value in $(0,1]$. Default is $0.1$. A smaller shrinkage
leads to less overfitting but slower convergence.terminalSize
: the minimum number of observations in a terminal node. Default is $20$. We do not recommend a very small terminalSize
, since fitting a density model at each terminal node may involve quite a few parameters and a reasonable number of samples are needed.We will discuss the centering for LinCDE boost below, which aims to solve the "disjoint support" issue. Here we list the hyper-parameters used by the centering.
centering
: If true, a conditional mean model is fitted first, and LinCDE boost is applied to the residuals. The centering is recommended for responses whose conditional support varies wildly. See below for an example. Default is false.centeringMethod
: a conditional mean estimator. Applies only when centering
is true. If centeringMethod = "linearRegression"
, a regression model is fitted to the response. If centeringMethod = "randomForest"
, a random forest model is fitted. Default is fitting a random forest. Applies only to centering = TRUE
. If centeringMethod
is a function, the call centeringMethod(y, X)
should return a conditional mean model with a predict function. Default is "randomForest". In the above example, we tune all the primary and part of the secondary parameters on the separate validation dataset. We choose hyper-parameters splineDf = "Gaussian"
, shrinkage = 0.02
, depth = 3
, n.trees = 300
, and leave other parameters at default values.
We print the trained LinCDE model which gives us the model formula, crucial hyper-parameters, and relative influences/importance scores (discussed in Section 2.5). Note that the boosting procedure is stopped early after $216 < 300$ iterations.
# tuned LinCDE boost model model.LGD.final = LinCDE.boost(y = y.LGD, X = X, basis = "Gaussian", depth = 3, n.trees = 300, shrinkage = 0.02, verbose = F) print(model.LGD.final)
With a LinCDE boost model, we can predict conditional densities of an
independent test dataset via the function predict
.
predict.LGD.final.simple = predict(model.LGD.final, X = XTest, y = yTest.LGD)
Note we have supplied both X
and y
--- the latter are the
locations where we wish to evaluate the density, one per value of X
.
We can also use predict
to produce conditional density curves. In
this case we supply test covariates only (and no test responses). By
default, predict
computes conditional densities at the response grid points used in training. We also allow users to supply handcrafted grid points through splitPointYTest
.
predict.LGD.final.density = predict(model.LGD.final, X = XTest)
Finally, it is possible to use predict
to make both types of predictions: conditional density predictions at test covariate-response pairs and conditional density curves at test covariates. Simply set densityOnly = FALSE
in predict
. Check out the documentation of predict
for the full list of returned values.
predict.LGD.final = predict(model.LGD.final, X = XTest, y = yTest.LGD, densityOnly = FALSE)
We evaluate the performance of LinCDE boost. On simulated data, we report the relative improvement in the test log-likelihood \begin{align} \frac{\ell_{\text{LinCDE boost}} - \ell_{\text{null}}}{\ell_{\text{oracle}} - \ell_{\text{null}}}. \end{align} Here the null model is the universal Gaussian distribution with the response's marginal mean and standard deviation, and the oracle denotes the true underlying conditional density. The criterion is analogous to the goodness-of-fit measure $R^2$ of linear regression. In this example, LinCDE boost achieves $80.9\%$ of the oracle's improvement over the null model.
# null model model.LGD.null = LinCDE.boost(X = X, y = y.LGD, basis = "Gaussian", n.trees = 0) predict.LGD.null = predict(model.LGD.null, X = XTest, y = yTest.LGD, densityOnly = FALSE) # oracle oracle = mean(log(density.LGD(X = XTest, y = yTest.LGD))) # relative improvement (relativeImprovement = (predict.LGD.final$testLogLikelihood - predict.LGD.null$testLogLikelihood)/ (oracle - predict.LGD.null$testLogLikelihood))
When the true density is unknown, such as on a real dataset, the relative improvement can't be computed. However, we can still obtain and use the test log-likelihood as an assessment of LinCDE boost's performance.
We plot the estimated conditional densities against the truth at the above $9$ selected feature points. The estimated conditional densities are close to the truth.
# conditional density plot densityPlot(X = XProfile, trueDensity = density.LGD, model = model.LGD.final, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = ""))
To identify the influential covariates, we plot the importance scores for each covariate. The importance score is defined as in random forests and graient boosting. The importance scores are computed from the training data and record how much each covariate contributes to the improvement of the objective.
In the above example, the importance score barplot indicates that the first two candidates contribute the most to the improvement in the objective, which agrees with the underlying density model.
# relative influence plot importance.LGD = summary(model.LGD.final, cBars = 8)
The above example shows LinCDE boost's ability to capture location and scale changes. We add two more examples focusing on modality and skewness (asymmetry), respectively. We also include an example with the "disjoint support" issue.
For modality, we generate locally Gaussian mixture responses if $X_2 \le 0.2$ and locally Gaussian responses if $X_2 > 0.2$. Meanwhile, we let the responses' locations depend on $X_1$. We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.05
, depth = 3
, n.trees = 200
. We use a larger df
to learn the bi-modality. The rest of the parameters are at default values.
# data generation mechanism density.modality = function(X, y = NULL){ if(is.null(dim(X))){X = matrix(X, nrow = 1)} if(!is.null(y)){ dens = dnorm(y, mean = 0.25 * X[,1], sd = 0.8) index = which(X[,2] < 0.2) dens[index] = 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] - 0.8, sd = 0.3) return(dens) }else{ n = dim(X)[1] groupIndex = rbinom(n, 1, 0.5) y = groupIndex * rnorm(n, -0.8, 0.3) + (1-groupIndex) * rnorm(n, 0.8, 0.3) y = 0.25 * X[,1] + (X[,2] <= 0.2) * y + (X[,2] > 0.2) * rnorm(n, 0, 0.8) } } set.seed(1) y.modality = density.modality(X) model.modality = LinCDE.boost(y = y.modality, X = X, df = 8, depth = 3, n.trees = 200, shrinkage = 0.05, verbose = FALSE)
We compare the estimated conditional densities against the truth at the $9$ selected feature points (left figure below). The estimated conditional densities are clearly bi-modal for $X_2 = -0.6$ and $X_2 = 0$. For $X_2 = 0.6$, the estimated conditional densities are largely Gaussian with mild curvatures in the middle.
We also plot the importance scores (right figure below). In this setting, $X_1$ and $X_2$ are the most influential covariates, consistent with the data generation mechanism.
densityPlot(X = XProfile, trueDensity = density.modality, model = model.modality, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = "")) importance.modality = summary(model.modality, cBars = 8)
For symmetry, we generate asymmetric Gaussian mixture responses. If $X_2 > 0$, the right modal is sharper; if $X_2 < 0$, the left modal is sharper. The larger the $|X_2|$ is, the more asymmetric the conditional distribution is. Meanwhile, we let the responses' locations depend on $X_1$. We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.1
, depth = 3
, n.trees = 200
, and df = 8
. The rest parameters are at default values.
# data generation mechanism density.skewness = function(X, y = NULL){ if(is.null(dim(X))){X = matrix(X, nrow = 1)} if(!is.null(y)){ dens = 0.5 * dnorm(y, mean = 0.5 * X[, 1] + 0.8, sd = 0.5 + 0.45 * X[, 2]) + 0.5 * dnorm(y, mean = 0.5 * X[, 1] - 0.8, sd = 0.5 - 0.45 * X[, 2]) }else{ n = dim(X)[1] groupIndex = rbinom(n, 1, 0.5) y = groupIndex * rnorm(n, 0.8, 0.5 + 0.45 * X[, 2]) + (1 - groupIndex) * rnorm(n, -0.8, 0.5 - 0.45 * X[, 2]) + 0.5 * X[, 1] } } set.seed(1) y.skewness = density.skewness(X) model.skewness = LinCDE.boost(X = X, y = y.skewness, df = 8, depth = 3, n.trees = 200, shrinkage = 0.1, verbose = FALSE)
We compare the estimated conditional densities against the truth at $9$ landmarks (left figure below). The estimated conditional densities are right-skewed for $X_2 = 0.6$, left-skewed for $X_2 = -0.6$, and symmetric for $X_2 = 0$.
As for the importance scores (right figure below), $X_1$ and $X_2$ are correctly identified as important covariates.
densityPlot(X = XProfile, trueDensity = density.skewness, model = model.skewness, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = "")) importance.skewness = summary(model.skewness, cBars = 8)
Centering aims to align the centers of the conditional densities in advance by estimating the conditional means first and subtracting the estimates from the responses. Then we apply LinCDE boost to the residuals to capture additional distributional structures. The centering step is helpful if a distribution's conditional components differ violently in location, i.e., the "disjoint support" issue. For more details regarding LinCDE boost and centering, please refer to our LinCDE paper. In practice, we recommend creating scatter plots of responses versus (a selected subset of) covariates to detect the "disjoint support" issue.
In the following example, we generate locally Gaussian mixture responses if $X_2 \le 0.2$, and locally Gaussian responses if $X_2 > 0.2$. Meanwhile, we let the responses' supports differ dramatically as $X_1$ changes, and thus the "disjoint support" problem is present.
# data generation mechanism density.centering = function(X, y = NULL){ if(is.null(dim(X))){X = matrix(X, nrow = 1)} if(!is.null(y)){ dens = dnorm(y, mean = 4 * X[, 1], sd = 0.8) index = which(X[, 2] < 0.2) dens[index] = (0.5 * dnorm(y[index], mean = 4 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 4 * X[index, 1] - 0.8, sd = 0.3)) return(dens) }else{ n = dim(X)[1] groupIndex = rbinom(n, 1, 0.5) y = groupIndex * rnorm(n, -0.8, 0.3) + (1 - groupIndex) * rnorm(n, 0.8, 0.3) y = 4 * X[, 1] + (X[, 2] <= 0.2) * y + (X[, 2] > 0.2) * rnorm(n, 0, 0.8) } } # generate data set.seed(1) y.centering = density.centering(X)
We follow the aforementioned workflow. For LinCDE boost without centering, we set the hyper-parameters at shrinkage = 0.02
, depth = 3
, n.trees = 200
(the rest parameters are at default values). For LinCDE boost with centering, we use linear regression as the centering method by setting centeringMethod = "linearRegression"
. Check out the documentary of LinCDE.boost
for possible centering methods.
# without centering model.centering.no = LinCDE.boost(X = X, y = y.centering, depth = 2, n.trees = 200, shrinkage = 0.02, splineDf = 10, verbose = FALSE) # with centering model.centering.OLS = LinCDE.boost(X = X, y = y.centering, df = 8, depth = 2, n.trees = 200, shrinkage = 0.02, centering = TRUE, centeringMethod = "linearRegression", verbose = FALSE)
We compare LinCDE boost with and without centering. We plot the estimated conditional densities at $9$ landmark covariate values. LinCDE boost without centering does not capture the bi-modality structure. LinCDE boost with centering reflects the bi-modality structure for $X_2 = 0$ or $X_2 = -0.6$.
densityPlot(X = XProfile, trueDensity = density.centering, model = model.centering.no, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = ""), main = "Estimated and true conditional densities (without centering)") densityPlot(X = XProfile, trueDensity = density.centering, model = model.centering.OLS, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = ""), main = "Estimated and true conditional densities (with centering)")
We also compare the relative influence plots. For LinCDE boost without centering, most efforts are spent on learning the influence of $X_1$ on the response's location. For LinCDE boost with centering, we stack importances from centering (in red) and beyond centering (in blue). $X_1$ accounts for most of the importances in the centering, and $X_2$ is the most important covariate beyond centering. Currently the decomposition of importances into centering and beyond centering is only available for centeringMethod = "linearRegression"
or centeringMethod = "randomForest"
.
importance.skewness = summary(model.centering.no, cBars = 8, main = "Without centering") importance.skewness = summary(model.centering.OLS, cBars = 8, main = "With centering")
We read in the benchmark dataset California Housing data. The dataset collects California housing prices and $8$ covariates from the $1990$ Census. More details are available at the website. We remove windsorized responses, and divide the responses by $1000$. For time efficiency, we randomly select $1000$ observations as the training data.
cahousing=read.csv("./data/cahousing.csv", header = TRUE) y.cahousing = cahousing[,1]; cahousing=cahousing[y.cahousing < max(y.cahousing),] y.cahousing = cahousing[,1]/1000 X = data.matrix(cahousing[,-1]) set.seed(318); indexTrain = sample(length(y.cahousing), 1000, replace = FALSE)
X.transform = X X.transform[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")] = log(X[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")])
We fit a simple LinCDE boost model with sufficient statistics $y$,$y^2$ as a start. The LinCDE boost model incorporates the information of location and variance. The resulting conditional densities are locally Gaussian.
model.cahousing.Gaussian = LinCDE.boost(y = y.cahousing[indexTrain], X=X[indexTrain,], basis = "Gaussian", depth=3, n.trees=150, terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510, verbose = FALSE)
Next we use $10$ transformed natural cubic splines with $6$ degrees of freedom to learn a more complicated LinCDE boost model.
model.cahousing.ns = LinCDE.boost(y=y.cahousing[indexTrain], X=X[indexTrain,], basis = "nsTransform", depth=3, splineDf = 10, df = 6, n.trees=300, terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510, verbose = FALSE)
We provide the relative influence plots. The importances computed by the two LinCDE boost models are similar: MedInc (median income) is the dominant covariate followed by AveOccup (average occupancy), Longitude, and Latitude.
par(mar = c(4,6,4,4)) importance.cahousing.Gaussian = summary(model.cahousing.Gaussian, cBars = 8, main = "Gaussian basis") importance.cahousing.ns = summary(model.cahousing.ns, cBars = 8, main = "Natural spline basis")
We plot the conditional density estimates for different (MedInc, AveOccup) pairs. Since we don't know the underlying truth, we plot the histograms of the samples with similar covariate values as our reference. (The similarity of samples to a target point is measured by the covariate Mahalanobis distance weighted by the covariate importances from the LinCDE boost model.)
For both LinCDE boost models, the conditional densities shift to the right as the MedInc increases, and the spreads decrease as the AveOccup grows in the middle and right columns. For the LinCDE boost model with natural spline basis, the estimated conditional densities have diverse shapes beyond Gaussian. In local regions (indicated in the subtitles of panels) of $2.32$ ($20\%$ quantile) or $3.45$ ($50\%$ quantile) median incomes, the conditional distributions appear right-skewed. In regions of $3.45$ ($50\%$ quantile) median income and $2.36$ ($20\%$ quantile), $2.84$ ($50\%$ quantile) average occupancies, the conditional distributions have flatter peaks compared to the Gaussian distribution.
# landmarks for visualization X.MedInc = quantile(X[,"MedInc"],c(.2,.5,.8)) # 20%, 50%, 80% quantiles of MedInc X.AveOccup = quantile(X[,"AveOccup"],c(.2,.5,.8)) # 20%, 50%, 80% quantiles of AveOccup XGrid=matrix(colMeans(X), byrow=TRUE, nrow=9, ncol=ncol(X)); colnames(XGrid) = colnames(X) # other covariates fixed at the sample means XGrid[,c("MedInc", "AveOccup")] = data.matrix(expand.grid(X.MedInc, X.AveOccup)) yGrid = seq(0,510,length.out = 100) # response grid # conditional density plot # Gaussian basis plot.cahousing.Gaussian = densityPlot(X = XGrid, yGrid = yGrid, model=model.cahousing.Gaussian, plot = F) # natural spline basis plot.cahousing.ns = densityPlot(X=XGrid, yGrid = yGrid, model = model.cahousing.ns, plot = F)
# extract the observations in the neighborhoods of the selected landmarks XGrid.transform = XGrid XGrid.transform[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")] = log(XGrid[,c("MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup")]) importance.matrix = diag((mean((model.cahousing.ns$importanceScore))/model.cahousing.ns$importanceScore)^3) Sigma = diag((apply(X.transform, 2, quantile, 0.75) - apply(X.transform, 2, quantile, 0.25))^2) Sigma = importance.matrix %*% Sigma %*% importance.matrix # plot n.sub = c(rep(c(300,1000,1000),3)) par(mfrow = c(3,3)); par(mar = c(4,4,6,4)) for(i in c(7,8,9,4,5,6,1,2,3)){ dist.matrix = mahalanobis(X.transform, center = XGrid.transform[i,], cov = Sigma) selectIndex = order(dist.matrix, decreasing = F)[1:n.sub[i]] hist(y.cahousing[selectIndex], main = paste("MedInc ", round(XGrid[i,"MedInc"],2), ", ", "AveOccup ", round(XGrid[i,"AveOccup"],2), sep = ""), xlab = "Y", ylab = "density", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5, freq = FALSE, breaks = seq(0,510,length.out=20), ylim = c(0,0.015), xlim = c(0,max(y.cahousing))) lines(x = plot.cahousing.Gaussian$y[plot.cahousing.Gaussian$group == i], y = plot.cahousing.Gaussian$density[plot.cahousing.Gaussian$group == i], lwd = 3, col = "red") } mtext('Gaussian basis', side=3, line=-2, outer=TRUE, cex = 1.5, font = 2) for(i in c(7,8,9,4,5,6,1,2,3)){ dist.matrix = mahalanobis(X.transform, center = XGrid.transform[i,], cov = Sigma) selectIndex = order(dist.matrix, decreasing = F)[1:n.sub[i]] hist(y.cahousing[selectIndex], main = paste("MedInc ", round(XGrid[i,"MedInc"],2), ", ", "AveOccup ", round(XGrid[i,"AveOccup"],2), sep = ""), xlab = "Y", ylab = "density", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5, freq = FALSE, breaks = seq(0, 510, length.out=20), ylim = c(0,0.015), xlim = c(0,510)) lines(x = plot.cahousing.ns$y[plot.cahousing.ns$group == i], y = plot.cahousing.ns$density[plot.cahousing.ns$group == i], lwd = 3, col = "red") } mtext('Natural spline basis', side=3, line=-2, outer=TRUE, cex = 1.5, font = 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.