title: "FlexCoDE: the Flexible Conditional Density Estimator"
author: "Rafael Izbicki"
date: "r Sys.Date()
"
output:
html_document:
toc: true
number_sections: true
vignette: >
%\VignetteIndexEntry{FlexCoDE: the Flexible Conditional Density Estimator}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
```r knitr::opts_chunk$set(echo=TRUE, warning=TRUE, message=TRUE,cache=FALSE, fig.align="center")
# Introduction FlexCode is a flexible approach to conditional density estimation. We observed an idd sample $(\mathbf{x}_1,z_1),\ldots,(\mathbf{x}_n,z_n)$ and we wish to estimate $f(z|\mathbf{x})$, the density of $z \in \mathbb{R}$ given $\mathbf{x} \in \mathbb{R}^d$. The estimator is based on a basis expansion of the conditional density, where each of the expansion coefficients is estimated via a regression method. FlexCode is flexible in that it is open for one to use his favorite regression method. In particular, it adapts to several situations (e.g., sparsity, redundancy of covariate, mixed covariates types, functional covariates etc) if the right regression method is chosen. The FlexCode package currently implements several regression methods: * Sparse Additive Regression (SpAM) (_regressionFunction.SpAM_) * Spectral Series Regression (_regressionFunction.Series_) * Kernel Nearest Neighbors Regression (_regressionFunction.NN_) * Nadaraya-Watson Regression (_regressionFunction.NW_) * Random Forests (_regressionFunction.Forest_) The user may also implement his own regression function. The package also offers tools to combine (stack) various estimators and visualize them. # Fitting Conditional Density Estimators We illustrate the methodology using an artificial data set: for each sample, there are 10 observed covariates that are generated from a standard Gaussian distribution. The response is generated according to $$ Z_i = x_{i,1}+\epsilon_i,$$ with $\epsilon_i \sim N(0,0.1^2)$ iid, $i=1,\ldots,1000$, and $x_{i,1}$ is the first covariate. That is, the only covariate that affects the distribution of the response is the first one. ```r library(FlexCoDE) set.seed(400) # generate data n=1000 d=10 data=matrix(NA,n,d+1) data[,1:d]=matrix(rnorm(n*d),n,d) data[,d+1]=data[,1]+rnorm(n,0,0.1)
Now that we have our dataset, we can split it into training/validation and testing:
# determine sample sizes nTrain=round(0.7*n) nValidation=round(0.25*n) nTest=n-nTrain-nValidation # split data randomIndex=sample(1:n) xTrain=data[randomIndex[1:nTrain],1:d] xValidation=data[randomIndex[(nTrain+1):(nTrain+nValidation)],1:d] xTest=data[randomIndex[(nTrain+nValidation+1):n],1:d] zTrain=data[randomIndex[1:nTrain],d+1] zValidation=data[randomIndex[(nTrain+1):(nTrain+nValidation)],d+1] zTest=data[randomIndex[(nTrain+nValidation+1):n],d+1]
We can now fit our estimators. We start by fitting sparse additive models. This can be done via
#library(FlexCoDE) fit1=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest, regressionFunction = regressionFunction.SpAM,nIMax=50, verbose=FALSE)
It is not necessary to provide test data, however more information (e.g., an estimate of the risk) is provided if it is available. \verb|nIMax| is the maximum number of basis elements to be used. The function automatically tunes the model (e.g, chooses how many components to use and parameters associated to the regression functions). A summary of the fitted model can be found via
print(fit1)
Notice the print function also provides a measure of importante of each covariate for some regression methods. In the case of SpAM, it shows how many of the fitted regressions selected each of the covariates. In this case, it is possible to see that the first covariate is the most important one. In fact, it is the only one that indeed influences the distribution of the response.
We can also plot some examples of estimated densities on new (testing) data. If we use the argument |predictionBandProb|, a second plot with the highest predictive regions (HPG) is also shown:
plot(fit1,xTest,zTest,predictionBandProb=0.95)
You may change default arguments for the regression method by using the \verb|regressionFunction.extra| argument. To see which arguments you can choose, see the help function for the regression method you want to use eg, \verb|regressionFunction.SpAM|.
Finally, you can use the \verb|predict| function to estimate the density on new testing points, as well as to build predictive intervals for new observations. See the function \verb|fitFlexCoDE| for more examples.
Some regression estimators can make use of parallel computing. For example, you can fit sparse additive model using 4 cores via
fit1=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest, regressionFunction = regressionFunction.SpAM, regressionFunction.extra=list(nCores=4), verbose=FALSE)
This will typically take less time than using a naive calculation.
The package provides a nice way to compare conditional estimators estimated by FlexCode via different regression methods. This can be done by binding the various estimators. For instance, let's fit two additional estimators:
# random Forest fit2=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest, regressionFunction = regressionFunction.Forest, regressionFunction.extra=list(nCores=4),nIMax = 50, verbose=FALSE) # Nearest Neighbors estimator fit3=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest, regressionFunction = regressionFunction.NN, regressionFunction.extra=list(nCores=4), verbose=FALSE)
You can bind the three estimators using
fitBind=bindFlexCoDE(fit1,fit2,fit3)
You can now compare easily these estimators:
print(fitBind)
plot(fitBind,xTest,zTest,nPlots = 4)
After binding estimators, it is easy to combine all of them into a singles conditional density estimate. This can be done via the \verb|combineFlexCoDE| function, which looks for the best linear combination of the predictions:
combinedEstimator=combineFlexCoDE(fitBind,xValidation,zValidation,xTest,zTest) combinedEstimator$weights # best weights combinedEstimator$estimatedRisk
plot(combinedEstimator,xTest,zTest,nPlots = 6,predictionBandProb = 0.95) print(combinedEstimator)
You can design your own regression functions to estimate the conditional densities. In order to this, you need to write three functions:
regressionFunction.YourMethod: function that receives three arguments:
x: matrix of covariates used to train the model
predict.YourMethod: function that received three arguments:
object: the output of regressionFunction.YourMethod, the regression function you've designed (an object of the class YourMethod - the name you've assigned to the regressionFunction function)
maxTerms: how many regression functions should be predicted. That is, in regressionFunction.YourMethod, you've created say nIMax regression functions. In predict.YourMethod, you only need to predict the first maxTerms of these functions. The default of maxTerms should be NULL, and your function should make sure it is at most the number of regressions you've fitted via regressionFunction.YourMethod. The output of predict.YourMethod should be a matrix with the same number of rows as xNew and maxTerms column, where column j has the prediction of the j-th regression you've fitted in each of the samples from xNew. We recommend you see the code of some existing regressionFunction methods to see how this can be done, for instance, predict.NN
print.YourMethod: function that receives three arguments:
regressionObject: the output of regressionFunction.YourMethod, the regression function you've designed (an object of the class YourMethod - the name you've assigned to the regressionFunction function)
An alternative way to use FlexCode is by using Gram matrices as inputs instead of the raw covariates. This allows for faster implementations when computing such matrices is time consuming (e.g., this computation can be performed outside R). It also yields flexibility, because the user can input his own kernel.
Currently, the implemented kernelized
Here is one example. First, we simulate the data. In this case, we the domain of the covariate space is in a submanifold of $\mathbb{R}^{50}$, a 2 dimensional circle:
n=1000 t=runif(n,0,2*pi) d=50 vec=c(2/sqrt(d+3),1/sqrt(d+3)*rep(rep(c(-1,1),each=d/2)))[-(d+1)] X=cos(t)%*%t(c(rep(1/sqrt(d-1),d-1),0))+sin(t)%*%t(vec) z <- t+rnorm(n,0,0.5)
We not split the data into training/validation/testing:
nTrain=700 nValidation=150 nTest=n-nValidation-nTrain randomPerm=sample(1:n) zTrain=z[randomPerm[1:nTrain]] zValidation=z[randomPerm[(nTrain+1):(nTrain+nValidation)]] zTest=z[randomPerm[(nTrain+nValidation+1):n]] xTrain=X[randomPerm[1:nTrain],,drop=FALSE] xValidation=X[randomPerm[(nTrain+1):(nTrain+nValidation)],,drop=FALSE] xTest=X[randomPerm[(nTrain+nValidation+1):n],,drop=FALSE]
In order to fit the spectral series estimator using a kernel, we first compute the Gram Matrix. For the sake of illustration, we use a gaussian kernel with bandwidth $\epsilon=0.5$, $K(\vec{x}_i,\vec{x}_j)=e^{{-d^2(\vec{x}_i,\vec{x}_j)/\epsilon}}$
eps=0.5 # gramTrainTrain=exp((fields::rdist(xTrain,xTrain)^2)/eps) gramValidationTrain=exp((fields::rdist(xValidation,xTrain)^2)/eps) gramTestTrain=exp((fields::rdist(xTest,xTrain)^2)/eps) fit=fitFlexCoDEKernel(gramTrainTrain,zTrain,gramValidationTrain,zValidation,gramTestTrain,zTest,regressionFunction = regressionFunction.SeriesKernel)
The same ploting/printing/predicting functions can be used with FlexCoDEKernel:
plot(fit,gramTestTrain,zTest,predictionBandProb = 0.95)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.