title: "FlexCoDE: the Flexible Conditional Density Estimator" author: "Rafael Izbicki" date: "r Sys.Date()" output: html_document: toc: true number_sections: true vignette: > %\VignetteIndexEntry{FlexCoDE: the Flexible Conditional Density Estimator} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}


```r knitr::opts_chunk$set(echo=TRUE, warning=TRUE, message=TRUE,cache=FALSE, fig.align="center")

# Introduction

FlexCode is a flexible approach to conditional density estimation.
We observed an idd sample $(\mathbf{x}_1,z_1),\ldots,(\mathbf{x}_n,z_n)$
and we wish to estimate $f(z|\mathbf{x})$, the density of $z \in \mathbb{R}$
given $\mathbf{x} \in \mathbb{R}^d$.
The estimator is based on a basis expansion of the conditional density,
where each of the expansion coefficients is estimated via a regression method.
FlexCode is flexible in that it is open for one to use his favorite
regression method. In particular, it adapts to several situations (e.g., 
sparsity, redundancy of covariate, mixed covariates types, functional covariates etc) if the right regression method is chosen.

The FlexCode package currently implements several regression methods:

* Sparse Additive Regression (SpAM) (_regressionFunction.SpAM_)
* Spectral Series Regression (_regressionFunction.Series_)
* Kernel Nearest Neighbors Regression (_regressionFunction.NN_)
* Nadaraya-Watson Regression (_regressionFunction.NW_)
* Random Forests (_regressionFunction.Forest_)

The user may also implement his own regression function.

The package also offers tools to combine (stack) various estimators
and visualize them.

# Fitting Conditional Density Estimators

We illustrate the methodology using an artificial data set:
for each sample, there are 10 observed covariates that are generated
from a standard Gaussian distribution. The response is generated according
to
$$ Z_i =  x_{i,1}+\epsilon_i,$$
with $\epsilon_i \sim N(0,0.1^2)$ iid, $i=1,\ldots,1000$, and
$x_{i,1}$ is the first covariate. That is, the only covariate that affects
the distribution of the response is the first one. 


```r
library(FlexCoDE)
set.seed(400)
# generate data
n=1000
d=10
data=matrix(NA,n,d+1)
data[,1:d]=matrix(rnorm(n*d),n,d)
data[,d+1]=data[,1]+rnorm(n,0,0.1)

Now that we have our dataset, we can split it into training/validation and testing:

# determine sample sizes
nTrain=round(0.7*n)
nValidation=round(0.25*n)
nTest=n-nTrain-nValidation

# split data
randomIndex=sample(1:n)
xTrain=data[randomIndex[1:nTrain],1:d]
xValidation=data[randomIndex[(nTrain+1):(nTrain+nValidation)],1:d]
xTest=data[randomIndex[(nTrain+nValidation+1):n],1:d]
zTrain=data[randomIndex[1:nTrain],d+1]
zValidation=data[randomIndex[(nTrain+1):(nTrain+nValidation)],d+1]
zTest=data[randomIndex[(nTrain+nValidation+1):n],d+1]

We can now fit our estimators. We start by fitting sparse additive models. This can be done via

#library(FlexCoDE)
fit1=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
                 regressionFunction = regressionFunction.SpAM,nIMax=50,
                 verbose=FALSE)

It is not necessary to provide test data, however more information (e.g., an estimate of the risk) is provided if it is available. \verb|nIMax| is the maximum number of basis elements to be used. The function automatically tunes the model (e.g, chooses how many components to use and parameters associated to the regression functions). A summary of the fitted model can be found via

print(fit1)

Notice the print function also provides a measure of importante of each covariate for some regression methods. In the case of SpAM, it shows how many of the fitted regressions selected each of the covariates. In this case, it is possible to see that the first covariate is the most important one. In fact, it is the only one that indeed influences the distribution of the response.

We can also plot some examples of estimated densities on new (testing) data. If we use the argument |predictionBandProb|, a second plot with the highest predictive regions (HPG) is also shown:

plot(fit1,xTest,zTest,predictionBandProb=0.95)

You may change default arguments for the regression method by using the \verb|regressionFunction.extra| argument. To see which arguments you can choose, see the help function for the regression method you want to use eg, \verb|regressionFunction.SpAM|.

Finally, you can use the \verb|predict| function to estimate the density on new testing points, as well as to build predictive intervals for new observations. See the function \verb|fitFlexCoDE| for more examples.

Parallel Computing

Some regression estimators can make use of parallel computing. For example, you can fit sparse additive model using 4 cores via

fit1=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
                 regressionFunction = regressionFunction.SpAM,
                 regressionFunction.extra=list(nCores=4),
                 verbose=FALSE)

This will typically take less time than using a naive calculation.

Comparing Conditional Density Estimators

The package provides a nice way to compare conditional estimators estimated by FlexCode via different regression methods. This can be done by binding the various estimators. For instance, let's fit two additional estimators:

# random Forest
fit2=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
                 regressionFunction = regressionFunction.Forest,
                 regressionFunction.extra=list(nCores=4),nIMax = 50,
                 verbose=FALSE)
# Nearest Neighbors estimator
fit3=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
                 regressionFunction = regressionFunction.NN,
                 regressionFunction.extra=list(nCores=4),
                 verbose=FALSE)

You can bind the three estimators using

fitBind=bindFlexCoDE(fit1,fit2,fit3)

You can now compare easily these estimators:

print(fitBind)
plot(fitBind,xTest,zTest,nPlots = 4)

Combining Conditional Density Estimators

After binding estimators, it is easy to combine all of them into a singles conditional density estimate. This can be done via the \verb|combineFlexCoDE| function, which looks for the best linear combination of the predictions:

combinedEstimator=combineFlexCoDE(fitBind,xValidation,zValidation,xTest,zTest)
combinedEstimator$weights # best weights
combinedEstimator$estimatedRisk
plot(combinedEstimator,xTest,zTest,nPlots = 6,predictionBandProb = 0.95)
print(combinedEstimator)

Designing your own regression functions

You can design your own regression functions to estimate the conditional densities. In order to this, you need to write three functions:

  1. regressionFunction.YourMethod: function that receives three arguments:

  2. x: matrix of covariates used to train the model

  3. responses: matrix where each row corresponds to a row in x and each column corresponds to a different response (in practice, it will be $\phi_i(z)$, but you do not need to worry about this)
  4. extra: a list with extra parameters used to fit your regression function you wish the used to have flexibility over Your function should output an object of the class YourMethod (the name you've assigned to the regressionFunction function). This object needs to have enough information for you to be able to predict the value of each of the regression functions on new data points $x$. We recommend you see the code of some existing regressionFunction methods to see how this can be done, for instance, regressionFunction.NN
  5. predict.YourMethod: function that received three arguments:

  6. object: the output of regressionFunction.YourMethod, the regression function you've designed (an object of the class YourMethod - the name you've assigned to the regressionFunction function)

  7. xNew: matrix where each row is a new sample, and each column is a covariate
  8. maxTerms: how many regression functions should be predicted. That is, in regressionFunction.YourMethod, you've created say nIMax regression functions. In predict.YourMethod, you only need to predict the first maxTerms of these functions. The default of maxTerms should be NULL, and your function should make sure it is at most the number of regressions you've fitted via regressionFunction.YourMethod. The output of predict.YourMethod should be a matrix with the same number of rows as xNew and maxTerms column, where column j has the prediction of the j-th regression you've fitted in each of the samples from xNew. We recommend you see the code of some existing regressionFunction methods to see how this can be done, for instance, predict.NN

  9. print.YourMethod: function that receives three arguments:

  10. regressionObject: the output of regressionFunction.YourMethod, the regression function you've designed (an object of the class YourMethod - the name you've assigned to the regressionFunction function)

  11. bestI: a number that indicates how many basis functions (regressions) will be used
  12. nameCovariates: name of the covariates Your function should then output your favorite statistics about the fitted regression functions. We recommend you see the code of some existing regressionFunction methods to see how this can be done, for instance, print.NN (which does not use nameCovariates)

Kernelized version of the functions

An alternative way to use FlexCode is by using Gram matrices as inputs instead of the raw covariates. This allows for faster implementations when computing such matrices is time consuming (e.g., this computation can be performed outside R). It also yields flexibility, because the user can input his own kernel.

Currently, the implemented kernelized

Here is one example. First, we simulate the data. In this case, we the domain of the covariate space is in a submanifold of $\mathbb{R}^{50}$, a 2 dimensional circle:

n=1000
t=runif(n,0,2*pi)
d=50
vec=c(2/sqrt(d+3),1/sqrt(d+3)*rep(rep(c(-1,1),each=d/2)))[-(d+1)]
X=cos(t)%*%t(c(rep(1/sqrt(d-1),d-1),0))+sin(t)%*%t(vec)
z <- t+rnorm(n,0,0.5)

We not split the data into training/validation/testing:

nTrain=700
nValidation=150
nTest=n-nValidation-nTrain
randomPerm=sample(1:n)


zTrain=z[randomPerm[1:nTrain]]
zValidation=z[randomPerm[(nTrain+1):(nTrain+nValidation)]]
zTest=z[randomPerm[(nTrain+nValidation+1):n]]
xTrain=X[randomPerm[1:nTrain],,drop=FALSE]
xValidation=X[randomPerm[(nTrain+1):(nTrain+nValidation)],,drop=FALSE]
xTest=X[randomPerm[(nTrain+nValidation+1):n],,drop=FALSE]

In order to fit the spectral series estimator using a kernel, we first compute the Gram Matrix. For the sake of illustration, we use a gaussian kernel with bandwidth $\epsilon=0.5$, $K(\vec{x}_i,\vec{x}_j)=e^{{-d^2(\vec{x}_i,\vec{x}_j)/\epsilon}}$

eps=0.5 # 
gramTrainTrain=exp((fields::rdist(xTrain,xTrain)^2)/eps)
gramValidationTrain=exp((fields::rdist(xValidation,xTrain)^2)/eps)
gramTestTrain=exp((fields::rdist(xTest,xTrain)^2)/eps)

fit=fitFlexCoDEKernel(gramTrainTrain,zTrain,gramValidationTrain,zValidation,gramTestTrain,zTest,regressionFunction = regressionFunction.SeriesKernel)

The same ploting/printing/predicting functions can be used with FlexCoDEKernel:

plot(fit,gramTestTrain,zTest,predictionBandProb = 0.95)


rizbicki/FlexCoDE documentation built on Feb. 10, 2022, 3:14 p.m.