# R code to render vignette
rmarkdown::render("xDensity-quicktut.Rmd")

\newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\tx}[1]{\mathrm{#1}} \newcommand{\rv}[3]{#2_{#1},\ldots,#2_{#3}} \newcommand{\XX}{{\bm X}} \newcommand{\xx}{{\bm x}} \newcommand{\zz}{{\bm z}} \newcommand{\ZZ}{{\bm Z}} \newcommand{\uu}{{\bm u}} \newcommand{\RR}{{\bm R}} \newcommand{\II}{{\bm I}} \newcommand{\ud}{\,\mathrm{d}} \newcommand{\iid}{\stackrel{\mathrm{iid}}{\sim}} \newcommand{\argmin}{\mathrm{argmin}} \newcommand{\N}{{\mathcal N}} \renewcommand{\|}{\,|\,}

Introduction

The GaussCop package proposes a simple, unified framework to provide the basic d/p/q/r functions for arbitrary continuous one-dimensional distributions. This can be useful for working with various Copula models (including the Gaussian Copula in the example below).

xDensity Objects

Let $X$ denote an unbounded, continuous random variable with PDF $X \sim f(x)$. An xDensity object essentially consists of the following elements:

Based on this representation, the d/p/q/r functions are implemented as follows:

Basic Usage

xDensity objects can be constructed either from iid samples $\rv 1 X n \iid f(x)$ or from a functional representation of $f(x)$ itself. For example, consider the following xDensity representation based on a random sample from the log-Exponential distribution, $$ \log(X) \sim \tx{Expo}(1) \qquad \iff \qquad f(x) = \exp(-e^x + x). $$

require(GaussCop) # load package

# simulate data
nsamples <- 1e5
X <- log(rexp(nsamples)) # iid samples from the log-Exponential

xDens <- kernelXD(X) # basic xDensity constructor
names(xDens)

The elements of xDens are:

# check that xDensity d/p/q/r functions match the true distribution,
# by plotting true PDF/CDF vs xDensity representation.
# xDens: xDensity object.
# dtrue/ptrue: single-variable function of the true pdf/cdf.
# dname: name of true distribution (for legend)
# dlpos, plpos: legend position for pdf/cdf plot.
xdens_test <- function(xDens, dtrue, ptrue, dname, dlpos, plpos) {
  op <- par(no.readonly = TRUE)
  par(mfrow = c(1,2), mar = c(4,4,2,.5)+.1)
  # pdf and sampling check
  nsamples <- 1e5
  Xsim <- rXD(nsamples, xDens = xDens) # xDensity: random sampling
  # extend xDens range to show endpoint matching
  xlim <- xDens$xrng
  xlim <- (xlim - mean(xlim)) * 1.10 + mean(xlim)
  hist(Xsim, breaks = 100, freq = FALSE, # xDensity random sample
       xlab = "x", main = "PDF", xlim = xlim)
  curve(dXD(x, xDens = xDens), add = TRUE, col = "red") # xDensity PDF
  curve(dtrue, add = TRUE, col = "blue") # true PDF
  abline(v = xDens$xrng, lty = 2) # grid endpoints
  legend(dlpos, parse(text = c(dname, "dXD", "rXD", "xrng")),
         pch = c(22,22,22,NA), pt.cex = 1.5,
         pt.bg = c("blue", "red", "white", "black"),
         lty = c(NA, NA, NA, 2))
  # cdf check
  curve(pXD(x, xDens), # xDensity CDF
        from = xlim[1], to = xlim[2], col = "red",
        xlab = "x", main = "CDF", ylab = "Cumulative Probability")
  curve(ptrue, add = TRUE, col = "blue") # true CDF
  abline(v = xDens$xrng, lty = 2) # grid endpoints
  legend(plpos, parse(text = c(dname, "pXD", "xrng")),
         pch = c(22,22,NA), pt.cex = 1.5,
         pt.bg = c("blue", "red", "black"),
         lty = c(NA, NA, 2))
  invisible(par(op))
}

The d/p/q/r functions for xDensity objects are invoked by e.g.:

dXD(x, xDens = xDens, log = TRUE) # log-PDF
pXD(x, xDens = xDens, lower.tail = FALSE) # survival function (1-CDF)
qXD(p = .5, xDens = xDens) # median
rXD(n, xDens = xDens) # random sample

To assess the quality of the xDensity approximation to the log-Exponential distribution $\log(X) \sim \tx{Expo}(1)$, consider the following checks:

  1. That the histogram of rXD() sample matches dXD() and the true PDF (recall that rXD() calls qXD() on stats::runif()).
  2. That pXD() matches the true CDF.

These graphical checks are implemented by the function xdens_test (R code in Appendix) as illustrated below:

# check that xDensity d/p/q/r functions match the true distribution,
# by plotting true PDF/CDF vs xDensity representation.

dlexp <- function(x) exp(-exp(x) + x) # true PDF
plexp <- function(x) pexp(exp(x)) # true CDF

xdens_test(xDens, # xDensity approximation 
           dtrue = dlexp, # true PDF 
           ptrue = plexp, # true CDF
           dname = expression(log(X)%~%"Expo"*(1)), # name of true distribution (for legend)
           dlpos = "topleft", plpos = "topleft") # legend position for pdf/cdf plot

Constructors for xDensity Objects

The GaussCop package currently implements three methods of constructing xDensity objects. Two take as input a random sample $\rv 1 X n \sim f(x)$. The third is based on a functional approximation $\hat f(x)$, e.g., as resulting from a parametric model $\hat f(x) = f(x \mid \hat {\bm \theta})$, where $\hat{\bm \theta}$ is the parameter MLE.

Kernel Density Estimation

The kernelXD() function uses the stats::density() function to fit a Gaussian kernel to a random sample from $f(x)$. It is the most flexible constructor; however, it also requires the most datapoints for stable estimation. Below is a comparison with $M$ iid samples from a $\chi^2_{(4)}$ distribution, using $M = 1000$ and $M = 10000$ data points.

# true distribution: chi^2_(4)
df_true <- 4
dtrue <- function(x) dchisq(x, df = df_true)
ptrue <- function(x) pchisq(x, df = df_true)
rtrue <- function(n) rchisq(n, df = df_true)
dname <- paste0("X%~%chi[(",df_true,")]^2")

# simulate data
nsamples <- 1e3
X <- rtrue(nsamples)
xDens <- kernelXD(X) # xDensity constructor

par(oma = c(0,2,0,0)) # put sample size in outer margin
xdens_test(xDens, dtrue = dtrue, ptrue = ptrue,
           dname = dname,
           dlpos = "topright", plpos = "bottomright")
mtext(text = paste0("M = ", nsamples),
      side = 2, line = .5, outer = TRUE)

# increase sample size
nsamples <- 1e4
X <- rtrue(nsamples)
xDens <- kernelXD(X)
xdens_test(xDens, dtrue = dtrue, ptrue = ptrue,
           dname = dname,
           dlpos = "topright", plpos = "bottomright")
mtext(text = paste0("M = ", nsamples),
      side = 2, line = .5, outer = TRUE)
par(oma = c(0,0,0,0)) # remove outer margin

Gram-Charlier Expansion

The gc4XD() constructor implements a moment-based density approximation, which works best for unimodal distributions that are not too far from normal. The estimator proceeds in two steps:

  1. The data $\rv 1 X n \iid f(x)$ are first normalized by applying a Box-Cox transformation to each observation $X_i$, $$Z_i = \tx{BoxCox}(X_i \| \hat \alpha, \hat \lambda), \qquad \tx{BoxCox}(x \| \alpha, \lambda) = \begin{cases} \frac{((x + \alpha)^\lambda - 1)}{(\lambda C^{\lambda-1})} & \lambda \neq 0 \ C \log(x + \alpha) & \lambda = 0, \end{cases} $$ where $(\hat \alpha, \hat \lambda)$ are obtained by maximum likelihood.
  2. The transformed data $\rv 1 Z n$ are then fit with a 4th order Gram-Charlier expansion [e.g., @draper.tierney72]. That is, let $\hat \mu$, $\hat \sigma$, $\hat \kappa_3$, and $\hat \kappa_4$ denote the sample-based estimates of the first four cumulants of $Z = \tx{BoxCox}(X \| \hat \alpha, \hat \lambda)$. The 4th order Gram-Charlier approximation to the true PDF $f(z)$ is $$ \hat f(z) = \frac{1}{\sqrt{2\pi}\hat \sigma} \exp\left[-\frac{(z-\hat \mu)^2}{2\hat\sigma^2}\right] \times \left[1 + \frac{\hat \kappa_3}{3!\hat \sigma^3} H_3\left(\frac{z-\hat\mu}{\hat\sigma}\right) + \frac{\hat \kappa_4}{4!\hat\sigma^4} H_4\left(\frac{z-\hat\mu}{\hat\sigma}\right)\right], $$ where $H_3(z) = z^3 - 3z$ and $H_4(z) = z^4 - 6z^2 + 3$ are the 3rd and 4th Hermite polynomials. The approximation $\hat f(z)$ is not guaranteed to be positive, but generally is for reasonably unimodal distributions, in which case $\hat f(z)$ is a (normalized) PDF with mean, variance, skewness and kurtosis exactly matching those of the transformed data $\rv 1 Z n$.

The gc4XD() constructor estimates the PDF of the data $\rv 1 X n$ on the original scale by inverting steps 1 and 2 above. The example below illustrates the quality of the approximation for $M = 1000$ observations of $X \sim \chi^2_{(4)}$.

set.seed(1) # reproducible results
nsamples <- 1e3
X <- rtrue(nsamples)
xDens <- gc4XD(X) # xDensity constructor
xdens_test(xDens, dtrue = dtrue, ptrue = ptrue,
           dname = dname,
           dlpos = "topright", plpos = "bottomright")

Known Density Values

The matrixXD() constructor is used to convert functional density representations into the standardized xDensity format. Its input is a 2-column matrix with equally-spaced $x$-grid values in the first column and corresponding density values in the second. Below is an illustration using the Student-$t$ distributions $X \sim t_{(2)}$ and $X \sim t_{(1)}$ (i.e., Cauchy) distributions. We can see that for extremely heavy-tailed distributions such as Cauchy, the normal approximation used by xDensity clearly undercuts tail probabilities.

```r$ distributions with $\nu = 1,2$."}

true distribution: t(2)

dtrue <- list(t2 = function(x) dt(x, df = 2), Cauchy = function(x) dt(x, df = 1)) ptrue <- list(t2 = function(x) pt(x, df = 2), Cauchy = function(x) pt(x, df = 1)) dname <- expression(X%~%t[(2)], X%~%"Cauchy")

matrix density representation

xgrid <- seq(from=-15, to=15, len = 500) # grid for(ii in 1:2) { ypdf <- dtrue[ii] # true density xDens <- matrixXD(cbind(X = xgrid, Y = ypdf)) # xDensity constructor xdens_test(xDens, dtrue = dtrue[[ii]], ptrue = ptrue[[ii]], dname = dname[ii], dlpos = "topright", plpos = "bottomright") }

## Application: The Gaussian Copula Distribution {#gCop}

Consider a $d$-dimensional random variable $\XX = (\rv 1 X d)$ with unknown PDF $\XX \sim f(\xx)$.  A popular approach to statistical dependence modeling is to do so upon transforming $\XX$ to having univariate margins.  That is, if $F_i(x_i)$ is the CDF of $X_i$, then
$$
U_i = F_i^{-1}(X_i) \sim \tx{Unif}(0,1),
$$
such that the joint PDF of $\XX$ can be written as
$$
f(\xx) = c(\uu) \times\prod_{i=1}^n f_i(x_i), \qquad u_i = F_i(x_i).
$$
Thus, $c(\uu)$ is the joint PDF of marginally Uniform random variables, $U_i \sim \tx{Unif}(0,1)$, and is known as a [Copula distribution](https://en.wikipedia.org/wiki/Copula_(probability_theory)).  Numerous R packages provide various for fitting and sampling from Copula distributions, e.g., [**copula**](https://CRAN.R-project.org/package=copula) [@hofert.etal17],  [**VineCopula**](https://CRAN.R-project.org/package=VineCopula) [@schepsmeier.etal17], and many others listed in the [**Distributions**](https://CRAN.R-project.org/view=Distributions) Task View.  The `xDensity` methods for modeling the marginal distributions $f_i(x_i)$ can easily be combined with the Copula dependence models $c(\uu)$ provided by these packages.  **GaussCop** provides a simple and specific interface to the so-called Gaussian Copula distribution, which approximates the true $c(\uu)$ as follows:

1. Let $\varphi(z)$ and $\Phi(z)$ denote the PDF and CDF of the standard normal $\N(0,1)$.  Now, consider the change-of-variables $Z_i = \Phi^{-1}(U_i)$, for which the joint PDF of $\ZZ = (\rv 1 Z n)$ is
$$
g(\zz) = c(\uu) \times \prod_{i=1}^n \varphi(z_i), \qquad u_i = \Phi(z_i).
$$
2.  Since marginally $Z_i \sim \N(0,1)$, the Gaussian Copula approximates the true distirbution $g(\zz)$ by the best-matching multivariate normal, in the sense of minimizing the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence),
$$
\hat g(\zz) = \argmin_{\RR} \int \log \left[\frac{g(\zz)}{\varphi(\zz \mid \RR)}\right] \cdot g(\zz) \ud \zz,
$$
where $\varphi(\zz \mid \RR)$ is the joint PDF of a multivariate normal $\N(0, \RR)$ with correlation matrix $\RR_{ii} = 1$.

To evaluate the accuracy of the Gaussian Copula appromation, consider applying it to the $p(p+1)/2$-dimensional Cholesky decomposition of a "unit" $p\times p$ [Wishart distribution](https://en.wikipedia.org/wiki/Wishart_distribution),
$$
\XX_{p\times p} \sim \tx{Wishart}(\II, p).
$$
That is, $\XX = \ZZ'\ZZ$ and $\ZZ$ is a $p\times p$ matrix of iid $\N(0,1)$.  

By construction, the Gaussian Copula perfectly captures the marginal distributions of $\tx{chol}(\XX)$.  Therefore, consider the error on the marginal distributions of the eigenvalues of $\XX$, induced by approximating the copula dependence between the Cholesky variables as multivariate normal.  In the example below with $p = 4$, the error appears to be relatively small:
```r(\\II, 4)$, as fitted to $\\tx{chol}(\\XX)$."}
# simulate data from unit Wishart distribution
p <- 4 # size of matrix
nsamples <- 1e4
X <- replicate(nsamples, crossprod(matrix(rnorm(p^2),p,p)))

# Cholesky decomposition
Xchol <- apply(X, 3, function(V) chol(V)[upper.tri(V,diag = TRUE)]) 
Xchol <- t(Xchol)

# fit Gaussian Copula approximation
gCop <- gcopFit(X = Xchol, # iid sample from multivariate distribution
                fitXD = "kernel") # xDensity approximation to each margin

# Compare approximation to true distribution of eigenvalues of X

# extract sorted eigenvalues of a positive definite matrix
eig_val <- function(V) {
  sort(eigen(V, symmetric = TRUE)$values, decreasing = TRUE)
}

Xeig <- t(apply(X, 3, eig_val)) # true distribution
Xeig2 <- rgcop(nsamples, gCop = gCop) # Gaussian Copula approximation
Xeig2 <- t(apply(Xeig2, 1, function(U) {
  # recover full matrix
  V <- matrix(0,p,p)
  V[upper.tri(V,diag=TRUE)] <- U
  V <- crossprod(V)
  eig_val(V)
}))

# plot true and approximate eigenvalue distributions
main <- parse(text = paste0("lambda[", 1:p, "]"))
par(mfrow = c(1,p), mar = c(2.5,2.5,2,.1)+.1, oma = c(0,2,0,0))
for(ii in 1:p) {
  dtrue <- density(Xeig[,ii])
  dapprox <- density(Xeig2[,ii])
  xlim <- range(dtrue$x, dapprox$x)
  ylim <- range(dtrue$y, dapprox$y)
  plot(dtrue$x, dtrue$y, type = "l", xlim = xlim, ylim = ylim,
       main = main[ii], xlab = "", ylab = "", col = "blue")
  lines(dapprox$x, dapprox$y, col = "red")
}
mtext(text = "Density",
      side = 2, line = .5, outer = TRUE)
legend("topright", legend = c("True PDF", "gCop Approx."),
       fill = c("blue", "red"))
par(oma = c(0,0,0,0)) # remove outer margin

Appendix A: Endpoint Matching {#appendix}

We wish to find the parameters $(\mu, \sigma)$ of a normal distribution whose PDF passes through the points $(x_i, y_i)$, $i = 1,2$. Thus we wish to find $(\mu, \sigma)$ satisfying $$ \log y_i = -\frac 1 2 \left[\frac{(x_i - \mu)^2}{\sigma^2} + \log \sigma^2 + \log 2\pi \right], \qquad i = 1,2. $$ Subtracting one equation from the other, we obtain $$ \log(y_1/y_2) = \frac{(x_1 - \mu)^2}{2\sigma^2} - \frac{(x_2 - \mu)^2}{2\sigma^2}, $$ such that for any value of $\sigma$ the required value of $\mu$ is $$ \mu(\sigma) = \sigma^2\frac{\log(y_1/y_2)}{x_1 - x_2} + \frac{x_1 + x_2}{2} = b \sigma^2 + a. $$ Substituting this back into one of the two original equations leads to the root-finding problem $$ g(\tau) = \frac{A^2}{\tau} + B^2 \tau + \log \tau - C = 0, $$ where $\tau = \sigma^2$ and $$ A = x_i - a, \qquad B = b, \qquad C = 2[b(x_i-a) - \log y_i] - \log 2\pi. $$ A solution to this equation is not guaranteed to exist. For instance, let $x_1 = -1$, $x_2 = 1$, and $y_i = 1$. Then a Normal distribution passing through these points must have density at least $2$. However, in our case $y_i$ will typically be very small, in which case a root is likely to exist. Even then, there could be two roots to the equation: one with $\mu$ between $x_1$ and $x_2$, and one with $\mu$ (way) off to one side. We'll focus here on finding the former.

The root we seek can be calculated numerically with the (built-in) R function stats::uniroot(), which requires specification of an interval containing the root. To obtain this interval, first notice that $g'(\tau) = -\frac{A^2}{\tau^2} + B^2 + \frac{1}{\tau}$, of which the only positive solution is $$ \tau_L = \frac{-1 + \sqrt{1 + 4(AB)^2}}{2B^2}. $$ Since $g(\tau) \to \infty$ for $\tau \to 0$ and $\tau \to \infty$, we conclude that $\tau_L$ is the minimum value of $g(\tau)$ on $\tau \in (0,\infty)$. Let us now find a value $\tau_U < \tau_L$ at which $g(\tau_U) > 0$ (doing this for $\tau_U > \tau_L$ is likely to have $\mu$ outside of $(x_1, x_2)$).

In order to find a lower bound for the root, note that for $0 < \tau < A^2/2$, we have $$ \log \tau > -\frac{A^2/2}{\tau} + D, \qquad D = 1 + \log(A^2/2). $$ This is because the derivative of the RHS is larger than that of the LHS for $\tau < A^2/2$, thus falls to $-\infty$ monotonically faster as $\tau \to 0$. The value of $D$ is chosen such that LHS and RHS are equal at $\tau = A^2/2$. Substituting the inequality above into the root-finding equation, we obtain $$ g(\tau) > \frac{A^2}{\tau} + \log \tau - C > \frac{A^2/2}{\tau} + D - C, $$ such that $g(\tau) > 0$ for $\tau < \frac{A^2}{2(C-D)}$.

Note that when $y_1 = y_2$, the endpoint matching method breaks down and the mean and standard deviation of the approximated density is used for the Normally distributed tails.

Appendix B: Code For Test Function xdens_test {#apptest}


References

# simulate data from unit Wishart distribution
nsamples <- 1e5
X <- replicate(nsamples, crossprod(matrix(rnorm(4),2,2)))

# Cholesky decomposition
Xchol <- apply(X, 3, function(V) chol(V)[upper.tri(V,diag = TRUE)]) 
Xchol <- t(Xchol)

# fit Gaussian Copula approximation
gCop <- gcopFit(X = Xchol, # iid sample from multivariate distribution
                fitXD = "kernel") # xDensity approximation to each margin

# Compare approximation to true distribution of eigenvalues of X
# eigenvalues of 2x2 matrix
eig22 <- function(V) {
  c1 <- V[1,1] + V[2,2] # trace
  c2 <- V[1,1]*V[2,2] - V[1,2]^2 # determinant
  c2 <- sqrt(.25 * c1^2 - c2)
  c(.5 * c1 + c2, .5 * c1 - c2)
}

Xeig <- t(apply(X, 3, eig22)) # true distribution
Xeig2 <- rgcop(nsamples, gCop = gCop) # Gaussian Copula approximation
Xeig2 <- t(apply(Xeig2, 1, function(U) {
  # recover full matrix
  V <- matrix(0,2,2)
  V[upper.tri(V,diag=TRUE)] <- U
  V <- crossprod(V)
  eig22(V)
}))
main = expression(lambda[1],lambda[2])

par(mfrow = c(1,2), mar = c(2.5,2.5,2,.1)+.1, oma = c(0,2,0,0))
for(ii in 1:2) {
  dtrue <- density(Xeig[,ii])
  dapprox <- density(Xeig2[,ii])
  xlim <- range(dtrue$x, dapprox$x)
  ylim <- range(dtrue$y, dapprox$y)
  plot(dtrue$x, dtrue$y, type = "l", xlim = xlim, ylim = ylim,
       main = main[ii], xlab = "", ylab = "")
  lines(dapprox$x, dapprox$y, col = "red")
}
mtext(text = "Density",
      side = 2, line = .5, outer = TRUE)
par(oma = c(0,0,0,0)) # remove outer margin


mlysy/GaussCop documentation built on Nov. 6, 2019, 6:19 p.m.