# R code to render vignette rmarkdown::render("xDensity-quicktut.Rmd")
\newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\tx}[1]{\mathrm{#1}} \newcommand{\rv}[3]{#2_{#1},\ldots,#2_{#3}} \newcommand{\XX}{{\bm X}} \newcommand{\xx}{{\bm x}} \newcommand{\zz}{{\bm z}} \newcommand{\ZZ}{{\bm Z}} \newcommand{\uu}{{\bm u}} \newcommand{\RR}{{\bm R}} \newcommand{\II}{{\bm I}} \newcommand{\ud}{\,\mathrm{d}} \newcommand{\iid}{\stackrel{\mathrm{iid}}{\sim}} \newcommand{\argmin}{\mathrm{argmin}} \newcommand{\N}{{\mathcal N}} \renewcommand{\|}{\,|\,}
The GaussCop package proposes a simple, unified framework to provide the basic d/p/q/r
functions for arbitrary continuous one-dimensional distributions. This can be useful for working with various Copula models (including the Gaussian Copula in the example below).
xDensity
ObjectsLet $X$ denote an unbounded, continuous random variable with PDF $X \sim f(x)$. An xDensity
object essentially consists of the following elements:
Based on this representation, the d/p/q/r
functions are implemented as follows:
d
): Step function inside the grid, Normal outside the grid.p
): Piecewise linear inside the grid, Normal outside the grid.q
): Exact numerical inversion of the CDF.r
): Inverse-CDF method. That is, if $F(x)$ is a CDF and $U \sim \tx{Unif}(0,1)$, then $X = F^{-1}(U)$ has CDF $F(x)$.xDensity
objects can be constructed either from iid samples $\rv 1 X n \iid f(x)$ or from a functional representation of $f(x)$ itself. For example, consider the following xDensity
representation based on a random sample from the log-Exponential distribution,
$$
\log(X) \sim \tx{Expo}(1) \qquad \iff \qquad f(x) = \exp(-e^x + x).
$$
require(GaussCop) # load package # simulate data nsamples <- 1e5 X <- log(rexp(nsamples)) # iid samples from the log-Exponential xDens <- kernelXD(X) # basic xDensity constructor names(xDens)
The elements of xDens
are:
ndens
, xrng
: a compact representation of the $x$-value grid, i.e., the number of grid points and the range of the grid.ypdf
, ylpdf
, ycdf
: the PDF, log-PDF, and CDF evaluated at the ndens
grid points. While the latter two elements can be reconstructed from the first, they are stored here in order to speed up subsequent calculations.mean
, sd
: The mean and variance of the Normal distribution used outside the range of the grid. By default, these values are chosen by matching the Normal PDF to the ypdf
values at then endpoints of the grid; details of the calculation can be found in the Appendix .
# check that xDensity d/p/q/r functions match the true distribution, # by plotting true PDF/CDF vs xDensity representation. # xDens: xDensity object. # dtrue/ptrue: single-variable function of the true pdf/cdf. # dname: name of true distribution (for legend) # dlpos, plpos: legend position for pdf/cdf plot. xdens_test <- function(xDens, dtrue, ptrue, dname, dlpos, plpos) { op <- par(no.readonly = TRUE) par(mfrow = c(1,2), mar = c(4,4,2,.5)+.1) # pdf and sampling check nsamples <- 1e5 Xsim <- rXD(nsamples, xDens = xDens) # xDensity: random sampling # extend xDens range to show endpoint matching xlim <- xDens$xrng xlim <- (xlim - mean(xlim)) * 1.10 + mean(xlim) hist(Xsim, breaks = 100, freq = FALSE, # xDensity random sample xlab = "x", main = "PDF", xlim = xlim) curve(dXD(x, xDens = xDens), add = TRUE, col = "red") # xDensity PDF curve(dtrue, add = TRUE, col = "blue") # true PDF abline(v = xDens$xrng, lty = 2) # grid endpoints legend(dlpos, parse(text = c(dname, "dXD", "rXD", "xrng")), pch = c(22,22,22,NA), pt.cex = 1.5, pt.bg = c("blue", "red", "white", "black"), lty = c(NA, NA, NA, 2)) # cdf check curve(pXD(x, xDens), # xDensity CDF from = xlim[1], to = xlim[2], col = "red", xlab = "x", main = "CDF", ylab = "Cumulative Probability") curve(ptrue, add = TRUE, col = "blue") # true CDF abline(v = xDens$xrng, lty = 2) # grid endpoints legend(plpos, parse(text = c(dname, "pXD", "xrng")), pch = c(22,22,NA), pt.cex = 1.5, pt.bg = c("blue", "red", "black"), lty = c(NA, NA, 2)) invisible(par(op)) }
The d/p/q/r
functions for xDensity
objects are invoked by e.g.:
dXD(x, xDens = xDens, log = TRUE) # log-PDF pXD(x, xDens = xDens, lower.tail = FALSE) # survival function (1-CDF) qXD(p = .5, xDens = xDens) # median rXD(n, xDens = xDens) # random sample
To assess the quality of the xDensity
approximation to the log-Exponential distribution $\log(X) \sim \tx{Expo}(1)$, consider the following checks:
rXD()
sample matches dXD()
and the true PDF (recall that rXD()
calls qXD()
on stats::runif()
).pXD()
matches the true CDF.These graphical checks are implemented by the function xdens_test
(R code in Appendix) as illustrated below:
# check that xDensity d/p/q/r functions match the true distribution, # by plotting true PDF/CDF vs xDensity representation. dlexp <- function(x) exp(-exp(x) + x) # true PDF plexp <- function(x) pexp(exp(x)) # true CDF xdens_test(xDens, # xDensity approximation dtrue = dlexp, # true PDF ptrue = plexp, # true CDF dname = expression(log(X)%~%"Expo"*(1)), # name of true distribution (for legend) dlpos = "topleft", plpos = "topleft") # legend position for pdf/cdf plot
xDensity
ObjectsThe GaussCop package currently implements three methods of constructing xDensity
objects. Two take as input a random sample $\rv 1 X n \sim f(x)$. The third is based on a functional approximation $\hat f(x)$, e.g., as resulting from a parametric model $\hat f(x) = f(x \mid \hat {\bm \theta})$, where $\hat{\bm \theta}$ is the parameter MLE.
The kernelXD()
function uses the stats::density()
function to fit a Gaussian kernel to a random sample from $f(x)$. It is the most flexible constructor; however, it also requires the most datapoints for stable estimation. Below is a comparison with $M$ iid samples from a $\chi^2_{(4)}$ distribution, using $M = 1000$ and $M = 10000$ data points.
# true distribution: chi^2_(4) df_true <- 4 dtrue <- function(x) dchisq(x, df = df_true) ptrue <- function(x) pchisq(x, df = df_true) rtrue <- function(n) rchisq(n, df = df_true) dname <- paste0("X%~%chi[(",df_true,")]^2") # simulate data nsamples <- 1e3 X <- rtrue(nsamples) xDens <- kernelXD(X) # xDensity constructor par(oma = c(0,2,0,0)) # put sample size in outer margin xdens_test(xDens, dtrue = dtrue, ptrue = ptrue, dname = dname, dlpos = "topright", plpos = "bottomright") mtext(text = paste0("M = ", nsamples), side = 2, line = .5, outer = TRUE) # increase sample size nsamples <- 1e4 X <- rtrue(nsamples) xDens <- kernelXD(X) xdens_test(xDens, dtrue = dtrue, ptrue = ptrue, dname = dname, dlpos = "topright", plpos = "bottomright") mtext(text = paste0("M = ", nsamples), side = 2, line = .5, outer = TRUE) par(oma = c(0,0,0,0)) # remove outer margin
The gc4XD()
constructor implements a moment-based density approximation, which works best for unimodal distributions that are not too far from normal. The estimator proceeds in two steps:
The gc4XD()
constructor estimates the PDF of the data $\rv 1 X n$ on the original scale by inverting steps 1 and 2 above. The example below illustrates the quality of the approximation for $M = 1000$ observations of $X \sim \chi^2_{(4)}$.
set.seed(1) # reproducible results nsamples <- 1e3 X <- rtrue(nsamples) xDens <- gc4XD(X) # xDensity constructor xdens_test(xDens, dtrue = dtrue, ptrue = ptrue, dname = dname, dlpos = "topright", plpos = "bottomright")
The matrixXD()
constructor is used to convert functional density representations into the standardized xDensity
format. Its input is a 2-column matrix with equally-spaced $x$-grid values in the first column and corresponding density values in the second.
Below is an illustration using the Student-$t$ distributions $X \sim t_{(2)}$ and $X \sim t_{(1)}$ (i.e., Cauchy) distributions. We can see that for extremely heavy-tailed distributions such as Cauchy, the normal approximation used by xDensity
clearly undercuts tail probabilities.
```r$ distributions with $\nu = 1,2$."}
dtrue <- list(t2 = function(x) dt(x, df = 2), Cauchy = function(x) dt(x, df = 1)) ptrue <- list(t2 = function(x) pt(x, df = 2), Cauchy = function(x) pt(x, df = 1)) dname <- expression(X%~%t[(2)], X%~%"Cauchy")
xgrid <- seq(from=-15, to=15, len = 500) # grid for(ii in 1:2) { ypdf <- dtrue[ii] # true density xDens <- matrixXD(cbind(X = xgrid, Y = ypdf)) # xDensity constructor xdens_test(xDens, dtrue = dtrue[[ii]], ptrue = ptrue[[ii]], dname = dname[ii], dlpos = "topright", plpos = "bottomright") }
## Application: The Gaussian Copula Distribution {#gCop} Consider a $d$-dimensional random variable $\XX = (\rv 1 X d)$ with unknown PDF $\XX \sim f(\xx)$. A popular approach to statistical dependence modeling is to do so upon transforming $\XX$ to having univariate margins. That is, if $F_i(x_i)$ is the CDF of $X_i$, then $$ U_i = F_i^{-1}(X_i) \sim \tx{Unif}(0,1), $$ such that the joint PDF of $\XX$ can be written as $$ f(\xx) = c(\uu) \times\prod_{i=1}^n f_i(x_i), \qquad u_i = F_i(x_i). $$ Thus, $c(\uu)$ is the joint PDF of marginally Uniform random variables, $U_i \sim \tx{Unif}(0,1)$, and is known as a [Copula distribution](https://en.wikipedia.org/wiki/Copula_(probability_theory)). Numerous R packages provide various for fitting and sampling from Copula distributions, e.g., [**copula**](https://CRAN.R-project.org/package=copula) [@hofert.etal17], [**VineCopula**](https://CRAN.R-project.org/package=VineCopula) [@schepsmeier.etal17], and many others listed in the [**Distributions**](https://CRAN.R-project.org/view=Distributions) Task View. The `xDensity` methods for modeling the marginal distributions $f_i(x_i)$ can easily be combined with the Copula dependence models $c(\uu)$ provided by these packages. **GaussCop** provides a simple and specific interface to the so-called Gaussian Copula distribution, which approximates the true $c(\uu)$ as follows: 1. Let $\varphi(z)$ and $\Phi(z)$ denote the PDF and CDF of the standard normal $\N(0,1)$. Now, consider the change-of-variables $Z_i = \Phi^{-1}(U_i)$, for which the joint PDF of $\ZZ = (\rv 1 Z n)$ is $$ g(\zz) = c(\uu) \times \prod_{i=1}^n \varphi(z_i), \qquad u_i = \Phi(z_i). $$ 2. Since marginally $Z_i \sim \N(0,1)$, the Gaussian Copula approximates the true distirbution $g(\zz)$ by the best-matching multivariate normal, in the sense of minimizing the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence), $$ \hat g(\zz) = \argmin_{\RR} \int \log \left[\frac{g(\zz)}{\varphi(\zz \mid \RR)}\right] \cdot g(\zz) \ud \zz, $$ where $\varphi(\zz \mid \RR)$ is the joint PDF of a multivariate normal $\N(0, \RR)$ with correlation matrix $\RR_{ii} = 1$. To evaluate the accuracy of the Gaussian Copula appromation, consider applying it to the $p(p+1)/2$-dimensional Cholesky decomposition of a "unit" $p\times p$ [Wishart distribution](https://en.wikipedia.org/wiki/Wishart_distribution), $$ \XX_{p\times p} \sim \tx{Wishart}(\II, p). $$ That is, $\XX = \ZZ'\ZZ$ and $\ZZ$ is a $p\times p$ matrix of iid $\N(0,1)$. By construction, the Gaussian Copula perfectly captures the marginal distributions of $\tx{chol}(\XX)$. Therefore, consider the error on the marginal distributions of the eigenvalues of $\XX$, induced by approximating the copula dependence between the Cholesky variables as multivariate normal. In the example below with $p = 4$, the error appears to be relatively small: ```r(\\II, 4)$, as fitted to $\\tx{chol}(\\XX)$."} # simulate data from unit Wishart distribution p <- 4 # size of matrix nsamples <- 1e4 X <- replicate(nsamples, crossprod(matrix(rnorm(p^2),p,p))) # Cholesky decomposition Xchol <- apply(X, 3, function(V) chol(V)[upper.tri(V,diag = TRUE)]) Xchol <- t(Xchol) # fit Gaussian Copula approximation gCop <- gcopFit(X = Xchol, # iid sample from multivariate distribution fitXD = "kernel") # xDensity approximation to each margin # Compare approximation to true distribution of eigenvalues of X # extract sorted eigenvalues of a positive definite matrix eig_val <- function(V) { sort(eigen(V, symmetric = TRUE)$values, decreasing = TRUE) } Xeig <- t(apply(X, 3, eig_val)) # true distribution Xeig2 <- rgcop(nsamples, gCop = gCop) # Gaussian Copula approximation Xeig2 <- t(apply(Xeig2, 1, function(U) { # recover full matrix V <- matrix(0,p,p) V[upper.tri(V,diag=TRUE)] <- U V <- crossprod(V) eig_val(V) })) # plot true and approximate eigenvalue distributions main <- parse(text = paste0("lambda[", 1:p, "]")) par(mfrow = c(1,p), mar = c(2.5,2.5,2,.1)+.1, oma = c(0,2,0,0)) for(ii in 1:p) { dtrue <- density(Xeig[,ii]) dapprox <- density(Xeig2[,ii]) xlim <- range(dtrue$x, dapprox$x) ylim <- range(dtrue$y, dapprox$y) plot(dtrue$x, dtrue$y, type = "l", xlim = xlim, ylim = ylim, main = main[ii], xlab = "", ylab = "", col = "blue") lines(dapprox$x, dapprox$y, col = "red") } mtext(text = "Density", side = 2, line = .5, outer = TRUE) legend("topright", legend = c("True PDF", "gCop Approx."), fill = c("blue", "red")) par(oma = c(0,0,0,0)) # remove outer margin
We wish to find the parameters $(\mu, \sigma)$ of a normal distribution whose PDF passes through the points $(x_i, y_i)$, $i = 1,2$. Thus we wish to find $(\mu, \sigma)$ satisfying $$ \log y_i = -\frac 1 2 \left[\frac{(x_i - \mu)^2}{\sigma^2} + \log \sigma^2 + \log 2\pi \right], \qquad i = 1,2. $$ Subtracting one equation from the other, we obtain $$ \log(y_1/y_2) = \frac{(x_1 - \mu)^2}{2\sigma^2} - \frac{(x_2 - \mu)^2}{2\sigma^2}, $$ such that for any value of $\sigma$ the required value of $\mu$ is $$ \mu(\sigma) = \sigma^2\frac{\log(y_1/y_2)}{x_1 - x_2} + \frac{x_1 + x_2}{2} = b \sigma^2 + a. $$ Substituting this back into one of the two original equations leads to the root-finding problem $$ g(\tau) = \frac{A^2}{\tau} + B^2 \tau + \log \tau - C = 0, $$ where $\tau = \sigma^2$ and $$ A = x_i - a, \qquad B = b, \qquad C = 2[b(x_i-a) - \log y_i] - \log 2\pi. $$ A solution to this equation is not guaranteed to exist. For instance, let $x_1 = -1$, $x_2 = 1$, and $y_i = 1$. Then a Normal distribution passing through these points must have density at least $2$. However, in our case $y_i$ will typically be very small, in which case a root is likely to exist. Even then, there could be two roots to the equation: one with $\mu$ between $x_1$ and $x_2$, and one with $\mu$ (way) off to one side. We'll focus here on finding the former.
The root we seek can be calculated numerically with the (built-in) R function stats::uniroot()
, which requires specification of an interval containing the root. To obtain this interval, first notice that $g'(\tau) = -\frac{A^2}{\tau^2} + B^2 + \frac{1}{\tau}$, of which the only positive solution is
$$
\tau_L = \frac{-1 + \sqrt{1 + 4(AB)^2}}{2B^2}.
$$
Since $g(\tau) \to \infty$ for $\tau \to 0$ and $\tau \to \infty$, we conclude that $\tau_L$ is the minimum value of $g(\tau)$ on $\tau \in (0,\infty)$. Let us now find a value $\tau_U < \tau_L$ at which $g(\tau_U) > 0$ (doing this for $\tau_U > \tau_L$ is likely to have $\mu$ outside of $(x_1, x_2)$).
In order to find a lower bound for the root, note that for $0 < \tau < A^2/2$, we have $$ \log \tau > -\frac{A^2/2}{\tau} + D, \qquad D = 1 + \log(A^2/2). $$ This is because the derivative of the RHS is larger than that of the LHS for $\tau < A^2/2$, thus falls to $-\infty$ monotonically faster as $\tau \to 0$. The value of $D$ is chosen such that LHS and RHS are equal at $\tau = A^2/2$. Substituting the inequality above into the root-finding equation, we obtain $$ g(\tau) > \frac{A^2}{\tau} + \log \tau - C > \frac{A^2/2}{\tau} + D - C, $$ such that $g(\tau) > 0$ for $\tau < \frac{A^2}{2(C-D)}$.
Note that when $y_1 = y_2$, the endpoint matching method breaks down and the mean and standard deviation of the approximated density is used for the Normally distributed tails.
xdens_test
{#apptest}# simulate data from unit Wishart distribution nsamples <- 1e5 X <- replicate(nsamples, crossprod(matrix(rnorm(4),2,2))) # Cholesky decomposition Xchol <- apply(X, 3, function(V) chol(V)[upper.tri(V,diag = TRUE)]) Xchol <- t(Xchol) # fit Gaussian Copula approximation gCop <- gcopFit(X = Xchol, # iid sample from multivariate distribution fitXD = "kernel") # xDensity approximation to each margin # Compare approximation to true distribution of eigenvalues of X # eigenvalues of 2x2 matrix eig22 <- function(V) { c1 <- V[1,1] + V[2,2] # trace c2 <- V[1,1]*V[2,2] - V[1,2]^2 # determinant c2 <- sqrt(.25 * c1^2 - c2) c(.5 * c1 + c2, .5 * c1 - c2) } Xeig <- t(apply(X, 3, eig22)) # true distribution Xeig2 <- rgcop(nsamples, gCop = gCop) # Gaussian Copula approximation Xeig2 <- t(apply(Xeig2, 1, function(U) { # recover full matrix V <- matrix(0,2,2) V[upper.tri(V,diag=TRUE)] <- U V <- crossprod(V) eig22(V) })) main = expression(lambda[1],lambda[2]) par(mfrow = c(1,2), mar = c(2.5,2.5,2,.1)+.1, oma = c(0,2,0,0)) for(ii in 1:2) { dtrue <- density(Xeig[,ii]) dapprox <- density(Xeig2[,ii]) xlim <- range(dtrue$x, dapprox$x) ylim <- range(dtrue$y, dapprox$y) plot(dtrue$x, dtrue$y, type = "l", xlim = xlim, ylim = ylim, main = main[ii], xlab = "", ylab = "") lines(dapprox$x, dapprox$y, col = "red") } mtext(text = "Density", side = 2, line = .5, outer = TRUE) par(oma = c(0,0,0,0)) # remove outer margin
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.