In RobinHankin/hyper2: The Hyperdirichlet Distribution, Mark 2

knitr::opts_chunk$set(echo = TRUE)
library("hyper2",quietly=TRUE)
set.seed(0)

We will numerically verify the gradientn() function for two hyper2 objects.

Chess dataset

The chess dataset is as follows:

chess

We will define two gradient functions: chess_gradient_theoretical(), which calculates the partial derivatives using straightforward algebra (specifically elementary calculus, $\partial(x+y)^n/\partial x=n(x+y)^{n-1}$), and chess_gradient_theoretical(), which uses C code:

chess_gradient_theoretical <- function(x){
    Topalov <- x[1]
    Anand   <- x[2]
    Karpov  <- x[3]

    out <- c(
        Topalov =  30/Topalov -35/(Topalov+Anand) -18/(Topalov+Karpov),
        Anand   = -35/(Topalov+Anand) +36/Anand -35/(Anand+Karpov),
        Karpov  = -18/(Topalov+Karpov) -35/(Anand+Karpov) +22/Karpov
    )

    names(out) <- pnames(chess)
    out

}
chess_gradient_numerical  <- function(x){gradientn(chess,x)}

So in the above we have two ways of calculating the gardient: one theoretical and one numerical. Do they agree?

p_chess <- 
    c(
        Topalov = 0.5,
        Anand   = 0.3,
        Karpov  = 0.2
    )
chess_gradient_theoretical(p_chess) - chess_gradient_numerical(p_chess)

Yes, they agree

Icons

The icons dataset is:

icons

We can do the same thing but the algebra is more involved:

icons_gradient_theoretical <- function(x){
    NB <- x[1]
    L <- x[2]
    PB <- x[3]
    THC <- x[4]
    OA <- x[5]
    WAIS <- x[6]

    out <- c(
        NB = (
            +32/NB
            -20/(NB + L + THC + OA)
            -15/(NB + L + THC + WAIS)
            -09/(NB + L + OA + WAIS)
            -18/(NB + PB + THC + OA)
            -18/(NB + PB + THC + WAIS)
            -08/(NB + PB + OA + WAIS)
        ),
        L = (
            -20/(NB + L + THC + OA)
            -15/(NB + L + THC + WAIS)
            -09/(NB + L + OA + WAIS)
            +24/L
            -11/(L + PB + THC + OA)
            -16/(L + PB + THC + WAIS)
            -18/(L + PB + OA + WAIS)
        ),
        PB = (
            -18/(NB + PB + THC + OA)
            -18/(NB + PB + THC + WAIS)
            -08/(NB + PB + OA + WAIS)
            -11/(L + PB + THC + OA)
            -16/(L + PB + THC + WAIS)
            -18/(L + PB + OA + WAIS)
            +30/PB
        ),
        THC = (
            -20/(NB + L + THC + OA)
            -15/(NB + L + THC + WAIS)
            -18/(NB + PB + THC + OA)
            -18/(NB + PB + THC + WAIS)
            -11/(L + PB + THC + OA)
            -16/(L + PB + THC + WAIS)
            +24/THC
        ),
        OA = (
            -20/(NB + L + THC + OA)
            -09/(NB + L + OA + WAIS)
            -18/(NB + PB + THC + OA)
            -08/(NB + PB + OA + WAIS)
            -11/(L + PB + THC + OA)
            -18/(L + PB + OA + WAIS)
            +14/OA
        ),
        WAIS = (
            -15/(NB + L + THC + WAIS)
            -09/(NB + L + OA + WAIS)
            -18/(NB + PB + THC + WAIS)
            -08/(NB + PB + OA + WAIS)
            -16/(L + PB + THC + WAIS)
            -18/(L + PB + OA + WAIS)
            +9/WAIS
        )
    )
    names(out) <- pnames(icons)
    return(out)
}

icons_gradient_numerical  <- function(x){gradientn(icons,x)}

Then

p_icons <- c(NB=0.3, L=0.1, PB=0.2, THC=0.15, OA=0.15, WAIS=0.1)

icons_gradient_theoretical(p_icons)
icons_gradient_theoretical(p_icons) - icons_gradient_numerical(p_icons)

also agreeing.

Difference between `gradientn()` and `gradient()`

Taking the chess dataset we have

rbind(indep(p_chess), gradient(chess,indep(p_chess)))
rbind(p_chess,gradientn(chess,p_chess))

Indeed, because the powers of chess sum to zero, we can test the fact that the derivative in the direction parallel to p_chess is zero:

sum(powers(chess))
sum(p_chess*gradientn(chess,p_chess))

which is correct to numerical precision. Further,

sum(powers(icons))
sum(p_icons*gradientn(icons,p_icons))

Case study: icons

Consider the icons likelihood function; we wonder whether the loglikelihood function has a well-defined maximum.

(M <- hessian(icons))
is_ok_hessian(M)

This would suggest that the maximum is defined; we may ask how sharp it is:

is_ok_hessian(M,give=TRUE)

suggesting a reasonably sharp constrained maximum point. We now demostrate that the Hessian matrix of second derivatives is numerically correct. First the unconstrained problem, working with the first five elements. The first step is to look at the likelihood, the gradient, and the Hessian, all evaluated at a certain point:

(p <- indep(equalp(icons))) # NB not the evaluate
(G <- gradient(icons,p))
(M <- hessian(icons,p,border=FALSE))
eigen(M,TRUE,TRUE)$values

dp <- rnorm(5)*1e-4        # small perturbation
dL <- loglik(p+dp,icons) - loglik(p,icons)  # delta loglikelihood (exact)

dL1 <- sum(gradient(icons,p)*dp)    # first order approximation
dL2 <- dL1 + t(dp) %*% M %*% dp/2   # second order; should be quad.form(dp,M)

c(dL,dL1,dL2)

c(first_order_error=dL-dL1,second_order_error=dL-dL2)

In the above, see how the first order approximation is pretty good: it is of ${\mathcal O}(f''(x)\delta x^2)$ which here would be about $1000\times 10^{-8}=10^{-5}$ which is pretty much what we observe [the 1000 is the typical size of the elements of the Hessian]. The second-order approximation is better, having a smaller error ... but not that much better. I would expect the error to be ${\mathcal O}(f'''(x)\delta x^3)$ but maybe the third order derivatives are larger than one might expect. We can perform a similar analysis but using bordered Hessians:

dpc <- fillup(dp,icons)   # constrained delta p

MC <- hessian(icons,p,border=TRUE)
dLC1 <- sum(gradientn(icons,fillup(p))*dpc)    # first order approximation
dLC2 <- dLC1 + t(dpc) %*% MC %*% dpc/2     # second order

c(dL,dLC1,dLC2)

c(first_order_error=dL-dLC1,second_order_error=dL-dLC2)