examples/autodiffr_intro.md

Automatic Differentiation in R by autodiffr

Changcheng Li, John Nash, Hans Werner Borchers 2018/8/10

Package autodiffr provides an R wrapper for Julia packages ForwardDiff.jl and ReverseDiff.jl through R package JuliaCall to do automatic differentiation for native R functions. This vignette will walk you through several examples in using autodiffr.

Installation

Julia is needed to use autodiffr. You can download a generic Julia binary from https://julialang.org/downloads/ and add it to the path. Pakcage autodiffr is not on CRAN yet. You can get the development version of autodiffr by

devtools::install_github("Non-Contradiction/autodiffr")

Important: Note that currently Julia v0.6.x, v0.7.0 and v1.0 are all supported by autodiffr, but to use autodiffr with Julia v0.7/1.0, you need to get the development version of JuliaCall by:

devtools::install_github("Non-Contradiction/JuliaCall")

Basic usage

When loading the library autodiffr it is still necessary to create a process running Julia, to connect to it, to check whether all needed Julia packages are installed resp. to install them in case they are missing. All this is done by the ad_setup function. If everything is in place, it will finish in short time, otherwise it may take several minutes (the first time it is called).

library(autodiffr)

ad_setup()
## Julia version 1.0.0 at location [...]/julia/bin will be used.
## Loading setup script for JuliaCall...
Finish loading setup script for JuliaCall.
INFO: Precompiling module DiffResults.
INFO: Precompiling module ForwardDiff.
INFO: Precompiling module ReverseDiff.

The first time ad_setup() is called on your system, these three Julia modules / packages are being precompiled, later on this will not happen again, except when the julia packages get installed anew (e.g., a new version has appeared). Please note that ad_setup is required whenever you load the autodiffr package into R. And if Julia is not in the path, then autodiffr may fail to locate Julia. In this case, use ad_setup(JULIA_HOME = "the file folder contains the julia binary").

Now automatic differentiation is ready to get applied. autodiffr has ad_grad, ad_jacobian and ad_hessian to calculate gradient, jacobian and hessian, and has makeGradFunc, makeJacobianFunc and makeHessianFunc to create gradient, jacobian and hessian functions respectively. We can see the basic usage below:

## Define a target function with vector input and scalar output
f <- function(x) sum(x^2L)

## Calculate gradient of f at [2,3] by
ad_grad(f, c(2, 3)) ## deriv(f, c(2, 3))
#> [1] 4 6

## Get a gradient function g
g <- makeGradFunc(f)

## Evaluate the gradient function g at [2,3]
g(c(2, 3))
#> [1] 4 6

## Calculate hessian of f at [2,3] by
ad_hessian(f, c(2, 3))
#>      [,1] [,2]
#> [1,]    2    0
#> [2,]    0    2

## Get a hessian function h
h <- makeHessianFunc(f)

## Evaluate the hessian function h at [2,3]
h(c(2, 3))
#>      [,1] [,2]
#> [1,]    2    0
#> [2,]    0    2

## Define a target function with vector input and vector output
f <- function(x) x^2

## Calculate jacobian of f at [2,3] by
ad_jacobian(f, c(2, 3))
#>      [,1] [,2]
#> [1,]    4    0
#> [2,]    0    6

## Get a jacobian function j
j <- makeJacobianFunc(f)

## Evaluate the gradient function j at [2,3]
j(c(2, 3))
#>      [,1] [,2]
#> [1,]    4    0
#> [2,]    0    6

What to do if autodiffr cannot handle your function?

autodiffr works by ultilizing R’s S3 object system: most of R functions are just compositions of many small operations like + and -, and these small operations can be dispatched by the S3 object system to Julia, which finishes the real work of automatic differentiation behind the scence. This mechanism can work for many R functions, but it also means that autodiffr has same limitations just as R’s S3 object system. For example, autodiffr can’t handle R’s internal C code or some external C and Fortran code. In this section, we will talk about how to detect to deal with such problems. We will first talk about the common issues in the next subsection, and then we will see a little example, finally we will show how to use a helper function ad_variant provided by autodiffr to deal with these common problems at the same time.

Basic R functions that have problems with autodiffr

This is a list of the most common basic R functions that have problems with autodiffr:

A little example to demonstrate issues

Let us first define a makeup function to show some of the aforementioned issues:

fun <- function(x) {
    if (length(x) != 4L) stop("'x' must be a vector of 4 elements.")
    y <- rep(0, 4)
    y[1] <- x[1]; y[2] <- x[2]^2
    y[3] <- x[3]; y[4] <- x[4]^3
    y <- matrix(y, 2, 2)
    det(y)
}

If we directly use autodiffr to deal with fun, we will have:

x0 <- c(1.2, 1.4, 1.6, 1.8)

try(ad_grad(fun, x0))

## from the error message
## Error in y[1] <- x[1] : 
##  incompatible types (from environment to double) in subassignment type fix
## and the error list in the previous subsection, we can know the problem is from assignment, 
## and we can come up with the solution which deal with the issue:

fun1 <- function(x) {
    if (length(x) != 4L) stop("'x' must be a vector of 4 elements.")
    y <- julia_array(rep(0, 4)) 
    ## we need to use julia_array to wrap the vector
    ## or we can do y <- julia_array(0, 4) which is more elegant.
    y[1] <- x[1]; y[2] <- x[2]^2
    y[3] <- x[3]; y[4] <- x[4]^3
    y <- matrix(y, 2, 2)
    det(y)
}

try(ad_grad(fun1, x0))

## and then we have another error, 
## from the error message 
## Error in matrix(y, 2, 2) : 
##   'data' must be of a vector type, was 'environment'
## and the error list in the previous subsection, we can know the problem is from 
## using matrix to reshape the vector y, 
## and we can come up with the solution which deal with the issue:

fun2 <- function(x) {
    if (length(x) != 4L) stop("'x' must be a vector of 4 elements.")
    ## we need to use julia_array to wrap the vector
    ## or we can do y <- julia_array(0, 4) which is more elegant.
    y <- julia_array(rep(0, 4)) 
    y[1] <- x[1]; y[2] <- x[2]^2
    y[3] <- x[3]; y[4] <- x[4]^3
    ## we need to use array to reshape the vector,
    ## or we can do y <- julia_array(y, 2, 2).
    y <- array(y, c(2, 2))
    det(y)
}

## This time we will have the correct solution
ad_grad(fun2, x0)
#> [1]  5.832 -4.480 -1.960 11.664

## and we can use numeric approximation to check our result:
numDeriv::grad(fun2, x0)
#> [1]  5.832 -4.480 -1.960 11.664

Use ad_variant to deal with these issues “automatically”

The above process is a little tedious. Helpfully, autodiffr provides a helper function ad_variant to deal with these problems simutaneously. And the usage is quite simple and self-explanotory:

funvariant <- ad_variant(fun)
#> ad_variant will try to wrap your function in a special environment, which redefines some R functions to be compatible with autodiffr. Please use it with care.

ad_grad(funvariant, x0)
#> [1]  5.832 -4.480 -1.960 11.664

funvariant1 <- ad_variant(fun, checkArgs = x0)
#> ad_variant will try to wrap your function in a special environment, which redefines some R functions to be compatible with autodiffr. Please use it with care.
#> Start checking...
#> assignment in arrays has been replaced to be compatible with arrays in Julia.
#> matrix has been replaced with array.
#> New function gives the same result as the original function at the given arguments. Still need to use it with care.
#> Checking finished.

ad_grad(funvariant1, x0)
#> [1]  5.832 -4.480 -1.960 11.664

A more complex example: Chebyquad function

This problem was given prominence in the optimization literature by Fletcher (1965).

First let us have our Chebyquad function. Note that this is for the vector x. This function is taken from the package funconstrain.

cyq.f <- function (par) 
{
    n <- length(par)
    if (n < 1) {
        stop("Chebyquad: n must be positive")
    }
    y <- 2 * par - 1
    t0 <- rep(1, n)
    t1 <- y
    ti <- t1
    fsum <- 0
    for (i in 1:n) {
        if (i > 1) {
            ti <- 2 * y * t1 - t0
            t0 <- t1
            t1 <- ti
        }
        fi <- sum(ti)/n
        if (i%%2 == 0) {
            fi <- fi + 1/(i * i - 1)
        }
        fsum <- fsum + fi * fi
    }
    fsum
}

Let us choose a single value for the number of parameters, and for illustration use (n = 10).

## cyq.setup
n <- 10
x<-1:n
x<-x/(n+1.0) # Initial value suggested by Fletcher

For safety, let us check the function and a numerical approximation to the gradient.

require(numDeriv)
#> Loading required package: numDeriv
cat("Initial parameters:")
#> Initial parameters:
x
#>  [1] 0.09090909 0.18181818 0.27272727 0.36363636 0.45454545 0.54545455
#>  [7] 0.63636364 0.72727273 0.81818182 0.90909091
cat("Initial value of the function is ",cyq.f(x),"\n")
#> Initial value of the function is  0.03376327
gn <- numDeriv::grad(cyq.f, x) # using numDeriv
cat("Approximation to gradient at initial point:")
#> Approximation to gradient at initial point:
gn
#>  [1]  0.7446191 -0.4248347  0.3221248 -0.0246995 -0.2126736  0.2126736
#>  [7]  0.0246995 -0.3221248  0.4248347 -0.7446191

And then we can use autodiffr to get the gradient:

cyq.ag <- makeGradFunc(cyq.f) # Need autodiffr:: specified for knitr
cat("Gradient at x")
#> Gradient at x
cyq.ag(x)
#>  [1]  0.7446191 -0.4248347  0.3221248 -0.0246995 -0.2126736  0.2126736
#>  [7]  0.0246995 -0.3221248  0.4248347 -0.7446191

And we can see that the result is very similar, which means that autodiffr gives the correct result.

Create optimized gradient/jacobian/hessian functions with autodiffr

autodiffr also provides some ways to generate a more optimized version of gradient/jacobian/hessian functions. See ?autodifffunc.

Here we will give a simple example for one possible way (and maybe the most effective one) to generate optimized gradient function by use_tape = TRUE using the Chebyquad function:

cyq.ag_with_tape <- makeGradFunc(cyq.f, x = runif(length(x)), use_tape = TRUE)

cyq.ag_with_tape(x)
#>  [1]  0.7446191 -0.4248347  0.3221248 -0.0246995 -0.2126736  0.2126736
#>  [7]  0.0246995 -0.3221248  0.4248347 -0.7446191

From the documentation: use_tape is whether or not to use tape for reverse mode automatic differentiation. Note that although use_tape can greatly improve the performance sometime, it can only calculate the derivatives w.r.t a certain branching, so it may not give the correct result for functions containing things like if. And this parameter is only effective when mode = “reverse”.

Important And also pay attention that for the optimization to have any effect, x must be given and must have the same shape of the potential input(s). Important And also pay attention to the correctness of the optimized gradient function. Here we can see that the optimized gradient function also gives the correct result.

And then we can use package microbenchmark to check the effect of use_tape and also compare it with the numeric approximation from numDeriv.

require(microbenchmark)
#> Loading required package: microbenchmark

cat("cyq.f timing:\n")
#> cyq.f timing:
tcyq.f <- microbenchmark(cyq.f(x))
tcyq.f
#> Unit: microseconds
#>      expr    min      lq     mean  median      uq    max neval
#>  cyq.f(x) 10.505 10.7675 11.25659 10.8595 11.0695 41.397   100

tcyq.ag <- microbenchmark(cyq.ag(x), unit="us" )

tcyq.ag
#> Unit: microseconds
#>       expr      min       lq     mean   median       uq      max neval
#>  cyq.ag(x) 5384.953 5771.823 6952.028 6038.078 6816.669 27676.09   100

tcyq.ag_with_tape <- microbenchmark(cyq.ag_with_tape(x), unit="us" )

tcyq.ag_with_tape
#> Unit: microseconds
#>                 expr     min      lq     mean   median       uq     max
#>  cyq.ag_with_tape(x) 117.252 119.487 136.8331 120.5385 124.5735 386.252
#>  neval
#>    100

tcyq.ng <- microbenchmark(numDeriv::grad(cyq.f, x), unit="us" )

tcyq.ng
#> Unit: microseconds
#>                      expr      min       lq     mean   median       uq
#>  numDeriv::grad(cyq.f, x) 1573.643 1613.349 1935.169 1699.395 1934.216
#>       max neval
#>  6291.453   100

From the time comparison, we can see that the use_tape is very effective and the optimized gradient function performs much better than the numeric approximation provided by numDeriv in this case.

Bibliography

Fletcher, R. 1965. “Function Minimization Without Calculating Derivatives – a Review.” *Computer Journal* 8: 33–41.


Non-Contradiction/autodiffr documentation built on May 10, 2019, 8:04 a.m.