Basic zoon usage"

An Introduction to the zoon package

Zoon is a package to aid reproducibility and between-model comparisons in species distribution modelling. Each step in an analysis is a 'module'. These modules will include:

Getting set up

Zoon is on CRAN and can be insalled like this:

install.packages('zoon')

Alternativly you can install the most up to date development version of Zoon from Github

library(devtools)
install_github('zoonproject/zoon')

and load

library(zoon)

Basic usage

A basic worklow is run using the workflow function. We must chose a module for each type: occurrence, covariate, process, model and output.

work1 <- workflow(occurrence = UKAnophelesPlumbeus,
                  covariate  = UKAir,
                  process    = OneHundredBackground,
                  model      = RandomForest,
                  output     = PrintMap)

plot of chunk basic_5

class(work1)
## [1] "zoonWorkflow"
str(work1, 1)
## List of 9
##  $ occurrence.output:List of 1
##  $ covariate.output :List of 1
##  $ process.output   :List of 1
##  $ model.output     :List of 1
##  $ report           :List of 1
##  $ call             : chr "workflow(occurrence = UKAnophelesPlumbeus, covariate = UKAir, process = OneHundredBackground, model = RandomForest, output = Pr"| __truncated__
##  $ call.list        :List of 5
##  $ session.info     :List of 7
##   ..- attr(*, "class")= chr "sessionInfo"
##  $ module.versions  :List of 5
##  - attr(*, "class")= chr "zoonWorkflow"

In this case we are using the following modules which do the following things:

The object returned from the workflow function (work1 in the above example) is an object of class zoonWorkflow. This object is a list with all the data, models and output we collected and created in our analysis.

To access the output of a particular part of the workflow you can use the accessor functions which have the same names as the modules. For example if you want the data returned from the occurrence module you can use the Occurrence() accessor function

# Use the Occurrence function to get occurrence module
# output from the workflow object

occ_out <- Occurrence(work1)

head(occ_out)
##     longitude latitude value     type fold
## 1  1.01287600 52.37696     1 presence    1
## 2 -0.16003467 51.57146     1 presence    1
## 3 -2.83497900 53.40813     1 presence    1
## 4 -0.62955210 51.55540     1 presence    1
## 5 -3.52534680 56.04848     1 presence    1
## 6  0.01144066 51.58168     1 presence    1

To find out more about the elements returned from each module there is a summary at the end of the 'Building a Module' vignette. In this instance a data frame is returned showing all of the occurrence data that is returned by the occurrence module.

Getting Help

To find a list of modules available on the online repository use

GetModuleList()

To find help on a specific module use

ModuleHelp(LogisticRegression)

Note that you can't use ? as the modules are held on a repository. Therefore the module documentation files are not included with the basic zoon install.

If you have used zoon in a publication you will need to cite zoon and the modules you have used. There are two different functions for doing this.

# For the zoon package
citation('zoon')

# For zoon modules
ZoonCitation('OptGRaF')

More complex analyses

The syntax for including arguments to modules is simply ModuleName(parameter = 'value'). For example, to do two fold crossvalidation we do

work2 <- workflow(occurrence = UKAnophelesPlumbeus,
                  covariate  = UKAir,
                  process    = BackgroundAndCrossvalid(k = 2),
                  model      = LogisticRegression,
                  output     = PerformanceMeasures)
## Occurrence data does not have a "crs" column, zoon will assume it is in the same projection as the covariate data
## There are fewer than 100 cells in the environmental raster.
## Using all available cells (81) instead
## Loading required package: SDMTools
## 
## Attaching package: 'SDMTools'
## The following object is masked from 'package:raster':
## 
##     distance

Here we are providing an argument to the module BackgroundAndCrossvalid. We are setting k (the number of cross validation folds) to 2.

We are using an output module PerformanceMeasures which calculates a number of measures of the effectiveness of our model: AUC, kappa, sensitivity, specificity etc.

Multiple modules with Chain

We might want to combine multiple modules in our analysis. For this we use the function Chain.

work3 <- workflow(occurrence = UKAnophelesPlumbeus,
                  covariate  = UKAir,
                  process    = Chain(OneHundredBackground, Crossvalidate),
                  model      = LogisticRegression,
                  output     = PerformanceMeasures)
## Occurrence data does not have a "crs" column, zoon will assume it is in the same projection as the covariate data
## There are fewer than 100 cells in the environmental raster.
## Using all available cells (81) instead

Here we draw some pseudoabsence background points, and do crossvalidation (which is the same as work2, but explicitely using the separate modules.)

The effect of Chain depends on the module type:

Chain can be used on as many module type as is required.

Multiple modules with list

If you want to run separate analyses that can then be compared for example, specifiy a list of modules.

work4 <- workflow(occurrence = UKAnophelesPlumbeus,
                  covariate  = UKAir,
                  process    = OneHundredBackground,
                  model      = list(LogisticRegression, RandomForest),
                  output     = PrintMap)
## Occurrence data does not have a "crs" column, zoon will assume it is in the same projection as the covariate data
## There are fewer than 100 cells in the environmental raster.
## Using all available cells (81) instead

plot of chunk basic_11plot of chunk basic_11

str(work4, 1)
## List of 9
##  $ occurrence.output:List of 1
##  $ covariate.output :List of 1
##  $ process.output   :List of 1
##  $ model.output     :List of 2
##  $ report           :List of 2
##  $ call             : chr "workflow(occurrence = UKAnophelesPlumbeus, covariate = UKAir, process = OneHundredBackground, model = list(LogisticRegression, "| __truncated__
##  $ call.list        :List of 5
##  $ session.info     :List of 7
##   ..- attr(*, "class")= chr "sessionInfo"
##  $ module.versions  :List of 5
##  - attr(*, "class")= chr "zoonWorkflow"

Here, the analysis is split into two and both logistic regression and random forest (a machine learning algorithm) are used to model the data. Looking at the structure of the output we can see that the output from the first three modules are a list of length one. When the analysis splits into two, the output of the modules (in work4$model.output and work4$report) is then a list of length two. One for each branch of the split analysis.

Repeating a module multiple times

If you want to repeat a module multiple times you can use Replicate. This can be useful when using modules that have a random process such as the creation of pseudoabsences.

work5 <- workflow(occurrence = UKAnophelesPlumbeus,
                  covariate  = UKAir,
                  process    = Replicate(Background(n = 20), n = 3),
                  model      = RandomForest,
                  output     = PrintMap)
## Occurrence data does not have a "crs" column, zoon will assume it is in the same projection as the covariate data

plot of chunk basic_12plot of chunk basic_12plot of chunk basic_12

Replicate takes as its first arguement the module you want to repeat and as its second argument the number of times yo want to repeat it. Here we end up running our model three times for three different sets out background points

Auxillary information in a ZoonWorkflow

A ZoonWorkflow object (such as work5 above), has a number of auxillary elements to help you interpret its contents.

# call gives the R call used to create the workflow
work5$call
## [1] "workflow(occurrence = UKAnophelesPlumbeus, covariate = UKAir, process = Replicate(Background(n = 20), n = 3), model = RandomForest, output = PrintMap, forceReproducible = FALSE)"
# session.info gives the session info when the 
# workflow was created
work5$session.info
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SDMTools_1.1-221    randomForest_4.6-12 dismo_1.1-1        
## [4] zoon_0.5.3          raster_2.5-8        sp_1.2-3           
## [7] knitr_1.15         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7       magrittr_1.5      roxygen2_5.0.1   
##  [4] munsell_0.4.3     colorspace_1.3-0  lattice_0.20-34  
##  [7] R6_2.2.0          highr_0.6         httr_1.2.1       
## [10] stringr_1.1.0     plyr_1.8.4        tools_3.3.2      
## [13] rgdal_1.2-4       grid_3.3.2        gtable_0.2.0     
## [16] R.oo_1.21.0       assertthat_0.1    lazyeval_0.2.0   
## [19] yaml_2.1.14       tibble_1.2        rfigshare_0.3.7  
## [22] crayon_1.3.2      RJSONIO_1.3-0     ggplot2_2.2.0    
## [25] R.utils_2.5.0     bitops_1.0-6      RCurl_1.95-4.8   
## [28] testthat_1.0.2    evaluate_0.10     stringi_1.1.2    
## [31] scales_0.4.1      R.methodsS3_1.7.1 XML_3.98-1.5     
## [34] httpuv_1.3.3
# module versions lists the modules used at each
# step and which version number they were
work5$module.versions
## $occurrence
##         [,1]                 
## module  "UKAnophelesPlumbeus"
## version "1.0"                
## 
## $covariate
##         [,1]   
## module  "UKAir"
## version "1.0"  
## 
## $process
##         [,1]         [,2]         [,3]        
## module  "Background" "Background" "Background"
## version "1.1"        "1.1"        "1.1"       
## 
## $model
##         [,1]          
## module  "RandomForest"
## version "1.0"         
## 
## $output
##         [,1]      
## module  "PrintMap"
## version "1.1"

When using lists in a workflow (as in work4 above) the workflow becomes forked. For example when work4 was created two models where run leading to two output maps. We can easily trace back the origins of any module output using the attribute call_path

# work4 has two output maps, find the origins of the first
# using the Output accessor function and the call_path
# attribute
attr(Output(work4)[[1]], which = 'call_path')
## $occurrence
## [1] "UKAnophelesPlumbeus"
## 
## $covariate
## [1] "UKAir"
## 
## $process
## [1] "OneHundredBackground"
## 
## $model
## [1] "LogisticRegression"
## 
## $output
## [1] "PrintMap"

A larger example

Here is an example of a larger analysis.

work6 <- workflow(occurrence = Chain(SpOcc(species = 'Eresus kollari', 
                                       extent = c(-10, 10, 45, 65)),
                                     SpOcc(species = 'Eresus sandaliatus', 
                                       extent = c(-10, 10, 45, 65))),
                  covariate = UKAir,
                  process = BackgroundAndCrossvalid(k = 2),
                  model = list(LogisticRegression,
                               RandomForest),
                  output = Chain(PrintMap(plot = FALSE),
                                 PerformanceMeasures)
         )
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following objects are masked from 'package:raster':
## 
##     area, select
## Loading required package: spocc
## Occurrence data does not have a "crs" column, zoon will assume it is in the same projection as the covariate data
## There are fewer than 100 cells in the environmental raster.
## Using all available cells (81) instead
# Take a look at the structure of the workflow object
str(work6, 1)
## List of 9
##  $ occurrence.output:List of 1
##  $ covariate.output :List of 1
##  $ process.output   :List of 1
##  $ model.output     :List of 2
##  $ report           :List of 4
##  $ call             : chr "workflow(occurrence = Chain(SpOcc(species = \"Eresus kollari\", extent = c(-10, 10, 45,      65)), SpOcc(species = \"Eresus san"| __truncated__
##  $ call.list        :List of 5
##  $ session.info     :List of 7
##   ..- attr(*, "class")= chr "sessionInfo"
##  $ module.versions  :List of 5
##  - attr(*, "class")= chr "zoonWorkflow"
# Create some custom plots using the raster returned from 
# the output module
par(mfrow = c(2,1), mar = c(3,4,6,4))
plot(Output(work6)[[1]], 
     main = paste('Logistic Regression: AUC = ', 
             round(Output(work6)[[2]]$auc, 2)),
     xlim = c(-10, 10))
plot(Output(work6)[[3]],
  main = paste('Random forest: AUC = ', 
             round(Output(work6)[[4]]$auc, 2)))

plot of chunk basic_15

Here we are collecting occurrence data for two species, Eresus kollari and E. sandaliatus and combining them (having presumably decided that this is ecologically appropriate). We are using the air temperature data from NCEP again. We are sampling 100 pseudo absence points and running two fold crossvalidation.

We run logistic regression and random forest on the data separately. We then predict the model back over the extent of our environmental data and calculate some measures of how good the models are. Collating the output into one plot we can see the very different forms of the models and can see that the random forest has a higher AUC (implying it predicts the data better).



Try the zoon package in your browser

Any scripts or data that you put into this service are public.

zoon documentation built on May 29, 2017, 10:45 a.m.