humboldt.top.env: Select top environmental variables for PCA
In jasonleebrown/humboldt: Analysis of Species in Environmental Space

humboldt.top.env

R Documentation

Select top environmental variables for PCA

Description

Select top environmental variables for PCA

Usage

humboldt.top.env(
  env1,
  env2,
  sp1,
  sp2,
  rarefy.dist = 0,
  rarefy.units = "km",
  env.reso,
  learning.rt1 = 0.01,
  learning.rt2 = 0.01,
  e.var,
  pa.ratio = 4,
  steps1 = 50,
  steps2 = 50,
  method = "contrib",
  nvars.save = 5,
  contrib.greater = 5
)

Arguments

`env1`	environmental variables for all sites of the study area 1 (env1). Column names should be x,y,X1,X2,...,Xn; with X1-Xn being any string label. If env1=env2, input the same file twice
`env2`	environmental variables for all sites of the study area 2 (env2). Column names should be x,y,X1,X2,...,Xn; with X1-Xn being any string label. If env1=env2, input the same file twice
`sp1`	occurrence sites for the species/population 1 at study area 1 (env1). Column names should be 'sp', 'x','y'
`sp2`	occurrence sites for the species/population 2 at study area 2 (env2). Column names should be 'sp', 'x','y'
`rarefy.dist`	removes occurrences within a minimum distance (specified here) to each other (this function uses the humboldt.occ.rarefy function). Values need to be in km[recommended] or decimal degrees. See associated parameter rarefy.units. Note: rarefy.dist=0 will remove no occurrences
`rarefy.units`	the units of rarefy.dist parameter, either "km" for kilometers or "dd" for decimal degrees
`env.reso`	the resolution of the input environmental data grid in decimal degrees
`learning.rt1`	value from 0.01 to 0.001 for building SDM, start with 0.01 and if prompted, change to 0.001. the default value is 0.01
`learning.rt2`	value from 0.01 to 0.001 for building SDM, start with 0.01 and if prompted, change to 0.001. The default value is 0.01
`e.var`	Selection of variables to include in evaluation for each species
`pa.ratio`	ratio of pseudoabsences to occurrence points, typically this is 4. The null value is 4
`steps1`	numbers of trees to add at each cycle for modelling sp1. Start with 50 and if you run into problems gradually decrease, stopping at 1. The default value is 50
`steps2`	numbers of trees to add at each cycle for modelling sp2. Start with 50 and if you run into problems gradually decrease, stopping at 1. The default value is 50
`method`	this determines how important environmental variables are selected. There are three options: "estimate", "contrib", "nvars". If method="estimate", the boosted regression tree algorithm will choose the number of variables to include by systematically removing variables until average change in the model exceeds the original standard error of deviance explained. This is the most computationally intensive method. If method="contrib", variables above a relative influence value will be kept. See associated parameter 'contrib.greater'. If method="nvars", a fixed number of user specified variables will be kept. The kept variables are selected by their relative influence, selecting for the highest contributing variables. See associated parameter 'nvars.save'
`nvars.save`	if method="nvars",this variable is required. It is the number of the top variables to save per species. The kept variables are selected by their relative influence in predicting the species distribution, selecting for the highest contributing variables. Often the total variables retained is lower due to identical variables select among both species. The default value is 5. This value will be ignored if method="estimate" or "contrib"
`contrib.greater`	if method="contrib", this variable is required. The kept variables are selected for their relative influence in predicting the species' distribution. Here users select variables equal to or above an input model contribution value. The default value for this method is 5 (= variables with 5 percent or higher contribution to model of either species are kept). This value will be ignored if method="estimate" or "nvars"

Value

This function runs generalized boosted regression models (a machine learning SDM algorithm) to select top parameters for inclusion in PCA. This is important because you want the PC to reflect variables that are relevant to the species distribution. Alternatively you can run Maxent outside of R and manually curate the variables you include (also recommended).

Examples

library(humboldt)

##load environmental variables for all sites of the study area 1 (env1). Column names should be x,y,X1,X2,...,Xn)
env1<-read.delim("env1.txt",h=T,sep="\t")

## load environmental variables for all sites of the study area 2 (env2). Column names should be x,y,X1,X2,...,Xn)
env2<-read.delim("env2.txt",h=T,sep="\t") 

## remove NAs and make sure all variables are imported as numbers
env1<-humboldt.scrub.env(env1)
env2<-humboldt.scrub.env(env2)

##load occurrence sites for the species at study area 1 (env1). Column names should be 'sp', 'x','y'
occ.sp1<-na.exclude(read.delim("sp1.txt",h=T,sep="\t"))

##load occurrence sites for the species at study area 2 (env2). Column names should be 'sp', 'x','y'. 
occ.sp2<-na.exclude(read.delim("sp2.txt",h=T,sep="\t"))

##perform modeling to determin imporant variables
reduc.vars<- humboldt.top.env(env1=env1,env2=env2,sp1=occ.sp1,sp2=occ.sp2,rarefy.dist=40, rarefy.units="km", env.reso=0.0833338,learning.rt1=0.01,learning.rt2=0.01,e.var=(3:21),pa.ratio=4,steps1=50,steps2=50,method="contrib",contrib.greater=5)

##use new variables for env1 and evn2, use as you normally would do for env1/env2 (input above)

##for example, input into converted geographic space to espace
##zz<-humboldt.g2e(env1=reduc.vars$env1, env2=reduc.vars$env2....

jasonleebrown/humboldt documentation built on Jan. 4, 2024, 7:46 a.m.