Factors: Identification of the factors affecting species richness

View source: R/Factors.R

FactorsR Documentation

Identification of the factors affecting species richness

Description

This function provides tools for the identification of most likely causal factors significantly correlated with species richness by following these steps: 1) quantifying the severity of multicollinearity among the factors, 2) estimating the contribution of factors to species richness by using hierarchical partitioning, 3) selecting the factors with higher contribution among auto-correlated factors and finally, 4) among all factors selected in the previous steps, selecting those factors significantly correlated with species richness using a stepwise multiple regression. Predictions and residuals of a support vector machine (SVM) model performed with the factors selected are shown on a map. Negative residuals may be potential areas with undiscovered and/or unregistered species, or areas with decreased species richness due to the negative effect of anthropogenic factors. This function uses an output file of ModestR with data of richness of the species and environmental variables in a cell size defined by the user.

Usage

Factors(data, varY, varX, Area="World", outliers=FALSE, quant1=0.05, quant2 = 0.95,
stepwise=FALSE, direction="both", deplog=FALSE, indlog=FALSE, zero=TRUE,
vifmethod="hier.part", threshold=10, PAIRS=TRUE, CoVa=FALSE, SVM=TRUE, Map=TRUE, TXT=TRUE,
jpg=TRUE, jpg1="Correlation matrix.jpg", jpg2="Contribution in the regression model.jpg",
jpg3="Residuals.jpg", jpg4="Actual richness.jpg", jpg5="Predicted richness.jpg",
jpg6="Contribution by Hierarchical Partitioning-All variables.jpg", 
jpg7="Contribution by Hierarchical Partitioning-Selected variables.jpg", pairs=NULL,
colhist="#00FFFFFF", ksvm=NULL, typeV="lmg", rela=TRUE, b=1000, names.abbrev=15,
ylimV=NULL, mainV=NULL, cex.title=1.5, minLon, maxLon, minLat, maxLat, colbg="#FFFFFF",
colcon="#C8C8C8",colf="black", pro=TRUE, inc=0.005, exclude=NULL, colexc=NULL,
colfexc="black", colscale1=c("#FFFFFFFF","#FF0080FF","#FF00FFFF","#00FFFFFF",
"#64FF64FF","#C8FF00FF","#FFFF00FF","#FFC800FF","#FF6400FF","#FF0000FF"),
colscale2=c("#FFFFFFFF","#C8FFFFFF","#64FFFFFF","#00FFFFFF","#64FF64FF",
"#C8FF00FF", "#FFFF00FF","#FFC800FF","#FF6400FF","#FF0000FF"), legend.pos="y",
breaks=10, xl=0, xr=0, yb=0, yt=0, asp, lab=NULL, xlab="Longitude",
ylab="Latitude", main1="Residuals of SVM model", main2="Actual richness",
main3="Richness predicted by SVM model",  cex.main=1.6, cex.lab=1.4,
cex.axis=1.2, family="sans", font.main=2, font.lab=1, font.axis=1, lwdP=0.6,
lwdC=0.1, trans=c(1,1), log=c(0,0), ndigits=0, ini, end=NULL, file1="Output.txt",
CSV1="Residuals of SVM model.csv", CSV2="Actual richness.csv",
CSV3="Richness predicted by SVM model.csv", 
CSV4="Residuals of the multiple regression model.csv", na="NA", dec=",",
row.names=FALSE)

Arguments

data

A matrix or data frame with the species richness and, environmental variables and/or biogeographical indexes (see details section).

varY

Dependent variable. In the CSV obtained using the software ModestR, this variable is Species.Richness.

varX

Independent variables.

Area

A character with the name of the administrative area or a vector with several administrative areas (see details).

outliers

If TRUE the outliers are removed.

quant1

Lower quantile for removing outliers.

quant2

Higher quantile for removing outliers.

stepwise

If TRUE applies a stepwise regression to eliminate those variables that are not significant in the multiple regression. It uses the Akaike Information Criterion (AIC)) to define what are the variables that are excluded. The variables selected are also used in the SVM model.

direction

The mode of stepwise search, can be one of: "both" (default), "backward" or "forward".

deplog

If TRUE the logarithm is applied to the dependent variable using the following equation: ln(variable-(min(variable)-1)).

indlog

If TRUE the logarithm of all independent variables is calculated and these transformed variables are added to the other variables in order to perform the multiple regression and SVM models with the transformed and untransformed independent variables. It uses the equation described in the argument deplog.

zero

If TRUE (default) the zero values of the dependent variable (cells without species) are included in both multiple regression and SVM models.

vifmethod

There are four methods for selecting variables uncorrelated: "hier.part" (default), "vif", "vifcor" and "vifstep" (see details). If NULL correlated variables are not eliminated and, therefore, included in both the multiple regression and SVM models.

threshold

A number specifying the VIF threshold(see details section).

PAIRS

If TRUE (default) a matrix of scatterplots is produced.

CoVa

If TRUE the plot showing the relative contribution of the independent variables in the multiple regression model is shown (only if the number of independent variables is greater than two). It is a slow process and, therefore, it is advisable to use a small number of variables (maximum 15).

SVM

If TRUE (default) the Support Vector Machine (SVM) analysis is performed. It is a time-consuming process.

Map

If TRUE (default) the maps with the residuals of the SVM model, actual richness and richness predicted by SVM model are shown.

TXT

If TRUE (default) the TXT file with the statistical analyses is generated.

jpg

If TRUE (default) the plots are exported to jpg files instead of using the windows device.

jpg1

Name of the jpg file with the matrix of scatterplots.

jpg2

Name of the jpg file with the plot of the contribution of the independent variables in the regression model.

jpg3

Name of the jpg file with the map of the residuals of the SVM model.

jpg4

Name of the jpg file with the map of the actual richness.

jpg5

Name of the jpg file with the map of the richness predicted by SVM model.

jpg6

Name of the jpg file with the plot of the contribution of all independent variables estimated by hierarchical partitioning.

jpg7

Name of the jpg file with the plot of the contribution of selected independent variables estimated by hierarchical partitioning after removing auto-correlated variables.

pairs

It allows access to the function pairs to modify the arguments of the matrix of scatterplots.

colhist

The filling color for the histograms in the matrix of scatterplots.

ksvm

It allows access to the function ksvm to modify the arguments of the SVM model.

typeV

Plot with the relative contribution of each independent variable of the argument CoVa. It can be a character string, character vector or list of character strings. It is the collection of metrics that are to be calculated. Available metrics: lmg, pmvd (non-US version only), last, first, betasq, pratt, genizi and car. For brief sketches of their meaning see details section of the function calc.relimp.

rela

Plot with the relative contribution of each independent variable of the argument CoVa. It is a logical argument requesting relative importances summing to 100% (rela=TRUE, default). If rela is FALSE, some of the metrics sum to R^2 (lmg, pmvd, pratt), others do not have a meaningful sum (last, first, betasq).

b

Plot with the relative contribution of each independent variable of the argument CoVa. It is the number of bootstrap runs requested on boot.relimp (default: b=1000). Make sure to set this to a lower number, if you are simply testing code.

names.abbrev

Plot with the relative contribution of each independent variable of the argument CoVa. It is an integer that provides the number of characters to which the bar labels are shortened.

ylimV

Plot with the relative contribution of each independent variable of the argument CoVa. The plot routines try to use appropriate scaling. If adjustments are needed, ylim can be used as usually on a plot.

mainV

Plot with the relative contribution of each independent variable of the argument CoVa. The plot routine uses a default title based on the reponse name. If adjustments are desired, main can be used for specifying a different title. Note that only the first title is affected.

cex.title

Plot with the relative contribution of each independent variable of the argument CoVa. It specifies the text size for the overall title. The par option cex.main can be used for specifying the size of individual plot titles.

minLon, maxLon

Maps. Optionally it is possible to define the minimum and maximum longitude (see details).

minLat, maxLat

Maps. Optionally it is possible to define the minimum and maximum latitude (see details).

colbg

Maps. Background color of the map (in some cases this is the sea).

colcon

Maps. Background color of the administrative areas.

colf

Maps. Color of administrative areas border.

pro

Maps. If it is TRUE an automatic calculation is made in order to correct the aspect ratio y/x along latitude.

inc

Maps. Adds some room along the map margins with the limits x and y thus not exactly the limits of the selected areas.

exclude

Maps. A character with the name of the administrative area or a vector with several administrative areas that may be plotted with a different color on the map.

colexc

Maps. Background color of areas selected in the argument exclude.

colfexc

Maps. Color of borders of the areas selected in the argument exclude.

colscale1

Maps. Palette color of the map with the residuals of the SVM model.

colscale2

Maps. Palette color of the maps with the actual and richness predicted by SVM model.

legend.pos

Maps. Whether to have a horizontal "x" or vertical "y" color gradient.

breaks

Maps. Number of breakpoints of the color legend.

xl,xr,yb,yt

Maps. The lower left and upper right coordinates of the color legend in user coordinates.

asp

Maps. The y/x aspect ratio.

lab

Maps. A numerical vector of the form c(x, y) which modifies the default way that axes are annotated. The values of x and y give the (approximate) number of tickmarks on the x and y axes.

xlab

Maps. A title for the x axis.

ylab

Maps. A title for the y axis.

main1

Maps. An overall title for the map of the residuals.

main2

Maps. An overall title for the map of the actual richness.

main3

Maps. An overall title for the map of the richness predicted by SVM model.

cex.main

Maps. The magnification to be used for main titles relative to the current setting of cex.

cex.lab

Maps. The magnification to be used for x and y labels relative to the current setting of cex.

cex.axis

Maps. The magnification to be used for axis annotation relative to the current setting of cex.

family

Maps. The name of a font family for drawing text.

font.main

Maps. The font to be used for plot main titles.

font.lab

Maps. The font to be used for x and y labels.

font.axis

Maps. The font to be used for axis annotation.

lwdP

Maps. Line width of the plot.

lwdC

Maps. Line width of the borders.

trans

Maps. It is possible to multiply or divide the dataset by a value. For a vector with two values, the first may be 0 (divide) or 1 (multiply), and the second number is the value of the division or multiplication.

log

It is possible to apply a logarithmic transformation to the dataset. For a vector with two values, the first may be 0 (do not log transform) or 1 (log transformation), and the second number is the value to be added in case of log transformation.

ndigits

Maps. Number of decimals in legend of the color scale.

ini

Maps. Minimum to be considered in the maps of the actual and predicted richness.

end

Maps. Maximum to be considered in the maps of the actual and predicted richness.

file1

TXT file. Name of the txt file with the results of the statistical analysis.

CSV1

CSV file. Name of the CSV file with the residuals of SVM model.

CSV2

CSV file. Name of the CSV file with the values of the actual richness.

CSV3

CSV file. Name of the CSV file with the values of richness predicted by SVM model.

CSV4

CSV file. Name of the CSV file with the residuals of the multiple regression model.

na

CSV file. The string to use for missing values in the data.

dec

CSV file. The string to use for decimal points in numeric or complex columns: must be a single character.

row.names

CSV file. Either a logical value indicating whether the row names of x are to be written along with x, or a character vector of row names to be written.

Details

The file required in the argument data may be obtained using ModestR, as it is shown in the following screenshot (Export/Export maps of the select branch/To RWizard Applications/To FactorsR).

The menu shown in the following screenshot is obtained, where it is possible to select one or several biogeographical indexes (Rarity index, AOO index, etc.) and one or several environmental variables, in addition to species richness.

The format of the CSV file obtained is shown in the following screenshot, and this CSV file is the one required in this argument.

If the argument Area = "World" (default) the entire world is plotted in the maps of the residuals, actual richness and predicted richness.

Clicking on the icon world, as it is shown in the following screenshot, would display a menu with all countries and their regions. It is possible to select one or several countries and/or regions and the selected administrative areas would be only shown on the map.

If the arguments minLon, maxLon, minLat and maxLat are not specified, they are calculated automatically based on the countries and/or regions selected. The latitude and longitude of the map may be delimited, by just specifying the arguments minLon, maxLon, minLat and maxLat.

In the function four different methods for selecting uncorrelated variables can be chosen. If "vif" is selected, a pop-up window comes out where variables with VIF above threshold indicated by the user in the argument threshold are reported, and the user decides which variables are removed. The advantage of the manual method is that the user decides which variables to delete from the correlated pairs.

In the "vifcor" and "vifstep" options, an automatic selection of the variables to exclude is made, which is obviously much faster and comfortable. If "vifcor" is selected, you must first find the pair of variables that has the maximum linear correlation, and eliminate those having the greater VIF, and the process should be repeated until there is no variable with a high correlation coefficient with another variable. If "vifstep" is selected, the VIF must be calculated for all variables, and exclude which has the greater VIF, always considering the threshold defined by the user in the argument threshold, the process should be repeated until no variable is correlated with another. The problem of "vifcor" and "vifstep" methods is that these are not taken into account when eliminating variables, which is the one that has a greater contribution to explain the observed variability in the dependent variable and, therefore, are not the best methods for obtaining the best regression model.

The default is "hier.part" in which the contribution of all independent variables to the regression model is calculated by the hierarchical partitioning method (Chevan & Sutherland, 1991) and subsequently between the variables that are above VIF threshold, eliminating those that contribute less to the regression model. Therefore, among the auto-correlated variables, those that best explain the observed variability in the dependent variable are selected. The problem is that one can only use this method if the number of dependent variables, including those log-transformed and untransformed, is not greater than 12. If the number is greater, then the "vifstep" method is automatically applied.

The default option of the argument outliers=TRUE, means that outliers are eliminated.

If the argument stepwise=TRUE is selected, it means that a regression is performed by steps using Akaike Information Criterion (AIC) to define which are the variables that are excluded from the regression model, since they do not contribute to explain the dependent variable. The criterion of Akaike (AIC) is used to decide how many explanatory variables are chosen, and what they should be. The method of Akaike values the goodness of the model by setting a penalty because of their complexity, so a simpler model is preferable to another with more independent variables that only explains a small additional portion of the variability of the dependent variable. In general AIC=2k-2ln(L), where k is the number of explanatory variables or parameters of the model and L the likelihood or probability associated with the sample used in accordance with model, so that AIC is smaller than the lower the number of variables and the greater the likelihood; between alternative models the one with the lowest value of AIC must be chosen. At each step the variable that, when it is incorporated into the model, makes smaller the value AIC is chosen, until any unused variables allows to reduce it.

If the argument indlog = TRUE, the logarithm of all independent variables were calculated using the following equation LnVariable = ln (variable-(min (variable) -1)), and all these variables are added to the non-transformed ones. The aim is to perform a regression model in which the transformed and non-transformed variables are included, to find the best model, because sometimes the transformed variable shows a better relationship with the dependent variable that the not transformed logarithmically.

FUNCTIONS

The goodness of fit measures for the entire hierarchy of models using all combinations of N independent variables is calculated using the function hier.part of the package hier.part (Walsh & Mac Nally, 2014).

The function lm is used for performing the multiple regression and the function step for applying the stepwise regression, both of the package stats.

The function lillie.test of the package nortest (Gross, 2013) is used for performing Kolmogorov-Smirnov Normality test with Lilliefors correction.

The function dwtest of the package lmtest (Hothorn et al., 2013) that tests autocorrelation with Durbin-Watson test and the function bptest of the package lmtest (Hothorn et al., 2013) that tests homoscedasticity with Breusch-Pagan test are also used.

The functions vif, vifcor and vifstep of the package usdm (Naimi, 2014; Naimi et al., 2014) and vif of the package car (Fox & Weisberg, 2011; Fox et al., 2014) are used to estimate the variance inflation factor (VIF).

The function calc.relimp is used for the estimation of the relative contribution of the independent variables in the multiple regression (Gr<f6>mping, 2006; 2013).

The functions panel.hist and panel.reg of the package SciViews (Grosjean, 2014) and pairs are used for the matrix of scatterplots.

The function ksvm is used for the estimation of the SVM model of the package kernlab (Karatzoglou et al., 2004; Karatzoglou, 2014)

Finally, for the maps the functions image and color.legend of the package plotrix (Lemon et al., 2014) is necessary.

EXAMPLE

The range maps of shark species were elaborated following the methodology described by Guisande et al. (2013) and these maps were inputted into ModestR (Garc<ed>a-Rosell<f3> et al., 2013).

The Bio-ORACLE global dataset 90_N 90_S real values were used as environmental factors (Tyberghein et al., 2012). The variables used were sea surface temperature, nitrate, salinity and depth.

In addition to the variables mentioned above, the mean area of occupancy was also included (AOO index, see Guisande et al. 2013 for further details). In this example, the number of variables used was lower than it in Guisande et al. (2013) to shorten the execution of the script.

In the script the argument indlog=TRUE, so the logarithm of all independent variables is calculated and these transformed variables are added to the other variables in order to select, among transformed and untransformed independent variables, the variables with higher contribution to the model. The argument threshold=12, so higher than the defult value.

The first table of the following screenshot shows VIF values higher than 12 and, therefore, there is multicollinearity among the independent variables, which is logical for this to happen because there is the same variable transformed and untransformed, for instance <<Temperature>> and <<LNTemperature>>. The second table shows the contribution of the variables to the model estimated by using hierarchical partitioning. As the script uses the default option vifmethod="hier.part", the auto-correlated variables (variables with VIF values higher than the threshold) are removed following the criteria of selecting the variable with higher contribution to the model between two auto-correlated variables. The third table shows the variables selected, which are not auto-correlated.

The multiple regression shows that sea surface temperature, nitrate, salinity, depth and AOO index explained 66.8% of the variance observed in species richness of sharks. The following figures show the contribution of all variables to the model and the contribution of the variables selected once those auto-correlated variables have been removed. The contribution is estimated by using hierarchical partitioning. Temperature was the variable with a higher contribution to the variance observed in the species richness of sharks.

The following figure shows the matrix of scatterplots among the dependent variable (Species richness) and the dependent variables selected in the stepwise regression (in this case all the independent variables used for the analysis).

The following map shows the species richness of sharks around the world with a raster of 60<b4> x 60<b4>.

The following map shows the species richness of sharks around the world with a raster of 60<b4> x 60<b4> predicted by SVM model. The coefficient of determination of the model performed with SVM showed that the variables used in the model explained the 96.7% of the variance observed in the species richness of sharks.

Residuals of the SVM model around the world in cells of 60<b4>x 60<b4> are shown in the following map. Negative residuals of the SVM model may be a potential indicator of areas with undiscovered and/or unregistered species, or areas with lower species richness due to the negative effect of anthropogenic factors. See Guisande et al. (2013) for further details about this example.

Value

A TXT file is obtained with the VIF values of all variables, the VIF of the selected variables after removing those with colinearity, the VIF of the variables that are finally in the regression model, the normality test, homogeneity of variances and homoscedasticity and the regression model. A matrix of scatterplots, a plot with the relative contribution of the independent variables estimated by using hierarchical partitioning and a plot with the relative contribution of the independent variables in the multiple regression if the argument CoVa=TRUE. Also maps of the residuals obtained from the SVM and, actual richness and richness predicted by the SVM model.

Author(s)

C<e1>stor Guisande Gonz<e1>lez, Universidad de Vigo, Spain.

References

Durbin, J. & Watson G.S. (1951) Testing for serial correlation in least squares regression. Biometrika, 38: 159-171.

Fox, J. & Weisberg, S. (2011) An R Companion to Applied Regression. Second Edition, Sage.

Fox, J., Weisberg, S., Adler, D., Bates, D., Baud-Bovy, G., Ellison, S., Firth, D., Friendly. M., Gorjanc, G., Graves, S., Heiberger, R., Laboissiere, R., Monette, G., Murdoch, D., Nilsson, H., Ogle, D., Ripley, B., Venables, W. & Zeileis, A. (2013) Companion to Applied Regression. R package version 2.0-21. Available at: https://CRAN.R-project.org/package=car.

Garc<ed>a-Rosell<f3>, E., Guisande, C., Gonz<e1>lez-Dacosta, J., Heine, J., Pelayo-Villamil, P., Manjarr<e9>s-Hern<e1>ndez, A., Vaamonde, A. & Granado-Lorencio, C. (2013) ModestR: a software tool for managing and analyzing species distribution map databases. Ecography, 36: 1202-1207.

Groemping, U. (2006) Relative Importance for Linear Regression in R: The Package relaimpo. Journal of Statistical Software, 17: 1-27.

Gr<f6>mping, U. (2014)Relative importance of regressors in linear models. R package version 2.2-2. Available at: https://CRAN.R-project.org/package=relaimpo.

Grosjean, P. (2014) SciViews GUI API - Main package. R package version 0.9-5. Available at: https://CRAN.R-project.org/package=SciViews.

Gross, J. (2014) Tests for Normality. R package version 1.0-2. Available at: https://CRAN.R-project.org/package=nortest.

Guisande, C., Patti, B., Vaamonde, A., Manjarr<e9>s-Hern<e1>ndez, A., Pelayo-Villamil, P., Garc<ed>a-Rosell<f3>, E., Gonz<e1>lez-Dacosta, J., Heine, J. & Granado-Lorencio, C. (2013) Factors affecting species richness of marine elasmobranchs. Biodiversity and Conservation, 22: 1703-1714.

Hothorn, T. et al., (2014) Testing Linear Regression Models R package version 0.9-33. Available at: https://CRAN.R-project.org/package=lmtest.

Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. (2004). kernlab - An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11: 1-20.

Karatzoglou, A. (2014) Kernel-based Machine Learning Lab. R package version 0.9-19. Available at: https://CRAN.R-project.org/package=kernlab.

Lemon, J., Bolker, B., Oom, S., Klein, E., Rowlingson, B., Wickham, H., Tyagi, A., Eterradossi, O., Grothendieck, G., Toews, M., Kane, J., Turner, R., Witthoft, C., Stander, J., Petzoldt, T., Duursma, R., Biancotto, E., Levy, O., Dutang, C., Solymos, P., Engelmann, R., Hecker, M., Steinbeck, F., Borchers, H., Singmann, H., Toal, T. & Ogle, D. (2014). Various plotting functions. R package version 3.5-7. Available at: https://CRAN.R-project.org/package=plotrix.

Naimi, B. (2014) Uncertainty analysis for species distribution models. R package version 3.5-0. Available at: https://CRAN.R-project.org/package=usdm.

Naimi, B., Hamm, N.A.S., Groen, T.A., Skidmore, A.K., & Toxopeus, A.G. (2014) Where is positional uncertainty a problem for species distribution modelling? Ecography, 37: 191-203.

Tyberghein L, Verbruggen H, Pauly K, Troupin C, Mineur F, de Clerck O (2012) Bio-ORACLE: a global environmental dataset for marine species distribution modelling. Global Ecology and Biogeography, 21: 272-281.

Walsh, C. & Mac Nally, R. (2014) Hierarchical Partitioning. R package version 1.0-4. Available at: https://CRAN.R-project.org/package=hier.part.

Examples

## Not run: 

data(Sharks)
data(adworld)

Factors(data = Sharks, varY = "Species.Richness", varX = c("AOO.Index",
"Bathymetry","Nitrate","Salinity","Temperature"), outliers=TRUE, stepwise=TRUE,
indlog = TRUE, threshold=12)

#Remove the data set
rm(Sharks)
rm(adworld)

## End(Not run)

FactorsR documentation built on March 18, 2022, 6:59 p.m.

Related to Factors in FactorsR...