knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.retina = 3 ) suppressPackageStartupMessages(library(ggplot2)) theme_set(theme_light())
Density Surface Modelling (DSM) consist in adjusting a Generalized Additive Model (GAM) on count data, while taking simultaneously into account imperfect detection (with distance sampling data) and covariates to obtain smoothed spatial maps of density. Once a model is adjusted, if its fit to the data is good enough, it can be used to predict the response variable (density of animals) over large areas, and possibly outside of the surveyed area.
geffaeR uses the DSM package where more information and a clear explanation on the use of Density Surface Models can be found (e.g. http://distancesampling.org/R/vignettes/mexico-analysis.html). In this vignette it is presented how to fit a simple DSM with geffaeR and what is needed to do it applied on an applied example.
The vignette is organized in two main parts:
To illustrate how to fit DSMs in geffaeR, it is proposed to use the DSM_pack_MOLMOL included as data within geffaeR. It consist in a list of list bringing together all files needed for a DSM analysis.
suppressPackageStartupMessages(library(geffaeR)) str(DSM_pack_MOLMOL[[1]], max.level = 1) str(DSM_pack_MOLMOL[[2]], max.level = 3) str(DSM_pack_MOLMOL[[3]], max.level = 2) str(DSM_pack_MOLMOL[[4]], max.level = 2) str(DSM_pack_MOLMOL[[5]], max.level = 1)
The sunfish, Mola mola, is a marine fish species that can reach up to 1 ton! It can live in temperate and tropical waters. It have a particular shape that makes it easy to recognize :
knitr::include_graphics("Mola_mola.jpg")
There are some speculations that sunfish densities in the oceans have increased, but the evidence is scarce and indirect. Here the aim is to estimate the density of sunfish is the North-Western Mediterranean Sea.
The data were collected during the SAMM surveys (Suivi Aérien de la Mégafaune Marine, https://www.observatoire-pelagis.cnrs.fr/observatoire/Suivi-en-mer/suivi-aerien/) in 2011 and 2012. These surveys consist in flying over the sea in a plane (equiped with bubble windows) at a height of $180$ metres above sea level and at a speed of $180$ kilometers per hour along pre-defined transects. As the transects are flown over, two observers on either side of the plane scan the sea surface for sightings of marine megafauna.
suppressPackageStartupMessages(library(sf)) suppressPackageStartupMessages(library(tidyverse)) distdata <- DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$distdata distdata_detected <- distdata %>% filter(detected == 1) ggplot() + geom_sf(data = NEA_simplified_FR %>% st_as_sf(), mapping = aes()) + geom_rect(aes(xmin = 2, xmax = 12, ymin = 40, ymax = 45), fill = alpha("white",0), colour = "black", size = 2) + ggtitle("Global view of the study area") + theme(plot.margin = margin(0, 0, 0, 0, "cm"))
segdata <- DSM_pack_MOLMOL$effort_output$segdata ggplot() + geom_sf(data = NEA_simplified_FR %>% st_as_sf(), mapping = aes()) + coord_sf(xlim = c(2,12), ylim = c(40,45)) + geom_point(data = segdata, aes(x = longitude, y = latitude), colour = alpha("black",0.5)) + ggtitle("Sampling effort in SAMM campaign \nin the mediterranean sea") + theme(plot.margin = margin(0, 0, 0, 0, "cm")) ggplot() + geom_sf(data = NEA_simplified_FR %>% st_as_sf(), mapping = aes()) + coord_sf(xlim = c(2,12), ylim = c(40,45)) + geom_point(data = distdata_detected, aes(x = longitude, y = latitude), colour = alpha("black",0.5)) + ggtitle("Observation of MOLMOL in SAMM \ncampaign in the mediterranean sea") + theme(plot.margin = margin(0, 0, 0, 0, "cm"))
suppressPackageStartupMessages( expr = { library(calendR) library(lubridate) } ) # 2011 segdata_2011 <- segdata %>% select(date, session) %>% mutate(day = yday(date), year = year(date)) %>% filter(year == 2011) %>% group_by(day, session) %>% summarise(n=n(), .groups = 'drop') spday_2011 <- rep(NA, yday("2011-12-31")) # Add the events to the desired days spday_2011[segdata_2011$day] <- as.character(segdata_2011$session) cal_2011 <- calendR( year = 2011, special.days = spday_2011, special.col = "blue", # Add as many colors as events legend.pos = "bottom", orientation = "portrait", start = "M") # 2012 segdata_2012 <- segdata %>% select(date, session) %>% mutate(day = yday(date), year = year(date)) %>% filter(year == 2012) %>% group_by(day, session) %>% summarise(n=n(), .groups = 'drop') spday_2012 <- rep(NA, yday("2012-12-31")) # Add the events to the desired days spday_2012[segdata_2012$day] <- as.character(segdata_2012$session) cal_2012 <- calendR( year = 2012, special.days = spday_2012, special.col = c("blue","red"), # Add as many colors as events legend.pos = "bottom", orientation = "portrait", start = "M") print(cal_2011) print(cal_2012)
The SAMM surveys were carried out twice so far but in this vignette, only data from 2011 and 2012 will be used. These data correspond to two seasons: winter (blue) and summer (red).
To fit some simple DSM with geffaeR, $4$ objects are necessary:
prepare_effort()
)prepare_obs()
)plot_detection()
To also predict, a fifth object will be needed:
The object collates all the locations at which we may want to predict sunfish densities from a fitted DSM.
$6$ functions in geffaeR are used in the DSM process (and $1$ is optional for using a soap smooth, dashed in the flowchart below) :
knitr::include_graphics("DSM_functions.png")
The first step consist in imputing missing values of covariates of segdata and gridata. Missing data imputation is done either using the Amelia package (https://cran.r-project.org/web/packages/Amelia/index.html) or the missMDA package (https://cran.r-project.org/web/packages/missMDA/). Missing data imputation is used to ensure that a complete dataset with no missing values for any of the covariates will be used during model fitting. It is up to the user to check that the imputed values are realistic and plausible. Here we will consider two physiographic covariates, seafloor depth and distance to the nearest coastline, and one environmental covariate, chlorophyll a concentration averaged over a month. We can choose to use a logarithm transformation during missing data imputation to ensure that only positive values will be imputed. Here the missMDA package is used by specifying imputation = "PCA"
below.
library(geffaeR) # get segdata segdata <- DSM_pack_MOLMOL$effort_output$segdata # get grid grid <- DSM_pack_MOLMOL$gridata # get shape shape <- DSM_pack_MOLMOL$shape_med predata_output <- prep_predata(segdata = segdata, gridfile_name = grid, varenviro = "CHL_month", do_log_enviro = "CHL_month", varphysio = c("depth","distCot"), do_log_physio = c("depth", "distCot"), imputation = "PCA", shape = shape, verbose = F ) # predata_output$predata %>% # ggplot() + # geom_point(aes(x = longitude, y = latitude))
The output of a call to this function is a wee bit complicated:
str(predata_output, max.level = 1)
Once a complete dataset for covariates is available and satisfactory, sightings of the species of interest, namely the sunfish can be join to these data. Note that the effort data here has been segmented in a pre-processing step. The function to join sightings to the covariate data is named, pardon my French! ajout_obs:
effort_w_obs <- ajout_obs( countdata_leg = DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$countdata_leg, legdata = DSM_pack_MOLMOL$effort_output$legdata, countdata_seg = DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$countdata_seg, segdata = predata_output$segdata ) str(effort_w_obs, max.level = 2)
The dataframe segdata in effort_w_obs has now two additional 2 column: one called "n" corresponding to the number of sightings made on segment of effort; and "y" corresponding to the total number of individuals counted.
The next step consist in fitting all possible models and choose the best model to use for DSM with fit_all_dsm
function. The minimum requirements to use this function are :
prep_predata
) and with observation (output of ajout_obs
).plot_detection
, (see the vignette : I_Preparation_of_data_CDS_and_kriging.Rmd)# get segdata_completed_w_obs segdata_completed_w_obs <- effort_w_obs$segdata_obs # get distFit distFit <- DSM_pack_MOLMOL$list_plot_detection_MOLMOL$plot_detectionMOLMOL$all_session$distFit # get obsdata obsdata <- DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$obsdata # get covariates on which we fit gam model (predictors) predictors <- c("CHL_month", "distCot", "depth") # fit models fit_dsm_basic <- fit_all_dsm(distFit = distFit, segdata_obs = segdata_completed_w_obs, obsdata = obsdata, predictors = predictors ) obsdata_w_coords <- obsdata %>% left_join(segdata %>% select(longitude, latitude, Seg), by = "Seg") ggplot() + geom_sf(data = shape %>% st_as_sf()) + geom_point(data = obsdata_w_coords, aes(x = longitude, y = latitude)) + facet_wrap(vars(session)) dsm_model <- fit_dsm_basic$best_models4plotting[[1]] plot(dsm_model)
A call to fit_all_dsm() yields as output a table that sums up model fitting according to different indicators and 2 list of the k best models (5 by default) sorted by their "stacking_weights". Both lists are the same but one has standardized covariates (that is mean-centered and unit-variance, labelled best_models) and the other has the raw values (best_models4plotting, which will be used for plotting purposes).
suppressPackageStartupMessages( library(kableExtra) ) fit_dsm_basic$all_fits_binded %>% kable("html") %>% kable_styling(font_size = 12)
One does want to visualize splines of each variable fitted with fit_all_dsm
, and it is possible with pred_splines
.
# get first model of fit_dsm fit_dsm_basic_1 <-fit_dsm_basic$best_models4plotting[[1]] splines <- pred_splines( segdata = segdata_completed_w_obs, dsm_model = fit_dsm_basic_1 ) str(splines, max.level = 1)
# Visualize variable splines splines$g_splines
This step consists in getting predicted values from models fitted with the output of fit_all_dsm.
# get predata predata <- predata_output$predata predicted_models <- predict_all( listdsm = fit_dsm_basic, predata = predata ) str(predicted_models)
predict_all
gives a data.frame of predicted values and standard error of the k best models of fit_all_dsm
with a weighted combination of the k best named 'stacking' in the model column. By default, stacking is used on the 5 models with the smallest AIC.
predict_map
allows to visualize prediction of predict_all
.
# get combination of all models pred_best_1 <- predicted_models %>% filter(model == "best_1") # Set predata as sf object predata_custom <- predata %>% mutate(stack_pred = pred_best_1$mean) %>% st_as_sf(coords = c("longitude", "latitude"), crs = 4326) # Build prediction map map_pred <- pred_map_dsm( predata = predata_custom, grid = DSM_pack_MOLMOL$gridata %>% st_transform("+proj=aea +lat_1=43 +lat_2=62 +lat_0=30 +lon_0=10 +x_0=0 +y_0=0 +ellps=intl +units=m +no_defs"), var = "stack_pred", title = substitute(expr = paste("Density prediction of ", italic("Mola mola"), " for 2011 and 2012 during SAMM campaign")) # title = "Density prediction of Mola mola for 2011 and 2012 during SAMM campaign" ) map_pred
The sampling occurred during 2 seasons (summer and winter) and we can expect difference in sunfish densities between summer and winter. Covariate effects may also be different between winter and summer, and we may want to allow for this possibility during modelling.
library(stringr) # get distdata distdata <- DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$distdata # map of Mola mola occurence per season distdata %>% filter(detected != 0) %>% mutate(session = stringr::str_extract(session, "[:alpha:]+")) %>% ggplot() + geom_point(aes(x = X, y = Y, size = size, colour = size)) + geom_sf(data = NEA_simplified_FR %>% st_as_sf(), mapping = aes()) + coord_sf(xlim = c(2,12), ylim = c(40,45)) + facet_grid(.~session) + scale_colour_viridis_c() + theme_bw()
What has to be added to the simple DSM process ?
prep_predata
with col2keep = "session"fit_all_dsm
with splines_by = "session"pred_splines
with splines_by = "session"pred_map_dsm
with facet_param = "session"library(geffaeR) library(tidyverse) # get segdata segdata <- DSM_pack_MOLMOL$effort_output$segdata %>% mutate(session = fct_drop(session)) # get grid grid <- DSM_pack_MOLMOL$gridata %>% rename(session = session_) %>% mutate(session = fct_drop(session)) # get shape shape <- DSM_pack_MOLMOL$shape_med # build predata outputs predata_output <- prep_predata(segdata = segdata, gridfile_name = grid, varenviro = "CHL_month", do_log_enviro = "CHL_month", varphysio = c("depth", "distCot"), do_log_physio = c("depth", "distCot"), shape = shape, imputation = "PCA", col2keep = "session", verbose = F ) effort_w_obs <- ajout_obs(countdata_leg = DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$countdata_leg, legdata = DSM_pack_MOLMOL$effort_output$legdata, countdata_seg = DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$countdata_seg, segdata = predata_output$segdata) # get segdata_completed_w_obs segdata_completed_w_obs <- effort_w_obs$segdata_obs # get distFit distFit <- DSM_pack_MOLMOL$list_plot_detection_MOLMOL$plot_detectionMOLMOL$all_session$distFit # get obsdata obsdata <- DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$obsdata # get covariates on which we fit gam model (predictors) predictors <- c("CHL_month","distCot","depth") # fit models fit_dsm_splines_by <- fit_all_dsm(distFit = distFit, segdata_obs = segdata_completed_w_obs, obsdata = obsdata, predictors = predictors, splines_by = "session" ) # get first model of fit_dsm fit_dsm__splines_by_1 <- fit_dsm_splines_by$best_models4plotting[[1]] splines <- pred_splines( segdata = segdata_completed_w_obs, dsm_model = fit_dsm__splines_by_1, splines_by = "session" ) splines$g_splines plot(fit_dsm__splines_by_1)
Now we can visualize the differences between winter and summer in the estimated relationships with covariates. Predictions can also be made for each season.
library(sf) # get predata predata <- predata_output$predata predicted_models <- predict_all( listdsm = fit_dsm_splines_by, predata = predata ) # get combination of all models pred_stack <- predicted_models %>% filter(model == "best_1") # Set predata as sf object predata_custom <- predata %>% mutate(stack_pred = pred_stack$mean) %>% st_as_sf(coords = c("longitude", "latitude"), crs = 4326) # Build prediction map map_pred_splines_by <- pred_map_dsm( predata = predata_custom, grid = DSM_pack_MOLMOL$gridata %>% st_transform("+proj=aea +lat_1=43 +lat_2=62 +lat_0=30 +lon_0=10 +x_0=0 +y_0=0 +ellps=intl +units=m +no_defs"), var = "stack_pred", title = substitute(expr = paste("Density prediction of ", italic("Mola mola"), " for 2011 and 2012 during SAMM campaign")), facet_param = "session" ) map_pred_splines_by
tab_basic <- fit_dsm_basic$all_fits_binded %>% slice(1) %>% mutate(type_model = "no_splines") tab_wo_weights <- fit_dsm_splines_by$all_fits_binded %>% slice(1) %>% mutate(type_model = "splined_by") tab_splines_by_vs_no_splines_by <- tab_basic %>% bind_rows(tab_wo_weights) tab_splines_by_vs_no_splines_by %>% kableExtra::kable("html") %>% kableExtra::kable_styling(font_size = 12)
Defining splines by session improve model AIC and explained deviance drastically in the case of Mola mola.
An option is implemented in fit_all_dsm
to give weights to observation. We used Counterfactuals methodology to give weights to observation, to summarise the principles, it considers that points which are far from the other points will have less credibility than points that are net to each other. Doing so, it will minimize speculation on outlying points and give more confidence on grouped points. We implemented this using package which allows to estimate weigths on points.
While using fit_all_dsm
it is just needed to add weighted = TRUE
, and that's it !
# fit models fit_dsm_splines_by_weighted <- fit_all_dsm(distFit = distFit, segdata_obs = segdata_completed_w_obs, obsdata = obsdata, predictors = predictors, splines_by = "session", weighted = TRUE ) # get first model of fit_dsm fit_dsm_splines_by_weighted_1 <- fit_dsm_splines_by_weighted$best_models4plotting[[1]] splines <- pred_splines( segdata = segdata_completed_w_obs, dsm_model = fit_dsm_splines_by_weighted_1, splines_by = "session" ) splines$g_splines
library(sf) # get predata predata <- predata_output$predata predicted_models <- predict_all( listdsm = fit_dsm_splines_by_weighted, predata = predata ) # get combination of all models pred_stack <- predicted_models %>% filter(model == "best_1") # Set predata as sf object predata_custom <- predata %>% mutate(stack_pred = pred_stack$mean) %>% st_as_sf(coords = c("longitude", "latitude"), crs = 4326) # Build prediction map map_pred_splines_by_weighted <- pred_map_dsm( predata = predata_custom, grid = DSM_pack_MOLMOL$gridata %>% st_transform("+proj=aea +lat_1=43 +lat_2=62 +lat_0=30 +lon_0=10 +x_0=0 +y_0=0 +ellps=intl +units=m +no_defs"), var = "stack_pred", title = substitute(expr = paste("Density prediction of ", italic("Mola mola"), " for 2011 and 2012 during SAMM campaign")), facet_param = "session" ) map_pred_splines_by_weighted
tab_basic <- fit_dsm_basic$all_fits_binded %>% slice(1) %>% mutate(type_model = "no_splines_no_weights") tab_wo_weights <- fit_dsm_splines_by$all_fits_binded %>% slice(1) %>% mutate(type_model = "splined_by_without_weights") tab_w_weigths <- fit_dsm_splines_by_weighted$all_fits_binded %>% slice(1) %>% mutate(type_model = "splined_by_with_weights") tab_weights_vs_no_weights <- tab_basic %>% bind_rows(tab_wo_weights, tab_w_weigths) tab_weights_vs_no_weights %>% kable("html") %>% kable_styling(font_size = 12)
In the case of Mola mola, giving weights to data seems to slightly improve the model.
Soap-film smoother allows to integrate the irregularity of the study area and consider it in the modelling process. When the study area is perfectly regular (rectangle or circle shape) with no island, it is not necessary to use soap-film smoother, but in the other cases it could be interesting. A good example of the utility of soap-film smoother is presented on this .
In the case of Mola mola :
library(mgcViz) library(ggforce) get_isocline_plot <- function(dsm_model) { class(dsm_model) <- class(dsm_model)[class(dsm_model) != "dsm"] plot_viz <- getViz(dsm_model) %>% sm(1) %>% plot() plot_viz } pred_wo_soap <- get_isocline_plot(fit_dsm_splines_by_weighted$best_models4plotting[[1]]) + labs(title = "Corsica and Sardinia are not considered in dsm model") test <- pred_map_dsm( predata = predata %>% filter(session=="1winter") %>% mutate(col_null = 1), grid = st_crs("+proj=aea +lat_1=43 +lat_2=62 +lat_0=30 +lon_0=10 +x_0=0 +y_0=0 +ellps=intl +units=m +no_defs"), var = "col_null", title = "Map built with predata") + guides(fill = "none") + labs(subtitle = "Corsica and Sardinia are present in predata") pred_wo_soap test
We can see that island and edges of the study area are not considered while adjusting the DSM model, it can be source of error. To consider those subtilities, it is necessary to proceed to a soap-film smoother.
It is necessary to give knots and boundary of the area to dsm model. This require to create those two object and it is possible with geffaeR.
Boundary and Knots are created with prepare_soap
function. This function will create the boundary of the study area by considering the range of the coordinates given in segdata. To do so, it will use land contour and simplify it, and crop with the area created with the range of segdata coordinates. Because on boundary simplifcation, some observation can be discarded because those observation can be outside the boundary. In the same time, it will place knots in the boundary. Knots that are too close to boundary edges will be discarded (it causes issues while adjusting dsm model). We used autocruncher
, a function built by Simon Wood and David Miller, all credits goes to them.
If the study area is in North East Europe otherwise it is necessary to provide a contour of the study area or a polygon delimiting the study area : boundary (it is explained in the prepare_soap
function).
prep_soap <- prepare_soap(data = predata_output$segdata, N = 10, ratio_simplify = 1e-3) prep_soap$gg_cropped_data
prep_soap$gg_cropped_data
give a synthesis map with points and knots discarded, with :
prep_soap$df_cropped_data
prep_soap$knots
prep_soap$knots
prep_soap$bnd
and as sf object prep_soap$sf_ocean
.Next, it is just necessary to perform the same process previously done, specifying in fit_all_dsm
functions bnd and knots and use segdata of prep_soap
.
effort_w_obs <- ajout_obs( countdata_leg = DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$countdata_leg, legdata = DSM_pack_MOLMOL$effort_output$legdata, countdata_seg = DSM_pack_MOLMOL$list_prepare_obs_MOLMOL$MOLMOL_obs_output$countdata_seg, segdata = prep_soap$df_cropped_data ) bnd <- prep_soap$bnd # necessary to put bnd in the Global Environment fit_dsm_splines_by_weighted_soap <- fit_all_dsm( distFit = distFit, segdata_obs = effort_w_obs$segdata_obs, obsdata = obsdata, predictors = predictors, splines_by = "session", weighted = TRUE, soap = list(knots = prep_soap$knots, bnd = bnd) )
tab_soap <- fit_dsm_splines_by_weighted_soap$all_fits_binded %>% slice(1) %>% mutate(type_model = "splines_weigths_soap") tab_soap_vs_no_soap <- tab_weights_vs_no_weights %>% bind_rows(tab_soap) tab_soap_vs_no_soap %>% kableExtra::kable("html") %>% kableExtra::kable_styling(font_size = 12)
isocline_soap <- fit_dsm_splines_by_weighted_soap$best_models4plotting[[1]] %>% get_isocline_plot() + labs(title = "Soap-film smoother") isocline_no_soap <- fit_dsm_splines_by_weighted$best_models4plotting[[1]] %>% get_isocline_plot() + labs(title = "NO Soap-film smoother") isocline_soap isocline_no_soap # obsdata %>% # left_join(segdata %>% # select(longitude, latitude, Seg), # by = "Seg") %>% # ggplot() + # geom_point(aes(x = longitude, y = latitude))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.