In DistanceDevelopment/dsm: Density Surface Modelling of Distance Sampling Data

library(knitr)

knitr::opts_chunk$set(eval=TRUE, echo=TRUE, message=FALSE, warnings=FALSE, progress=FALSE)

Introduction

The dsm package expects data to be in a specific format. In this example I'll show how to get data into that format and give some explanation as to why each of these requirements exist.

I recommend getting to grips with fitting models in dsm before reading this tutorial so you know what the end point will look like before trying to grapple with the data. An example for Gulf of Mexico pantropical spotted dolphins is available here.

Prerequisites

To run the code below, you will need to have the ggplot2 and sf libraries installed. You can do this with the following code:

install.packages(c("ggplot2", "sf"))

The Rhode Island data required downloaded below:

These should be placed in a folder called data for the code below to work.

Overview

When we collect distance sampling data, we need to record (1) the distance to an observation (along with other detectability covariates like weather), (2) the transect we are currently on, and (3) how much effort we expended on that transect (its length or number of visits).

If we want to fit a DSM that information needs to be expanded to include where on the transect we are too, so we can build the spatial model. For point transects, this is simple as we are always at the point, but for line transects we need to know the position of the detection too (or equivalently, how far along the transect we are).

More abstractly, we need information about both detections and effort. We also need to link these two. In dsm we use three objects to hold this information:

the fitted detection function (argument ddf.obj) holds all the detections but ignores spatial information
the "segments" (argument segment.data) holds the chunks of effort (parts of the transect or point visits) and their locations. It might also include environmental covariate information.
a table linking the above two (argument observation.data), making sure each detection has a corresponding segment.

Correspondence between dsm tables {width="100%"}

The above image gives an overview of how the different data.frames interact, along with how that corresponds to the real life situation of collecting data on a transect.

The rest of this example goes through each table in turn, explaining their construction.

Table construction

Detection data

The data requirements are very similar to those for packages mrds and Distance. In both cases, each detection has a corresponding row. For an analysis using dsm we have additional requirements in the data (which can be auto-generated by Distance but we recommend against this to ensure that you know that records link up correctly).

For Distance we need:

distance : the distance to the detection
object : a unique detection identifier (which will link to the observation table)
size : the number of individuals in the detection (1 if objects occur alone)

mrds requires the following extra columns (to allow for double observer surveys):

detected : 1 if detected by the observer and 0 if missed (always 1 for single observer)
observer : observer number (1 or 2) (always 1 for single observer)

If other covariates that affect detectability are recorded (e.g., sex or weather) then these can be included in this data for analysis in mrds or Distance.

Once the model is fitted, we deal only with that fitted model object in the dsm analysis.

Segment data

I will use the term "segment" here to refer to both points for point transects and small chunks of transect for the line transect case. I'll assume that you've already segmented your lines while first describing the data format, then go on to an example of how to segment line transect data.

The data.frame required for the segments needs the following columns:

Effort : the effort expended on this segment (either the length of the segment for lines, or the number of visits for points)
Sample.Label : a unique identifier for the segment (if you are engaged in a complicated survey, this might take a format like "YEAR-AREA-LINE-SEGMENT", so you have labels like "2020-North-A-15", which can be useful to keep track of where data came from later).

In addition, environmental covariates like location and relevant covariates (sea surface temperature, foliage type, altitude, bathymetry, etc) can be included in this data.frame if they are to be used in the spatial model.

Segmenting

If you have data as lines and you want to chop them up into segments, this section will give some sample R code to do this. How things work is extremely dependent on the input data, but hopefully this gives a template for what you want to do at least.

Before you jump in: If you are fluent in ArcGIS or QGIS I recommend doing this task in the software you are familiar with at first. A simple guide to doing this task in QGIS is here and for ArcGIS I recommend the MGET toolbox.

As an example, we will look at an aerial survey of seabirds off the coast of Rhode Island, USA. Data from these surveys were collected by the University of Rhode Island and analysed in @winiarski_spatially_2013, @winiarski_integrating_2014 and @winiarski_spatial_2014. I have the transects saved as a shapefile that I will then "chop-up" into segments. To start with let's plot the data with the coastline:

library(ggplot2)
library(sf)

# coast data, just for reference
coastline <- read_sf("data/ri_coast.shp")

# our transect lines
transects <- read_sf("data/ri_transects.shp")

# now make the plot
p <- ggplot() +
  # geom_sf knows what to do with spatial data
  geom_sf(data=coastline) +
  geom_sf(data=transects, aes(colour=Transect)) +
  # chop to be on a better scale
  coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) +
  # make the plot look simpler
  theme_minimal()

print(p)

In the above code we have loaded the sf package. This is the package we will use to do most of our spatial work in segmenting the data. It has most of the tools expected from a GIS, though the package is still being developed so its functionality is always increasing.

Investigating the transects data we loaded, we can see it consists of a data.frame with some extra bits (the spatial information, including locations (geometry) and projection (CRS)):

transects

We have one row per transect (for a total of 24 transects), each of which consists of a line joining the start and end points that make up the transects. If we want we can plot individual rows via plot(transects[1,]); this can be useful to check that each row is a single line, or if further processing is needed (this can be tricky and is not covered in this tutorial but see "More information" below).

This data looks in good shape, so we can use the st_segmentize function to take each transect and make segments from them. To make sure everything works correctly, we need to project the data first. Here I'm using an appropriate projected coordinate system (EPSG code 6348), which is the Rhode Island State Plane.

# project transects
transects <- st_transform(transects, 6348)

# do the segmenting
segs <- st_segmentize(transects, dfMaxLength=units::set_units(2000, "metres"))

# transform back to lat/long
segs <- st_transform(segs, 4326)
transects <- st_transform(transects, 4326)

Here we know the truncation used for the detection function was 1000m (the distances were collected in bins and this was the further bin), and since we're trying to make our segments approximately square, we set the length to be twice that (since the truncation applies to either side of the transect), so 2000m.

Looking at what that did:

segs

See now that each row has many coordinates attached to it, just looking at the first row (transect 1) and comparing the coordinates between transects and segs

st_coordinates(transects[1,])

transects has just two coordinates in it (the start and end points of the line). Whereas:

st_coordinates(segs[1,])

segs has 5, corresponding to our segment cut points.

We now need to break up the rows of segs into multiple rows, one per segment. This is a bit fiddly. We use a function provided by dieghernan on their site to do this. We first load the function (don't worry about understanding the code!):

stdh_cast_substring <- function(x, to = "MULTILINESTRING") {
  ggg <- st_geometry(x)

  if (!unique(st_geometry_type(ggg)) %in% c("POLYGON", "LINESTRING")) {
    stop("Input should be  LINESTRING or POLYGON")
  }
  for (k in 1:length(st_geometry(ggg))) {
    sub <- ggg[k]
    geom <- lapply(
      1:(length(st_coordinates(sub)[, 1]) - 1),
      function(i)
        rbind(
          as.numeric(st_coordinates(sub)[i, 1:2]),
          as.numeric(st_coordinates(sub)[i + 1, 1:2])
        )
    ) %>%
      st_multilinestring() %>%
      st_sfc()

    if (k == 1) {
      endgeom <- geom
    }
    else {
      endgeom <- rbind(endgeom, geom)
    }
  }
  endgeom <- endgeom %>% st_sfc(crs = st_crs(x))
  if (class(x)[1] == "sf") {
    endgeom <- st_set_geometry(x, endgeom)
  }
  if (to == "LINESTRING") {
    endgeom <- endgeom %>% st_cast("LINESTRING")
  }
  return(endgeom)
}

We can then use it to separate the segments and look at the results:

segs <- stdh_cast_substring(segs, to="LINESTRING")
segs

We now have r nrow(segs) segments between 1km and 2km long. We can check the lengths using the st_length function and plot a histogram of the segment lengths:

hist(st_length(segs), xlab="Segment lengths (m)", main="")

Note that in setting the dfMaxLength argument to st_segmentize we are giving a rough guide to the segment length and the algorithm inside st_segmentize tries to get segments to be as equal in length as possible (note the $x$ axis in the histogram).

Note also that the Transect column in segs now has duplicate entries as each transect has multiple segments in it. We can now create our required columns for dsm.

First, the Effort column can be generated using st_length:

segs$Effort <- st_length(segs)

Then we can generate the Sample.Labels. Here I'll use a more complicated naming scheme to show how that's done, but one could simply write segs$Sample.Label <- 1:nrow(segs) and that would be sufficient. Instead, I would like to have my Sample.Label be the "YEAR-AREA-LINE-SEGMENT" scheme I suggested above.

There are fancier ways to do this, but for clarity, let's use a for loop:

# create a dummy column that we can fill in as we go
segs$Sample.Label <- NA

# set the year and area once for this data
year <- 2010
area <- "RI"

# loop over the transect IDs
for(this_transect in unique(segs$Transect)){
  # how many segments in this transect?
  n_segs <- nrow(subset(segs, Transect==this_transect))
  # generate the n_segs labels that we need
  segs$Sample.Label[segs$Transect==this_transect] <- paste(year, area, this_transect,
                                                           1:n_segs, sep="-")
}

Now we can check that looks right:

segs

With the required columns taken care of, we can also calculate the centroids of our segments to get the locations of each segment to use in the spatial model. Handily we can use st_centroid to do this in one step and keep all our data at the same time.

We get a warning about how centroids might not be calculated correctly when latitude/longitude are used. So again we need to project the data first (to Rhode Island State Plane) for this step using st_transform.

# save the line version of segs for plotting later
segs_lines <- segs

# project segs
segs <- st_transform(segs, 6348)

# find centroids
segs <- st_centroid(segs)

# project back to lat/long
segs <- st_transform(segs, 4326)

# how does that look?
segs

Notice that our geometry type is now POINT. We can plot these centroids on our previous map:

p <- ggplot() +
  # geom_sf knows what to do with spatial data
  geom_sf(data=coastline) +
  geom_sf(data=transects) +
  geom_sf(data=segs) +
  # chop to be on a better scale
  coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) +
  # make the plot look simpler
  theme_minimal()
print(p)

At present, the dsm package doesn't know how to talk to sf objects, so as a final step we need to simplify segs to be a regular data.frame, which dsm can interpret. We can extract the coordinates of the centroids using st_coordinates and then append them onto a data.frame version of the object with the geometry dropped, like so:

segs_df <- cbind(as.data.frame(st_drop_geometry(segs)),
              st_coordinates(segs))

We can double check that looks right:

head(segs_df)

and we are done. See the next section for how to link the segments and the detections.

If we want to include spatial covariate information in the segment table, we could then use (for example) rerrdap to obtain remotely-sensed data. For in situ data (i.e., weather conditions recorded while on effort), this can be more complicated and will require the use of summarize, see this thread for more information on how to deal with this situation.

Linking observations and segments

To link together the data in the detection function and the spatial data giving the effort, we use the observation data.frame. This is really just a cross-reference between those two tables, so each detection lives in one segment.

The observation data.frame must have (at least) the following columns:

object : unique object identifier (corresponding to those identifiers in the detection function)
Sample.Label : the identifier for the segment that the observation occurred in
size : the size of each observed group (e.g., 1 if all animals occurred individually)
distance : distance to observation

The observation data should have as many rows as there are in the detection function.

Relating observations and segments in practice

There are a few methods for building the observation table. Here I'll illustrate one assigning detections to segments based on their distance (i.e., detections are associated with their closest segment centroid). This is appropriate when you don't see forwards too far, so it's unlikely that detections will be miss assigned. This is true, for instance, in an aerial survey where observers look out windows located on the sides of the plane but might be inappropriate for a shipboard survey where observers on the flying bridge can see way ahead of the ship.

First loading the detection data for this survey (this would be what we use to fit the detection function, post-processing):

obs <- read.csv("data/loons.csv")
head(obs)

We can see there are columns for latitude and longitude (Lat and Long). We can make this data.frame be an sf object by telling sf that these columns contain spatial coordinates (in sf lingo, "geometry"):

obs <- st_as_sf(obs, coords=c("Long", "Lat"))
# set the coordinate system
st_crs(obs) <- 4326
obs

We can overlay this on our previous map of transects:

p <- ggplot() +
  # geom_sf knows what to do with spatial data
  geom_sf(data=coastline) +
  geom_sf(data=transects) +
  geom_sf(data=obs, size=0.5) +
  # chop to be on a better scale
  coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) +
  # make the plot look simpler
  theme_minimal()
print(p)

Going back to our segs object above, we can use the st_join function, combined with the st_nearest_feature function to control how the tables are linked. Again, we need to project the data to avoid issues.

# project segs
segs <- st_transform(segs, 6348)
obs <- st_transform(obs, 6348)

# do the join
obs <- st_join(obs, segs, join=st_nearest_feature)


# project back to lat/long
segs <- st_transform(segs, 4326)
obs <- st_transform(obs, 4326)

# how does that look?
obs

Now obs has acquired additional columns from segs, including the Sample.Label (which we want) and Effort (which we don't). To check this worked okay, we can plot the obs and segs_lines (the line version of the segments which we saved earlier) with colour coding for the Sample.Label. I randomized the colour order so it would be easier to tell if observations were misallocated.

# make a random colour palette to avoid similar colours being near each other
pal <- rainbow(nrow(segs), s=.6, v=.9)[sample(1:nrow(segs),nrow(segs))]
p <- ggplot() +
  # geom_sf knows what to do with spatial data
  geom_sf(data=coastline) +
  geom_sf(data=segs_lines, aes(colour=Sample.Label), pch=21) +
  geom_sf(data=obs, size=0.5, aes(colour=Sample.Label)) +
  # chop to be on a better scale
  coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) +
  scale_colour_manual(values=pal) +
  # make the plot look simpler
  theme_minimal() +
  theme(legend.position = "none")
print(p)

As with segs we can remove unnecessary columns and geometry to get the required columns only:

# get rid of "spatialness"
obs <- st_drop_geometry(obs)
# select the columns we need
obs <- obs[, c("object", "Sample.Label", "size", "Bin")]
head(obs)

(Note here since we have binned data, we just keep the Bin column and need to process this further later.)

A different way to approach this would be if there were timestamps on waypoints from the GPS, which could be related to the segments. One could then look at whether an observation was made between the start and end times of a segment.

More information

We have provided some information about segmenting on this wiki page, as part of the US Navy-funded project DenMod.

This information is summarized at ?"dsm-data" in the dsm package. It may also be useful to look at the data in the example dataset mexdolphins which can be loaded with data(mexdolphins). The vignette for an analysis of those data is available here.

It is hard to give very general information on how to segment lines, as to some extent it depends on how the data were originally formatted (going all the way back to the make and model of the GPS unit used). More information on dealing with spatial data in R using the sf package is available at the R spatial website (see the "Articles" drop down in the header).

Another source of help is the distance sampling mailing list. Be sure to search the archives prior to posting as there have been several threads on segmenting previously that might be helpful.

Acknowledgements

Thanks to Phil Bouchet for providing helpful comments to an early version of this document and to Iúri Correia for finding an important bug.

References

DistanceDevelopment/dsm documentation built on June 2, 2025, 1:38 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DistanceDevelopment/dsm
Density Surface Modelling of Distance Sampling Data

In DistanceDevelopment/dsm: Density Surface Modelling of Distance Sampling Data

Introduction

Prerequisites

Overview

Table construction

Detection data

Segment data

Segmenting

Linking observations and segments

Relating observations and segments in practice

More information

Acknowledgements

References

R Package Documentation

Browse R Packages

We want your feedback!

DistanceDevelopment/dsm Density Surface Modelling of Distance Sampling Data

In DistanceDevelopment/dsm: Density Surface Modelling of Distance Sampling Data

Introduction

Prerequisites

Overview

Table construction

Detection data

Segment data

Segmenting

Linking observations and segments

Relating observations and segments in practice

More information

Acknowledgements

References

R Package Documentation

Browse R Packages

We want your feedback!

DistanceDevelopment/dsm
Density Surface Modelling of Distance Sampling Data