library(knitr)
knitr::opts_chunk$set(eval=TRUE, echo=TRUE, message=FALSE, warnings=FALSE, progress=FALSE)
The dsm
package expects data to be in a specific format. In this example I'll show how to get data into that format and give some explanation as to why each of these requirements exist.
I recommend getting to grips with fitting models in dsm
before reading this tutorial so you know what the end point will look like before trying to grapple with the data. An example for Gulf of Mexico pantropical spotted dolphins is available here.
To run the code below, you will need to have the ggplot2
and sf
libraries installed. You can do this with the following code:
install.packages(c("ggplot2", "sf"))
The Rhode Island data required downloaded below:
These should be placed in a folder called data
for the code below to work.
When we collect distance sampling data, we need to record (1) the distance to an observation (along with other detectability covariates like weather), (2) the transect we are currently on, and (3) how much effort we expended on that transect (its length or number of visits).
If we want to fit a DSM that information needs to be expanded to include where on the transect we are too, so we can build the spatial model. For point transects, this is simple as we are always at the point, but for line transects we need to know the position of the detection too (or equivalently, how far along the transect we are).
More abstractly, we need information about both detections and effort. We also need to link these two. In dsm
we use three objects to hold this information:
ddf.obj
) holds all the detections but ignores spatial informationsegment.data
) holds the chunks of effort (parts of the transect or point visits) and their locations. It might also include environmental covariate information.observation.data
), making sure each detection has a corresponding segment.{width="100%"}
The above image gives an overview of how the different data.frames
interact, along with how that corresponds to the real life situation of collecting data on a transect.
The rest of this example goes through each table in turn, explaining their construction.
The data requirements are very similar to those for packages mrds
and Distance
. In both cases, each detection has a corresponding row. For an analysis using dsm
we have additional requirements in the data (which can be auto-generated by Distance
but we recommend against this to ensure that you know that records link up correctly).
For Distance
we need:
distance
: the distance to the detectionobject
: a unique detection identifier (which will link to the observation table)size
: the number of individuals in the detection (1 if objects occur alone)mrds
requires the following extra columns (to allow for double observer surveys):
detected
: 1 if detected by the observer and 0 if missed (always 1 for single observer)observer
: observer number (1 or 2) (always 1 for single observer)If other covariates that affect detectability are recorded (e.g., sex or weather) then these can be included in this data for analysis in mrds
or Distance
.
Once the model is fitted, we deal only with that fitted model object in the dsm
analysis.
I will use the term "segment" here to refer to both points for point transects and small chunks of transect for the line transect case. I'll assume that you've already segmented your lines while first describing the data format, then go on to an example of how to segment line transect data.
The data.frame
required for the segments needs the following columns:
Effort
: the effort expended on this segment (either the length of the segment for lines, or the number of visits for points)Sample.Label
: a unique identifier for the segment (if you are engaged in a complicated survey, this might take a format like "YEAR
-AREA
-LINE
-SEGMENT
", so you have labels like "2020-North-A-15
", which can be useful to keep track of where data came from later).In addition, environmental covariates like location and relevant covariates (sea surface temperature, foliage type, altitude, bathymetry, etc) can be included in this data.frame
if they are to be used in the spatial model.
If you have data as lines and you want to chop them up into segments, this section will give some sample R code to do this. How things work is extremely dependent on the input data, but hopefully this gives a template for what you want to do at least.
Before you jump in: If you are fluent in ArcGIS or QGIS I recommend doing this task in the software you are familiar with at first. A simple guide to doing this task in QGIS is here and for ArcGIS I recommend the MGET toolbox.
As an example, we will look at an aerial survey of seabirds off the coast of Rhode Island, USA. Data from these surveys were collected by the University of Rhode Island and analysed in @winiarski_spatially_2013, @winiarski_integrating_2014 and @winiarski_spatial_2014. I have the transects saved as a shapefile that I will then "chop-up" into segments. To start with let's plot the data with the coastline:
library(ggplot2) library(sf) # coast data, just for reference coastline <- read_sf("data/ri_coast.shp") # our transect lines transects <- read_sf("data/ri_transects.shp") # now make the plot p <- ggplot() + # geom_sf knows what to do with spatial data geom_sf(data=coastline) + geom_sf(data=transects, aes(colour=Transect)) + # chop to be on a better scale coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) + # make the plot look simpler theme_minimal() print(p)
In the above code we have loaded the sf
package. This is the package we will use to do most of our spatial work in segmenting the data. It has most of the tools expected from a GIS, though the package is still being developed so its functionality is always increasing.
Investigating the transects
data we loaded, we can see it consists of a data.frame
with some extra bits (the spatial information, including locations (geometry
) and projection (CRS
)):
transects
We have one row per transect (for a total of 24 transects), each of which consists of a line joining the start and end points that make up the transects. If we want we can plot individual rows via plot(transects[1,])
; this can be useful to check that each row is a single line, or if further processing is needed (this can be tricky and is not covered in this tutorial but see "More information" below).
This data looks in good shape, so we can use the st_segmentize
function to take each transect and make segments from them. To make sure everything works correctly, we need to project the data first. Here I'm using an appropriate projected coordinate system (EPSG code 6348), which is the Rhode Island State Plane.
# project transects transects <- st_transform(transects, 6348) # do the segmenting segs <- st_segmentize(transects, dfMaxLength=units::set_units(2000, "metres")) # transform back to lat/long segs <- st_transform(segs, 4326) transects <- st_transform(transects, 4326)
Here we know the truncation used for the detection function was 1000m (the distances were collected in bins and this was the further bin), and since we're trying to make our segments approximately square, we set the length to be twice that (since the truncation applies to either side of the transect), so 2000m.
Looking at what that did:
segs
See now that each row has many coordinates attached to it, just looking at the first row (transect 1) and comparing the coordinates between transects
and segs
st_coordinates(transects[1,])
transects
has just two coordinates in it (the start and end points of the line). Whereas:
st_coordinates(segs[1,])
segs
has 5, corresponding to our segment cut points.
We now need to break up the rows of segs
into multiple rows, one per segment. This is a bit fiddly. We use a function provided by dieghernan
on their site to do this. We first load the function (don't worry about understanding the code!):
stdh_cast_substring <- function(x, to = "MULTILINESTRING") { ggg <- st_geometry(x) if (!unique(st_geometry_type(ggg)) %in% c("POLYGON", "LINESTRING")) { stop("Input should be LINESTRING or POLYGON") } for (k in 1:length(st_geometry(ggg))) { sub <- ggg[k] geom <- lapply( 1:(length(st_coordinates(sub)[, 1]) - 1), function(i) rbind( as.numeric(st_coordinates(sub)[i, 1:2]), as.numeric(st_coordinates(sub)[i + 1, 1:2]) ) ) %>% st_multilinestring() %>% st_sfc() if (k == 1) { endgeom <- geom } else { endgeom <- rbind(endgeom, geom) } } endgeom <- endgeom %>% st_sfc(crs = st_crs(x)) if (class(x)[1] == "sf") { endgeom <- st_set_geometry(x, endgeom) } if (to == "LINESTRING") { endgeom <- endgeom %>% st_cast("LINESTRING") } return(endgeom) }
We can then use it to separate the segments and look at the results:
segs <- stdh_cast_substring(segs, to="LINESTRING") segs
We now have r nrow(segs)
segments between 1km and 2km long. We can check the lengths using the st_length
function and plot a histogram of the segment lengths:
hist(st_length(segs), xlab="Segment lengths (m)", main="")
Note that in setting the dfMaxLength
argument to st_segmentize
we are giving a rough guide to the segment length and the algorithm inside st_segmentize
tries to get segments to be as equal in length as possible (note the $x$ axis in the histogram).
Note also that the Transect
column in segs
now has duplicate entries as each transect has multiple segments in it. We can now create our required columns for dsm
.
First, the Effort
column can be generated using st_length
:
segs$Effort <- st_length(segs)
Then we can generate the Sample.Labels
. Here I'll use a more complicated naming scheme to show how that's done, but one could simply write segs$Sample.Label <- 1:nrow(segs)
and that would be sufficient. Instead, I would like to have my Sample.Label
be the "YEAR
-AREA
-LINE
-SEGMENT
" scheme I suggested above.
There are fancier ways to do this, but for clarity, let's use a for
loop:
# create a dummy column that we can fill in as we go segs$Sample.Label <- NA # set the year and area once for this data year <- 2010 area <- "RI" # loop over the transect IDs for(this_transect in unique(segs$Transect)){ # how many segments in this transect? n_segs <- nrow(subset(segs, Transect==this_transect)) # generate the n_segs labels that we need segs$Sample.Label[segs$Transect==this_transect] <- paste(year, area, this_transect, 1:n_segs, sep="-") }
Now we can check that looks right:
segs
With the required columns taken care of, we can also calculate the centroids of our segments to get the locations of each segment to use in the spatial model. Handily we can use st_centroid
to do this in one step and keep all our data at the same time.
We get a warning about how centroids might not be calculated correctly when latitude/longitude are used. So again we need to project the data first (to Rhode Island State Plane) for this step using st_transform
.
# save the line version of segs for plotting later segs_lines <- segs # project segs segs <- st_transform(segs, 6348) # find centroids segs <- st_centroid(segs) # project back to lat/long segs <- st_transform(segs, 4326) # how does that look? segs
Notice that our geometry type
is now POINT
. We can plot these centroids on our previous map:
p <- ggplot() + # geom_sf knows what to do with spatial data geom_sf(data=coastline) + geom_sf(data=transects) + geom_sf(data=segs) + # chop to be on a better scale coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) + # make the plot look simpler theme_minimal() print(p)
At present, the dsm
package doesn't know how to talk to sf
objects, so as a final step we need to simplify segs
to be a regular data.frame
, which dsm
can interpret. We can extract the coordinates of the centroids using st_coordinates
and then append them onto a data.frame
version of the object with the geometry dropped, like so:
segs_df <- cbind(as.data.frame(st_drop_geometry(segs)), st_coordinates(segs))
We can double check that looks right:
head(segs_df)
and we are done. See the next section for how to link the segments and the detections.
If we want to include spatial covariate information in the segment table, we could then use (for example) rerrdap
to obtain remotely-sensed data. For in situ data (i.e., weather conditions recorded while on effort), this can be more complicated and will require the use of summarize
, see this thread for more information on how to deal with this situation.
To link together the data in the detection function and the spatial data giving the effort, we use the observation data.frame
. This is really just a cross-reference between those two tables, so each detection lives in one segment.
The observation data.frame
must have (at least) the following columns:
object
: unique object identifier (corresponding to those identifiers in the detection function)Sample.Label
: the identifier for the segment that the observation occurred insize
: the size of each observed group (e.g., 1 if all animals occurred individually)distance
: distance to observationThe observation data should have as many rows as there are in the detection function.
There are a few methods for building the observation table. Here I'll illustrate one assigning detections to segments based on their distance (i.e., detections are associated with their closest segment centroid). This is appropriate when you don't see forwards too far, so it's unlikely that detections will be miss assigned. This is true, for instance, in an aerial survey where observers look out windows located on the sides of the plane but might be inappropriate for a shipboard survey where observers on the flying bridge can see way ahead of the ship.
First loading the detection data for this survey (this would be what we use to fit the detection function, post-processing):
obs <- read.csv("data/loons.csv") head(obs)
We can see there are columns for latitude and longitude (Lat
and Long
). We can make this data.frame
be an sf
object by telling sf
that these columns contain spatial coordinates (in sf
lingo, "geometry"):
obs <- st_as_sf(obs, coords=c("Long", "Lat")) # set the coordinate system st_crs(obs) <- 4326 obs
We can overlay this on our previous map of transects:
p <- ggplot() + # geom_sf knows what to do with spatial data geom_sf(data=coastline) + geom_sf(data=transects) + geom_sf(data=obs, size=0.5) + # chop to be on a better scale coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) + # make the plot look simpler theme_minimal() print(p)
Going back to our segs
object above, we can use the st_join
function, combined with the st_nearest_feature
function to control how the tables are linked. Again, we need to project the data to avoid issues.
# project segs segs <- st_transform(segs, 6348) obs <- st_transform(obs, 6348) # do the join obs <- st_join(obs, segs, join=st_nearest_feature) # project back to lat/long segs <- st_transform(segs, 4326) obs <- st_transform(obs, 4326) # how does that look? obs
Now obs
has acquired additional columns from segs
, including the Sample.Label
(which we want) and Effort
(which we don't). To check this worked okay, we can plot the obs
and segs_lines
(the line version of the segments which we saved earlier) with colour coding for the Sample.Label
. I randomized the colour order so it would be easier to tell if observations were misallocated.
# make a random colour palette to avoid similar colours being near each other pal <- rainbow(nrow(segs), s=.6, v=.9)[sample(1:nrow(segs),nrow(segs))] p <- ggplot() + # geom_sf knows what to do with spatial data geom_sf(data=coastline) + geom_sf(data=segs_lines, aes(colour=Sample.Label), pch=21) + geom_sf(data=obs, size=0.5, aes(colour=Sample.Label)) + # chop to be on a better scale coord_sf(xlim=c(-72, -70.8), ylim=c(40.8, 41.5), expand=FALSE) + scale_colour_manual(values=pal) + # make the plot look simpler theme_minimal() + theme(legend.position = "none") print(p)
As with segs
we can remove unnecessary columns and geometry to get the required columns only:
# get rid of "spatialness" obs <- st_drop_geometry(obs) # select the columns we need obs <- obs[, c("object", "Sample.Label", "size", "Bin")] head(obs)
(Note here since we have binned data, we just keep the Bin
column and need to process this further later.)
A different way to approach this would be if there were timestamps on waypoints from the GPS, which could be related to the segments. One could then look at whether an observation was made between the start and end times of a segment.
We have provided some information about segmenting on this wiki page, as part of the US Navy-funded project DenMod.
This information is summarized at ?"dsm-data"
in the dsm
package. It may also be useful to look at the data in the example dataset mexdolphins
which can be loaded with data(mexdolphins)
. The vignette for an analysis of those data is available here.
It is hard to give very general information on how to segment lines, as to some extent it depends on how the data were originally formatted (going all the way back to the make and model of the GPS unit used). More information on dealing with spatial data in R using the sf
package is available at the R spatial website (see the "Articles" drop down in the header).
Another source of help is the distance sampling mailing list. Be sure to search the archives prior to posting as there have been several threads on segmenting previously that might be helpful.
Thanks to Phil Bouchet for providing helpful comments to an early version of this document and to Iúri Correia for finding an important bug.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.