Aim and setup

The spread of invasive species is due both to diffusive spread and human-assisted jump dispersal, often caused by species hitchhiking on vehicles. The aim of this vignette is to differentiate diffusive spread from jump dispersal in invasive species based on occurrence data. This identification required the development of a directional analysis of species presence. We demonstrate this method using occurrence data of the spotted lanternfly, Lycorma delicatula, an invasive Hemiptera in the United States.

We load the necessary packages.

library(magrittr)
library(tidyverse)
library(sf)
library(ggplot2)
library(here)
library(dplyr)
library(slfjumps)

1. Data overlook

The SLF dataset is available from the lydemapr companion package. In the download_data folder, then in the v2_2023 subfolder, download the lyde_data_v2.zip file and place it in the data folder of your local jumpID project. At the date at which this vignette is written, lydemapr contains data up to year 2022.

We load this dataset that contains all the occurrences of the spotted lanternfly, and take a look at it.

slf <- read.csv(file.path(here(), "data", "lyde_data_v2", "lyde.csv"))
slf %<>% select(bio_year, latitude, longitude, lyde_established) 
tibble::tibble(slf)

The dataset contains 831,039 rows, each corresponding to a SLF survey at a specific date and location. The columns that we will using are: bio_year: the biological year of the occurrence (see lydemapr for a description of the difference between year and bio_year)
latitude and longitude (XY coordinates). The precise location of SLF surveys is here summarized in 1-km² cells. All coordinates are in WGS84 (EPSG 4326)
* lyde_established: whether several live SLF were found in this survey

2. Data formatting

First, to use the jumpID package, we need to rename the columns with generic names expected by the package.

slf %<>% rename(year = bio_year,
                   established = lyde_established)

Second, multiple surveys were sometimes conducted at the same location during the same year, resulting in a redundant dataset for the purpose of our analysis. We summarize the information available at each location every year, so that a row now represents the detection status at a location for a given year.

Note: when several surveys indicate that SLF are "present" the same year at the same location, it could be tempting to categorize SLF as "established". However, the category "present" often refers to dead individuals, although this information is not explicitly available. We use a conservative approach and consider SLF as established in a cell only if one of the surveys considered it as established.

grid_data <- slf %>%
  dplyr::group_by(year, latitude, longitude) %>%
  dplyr::summarise(established = any(established)) %>% 
  ungroup()

tibble::tibble(grid_data)

The dataset now has 130,015 rows, corresponding to r length(unique(grid_data$latitude)) x r length(unique(grid_data$longitude)) cells. Let's look at these data on a map.

# extract a map of the States and recodes state labels to show the two-letter code rather than the full state name.

# obtaining simple feature objects for states and finding centroids for label positioning
states <- sf::st_as_sf(maps::map("state", plot = FALSE, fill = TRUE)) %>% st_transform(crs = 4326)
sf::sf_use_s2(FALSE)
states <- cbind(states, st_coordinates(st_centroid(states)))
sf::sf_use_s2(TRUE)

# making table key for state 2-letter abbreviations
# the vectors state.abb and state.name contains strings of all
# US states and abbreviations
state_abbr <- tibble(state.name = stringr::str_to_lower(state.name), state.abb) %>%
  dplyr::left_join(tibble(ID = states$ID), ., by = c(ID = "state.name")) %>%
  dplyr::mutate(state.abb = replace_na(state.abb, ""))

# adding 2-letter codes to sf
states$code <- state_abbr$state.abb

# Map all surveys
ggplot(data = states) +
  geom_point(data = grid_data, aes(x = longitude, y = latitude)) +
  xlab("Longitude") + ylab("Latitude") +
  geom_sf(data = states, alpha = 0) +
  facet_wrap(~year) +
  theme_classic()

Cells were predominantly surveyed in the Eastern US, but there are also cells documented in other parts of the US, starting in 2019. Positive cells (established SLF) are only in the Eastern US, so maps will be zoomed in on the Eastern US from here.

Every year, there is a large number of cells surveyed with positive (SLF established) or negative results (SLF not established). A zoomed-in map allows to see the cell grid.

ggplot(data = states) +
  geom_point(data = grid_data, aes(x = longitude, y = latitude, 
                                   col = established), size = 1) +
  scale_color_manual(values = c("gray", "black")) +
  labs(col = "SLF established") +
  xlab("Longitude") + ylab("Latitude") +
  geom_sf(data = states, alpha = 0) +
  geom_text(data = states, aes(X, Y, label = code), size = 2) +
  coord_sf(xlim = c(-78.5, -78), ylim = c(39, 39.5), expand = FALSE) +
  theme(legend.position = "bottom")

The progression of the invasion will be measured starting from the introduction point that has been documented in Barringer et al. 2015. We store the coordinates of the introduction point in an object for the next analyses. If the precise introduction point is unknown, it may be replaced by the centroid of the invasion at the time the invasive species was discovered in the invasive range.

# Coordinates of the introduction site, extracted from Barringer et al. 2015
# As a table (for distance calculations):
centroid_df <- data.frame(longitude = -75.675340, 
                        latitude = 40.415240) 


# Replace by the centroid of the invasion in the first years of the invasion if the introduction point is unknown:
# centroid_df <- slf %>% filter(year == 2014) %>%
#   summarise(latitude = mean(.$latitude),
#             longitude = mean(.$longitude))

To finish preparing the dataset for analysis, we calculate the distance between each survey cell and the introduction point.

grid_data %<>% 
  dplyr::mutate(DistToIntro = geosphere::distGeo(grid_data[,c(3,2)], centroid_df)/1000) 
# the distance is obtained in meters, and we divide it by 1000 to obtain kilometers given the scale of this invasion.

summary(grid_data)

# save this dataset
write.csv(grid_data, file.path(here(), "exported-data", "grid_data.csv"), row.names = F)

The dataset is now ready for the jump analysis.

3. Differentiating diffusive spread and jump dispersal: principle

A set of four successive functions will be run to separate cells where SLF is established due to continuous, diffusive spread, from cells where SLF is established due to jump dispersal. We will first run the functions on a tentative set of parameters to understand how the analysis works. The optimization of the parameters will be described later in this vignette.

a- attribute_sectors()

Considering that the expansion of the invasion is heterogeneous in space, the jump analyses requires the invasion range to be divided into sectors with the introduction site as the origin. The first function, attribute_sectors() attributes a sector number to each cell of the dataset, to be used in the jump analysis. Here, we divide the invasion range into 16 sectors. See vignette #2 "Decreasing calculation time" to learn how changing the number of sectors may make your analysis faster.

grid_data_sectors <- slfjumps::attribute_sectors(dataset = grid_data, # dataset to be explored
                             nb_sectors = 16, # number of sectors to divide the invasion range
                             centroid = c(-75.675340, 40.415240)) # vector containing the centroid coordinates as long/lat

# Map results
ggplot(data = states) +
  geom_point(data = grid_data_sectors, 
             aes(x = longitude, y = latitude, 
                 col = as.factor(sectors_nb))) +
  geom_sf(data = states, alpha = 0) +
  geom_text(data = states, aes(X, Y, label = code), size = 2) +
  theme(legend.position = "bottom") +
  xlab("Longitude") + ylab("Latitude") +
  coord_sf(xlim = c(-82, -72), ylim = c(38, 43), expand = FALSE) +
  labs(col = "Sector number")

The analysis will go through each sector successively.

b- find_thresholds()

The function find_thresholds() will then search for the first discontinuity in the SLF distribution. Points are sorted by increasing distance to the introduction point, and the function goes through each consecutive pair of points, starting from the introduction point and going outwards. It stops once it identifies a distance larger than a defined gap_size between two consecutive points, marking a discontinuity in the SLF distribution. The last point before this discontinuity marks the putative limit of the diffusive spread. The function runs independently for each sector. It is considered that populations do not go extinct, and data are cumulated over time. The limit of diffusive spread cannot go back towards the introduction point over time.

The input dataset is the output of attribute_sectors(). The gap_size parameter defines the minimal distance between two consecutive points that marks a discontinuity in the SLF distribution, in km, and it is here set to 15 km. If the option "negatives" is set to TRUE (default), the function will check that there are negative surveys in the discontinuity, so that the absence of SLF established is not due to an absence of surveys in the area.

Results_thresholds <- slfjumps::find_thresholds(dataset = grid_data_sectors, 
                                                gap_size = 15, 
                                                negatives = T)

# store resulting objects
preDist <- Results_thresholds$preDist # list of threshold points = limits of the 
#diffusive spread. Will be completed in find_secDiff()
potJumps <- Results_thresholds$potJumps # list of potential jumps. Will be
#pruned in find_jumps()

# Make a single object for the map
thresholds_map <- dplyr::bind_rows(preDist %>% dplyr::mutate(Type = "Diffusion threshold"),
                        potJumps %>% dplyr::mutate(Type = "Potential jump"))

# Map results
ggplot(data = states) +
  geom_point(data = grid_data %>% filter(established == T), 
             aes(x = longitude, y = latitude), col = "gray") +
  geom_point(data = thresholds_map, aes(x = longitude, y = latitude, col = Type), size = 1) +
  geom_sf(data = states, alpha = 0) +
  facet_wrap(~year) +
  geom_text(data = states, aes(X, Y, label = code), size = 2) +
  theme_classic() +
  theme(legend.position = "bottom") +
  scale_color_manual(values = c("blue", "green")) +
  xlab("Longitude") + ylab("Latitude") + labs(col = "Identification") +
  coord_sf(xlim = c(-82, -72), ylim = c(38, 43), expand = FALSE)

On this map, blue points indicate the thresholds identified by find_thresholds(), i.e., the limits of the diffusive spread in every direction, for every year. At this point of the analysis, all points farther than these thresholds are stored as potential jumps (green points).

c- find_jumps()

The function find_jumps() prunes the list of potential jumps of cells that are less than from a positive cell from a previous year, as SLF likely spread from this other cell. The final list of jumps is obtained in the object Jumps. The pruned points may be examined in the last function, find_secDiff(). The analysis may also stop here if only jumps are of biological interest.

Results_jumps <- slfjumps::find_jumps(grid_data = grid_data, 
                       potJumps = potJumps,
                       gap_size = 15, 
                       crs = 4326)

# store resulting objects
Jumps <- Results_jumps$Jumps # final list of jumps
diffusers <- Results_jumps$diffusers # list of positive cells stemming from diffusive spread
potDiffusion <- Results_jumps$potDiffusion # list of cells containing a mix of secondary diffusion and additional threshold points. Will be pruned in find_secDiff()

# Make a single object for the map
jumps_map <- dplyr::bind_rows(preDist %>% mutate(Type = "Diffusion threshold"),
                       Jumps %>% mutate(Type = "Jump"),
                       potDiffusion %>% mutate(Type = "Potential diffusion"))

# Map results
ggplot(data = states) +
  geom_point(data = grid_data %>% filter(established == T), 
             aes(x = longitude, y = latitude), col = "gray") +
  geom_point(data = jumps_map, aes(x = longitude, y = latitude, col = Type)) +
  geom_sf(data = states, alpha = 0) +
  facet_wrap(~year) +
  geom_text(data = states, aes(X, Y, label = code), size = 2) +
  theme_classic() +
  theme(legend.position = "bottom") +
  scale_color_manual(values = c("blue", "green", "orange")) +
  xlab("Longitude") + ylab("Latitude") + labs(col = "Identification") +
  coord_sf(xlim = c(-82, -72), ylim = c(38, 43), expand = FALSE)

d- find_secDiff()

Cells that were discarded from the jump list are either additional threshold points or secondary diffusion from a previous jump. find_secDiff() attributes them to the correct category by checking whether they are close from a previous jump, or a diffusion point.

Results_secDiff <- slfjumps::find_secDiff(potDiffusion = potDiffusion,
                                Jumps = Jumps,
                                diffusers = diffusers,
                                Dist = preDist,
                                gap_size = 15,
                                crs = 4326)

# store resulting objects
secDiff <- Results_secDiff$secDiff
Dist <- Results_secDiff$Dist

# Make a single object for the map
secDiff_map <- dplyr::bind_rows(Dist %>% mutate(Type = "Diffusion threshold"),
                       Jumps %>% mutate(Type = "Jump"),
                       secDiff %>% mutate(Type = "Secondary diffusion"))

# Map results
ggplot(data = states) +
  geom_point(data = grid_data %>% filter(established == T), 
             aes(x = longitude, y = latitude), col = "gray") +
  geom_point(data = jumps_map, aes(x = longitude, y = latitude, col = Type)) +
  geom_sf(data = states, alpha = 0) +
  facet_wrap(~year) +
  geom_text(data = states, aes(X, Y, label = code), size = 2) +
  theme_classic() +
  theme(legend.position = "bottom") +
  scale_color_manual(values = c("blue", "green", "orange")) +
  xlab("Longitude") + ylab("Latitude") + labs(col = "Identification") +
  coord_sf(xlim = c(-82, -72), ylim = c(38, 43), expand = FALSE)

4. Wrapper for jump analysis

All these analyses can be done in one bout using the wrapper function.

jumps_wrapper <- slfjumps::find_jumps_wrapper(dataset = grid_data, 
                   nb_sectors = 16,
                   centroid = c(-75.675340, 40.415240),
                   gap_size = 15,
                   negatives = T,
                   crs = 4326)

# store resulting objects
Dist <- jumps_wrapper$Dist
Jumps <- jumps_wrapper$Jumps
secDiff <- jumps_wrapper$secDiff

# save them
write.csv(Dist, file.path(here(), "exported-data", "thresholds.csv"), row.names = F)
write.csv(Jumps, file.path(here(), "exported-data", "jumps.csv"), row.names = F)
write.csv(secDiff, file.path(here(), "exported-data", "secDiffusion.csv"), row.names = F)

We plot the results on a map.

# Make a single object for the map
jumps_wrapper_map <- dplyr::bind_rows(Dist %>% dplyr::mutate(Type = "Diffusion threshold"),
                       Jumps %>% dplyr::mutate(Type = "Jump"),
                       secDiff %>% dplyr::mutate(Type = "Secondary diffusion"))

# Map results
ggplot(data = states) +
  geom_point(data = grid_data %>% filter(established == F), 
             aes(x = longitude, y = latitude), col = "gray90") +
  geom_point(data = grid_data %>% filter(established == T), 
             aes(x = longitude, y = latitude), col = "gray30") +
  geom_point(data = jumps_wrapper_map, aes(x = longitude, y = latitude, col = Type)) +
  geom_sf(data = states, alpha = 0) +
  facet_wrap(~year) +
  geom_text(data = states, aes(X, Y, label = code), size = 2) +
  theme_classic() +
  theme(legend.position = "bottom") +
  scale_color_manual(values = c("blue", "orange", "green")) +
  xlab("Longitude") + ylab("Latitude") + labs(col = "Identification") +
  coord_sf(xlim = c(-82, -72), ylim = c(38, 43), expand = FALSE)

5. Rarefy jump clusters

Jumps <- read.csv(file.path(here(), "exported-data", "jumps.csv"))

Some jumps were clustered in the same area. Jump clusters may stem from independent dispersal jumps, i.e. SLF hitchhiked multiple times to these locations the same year. Alternatively, jump clusters may result from SLF quickly spreading around a single dispersal jump. Finally, they can be a mix between these two hypotheses. We identify these jump clusters to better understand jump locations.

First, we delineate these jump clusters as jumps located less than the gap size from each other. Check how many points there are per jump cluster.

a- group_jumps()

Jump_groups <- slfjumps::group_jumps(Jumps, gap_size = 15)

Groups <- Jump_groups %>% dplyr::group_by(year, Group) %>% dplyr::summarise(Nb = n()) %>% 
  dplyr::arrange(-Nb) %>% 
  dplyr::filter(Nb > 1)

ggplot(Groups, aes(x = as.factor(year), y = Nb)) +
  geom_boxplot() + theme_classic() +
  ylab("Number of jumps per cluster") + xlab("Year")

b- rarefy_groups()

We summarize each jump cluster by summarizing each of them by their most central point.

Jump_clusters <- slfjumps::rarefy_groups(Jump_groups) %>% dplyr::mutate(Rarefied = TRUE)

write.csv(Jump_clusters, 
          file.path(here(), "exported-data", "jump_clusters.csv"), 
          row.names = F)


# Map these groups and the rarefied points
ggplot(data = states) +
  geom_sf(data = states, fill = "white") +
  coord_sf(xlim = c(-82, -72), ylim = c(38, 43), expand = FALSE) +
  geom_point(data = Jump_groups,
             aes(x = longitude, y = latitude, 
                 col = as.factor(Group)), shape = 19, size = 3, 
             show.legend = F) +
  geom_point(data = Jump_clusters, 
             aes(x = longitude, y = latitude)) +
  labs(x = "Longitude", y = "Latitude")

These r dim(Jump_groups)[1] jumps were rarefied into r dim(Jump_clusters)[1] jump clusters.

6. Data is ready for biological analysis and interpretation!

-- end of vignette --



nbelouard/slfjumps documentation built on July 27, 2024, 8:28 a.m.