Select, reshape, and filter data

knitr::opts_chunk$set(
  collapse = TRUE,
  eval = TRUE,
  echo = TRUE,
  comment = "#>",
  dpi = 120,
  fig.align = "center",
  out.width = "80%"
)

The package forcis provides a lot of functions to filter, reshape, and select FORCIS data. This vignette shows how to use these functions. With the exception of select_taxonomy(), all functions presented in this vignette are optional and depend on your research questions. You can filter data by species, time range, ocean, etc.

Setup

First, let's import the required packages.

library(forcis)

Before proceeding, let's download the latest version of the FORCIS database.

# Create a data/ folder ----
dir.create("data")

# Download latest version of the database ----
download_forcis_db(path = "data", version = NULL)

The vignette will use the plankton nets data of the FORCIS database. Let's import the latest release of the data.

file_name <- system.file(
  file.path("extdata", "FORCIS_net_sample.csv"),
  package = "forcis"
)
net_data <- read.csv(file_name)
# Import net data ----
net_data <- read_plankton_nets_data(path = "data")

NB: In this vignette, we use a subset of the plankton nets data, not the whole dataset.

Selecting columns

Select a taxonomy

The FORCIS database provides three different taxonomies:

See the associated data paper for further information.

After importing the data and before going any further, the next step involves choosing the taxonomic level for the analyses. This is mandatory to avoid duplicated records.

Let's use the function select_taxonomy() to select the VT taxonomy (validated taxonomy):

# Select taxonomy ----
net_data_vt <- net_data |>
  select_taxonomy(taxonomy = "VT")

net_data_vt

Select required columns

Because FORCIS data contain more than 100 columns, the function select_forcis_columns() can be used to lighten the data to easily handle it and to speed up some computations.

By default, only required columns listed in get_required_columns() (required by some functions of the package like compute_*() and plot_*()) and species columns will be kept.

# Remove not required columns (optional) ----
net_data_vt <- net_data_vt |>
  select_forcis_columns()

net_data_vt

You can also use the argument cols to keep additional columns.

Filtering rows

The filter_by_*() functions are optional and their use depends on your research questions.

Filter by month of data collection

The filter_by_month() function filters observations based on the month of sampling. It requires two arguments: the data and a numeric vector with values between 1 and 12.

# Filter data by sampling month ----
net_data_vt_july_aug <- net_data_vt |>
  filter_by_month(months = 7:8)

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_july_aug)

Filter by year of data collection

The filter_by_year() function filters observations based on the year of sampling. It requires two arguments: the data and a numeric vector with the years of interest.

# Filter data by sampling year ----
net_data_vt_9020 <- net_data_vt |>
  filter_by_year(years = 1990:2020)

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_9020)

Filter by bounding box

The function filter_by_bbox() can be used to filter FORCIS data by a spatial bounding box (argument bbox).

Let's filter the plankton net data by a spatial rectangle located in the Indian ocean.

# Filter by spatial bounding box ----
net_data_vt_bbox <- net_data_vt |>
  filter_by_bbox(bbox = c(45, -61, 82, -24))

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_bbox)

Note that the argument bbox can be either an object of class bbox (package sf) or a vector of four numeric values defining a square bounding box. If a vector of numeric values is provided, coordinates must be defined in the system WGS 84 (epsg=4326).

Let's check the spatial extent by converting these two tibbles into spatial layers (sf objects) with the function data_to_sf().

# Filter by spatial bounding box ----
net_data_vt_sf <- net_data_vt |>
  data_to_sf()

net_data_vt_bbox_sf <- net_data_vt_bbox |>
  data_to_sf()

# Original spatial extent ----
sf::st_bbox(net_data_vt_sf)

# Spatial extent of filtered records ----
sf::st_bbox(net_data_vt_bbox_sf)

Filter by ocean

The function filter_by_ocean() can be used to filter FORCIS data by one or several oceans (argument ocean).

Let's filter the plankton net data located in the Indian ocean.

# Filter by ocean name ----
net_data_vt_indian <- net_data_vt |>
  filter_by_ocean(ocean = "Indian Ocean")

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_indian)

Use the function get_ocean_names() to retrieve the name of World oceans according to the IHO Sea Areas dataset version 3 (used in this package).

# Get ocean names ----
get_ocean_names()

Filter by spatial polygon

The function filter_by_polygon() can be used to filter FORCIS data a spatial polygon (argument polygon).

Let's filter the plankton net data by a spatial polygon defining boundaries of the Indian ocean.

# Import spatial polygon ----
file_name <- system.file(
  file.path("extdata", "IHO_Indian_ocean_polygon.gpkg"),
  package = "forcis"
)

indian_ocean <- sf::st_read(file_name, quiet = TRUE)

# Filter by polygon ----
net_data_vt_poly <- net_data_vt |>
  filter_by_polygon(polygon = indian_ocean)

# Number of original records ----
nrow(net_data_vt)

# Number of filtered records ----
nrow(net_data_vt_poly)

Filter by species

The filter_by_species() function allows users to filter FORCIS data for one or more species.

It takes a data.frame (or a tibble) and a vector of species names (argument species).

Let's subset plankton net data to only keep only two species: G. glutinata and C. nitida.

# Filter by species ----
net_data_vt_glutinata_nitida <- net_data_vt |>
  filter_by_species(species = c("g_glutinata_VT", "c_nitida_VT"))

# Dimensions of original data ----
dim(net_data_vt)

# Dimensions of filtered data ----
dim(net_data_vt_glutinata_nitida)

Important: The filter_by_species() function does not remove rows (samples) but columns: it removes other species columns. To only keep samples where these two species have been detected, we can use:

# Keep samples with positive counts ----
net_data_vt_glutinata_nitida <- net_data_vt_glutinata_nitida |>
  dplyr::filter(g_glutinata_VT > 0 | c_nitida_VT > 0)

# Number of filtered records ----
nrow(net_data_vt_glutinata_nitida)

Reshaping

Convert to long format

The convert_to_long_format() function converts FORCIS data into a long format.

# Convert to long format ----
net_data_long <- convert_to_long_format(net_data)

# Dimensions of original data ----
dim(net_data)

# Dimensions of reshaped data ----
dim(net_data_long)

Two columns have been created: taxa (taxon names) and counts (taxon counts).

# Column names ----
colnames(net_data_long)


Try the forcis package in your browser

Any scripts or data that you put into this service are public.

forcis documentation built on June 8, 2025, 10:37 a.m.