Standardise an Occurrence dataset
In galaxias: Describe, Package, and Share Biodiversity Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Data of species observations is referred to as occurrence data. In Living Atlases like the Atlas of Living Australia (ALA), this is the default type of data stored.

Using occurrence-based datasets assume that all observations are independent of each other. The benefit of this assumption is that observational data can remain simple in structure - every observation is made at a specific place and time. This simplicity allows all occurrence-based data to be aggregated and used together.

Let's see how to build an occurrence-based dataset using galaxias.

The dataset

Let's use an small example dataset of bird observations taken from 4 different site locations. This dataset has many different types of data like landscape type and age class. Importantly for standardising to Darwin Core, this dataset contains the scientific name (species), coordinate location (lat & lon) and date of observation (date).

#| warning: false
#| message: false
library(galaxias)
library(dplyr)
library(readr)

obs <- read_csv("dummy-dataset-sb.csv",
                show_col_types = FALSE) |>
  janitor::clean_names()

obs |> 
  gt::gt() |>
  gt::opt_interactive(page_size_default = 5)

Standardise to Darwin Core

To determine what we need to do to standardise our dataset, let's use suggest_workflow(). The output tells us we have one matching Darwin Core term in our data already (sex), but we are missing all minimum required Darwin Core terms.

obs |>
  suggest_workflow()

Under "Suggest workflow", the output above suggests a series of piped set_ functions that we can use to rename, modify or add columns that are missing from obs but required by Darwin Core. set_ functions are specialised wrappers around dplyr::mutate(), with additional functionality to support using Darwin Core Standard.

For simplicity, let's do the easy part first of renaming columns we already have in our dataset to use accepted standard Darwin Core terms. set_ functions will automatically check to make sure each column is correctly formatted. We'll save our modified dataframe as obs_dwc.

obs_dwc <- obs |>
  set_scientific_name(scientificName = species) |>
  set_coordinates(decimalLatitude = lat,
                  decimalLongitude = lon) |>
  set_datetime(eventDate = lubridate::ymd(date)) # specify year-month-day format

Running suggest_workflow() again will reflect our progress and show us what's left to do. Now the output tells us that we still need to add several columns to our dataset to meet minimum Darwin Core requirements.

obs_dwc |>
  suggest_workflow()

Here's a rundown of the columns we need to add:

occurrenceID: Unique identifiers of each record. This ensures that we can identify the specific record for any future updates or corrections. We can use composite_id(), sequential_id() or random_id() to add a unique IDs to each row.
basisOfRecord: The type of record (e.g. human observation, specimen, machine observation). See a list of acceptable values with corella::basisOfRecord_values().
geodeticDatum: The Coordinate Reference System (CRS) projection of your data (for example, the CRS of Google Maps is "WGS84").
coordinateUncertaintyInMeters: The area of uncertainty around your observation. You may know this value based on your method of data collection

Now let's add these columns using set_occurrences() and set_coordinates(). We can also add the suggested function set_individual_traits() which will automatically identify the matched column name sex and check the column's format.

obs_dwc <- obs_dwc |>
  set_occurrences(
    occurrenceID = composite_id(sequential_id(), site, landscape),
    basisOfRecord = "humanObservation"
    ) |>
  set_coordinates(
    geodeticDatum = "WGS84",
    coordinateUncertaintyInMeters = 30
    # coordinateUncertaintyInMeters = with_uncertainty(method = "phone")
    ) |>
  set_individual_traits()

Running suggest_workflow() once more will confirm that our dataset is ready to be used in a Darwin Core Archive!

obs_dwc |>
  suggest_workflow()

To submit our dataset, let's select columns with valid occurrence term names and save this dataframe to the file occurrences.csv. Importantly, we will save our csv in a folder called data-processed, which galaxias looks for automatically when building a Darwin Core Archive.

obs_dwc <- obs_dwc |>
  select(any_of(occurrence_terms())) # select any matching terms

obs_dwc |>
  gt::gt() |>
  gt::opt_interactive(page_size_default = 5)

Our final step is to save this to our 'publishing' directory:

#| eval: false
# Save in ./data-processed
use_data_occurrences(obs_dwc)

All done! See the Quick start guide vignette for how to build a Darwin Core Archive.

Any scripts or data that you put into this service are public.

galaxias documentation built on Aug. 8, 2025, 7:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

galaxias
Describe, Package, and Share Biodiversity Data

Standardise an Occurrence dataset
In galaxias: Describe, Package, and Share Biodiversity Data

The dataset

Standardise to Darwin Core

Try the galaxias package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

galaxias Describe, Package, and Share Biodiversity Data

Standardise an Occurrence dataset In galaxias: Describe, Package, and Share Biodiversity Data

The dataset

Standardise to Darwin Core

Try the galaxias package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

galaxias
Describe, Package, and Share Biodiversity Data

Standardise an Occurrence dataset
In galaxias: Describe, Package, and Share Biodiversity Data