knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 6 )
As any modeling technique, ecological niche modeling depends on the quality of the input data, particularly species occurrence records. Cleaning these data is a critical step to minimize biases, reduce errors, and ensure meaningful model outcomes. This vignette introduces tools available in the kuenm2 package to facilitate the cleaning of occurrence data prior to modeling. It guides users through inspecting data, applying cleaning functions, and saving cleaned datasets, all within a reproducible R workflow.
We want to highlight that additional data cleaning and filtering steps (e.g., spatial thinning) may be necessary depending on the type of model or modeling approach the user intends to adopt. The tools presented here are designed to assist with the most basic steps of cleaning data for modeling.
If kuenm has not been installed yet, please do so. See the Main guide for installation instructions.
Load kuenm and any other required packages, and define a working directory (if needed). In general, setting a working directory in R is considered good practice, as it provides better control over where files are read from or saved to. If users are not working within an R project, we recommend setting a working directory, since at least one file will be saved at later stages of this guide.
Note: functions from other packages (i.e., not from base R or kuenm) used in this guide will be displayed as package::function().
# Load packages library(kuenm2) library(terra) # Current directory getwd() # Define new directory #setwd("YOUR/DIRECTORY") # uncomment and modify if setting a new directory # Saving original plotting parameters original_par <- par(no.readonly = TRUE)
We will use occurrence records provided within the kuenm package. Most example data sets in the package were derived from Trindade & Marques (2024). The occ_data_noclean object contains 51 valid occurrences of Myrcia hatschbachii (a tree endemic to Southern Brazil) and a group of erroneous records to demonstrate cleaning steps.
# Import occurrences data(occ_data_noclean, package = "kuenm2") # Check data structure str(occ_data_noclean)
A raster layer included in the package will also be used in this example. This is a bioclimatic variable from WorldClim 2.1 at 10 arc-minute resolution. This layer has been masked using a polygon generated by drawing a minimum convex polygon around the records with a 300 km buffer.
# Import raster layers var <- terra::rast(system.file("extdata", "Current_variables.tif", package = "kuenm2")) # Keep only one layer var <- var$bio_1 # Check variable terra::plot(var)
Visualize occurrences records in geography:
# Visualize occurrences on one variable ## Create an extent based on the layer and the records to see all errors vext <- terra::ext(var) # extent of layer pext <- apply(occ_data_noclean[, 2:3], 2, range, na.rm = TRUE) # extent of records allext <- terra::ext(c(min(pext[1, 1], vext[1]), max(pext[2, 1], vext[2]), min(pext[1, 2], vext[3]), max(pext[2, 2], vext[4]))) + 1 # plotting records on the variable terra::plot(var, ext = allext, main = "Bio 1") points(occ_data_noclean[, c("x", "y")])
The basic data cleaning steps implemented in kuenm help to: remove missing data, eliminate duplicates, exclude typically (though not always) erroneous coordinates with 0 longitude and 0 latitude, and filter out records with low coordinate precision based on the number of decimal places.
Below is an example of cleaning missing data. This example uses a data.frame containing only the columns "Species", "x", and "y" (where "x" and "y" represent longitude and latitude, respectively). If the data.frame includes additional columns and not all of them should be considered when identifying missing values, users can specify which columns to use via the columns argument (default = NULL, which includes all columns). As with any function, we recommend consulting the documentation for more detailed explanations (e.g., help(remove_missing)).
# remove missing data mis <- remove_missing(data = occ_data_noclean, columns = NULL, remove_na = TRUE, remove_empty = TRUE) # quick check nrow(occ_data_noclean) nrow(mis)
The code below uses the previous results and continues the process of cleaning data by removing duplicates. The argument columns can be used as explained before. See full documentation with help(remove_duplicates).
# remove exact duplicates mis_dup <- remove_duplicates(data = mis, columns = NULL, keep_all_columns = TRUE) # quick check nrow(mis) nrow(mis_dup)
Continue the process by removing coordinates with values of 0 (zero) for longitude and latitude (not always needed, as this location is valid if working with some marine species). See full documentation with help(remove_corrdinates_00).
# remove records with 0 for x and y coordinates mis_dup_00 <- remove_corrdinates_00(data = mis_dup, x = "x", y = "y") # quick check nrow(mis_dup) nrow(mis_dup_00)
The following lines of code take the previous result and remove coordinates with low precision. If the longitude and latitude contain no or only a few decimal places, they may have been rounded, which can be problematic in some areas. This a step recommended when users know coordinate rounding is an issue. This filtering process can also be applied to longitude and latitude independently. See full documentation with help(filter_decimal_precision).
# remove coordinates with low decimal precision. mis_dup_00_dec <- filter_decimal_precision(data = mis_dup_00, x = "x", y = "y", decimal_precision = 2) # quick check nrow(mis_dup_00) nrow(mis_dup_00_dec)
Users can perform all this steps with a single function (initial_cleaning()) as follows:
# all basic cleaning steps clean_init <- initial_cleaning(data = occ_data_noclean, species = "species", x = "x", y = "y", remove_na = TRUE, remove_empty = TRUE, remove_duplicates = TRUE, by_decimal_precision = TRUE, decimal_precision = 2) # quick check nrow(occ_data_noclean) # original data nrow(clean_init) # data after all basic cleaning steps # a final plot to check par(mfrow = c(2, 2)) ## initial data terra::plot(var, ext = allext, main = "Initial data") points(occ_data_noclean[, c("x", "y")]) ## data after basic cleaning steps terra::plot(var, ext = allext, main = "After basic cleaning") points(clean_init[, c("x", "y")]) terra::plot(var, main = "After basic cleaning (zoom)") points(clean_init[, c("x", "y")])
Two additional cleaning steps are implemented in kuenm, removing cell duplicates and moving points to valid cells.
Removing cell duplicates involves excluding records that are not exact coordinate duplicates but are located within the same pixel. The process randomly selects one record from each cell to retain. See full documentation with help(remove_cell_duplicates).
# exclude duplicates based on raster cell (pixel) celldup <- remove_cell_duplicates(data = clean_init, x = "x", y = "y", raster_layer = var) # quick check nrow(clean_init) # data after all basic cleaning steps nrow(celldup) # plus removing cell duplicates
The following lines of code help adjust records that fall just outside valid raster cells to prevent data loss. Given the nature and resolution of raster layers, valid records are sometimes perceived as being outside the boundaries of cells with data. In such cases, an alternative is to move these records to the nearest valid cell. A distance limit is applied to avoid relocating records that are too far from the study area. See below for an example of how to use this functionality of kuenm. See full documentation with help(move_2closest_cell).
# move records to valid pixels moved <- move_2closest_cell(data = celldup, x = "x", y = "y", raster_layer = var, move_limit_distance = 10) # quick check nrow(celldup) # basic cleaning and no cell duplicates nrow(moved[moved$condition != "Not_moved", ]) # plus moved to valid cells
The function advanced_cleaning() facilitates the two processes from above in a single step:
# move records to valid pixels clean_data <- advanced_cleaning(data = clean_init, x = "x", y = "y", raster_layer = var, cell_duplicates = TRUE, move_points_inside = TRUE, move_limit_distance = 10) # exclude points not moved clean_data <- clean_data[clean_data$condition != "Not_moved", 1:3] # quick check nrow(occ_data_noclean) # original data nrow(clean_init) # data after all basic cleaning steps nrow(clean_data) # data after all basic cleaning steps # a final plot to check par(mfrow = c(3, 2)) ## initial data terra::plot(var, ext = allext, main = "Initial") points(occ_data_noclean[, c("x", "y")]) ## data after basic cleaning steps terra::plot(var, ext = allext, main = "Basic cleaning") points(clean_init[, c("x", "y")]) terra::plot(var, main = "Basic cleaning (zoom)") points(clean_init[, c("x", "y")]) ## data after basic cleaning steps terra::plot(var, main = "Final data") points(clean_data[, c("x", "y")]) ## zoom to a particular area, initial data terra::plot(var, xlim = c(-48, -50), ylim = c(-26, -25), main = "Initial (zoom +)") points(occ_data_noclean[, c("x", "y")]) ## zoom to a particular area, final data terra::plot(var, xlim = c(-48, -50), ylim = c(-26, -25), main = "Final (zoom +)") points(clean_data[, c("x", "y")])
Notes:
# Reset plotting parameters par(original_par)
The results of the data cleaning steps in kuenm are simple data.frames that may include a few additional columns and fewer records than the original dataset. An easy way to save these results is by writing them to CSV files. Although multiple options exist for saving this type of data, another useful alternative is to save it as an RDS file in your directory. See the examples below:
# Save as CSV write.csv(clean_data, file = "Clean_data.csv", row.names = FALSE) # Save as RDS saveRDS(clean_data, file = "Clean_data.rds")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.