split_data: Split occurrence data into training and testing data
In marlonecobos/ellipsenm: Ecological Niche's Characterizations Using Ellipsoids

split_data

R Documentation

Split occurrence data into training and testing data

Description

split_data splits occurrences into training and testing data based on distinct methods.

Usage

split_data(data, method = "random", longitude, latitude,
           train_proportion, raster_layer = NULL,
           background_n = 10000, save = FALSE, name = "occurrences")

Arguments

`data`	data.frame of occurrence records containing at least longitude and latitude columns.
`method`	(character) method for selecting training and testing data. Options are: "random" and "block"; default = "random".
`longitude`	(character) if `method` = "block", name of the column with longitude data.
`latitude`	(character) if `method` = "block", name of the column with latitude data.
`train_proportion`	(numeric) proportion (from 0 to 1) of data to be used as training occurrences. The remaining data will be used for testing. Default = 0.5 if `method` = "random", or 0.75 if `method` = "block".
`raster_layer`	optional RasterLayer to prepare background data if `method` = "block".
`background_n`	(numeric) optional number of coordinates to be extracted using the `raster_layer`. Default = 10000.
`save`	(logical) whether or not to save the results in the working directory. Default = FALSE.
`name`	(character) if `save` = TRUE, name of the csv files to be written (comon name for all files). A suffix will be added depending on the type of data: complete set, training set, or testing set of occurrences. Format (.csv) is automatically added; default = "occurrences".

Value

A list containing all, training, and testing occurrences. If save = TRUE, three csv files will be written in the working directory according to the name defined in name plus the suffix _all for all records, _train for the training set, and _test for the testing set.

If method = "block", an additional data.frame containing all data and an extra column with IDs for each block will be added to the resulted list. If save = TRUE, this data.frame will be written with the suffix _block. If a raster layer is given in raster_layer, background coordinates will be returned as part of this list. Data will be named as bg_all, bg_train, bg_test, and bg_block, for all, training, testing, and all background with assigned blocks, respectively.

Examples

# reading data
occurrences <- read.csv(system.file("extdata", "occurrences.csv",
                                    package = "ellipsenm"))

# random split 50% for trainig and 50% for testing
data_split <- split_data(occurrences, train_proportion = 0.5)

names(data_split)
lapply(data_split, head)
lapply(data_split, dim)

# random split 70% for trainig and 30% for testing
data_split1 <- split_data(occurrences, train_proportion = 0.7)

names(data_split1)
lapply(data_split1, head)
lapply(data_split1, dim)

# split 75% for trainig and 25% for testing using blocks
data_split2 <- split_data(occurrences, method = "block", longitude = "longitude",
                          latitude = "latitude", train_proportion = 0.75)

names(data_split2)
lapply(data_split2, head)
lapply(data_split2, dim)

# split data using blocks and preparing background
r_layer <- raster::raster(system.file("extdata", "bio_1.tif",
                                      package = "ellipsenm"))

data_split3 <- split_data(occurrences, method = "block", longitude = "longitude",
                          latitude = "latitude", train_proportion = 0.75,
                          raster_layer = r_layer)

# saving data
data_split4 <- split_data(occurrences, train_proportion = 0.7, save = TRUE,
                          name = "occs")

# cheking directory
dir()

marlonecobos/ellipsenm documentation built on Oct. 18, 2023, 8:09 a.m.