EG_selection: Selection of survey sites maximizing uniformity in...
In biosurvey: Tools for Biological Survey Planning

Description Usage Arguments Details Value See Also Examples

Selection of sites to be sampled in a survey, with the goal of maximizing uniformity of points in the environment, but considering geographic patterns of data. Sets of points that are environmentally similar and have a disjoint pattern in geography, are selected twice (two survey sites are placed so they consider the biggest geographic clusters).

EG_selection(master, n_blocks, guess_distances = TRUE, initial_distance = NULL,
             increase = NULL, max_n_samplings = 1, replicates = 10,
             use_preselected_sites = TRUE, select_point = "E_centroid",
             cluster_method = "hierarchical", median_distance_filter = NULL,
             sample_for_distance = 250, set_seed = 1,
             verbose = TRUE, force = FALSE)

`master`	master_matrix object derived from the function `prepare_master_matrix` or master_selection object derived from functions `random_selection`, `uniformG_selection`, or `uniformE_selection`.
`n_blocks`	(numeric) number of blocks to be selected to be used as the base for further explorations. Default = NULL.
`guess_distances`	(logical) whether or not to use internal algorithm to automatically select `initial_distance` and `increase`. Default = TRUE. If FALSE, `initial_distance` and `increase` must be defined.
`initial_distance`	(numeric) Euclidean distance to be used for a first process of thinning and detection of remaining blocks. See details in `point_thinning`. Default = NULL.
`increase`	(numeric) initial value to be added to or subtracted from `initial_distance` until reaching the number of `expected_points`. Default = NULL.
`max_n_samplings`	(numeric) maximum number of samples to be chosen after performing all thinning `replicates`. Default = 1.
`replicates`	(numeric) number of thinning replicates performed to select blocks uniformly. Default = 10.
`use_preselected_sites`	(logical) whether to use sites that have been defined as part of the selected sites previous any selection. Object in `master` must contain the site(s) preselected in and element of name "preselected_sites" for this argument to be effective. Default = TRUE. See details for more information on the approach used.
`select_point`	(character) how or which point will be selected for each block or cluster. Three options are available: "random", "E_centroid", and "G_centroid". E_ or G_ centroid indicate that the point(s) closets to the respective centroid will be selected. Default = "E_centroid".
`cluster_method`	(character) name of the method to be used for detecting geographic clusters of points inside each block. Options are "hierarchical" and "k-means"; default = "hierarchical". See details in `find_clusters`.
`median_distance_filter`	(character) optional argument to define a median distance-based filter based on which sets of sampling sites will be selected. The default, NULL, does not apply such a filter. Options are: "max" and "min". See details.
`sample_for_distance`	(numeric) sample to be considered when measuring the geographic distances among points in blocks created in environmental space. The distances measured are then used to test whether points are distributed uniformly or not in the geography. Default = 250.
`set_seed`	(numeric) integer value to specify a initial seed. Default = 1.
`verbose`	(logical) whether or not to print messages about the process. Default = TRUE.
`force`	(logical) whether to replace existing set of sites selected with this method in `master`.

Two important steps are needed before using this function: 1) exploring data in environmental and geographic spaces, and 2) performing a regionalization of the environmental space. Exploring the data can be done using the function explore_data_EG. This step is optional but strongly recommended, as important decisions that need to be taken depend on the of the data in the two spaces. A regionalization of the environmental space configuration of the region of interest helps in defining important parts of your region that should be considered to select sites. This can be done using the function make_blocks. Later, the regions created in environmental space will be used for selecting one or more sampling sites per block depending on the geographic pattern of such environmental combinations.

The process of survey-site selection with this function is the most complex among all functions in this package. The complexity derives from the aim of the function, which is to select sites that sample appropriately environmental combinations in the region of interest (environmental space), but considering the geographic patterns of such environmental regions (geographic space).

In this approach, the first step is to select candidate blocks (from the ones obtained with make_blocks) that are uniformly distributed in environmental space. The geographic configuration of points in such blocks is explored to detect whether they are clustered (i.e., similar environmental conditions are present in distant places in the region of interest). For blocks with points that present one cluster in geography, only one survey site is selected, and for those with multiple clusters in geographic space, two survey sites are selected considering the two largest clusters.

If use_preselected_sites is TRUE and such sites are included as an element in the object in master, the approach for selecting sites in environmental space considering geographic patterns is a little different. User-preselected sites will always be part of the sites selected. Other points are selected based on an algorithm that searches for sites that are uniformly distributed in environmental space but at a distance from preselected sites that helps in maintaining uniformity among environmental blocks selected. Note that preselected sites will not be processed, therefore, uniformity of blocks representing such points cannot be warrantied.

As multiple sets could result from selection, the argument of the function median_distance_filter could be used to select the set of sites with the maximum ("max") or minimum ("min") median distance among selected sites. Option "max" will increase the geographic distance among sampling sites, which could be desirable if the goal is to cover the region of interest more broadly. The other option, "min", could be used in cases when the goal is to reduce resources and time needed to sample such sites.

A master_selection object (S3) with a special element called selected_sites_EG containing one or more sets of selected sites depending on max_n_samplings and median_distance_filter.

uniformG_selection, uniformE_selection, random_selection, make_blocks, plot_sites_EG

# Data
data("m_matrix", package = "biosurvey")

# Making blocks for analysis
m_blocks <- make_blocks(m_matrix, variable_1 = "PC1", variable_2 = "PC2",
                        n_cols = 10, n_rows = 10, block_type = "equal_area")

# Checking column names
colnames(m_blocks$data_matrix)

# Selecting sites uniformly in E and G spaces
EG_sel <- EG_selection(master = m_blocks, n_blocks = 10,
                       initial_distance = 1.5, increase = 0.1,
                       replicates = 1, max_n_samplings = 1,
                       select_point = "E_centroid",
                       cluster_method = "hierarchical",
                       sample_for_distance = 100)

head(EG_sel$selected_sites_EG[[1]])
dim(EG_sel$selected_sites_EG[[1]])