run_sampler: Aggregated and overdispersed sampling
In BrunoVilela/sampler: Aggregated and overdispersed sampling

Description Usage Arguments Details Value Author(s) See Also Examples

Generates aggregated (underdispersed) or overdispersed samples of values from any given distance matrix (class matrix).

1 2	run_sampler(x, n, alpha, n_start = 1, return_start = FALSE, starting = NULL)

`x`	`matrix` indicating the distance (any unit) between sample units. Row and column names should be given.
`n`	A positive integer number indicating the sample size.
`alpha`	Number indicating the strength of aggregation (if negative) or overdispersion (if positive). When alpha = 0 sample is random.
`n_start`	Number of initial selected points. Default is one starting point.
`return_start`	if `TRUE` the starting point is returned.
`starting`	Character vector indicating the starting point. If not provided random starting value(s) is(are) selected.

Given a distance matrix (x), run_sampler resample n sample units with an attraction (positive) or repulsive (negative) effect determined by alpha(α). The algorithm begins selecting one random starting point i. The following sample unit is then selected based on the probability given by the distance of i to each remaining units raised to the power of alpha (pr(j | i) = d_{i,j} ^ α). The following selections will then use a joint probability. The first calculated as the average distance d of the remaining unit j to the selected ones k (pr1(j | k) = d_{k,j} ^ α). The second as the minimum distance dmin of the remaining units to the selected ones (pr2(j | k) = dmin_{k,j} ^ α). The second probability guarantees that representativeness is achieved. The procedure is repeated until the selected points reach n. Positive values of alpha generate overdispersed sample designs, as sample units distant from the selected unit(s) will have a higher probability of being selected. Inversely, negative values will generate an aggregated design. Note that as alpha approximate the infinity (+ or -), the sample design becomes more deterministic.

The function returns a vector indicating the selected rows. If return_start is TRUE, a list is returned with the first element being the Sampling_selection - selected sampling units - and Starting_points - selected starting point(s).

Bruno Vilela

run_sampler_phy

run_sampler_geo

# Phylogeny example:
## Generate a random tree
require(ape)
tree <- rcoal(10)
## Calculate the distance
dist <- cophenetic(tree)
## Highly overdispersed 50% resample design (alpha = 50)
selection <- run_sampler(x = dist, n = 5, alpha = 50, starting = "t10")
## Highly aggregated 50% resample design (alpha = -50)
selection2 <- run_sampler(x = dist, n = 5, alpha = -50, starting = "t10")
## Random 50% resample design (alpha = 0)
selection3 <- run_sampler(x = dist, n = 5, alpha = 0, starting = "t10")
## Plot to compare
par(mfrow = c(1, 3))
plot(tree,tip.color=ifelse(tree$tip.label %in% selection, "red","black"),
 main = "Overdispersed 50% sampling (red were selected)", cex = 1)
 axis(1)
plot(tree,tip.color=ifelse(tree$tip.label %in% selection2, "blue","black"),
main = "Aggregated 50% sampling (blue were selected)", cex = 1)
axis(1)
plot(tree,tip.color=ifelse(tree$tip.label %in% selection3, "green","black"),
main = "Random 50% sampling (green were selected)", cex = 1)
axis(1)

# Geography example
require(sp)
require(maptools)
require(fields)
data(wrld_simpl)  # World map
Brazil <- wrld_simpl[wrld_simpl$NAME == "Brazil", ]  # Brazil (polygon)
coords <- slot(spsample(Brazil, 100, "regular"), "coords")
rownames(coords) <- paste0("t", 1:nrow(coords))
## Calculate the geographic distance
dist.geo <- rdist.earth(coords)
## Subsample 50%
### Overdispersed
selection.geo <- run_sampler(x = dist.geo, n = 25, alpha = 100, starting = "t10")
### Aggregated
selection.geo2 <- run_sampler(x = dist.geo, n = 25, alpha = -100, starting = "t10")
### Random
selection.geo3 <- run_sampler(x = dist.geo, n = 25, alpha = 0, starting = "t10")

## Plot
par(mfrow = c(1, 3), mar = c(1, 1, 15, 1))
plot(Brazil, main = "Overdispersed 50% sampling (red were selected)")
points(coords, cex = 2, pch = 19,
col = ifelse(rownames(coords) %in% selection.geo, "red","gray"))
plot(Brazil, main = "Aggregated 50% sampling (blue were selected)")
points(coords, cex = 2, pch = 19,
col = ifelse(rownames(coords) %in% selection.geo2, "blue","gray"))
plot(Brazil, main = "Random 50% sampling (green were selected)")
points(coords, cex = 2, pch = 19,
col = ifelse(rownames(coords) %in% selection.geo3, "green","gray"))

# Trait example
## Fake body size
set.seed <- 1
body_size <- runif(1000)
# Biased sample towards large species
set.seed <- 1
body_size_bias <- sample(body_size, 500, prob = body_size)
par(mfrow = c(1, 3))
hist(body_size, main = "Species body size distribution\n(n = 1000)",  xlab = "Body size")
hist(body_size_bias, main = "Biased samplig towards larger species\n(n = 500)",
xlab = "Body size")
# Use sampler to reduce the bias
dist_bs <- as.matrix(dist(body_size_bias))
rownames(dist_bs) <- colnames(dist_bs) <- 1:length(body_size_bias)
selection.bs <- run_sampler(x = dist_bs, n = 100, alpha = 100)
hist(body_size_bias[as.numeric(selection.bs)],
 main = "Overdispersed sampling of biased information \n(n = 100)",
 xlab = "Body size")


# Real time simulation
require(raster)
par(mfrow = c(1, 1))
r <- raster(res = 25)
values(r) <- runif(ncell(r))
plot(r)
coords <- xyFromCell(r, 1:ncell(r))
rownames(coords) <- 1:ncell(r)
dist.geo <- as.matrix(dist(coords))
startingI <- c(1)
# Change alpha and see how it works
for(i in (length(startingI)+1):30) {
selection.geo <- run_sampler(x = dist.geo, n = i, alpha = 100,
starting = startingI)
startingI <- as.numeric(selection.geo)
r2 <- r
values(r2)[as.numeric(selection.geo)] <- 1
values(r2)[-as.numeric(selection.geo)] <- NA
plot(r2, col = "gray")
Sys.sleep(time = .2)
}