knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(hypervolume)
library(ggplot2)
library(gridExtra)
set.seed(123)
data(iris)
data(quercus)

Introduction to Resampling Hypervolumes

When working with the package hypervolume, it is important to understand the statistical significance of the resulting hypervolume or hypervolumes. Constructing a hypervolume using the default parameters is a non deterministic process. The methods introduced in this update are meant to characterize both variance from sampling and variance due to non deterministic model construction.

This update to the package provides the following functionalities:
- interface for generating large resamples of hypervolumes
- methods for generating non-parametric confidence intervals for hypervolume parameters and null distributions for overlap statistics

The purpose of this document is to provide use cases and explain best practice when using the new methods. The examples are chosen to highlight all the considerations that go into interpreting results.

Use case 1: Effect of sample size on volume

The following code demonstrates the effect of sample size on the construction of hypervolume constructed using gaussian kernels. Fifty hypervolumes are constructed per sample size.

To plot a parameter of a hypervolume, a function must be passed to the func field of hypervolume_funnel. A user inputted function must take a hypervolume object as an input and output a number. By default, func = get_volume. When using hypervolume_funnel to plot the output of hypervolume_resample, a ggplot object is returned. It is then possible to add more plot elements to the result.

# Run time with cores = 2
# We only resample up to size 50 for sake of computational ease
hv = hypervolume(iris[,1:4], verbose = FALSE)
resample_seq_path = hypervolume_resample("iris_hvs", hv, method = "bootstrap seq", n = 50, seq = c(10, 20, 30, 40, 50), cores = 2)

hypervolume_funnel(resample_seq_path, title = "Volume of Hypervolumes at Different Resample Sizes") + ylab("Volume")

plot1 = hypervolume_funnel(resample_seq_path, title = "Mean of Sepal Length at Different Resample Sizes",
                   func = function(x) {get_centroid(x)["Sepal.Length"]}) +
  ylab("Sepal Length")

plot2 = hypervolume_funnel(resample_seq_path, title = "Mean of Sepal Width at Different Resample Sizes",
                   func = function(x) {get_centroid(x)["Sepal.Width"]}) +
  ylab("Sepal Width")

plot3 = hypervolume_funnel(resample_seq_path, title = "Mean of Petal Length at Different Resample Sizes",
                   func = function(x) {get_centroid(x)["Petal.Length"]}) +
  ylab("Petal Length")

plot4 = hypervolume_funnel(resample_seq_path, title = "Mean of Sepal Width at Different Resample Sizes",
                   func = function(x) {get_centroid(x)["Petal.Width"]}) +
  ylab("Petal Width")

grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)

The default contruction of hypervolumes uses kde.bandwidth = estimate_bandwidth(data, method = "silverman"). The first plot shows that volume decreases with sample size due to Silverman bandwidth decreasing with sample size. In fact, Silverman bandwidth isn't appropriate for multimodel data such as iris. The plot demonstrates this fact and shows that at small sample size, the hypervolume overestimates the true volume. Other methods for estimating bandwidth may be more accurate, but are computationally unfeasible for data with more than 3 dimensions. The estimated volume converges to the true volume of the population as sample size increases; however, at 150 data points, the result from hypervolume_funnel suggests that the volume is being overestimated.

On the other hand, The plots for the mean of each column of data show that the centroid of the data is preserved by hypervolume construction using gaussian kernels.

The confidence intervals in the plots are generated non-parametrically by taking quantiles at each sample size. In the example, each confidence interval is a quantile of a sample size of 50. Improving the accuracy requires larger sample sizes which proportionally increases run time. It is recommended to use more cores to allow hypervolumes to be generated in parallel; however, by default, cores = 1 and the function runs sequentially.

Use case 2: Effect of simulating bias when resampling

The following code demonstrates the effect of applying a bias to resampling data. In the example, we use iris data to construct a hypervolume object, then resample it while biasing towards large petal sizes. In this example, this is done by weighing the points closer to the maximum observed values higher when resampling.

Weights can be applied when resampling points by either passing a user defined function to the weight_func field of hypervolume_resample, or by specifying the mu and sigma fields. When using mu and sigma, the weight function is a multivariate normal distribution. mu is the mean of multivariate normal distribution while sigma is the covariance matrix of a multivariate normal distribution. cols_to_bias specify which columns to use as the input of the weight function.

hv = hypervolume(iris[,1:4], verbose = FALSE)
cols_to_bias = c("Petal.Length", "Petal.Width")
mu = apply(hv@Data, 2, max)[cols_to_bias]
sigma = apply(hv@Data, 2, var)[cols_to_bias]*2
biased_path = hypervolume_resample("Petal bias", hv, method = "biased bootstrap", n = 1, mu = mu, sigma = sigma, cols_to_bias = cols_to_bias)

# Read in hypervolume object from file
biased_hv = readRDS(file.path(biased_path, "resample 1.rds"))

combined_dat = data.frame(rbind(hv@Data, biased_hv@Data))
combined_dat['Type'] = rep(c('original', 'biased'), each = 150)
plot1 = ggplot(combined_dat, aes(y = ..density..)) + geom_histogram(aes(x = Petal.Width, fill = Type), bins = 20) +
  facet_wrap(~Type) +
  ggtitle("Distribution of Petal Width", "Biased resample vs Original sample")
plot2 = ggplot(combined_dat, aes(y = ..density..)) + geom_histogram(aes(x = Petal.Length, fill = Type), bins = 20) +
  facet_wrap(~Type) +
  ggtitle("Distribution of Petal Length", "Biased resample vs Original sample")
grid.arrange(plot1, plot2, ncol = 2)
plot1 = ggplot(combined_dat, aes(y = ..density..)) + geom_histogram(aes(x = Sepal.Width, fill = Type), bins = 20) +
  facet_wrap(~Type) +
  ggtitle("Distribution of Sepal Width", "Biased resample vs Original sample")
plot2 = ggplot(combined_dat, aes(y = ..density..)) + geom_histogram(aes(x = Sepal.Length, fill = Type), bins = 20) +
  facet_wrap(~Type) +
  ggtitle("Distribution of Sepal Length", "Biased resample vs Original sample")
grid.arrange(plot1, plot2, ncol = 2)

The result shows that a bias is induced, but as a result, variance for all dimensions is decreases as there are less unique points sampled. The volume will also be significantly reduced if the applied bias is too strong. In this example, sigma is chosen arbitrarily as twice the variance of the original columns. Change sigma to control the strength of the bias. The larger sigma is, the weaker the bias and vice versa.

Use case 3: Using overlap statistics to test for similarity of populations

The following code demonstrates how to test the null hypothesis that two samples come from the same distribution. In this example, we treat the data from quercus as the population and draw two samples of 50 independently.

To test whether the two sample come from the same distribution, there are two approaches. In the first approach, we us the combined sample data as an approximation of the true distribution. To generate the null distribution for overlap statistics, size 50 hypervolumes are resampled from the combined data of the original samples. The overlap of the resampled hypervolumes are used to generate the distribution of the overlap statistics. If the size of the two samples are the same, the function takes half the hypervolumes and overlaps them with each of the other hypervolumes. (See documentation for case when sample sizes aren't the same)

The second approach is permutation test. For this method, the combined sample data is rearranged and split. A pair of hypervolumes are generated from each split and overlap statistics are generated from each pair.

The benefit of the first method is the ability to generate multiple overlap statistics per hypervolume. If both methods generate $N$ hypervolumes, the first method will generate $\frac{N^2}{4}$ overlap statistics while the second method will generate $\frac{N}{2}$ overlap statistics. Since hypervolume generation can be a non-deterministic process, method one will account for more of the variance from generating the hypervolume. However, when sample size is small, the combined data may not be a good approximation of the population. In this case, it is better to use the second method, since it doesn't make any assumptions about the population, and generating more hypervolumes is cheaper for smaller sample sizes.

data("quercus")
qsample1 = quercus[sample(1:nrow(quercus), 50),]
qsample2 = quercus[sample(1:nrow(quercus), 50),]

hv1 = hypervolume(qsample1[,2:3])
hv2 = hypervolume(qsample2[,2:3])

# Method 1
combined_sample = rbind(qsample1, qsample2)
population_hat = hypervolume(combined_sample[,2:3])

method1_path = hypervolume_resample("quercus_50_boot", population_hat, "bootstrap", n = 50, points_per_resample = 50, cores = 2)


result1 = hypervolume_overlap_test(hv1, hv2, method1_path, cores = 2)

#Method 2
method2_path = hypervolume_permute("quercus_50_perm", hv1, hv2, n = 100, cores = 2)

result2 = hypervolume_overlap_test(hv1, hv2, method2_path, cores = 2)

# Graphical Results of sorensen statistic
plot1 = result1$plots$sorensen + ggtitle("Method 1") + xlab("Sorensen Index")
plot2 = result2$plots$sorensen + ggtitle("Method 2") + xlab("Sorensen Index")
grid.arrange(plot1, plot2, ncol=2)

For our example, the red point shows the observed value of the sorensen index. Both plots show that the getting the observed value or less is highly likely; therefore, the null hypothesis cannot be rejected.

unlink("./Objects", recursive = TRUE)


dc165/Hypervolume-Dev documentation built on Dec. 13, 2020, 6:02 p.m.