library(hypervolume)
#> Loading required package: Rcpp
library(palmerpenguins)
library(ggplot2)
library(gridExtra)
set.seed(123)data(penguins)
data(quercus)When working with the package hypervolume, it is
important to understand the statistical significance of the resulting
hypervolume or hypervolumes. The methods introduced in this update are
meant to characterize both variance from data sampling and variance due
to non-deterministic behavior in the hypervolume algorithms.
This update to the package provides the following
functionalities:
- an interface for generating large resamples of hypervolumes
- methods for generating non-parametric confidence intervals for
hypervolume parameters and null distributions for overlap
statistics
- formal statistical tests based on hypervolumes
The purpose of this document is to provide use cases and explain best practices when using the new methods. The examples are chosen to highlight all the considerations that go into interpreting results.
The following code demonstrates visualizing the effect of sample size on hypervolumes constructed using Gaussian kernels. Thirty hypervolumes are constructed per sample size.
To plot how a summary statistic describing a hypervolume varies with
sample size, a function must be passed to the func field of
hypervolume_funnel. A user inputted function must take a
hypervolume object as an input and output a
numeric. By default, func = get_volume. The
confidence intervals in the plots are generated non-parametrically by
taking quantiles at each sample size. When using
hypervolume_funnel to plot the output of
hypervolume_resample, a ggplot object is returned. It is
then possible to add more plot elements to the result.
# Run time with cores = 2 is around 25 minutes
hv = hypervolume(na.omit(penguins)[,3:4], verbose = FALSE)
resample_seq_path = hypervolume_resample("penguins_hvs", hv, method = "bootstrap seq", n = 30, seq = c(100, 125, 150, 175, 200, 225, 250, 275, 300), cores = 20)
hypervolume_funnel(resample_seq_path, title = "Volume of Hypervolumes at Different Resample Sizes") + ylab("Volume")
plot1 = hypervolume_funnel(resample_seq_path, title = "Mean of Bill Length at Different Resample Sizes",
func = function(x) {get_centroid(x)["bill_length_mm"]}) +
ylab("Bill Length (mm)")
plot2 = hypervolume_funnel(resample_seq_path, title = "Mean of Bill Depth at Different Resample Sizes",
func = function(x) {get_centroid(x)["bill_depth_mm"]}) +
ylab("Bill Depth (mm)")
grid.arrange(plot1, plot2, nrow = 2)The default contruction of hypervolumes uses
kde.bandwidth = estimate_bandwidth(data, method = "silverman").
The first plot shows that volume decreases with sample size due to
Silverman bandwidth decreasing with sample size. In fact, Silverman
bandwidth is not appropriate for multimodal data such as
penguins. The plot demonstrates this fact and shows that at
small sample size, the hypervolume overestimates the true volume. Other
methods for estimating bandwidth may be more accurate, but are
computationally unfeasible for data with more than 3 dimensions. The
estimated volume converges to the true volume of the population as
sample size increases; however, at 300 data points, the result from
hypervolume_funnel suggests that the volume is being
overestimated.
In contrast, the plots for the mean of each column of data show that the centroid of the data is preserved by hypervolume construction using gaussian kernels.
In the example, each confidence interval is a quantile of 30
resampled values. Improving the accuracy requires larger sample sizes
which drastically increases run time. It is recommended to use more
cores to allow hypervolumes to be generated in parallel; however, by
default, cores = 1 and the function runs sequentially.
The following code demonstrates the effect of applying a bias while
resampling data. In the example, we use penguins data to
construct a hypervolume object, then resample it while biasing towards
large beak sizes. In this example, this is done by more strongly
weighting the points closer to the maximum observed values when
resampling.
Weights can be applied when resampling points by either passing a
user defined function to the weight_func field of
hypervolume_resample, or by specifying the mu
and sigma parameters. When using mu and
sigma, the weight function is a multivariate normal
distribution. mu is the mean of multivariate normal
distribution while sigma is the diagonal covariance matrix
of a multivariate normal distribution. cols_to_bias
specifies which columns to use as the input of the weight function.
hv = hypervolume(na.omit(penguins)[,3:6], verbose = FALSE)
#> Warning in hypervolume(na.omit(penguins)[, 3:6], verbose = FALSE):
#> Consider removing some axes.
#> Warning in hypervolume(na.omit(penguins)[, 3:6], verbose = FALSE): Some dimensions have much more higher standard deviations than others:
#> bill_length_mm 5.47
#> bill_depth_mm 1.97
#> flipper_length_mm 14.02
#> body_mass_g 805.22
#> Consider rescaling axes before analysis.
#> Note that the formula used for the Silverman estimator differs in version 3 compared to prior versions of this package.
#> Use method='silverman-1d' to replicate prior behavior.
cols_to_bias = c("bill_length_mm", "bill_depth_mm")
mu = apply(hv@Data, 2, max)[cols_to_bias]
sigma = apply(hv@Data, 2, var)[cols_to_bias]*2
biased_path = hypervolume_resample("Bill bias", hv, method = "biased bootstrap", n = 1, mu = mu, sigma = sigma, cols_to_bias = cols_to_bias)
#> Warning: executing %dopar% sequentially: no parallel backend registered
#> Note that the formula used for the Silverman estimator differs in version 3 compared to prior versions of this package.
#> Use method='silverman-1d' to replicate prior behavior.
# Read in hypervolume object from file
biased_hv = readRDS(file.path(biased_path, "resample 1.rds"))
combined_dat = data.frame(rbind(hv@Data, biased_hv@Data))
combined_dat['Type'] = rep(c('original', 'biased'), each = nrow(hv@Data))plot1 = ggplot(combined_dat, aes(y = ..density..)) + geom_histogram(aes(x = bill_depth_mm, fill = Type), bins = 20) +
facet_wrap(~Type) +
ggtitle("Distribution of Bill Depth", "Biased resample vs Original sample") +
xlab("bill depth (mm)")
plot2 = ggplot(combined_dat, aes(y = ..density..)) + geom_histogram(aes(x = bill_length_mm, fill = Type), bins = 20) +
facet_wrap(~Type) +
ggtitle("Distribution of Bill Length", "Biased resample vs Original sample") +
xlab("bill length(mm)")
grid.arrange(plot1, plot2, nrow = 2)plot1 = ggplot(combined_dat, aes(y = ..density..)) + geom_histogram(aes(x = flipper_length_mm, fill = Type), bins = 20) +
facet_wrap(~Type) +
ggtitle("Distribution of Flipper Length", "Biased resample vs Original sample") +
xlab("flipper length (mm)")
plot2 = ggplot(combined_dat, aes(y = ..density..)) + geom_histogram(aes(x = body_mass_g, fill = Type), bins = 20) +
facet_wrap(~Type) +
ggtitle("Distribution of Body Mass", "Biased resample vs Original sample") +
xlab("body mass (g)")
grid.arrange(plot1, plot2, nrow = 2)The result shows that a bias is induced, but as a result, variance for all dimensions decrease as there are less unique points sampled. The volume will also be significantly reduced if the applied bias is strong. Therefore, it is recommended to only apply strong bias to larger datasets. In this example, sigma is chosen arbitrarily as twice the variance of the original columns. The larger sigma is, the weaker the bias and vice versa.
The following code demonstrates how to test the null hypothesis that
two samples come from the same distribution. In this example, we map the
longitude and latitude data from quercus to a four
dimensional climate space, as in demo(quercus).
To test whether the two species Quercus rubra and Quercus alba have the same climate niche, there are two approaches. In the first approach, we use the combined sample data as an approximation of the true distribution. To generate the null distribution for overlap statistics, we treat all of the data as having the same label and then bootstrap hypervolumes from the combined data. The overlaps of the resampled hypervolumes are used to generate the distribution of the overlap statistics. If the size of the two samples is the same, the function takes half the hypervolumes and overlaps them with each of the other hypervolumes. In this case, since the number of samples of Quercus rubra and Quercus alba are different, we need to bootstrap an equal number of hypervolumes for each sample size.
The second approach is a permutation test. For this method, the labels of the data are rearranged then the data is split by label. A pair of hypervolumes are generated from each split and overlap statistics are generated from each pair.
The benefit of the first method is the ability to generate multiple overlap statistics per hypervolume. If both methods generate \(N\) hypervolumes, the first method will generate \(\frac{N^2}{4}\) overlap statistics while the second method will generate \(\frac{N}{2}\) overlap statistics. Since hypervolume construction and overlap both can be non-deterministic processes, method one will account for more of the variance from generating the overlap. However, when sample size is small, the combined data may not be a good approximation of the population. In this case, it is better to use method two, because it does not make any assumptions about the population, and generating more hypervolumes is fast for smaller sample sizes.
data("quercus")
data_alba = subset(quercus, Species=="Quercus alba")[,c("Longitude","Latitude")]
data_rubra = subset(quercus, Species=="Quercus rubra")[,c("Longitude","Latitude")]
climatelayers <- getData('worldclim', var='bio', res=10, path=tempdir())
# z-transform climate layers to make axes comparable
climatelayers_ss = climatelayers[[c(1,4,12,15)]]
for (i in 1:nlayers(climatelayers_ss))
{
climatelayers_ss[[i]] <- (climatelayers_ss[[i]] - cellStats(climatelayers_ss[[i]], 'mean')) / cellStats(climatelayers_ss[[i]], 'sd')
}
climatelayers_ss_cropped = crop(climatelayers_ss, extent(-150,-50,15,60))
# extract transformed climate values
climate_alba = extract(climatelayers_ss_cropped, data_alba)
climate_rubra = extract(climatelayers_ss_cropped, data_rubra)
# Generate Hypervolumes
hv_alba = hypervolume(climate_alba,name='alba',samples.per.point=10)
hv_rubra = hypervolume(climate_rubra,name='rubra',samples.per.point=10)
# Method 1: 2hr runtime with 12 threads
combined_sample = rbind(climate_alba, climate_rubra)
population_hat = hypervolume(combined_sample)
# Create bootstrapped hypervolumes of both sample sizes
method1_path_size_1669 = hypervolume_resample("quercus_1669_boot", population_hat, "bootstrap", n = 100, cores = 12)
method1_path_size_2110 = hypervolume_resample("quercus_2110_boot", population_hat, "bootstrap", n = 100, cores = 12)
result1 = hypervolume_overlap_test(hv_alba, hv_rubra, c(method1_path_size_1669, method1_path_size_2110), cores = 12)
#Method 2: 9hr runtime with 12 threads
method2_path = hypervolume_permute("rubra_alba_permutation", hv1, hv2, n = 1357, cores = 12)
result2 = hypervolume_overlap_test(hv1, hv2, method2_path, cores = 2)
# Graphical Results of null sorensen statistic
plot1 = result1$plots$sorensen + ggtitle("Method 1", as.character(result1$p_values$sorensen)) + xlab("Sorensen Index")
plot2 = result2$plots$sorensen + ggtitle("Method 2", as.character(result2$p_values$sorensen)) + xlab("Sorensen Index")
grid.arrange(plot1, plot2, ncol=2)For our example, the red line shows the observed value of the Sorensen overlap index. Method one results in a significantly lower p value, but Method two also results in a low p value. Since p is less than 0.05 in both cases, we can reject the hypothesis that the two Quercus species have identical climate niches.