In this section you will learn how to prepare the data to train models using SDMtune. We will use the virtualSp dataset included in the package and environmental predictors from the WorldClim dataset.
First let’s load some packages that we will use to visualize the data:
library(ggplot2) # To plot locations
library(maps) # To access useful maps
library(rasterVis) # To plot raster objectsWe use the climate data of WorldClim version 1.4 (Hijmans et al. 2005) and the terrestrial ecoregions from WWF (Olson et al. 2001) included in the dismo package:
files <- list.files(path = file.path(system.file(package = "dismo"), "ex"),
pattern = "grd", full.names = TRUE)We convert these files in a raster stack object that will be used later in the analysis:
There are nine environmental variables, eight continuous and one categorical:
We can plot bio1 using the gplot function from the rasterVis package:
gplot(predictors$bio1) +
geom_tile(aes(fill = value)) +
coord_equal() +
scale_fill_gradientn(colours = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"),
na.value = "transparent",
name = "°C x 10") +
labs(title = "Annual Mean Temperature",
x = "longitude",
y = "latitude") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank())Let’s load the SDMtune package:
For demonstrating how to use SDMtune we use the random generated virtualSp dataset included in the package.
Plot the study area together with the presence locations:
ggplot(map_data("world"), aes(long, lat)) +
geom_polygon(aes(group = group), fill = "grey95", color = "gray40", size = 0.2) +
geom_jitter(data = p_coords, aes(x = x, y = y), color = "red",
alpha = 0.4, size = 1) +
labs(x = "longitude", y = "latitude") +
theme_minimal() +
theme(legend.position = "none") +
coord_fixed() +
scale_x_continuous(limits = c(-125, -32)) +
scale_y_continuous(limits = c(-56, 40))To plot the background locations run the following code:
ggplot(map_data("world"), aes(long, lat)) +
geom_polygon(aes(group = group), fill = "grey95", color = "gray40", size = 0.2) +
geom_jitter(data = as.data.frame(bg_coords), aes(x = x, y = y),
color = "blue", alpha = 0.4, size = 0.5) +
labs(x = "longitude", y = "latitude") +
theme_minimal() +
theme(legend.position = "none") +
coord_fixed() +
scale_x_continuous(limits = c(-125, -32)) +
scale_y_continuous(limits = c(-56, 40))Before training a model we have to prepare the data in the correct format. The prepareSWD function creates an SWD object that stores the species’ name, the coordinates of the species at presence and absence/background locations and the value of the environmental variables at the locations. The argument categorical indicates which environmental variables are categorical. In our example biome is categorical (we can pass a vector if we have more than one categorical environmental variable). The function extracts the value of the environmental variables for each location and excludes those locations that have NA value for at least one environmental variable.
Let’s have a look at the SWD object that we have just created:
When we print an SWD object we get a bunch of information:
The object contains four slots: @species, @coords @data and @pa. @pa contains a vector with 1 for presence and 0 for absence/background locations. To visualize the data we run:
We can visualize the coordinates with:
or the name of the species with:
We can save the SWD object in a .csv file using the function swd2csv (the function saves the file in the working directory). There are two possibilities:
SDMtune supports four methods for model training:
nnet package (Venables and Ripley 2002);gbm package (Greenwell et al. 2019);dismo package (Hijmans et al. 2017);maxnet package (Phillips 2017);randomForest package (Liaw and Wiener 2002).The code necessary to train a model is the same for all the implementations. We will show how to train a Maxent model, you can adapt the code for the other methods or check the presence absence models vignette.
We use the function train to train a Maxent model. We need to provide two arguments:
method: “Maxent” in our case;data: the SWD object with the presence and background locations.The function trains the model using default settings that are:
We will see later how to change the default settings, for the moment let’s have a look at the model object.
The output of the function train is an object of class SDMmodel. Let’s print the model object we’ve just created:
When we print an SDMmodel object we get the following information:
An SDMmodel object has two slots:
SWD object with the presence absence/background locations used to train the model;Maxent object, in our case, with all the model configurations.The slot model contains the configurations of the model plus other information used to make predictions.
The most important are: fc, reg and iter that contain the values of the model configuration.
The function train() accepts optional arguments that can be used to change the default model settings. In our previous example we could have trained the same model using:
In the following example we train a model using linear and hinge as feature class combination, 0.5 as regularization multiplier and 700 iterations:
By default Maxent models are trained using the arguments “removeduplicates=false” and “addsamplestobackground=false”. The user should have the full control of the data used to train the model, so is expected that duplicated locations are already removed and that the presence locations are already included in the background locations, when desired. You can use the function thinData to remove duplicated locations and the function addSamplesToBg to add the presence locations to the background locations.
New locations are predicted with the function predict. The function takes three main arguments:
SDMmodel object;data.frame, an SWD object or a raster stack object);Next we get the prediction for our training locations using the cloglog output type:
The output in this case is a vector containing all the predicted values for the training locations:
We can get the prediction only for the presence location with:
For models trained with the Maxent method, the function performs the prediction in R without calling the MaxEnt Java software. This results in a faster computation for large datasets and might result in a slightly different output compared to the Java software.
We can use the same function to create a distribution map starting from the predictors raster `stack’ object:
In this case the output is a raster object:
The map can be saved in a file directly when running the prediction, we just have to pass additional arguments to the predict function. In the next example we save the map in a file called “my_file” in the GeoTIFF format:
The function predict has other arguments useful when predicting large datasets:
"text" to visualize a progress bar;In the next example we restrict the prediction to Chile and plot the prediction:
To plot the distribution map we can use the function plotPred:
The function plotPred plots a map with a color ramp similar to the one used by the MaxEnt Java software. We can pass additional arguments to customize the map. In the next example we provide a custom color ramp and we add a title to the legend:
To plot a presence/absence map we need a threshold value that splits the prediction in presence and absence values. The function thresholds returns some commonly used threshold values starting from an SDMmodel object. In the next example we print the threshold values using the type "cloglog":
For example we want to create a presence/absence map using the threshold that maximize the training sensitivity plus specificity. We use the function plotPA passing the threshold value as argument:
We can also save the map in a file with the following code:
Both functions plotPred and plotPA have the argument hrto plot the the map with high resolution, useful when the map is used in a scientific publication.
SDMtune implements three evaluation metrics:
We will compute the value of the metrics on the training dataset.
The AUC can be calculated using the function auc:
We can also plot the ROC curve using the function plotROC:
For the AICc we use the function aicc. In this case we need to pass to the env argument the predictors raster stack object:
It’s always a good practice to split the species locations into two parts and use one part to train the model and the remaining part to evaluate it. We can use the trainValTest function for this purpose. Let’s say we want to use 80% of the species locations to train our model and 20% as testing dataset to evaluate it:
library(zeallot) # For unpacking assignment
c(train, test) %<-% trainValTest(data, test = 0.2, only_presence = TRUE,
seed = 25)Now we train the model using the train dataset:
The only_presence argument is used to split only the presence and not the background locations. We can now evaluate the model using the testing dataset that has not been used to train the model:
We can plot the ROC curve for both, training and testing datasets, with:
This approach is valid when we have a large dataset. In our case, with only few observations, the evaluation depends strongly on how we split our presence locations. Let’s run a small experiment in which we perform different train/test splits and we compute the AUC:
# Create an empty data.frame
output <- data.frame(matrix(NA, nrow = 10, ncol = 3))
colnames(output) <- c("seed", "trainAUC", "testAUC")
# Create 10 different random seeds
set.seed(25)
seeds <- sample.int(1000, 10)
# Loop through the seeds
for (i in 1:length(seeds)) {
# Make the train/test split
c(train, test) %<-% trainValTest(data, test = 0.2, seed = seeds[i],
only_presence = TRUE)
# train the model
m <- train("Maxent", data = train)
# Populate the output data.frame
output[i, 1] <- seeds[i]
output[i, 2] <- auc(m)
output[i, 3] <- auc(m, test = test)
}
# Print the output
output
# compute the range of the testing AUC
range(output[, 3])The testing AUC varies for the different train/test partitions. When we have to deal with a small dataset a better approach is the cross validation.
To perform a cross validation in SDMtune we have to pass the fold argument to the train function. First we have to create the folds. There are several way to create them, here we explain how to make a random partition of 4 folds using the function randomFolds:
The output of the function is a list containing two matrices, the first for the training and the second for the testing locations. Each column of one matrix represents a fold with TRUE for the locations included in and FALSE excluded from the partition.
Let’s perform a 4 fold cross validation using the Maxent method (note that we use the full dataset):
The output in this case is an SDMmodelCV object. It contains the four trained models in the models slot and the fold partitions in the folds slot. We can compute the AUC of a SDMmodelCV object using:
this returns the AUC value averaged across the four different models.
The train() function accepts folds created with two other packages:
ENMeval (Muscarella et al. 2014)blockCV (Valavi et al. 2019)The function will convert internally the created folds into the correct format for SDMtune. These packages have specific function to create folds partitions that are spatially or environmentally independent.
Block partition using the ENMeval package:
library(ENMeval)
block_folds <- get.block(occ = data@coords[data@pa == 1, ],
bg.coords = data@coords[data@pa == 0, ])
model <- train(method = "Maxent", data = data, fc = "l", reg = 0.8,
folds = block_folds)Checkerboard1 partition using the ENMeval package:
cb_folds <- get.checkerboard1(occ = data@coords[data@pa == 1, ],
env = predictors,
bg.coords = data@coords[data@pa == 0, ],
aggregation.factor = 4)
model <- train(method = "Maxent", data = data, fc = "l", reg = 0.8,
folds = cb_folds)Environmental block using the package blockCV:
library(blockCV)
library(raster)
# Create spatial points data frame
sp_df <- SpatialPointsDataFrame(data@coords,
data = as.data.frame(data@pa),
proj4string = crs(predictors))
e_folds <- envBlock(rasterLayer = predictors, speciesData = sp_df,
species = "data@pa", k = 4, standardization = "standard",
rasterBlock = FALSE, numLimit = 100)
model <- train(method = "Maxent", data = data, fc = "l", reg = 0.8,
folds = e_folds)Allouche, Omri, Asaf Tsoar, and Ronen Kadmon. 2006. “Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS).” Journal of Applied Ecology 43 (6): 1223–32. https://doi.org/10.1111/j.1365-2664.2006.01214.x.
Fielding, Alan H., and John F. Bell. 1997. “A review of methods for the assessment of prediction errors in conservation presence/absence models.” Environmental Conservation 24 (1): 38–49.
Greenwell, Brandon, Bradley Boehmke, Jay Cunningham, and GBM Developers. 2019. “gbm: Generalized Boosted Regression Models. R package version 2.1.5.” https://CRAN.R-project.org/package=gbm.
Hijmans, Robert J., Susan E. Cameron, Juan L. Parra, Peter G. Jones, and Andy Jarvis. 2005. “Very high resolution interpolated climate surfaces for global land areas.” International Journal of Climatology 25 (15): 1965–78. https://doi.org/10.1002/joc.1276.
Hijmans, Robert J., Steven Phillips, John Leathwick, and Jane Elith. 2017. “dismo: Species Distribution Modeling. R package version 1.1-4.” https://cran.r-project.org/package=dismo.
Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22.
Muscarella, Robert, Peter J. Galante, Mariano Soley-Guardia, Robert A. Boria, Jamie M. Kass, María Uriarte, and Robert P. Anderson. 2014. “ENMeval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models.” Methods in Ecology and Evolution 5 (11): 1198–1205. https://doi.org/10.1111/2041-210X.12261.
Olson, David M., Eric Dinerstein, Eric D. Wikramanayake, Neil D. Burgess, George V. N. Powell, Emma C. Underwood, Jennifer A. D’amico, et al. 2001. “Terrestrial Ecoregions of the World: A New Map of Life on Earth.” BioScience 51 (11): 933. https://doi.org/10.1641/0006-3568(2001)051[0933:TEOTWA]2.0.CO;2.
Phillips, Steven. 2017. maxnet: Fitting ’Maxent’ Species Distribution Models with ’glmnet’. R package version 0.1.2. https://cran.r-project.org/package=maxnet.
Valavi, Roozbeh, Jane Elith, José J. Lahoz-Monfort, and Gurutzeta Guillera-Arroita. 2019. “block CV: An r package for generating spatially or environmentally separated folds for k -fold cross-validation of species distribution models.” Methods in Ecology and Evolution 10 (2): 225–32. https://doi.org/10.1111/2041-210X.13107.
Venables, W N, and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth Edi. New York, NY: Springer.
Warren, Dan L., and Stephanie N. Seifert. 2011. “Ecological niche modeling in Maxent: the importance of model complexity and the performance of model selection criteria.” Ecological Applications 21 (2): 335–42. https://doi.org/10.1890/10-1171.1.