Abstract
The ‘DHARMa’ package uses a simulation-based approach to create readily interpretable scaled (quantile) residuals for fitted (generalized) linear mixed models. Currently supported are linear and generalized linear (mixed) models from ‘lme4’ (classes ‘lmerMod’, ‘glmerMod’), ‘glmmTMB’ and ‘spaMM’, generalized additive models (‘gam’ from ‘mgcv’), ‘glm’ (including ‘negbin’ from ‘MASS’, but excluding quasi-distributions) and ‘lm’ model classes. Moreover, externally created simulations, e.g. posterior predictive simulations from Bayesian software such as ‘JAGS’, ‘STAN’, or ‘BUGS’ can be processed as well. The resulting residuals are standardized to values between 0 and 1 and can be interpreted as intuitively as residuals from a linear regression. The package also provides a number of plot and test functions for typical model misspecification problems, such as over/underdispersion, zero-inflation, and residual spatial and temporal autocorrelation.
Residual interpretation for generalized linear mixed models (GLMMs) is often problematic. As an example, here two Poisson GLMMs, one that is lacking a quadratic effect, and one that fits the data perfectly. I show three standard residuals diagnostics each. Which is the misspecified model?
Just for completeness - it was the first one. But don’t get too excited if you got it right. Either you were lucky, or you noted that the first model seems a bit overdispersed (range of the Pearson residuals). But even when noting that, would you have added a quadratic effect, instead of adding an overdispersion correction? The point here is that misspecifications in GL(M)Ms cannot reliably be diagnosed with standard residual plots, and GLMMs are thus often not as thoroughly checked as LMs.
One reason why GL(M)Ms residuals are harder to interpret is that the expected distribution of the data changes with the fitted values. Reweighting with the expected variance, as done in Pearson residuals, or using deviance residuals, helps a bit, but does not lead to visually homogenous residuals even if the model is correctly specified. As a result, standard residual plots, when interpreted in the same way as for linear models, seem to show all kind of problems, such as non-normality, heteroscedasticity, even if the model is correctly specified. Questions on the R mailing lists and forums show that practitioners are regularly confused about whether such patterns in GL(M)M residuals are a problem or not.
But even experienced statistical analysts currently have few options to diagnose misspecification problems in GLMMs. In my experience, the current standard practice is to eyeball the residual plots for major misspecifications, potentially have a look at the random effect distribution, and then run a test for overdispersion, which is usually positive, after which the model is modified towards an overdispersed / zero-inflated distribution. This approach, however, has a number of problems, notably:
Overdispersion often comes from missing or misspecified predictors. Standard residual plots make it difficult to test for residual patterns against the predictors to check for candidates.
Not all overdispersion is the same. For count data, the negative binomial creates a different distribution than adding observation-level random effects to the Poisson. Once overdispersion is corrected, such violations of distributional assumptions are not detectable with standard overdispersion tests (because the tests only looks at total dispersion), and nearly impossible to see visually from standard residual plots.
Dispersion frequently varies with predictors (heteroscedasticity). This can have a significant effect on the inference. While it is standard to tests for heteroscedasticity in linear regressions, heteroscedasticity is currently hardly ever tested for in GLMMs, although it is likely as frequent and influential.
Moreover, if residuals are checked, they are usually checked conditional on the fitted random effect estimates. Thus, standard checks only check the final level of the random structure in a GLMM. One can perform extra checks on the random effects, but it is somewhat unsatisfactory that there is no check on the entire model structure.
DHARMa aims at solving these problems by creating readily interpretable residuals for generalized linear (mixed) models that are standardized to values between 0 and 1, and that can be interpreted as intuitively as residuals for the linear model. This is achieved by a simulation-based approach, similar to the Bayesian p-value or the parametric bootstrap, that transforms the residuals to a standardized scale. The basic steps are:
Simulate new data from the fitted model for each observation.
For each observation, calculate the empirical cumulative density function for the simulated observations, which describes the possible values (and their probability) at the predictor combination of the observed value, assuming the fitted model is correct.
The residual is then defined as the value of the empirical density function at the value of the observed data, so a residual of 0 means that all simulated values are larger than the observed value, and a residual of 0.5 means half of the simulated values are larger than the observed value.
These steps are visualized in the following figure
The key advantage of this definition is that the so-defined residuals always have the same, known distribution, independent of the model that is fit, if the model is correctly specified. To see this, note that, if the observed data was created from the same data-generating process that we simulate from, all values of the cumulative distribution should appear with equal probability. That means we expect the distribution of the residuals to be flat, regardless of the model structure (Poisson, binomial, random effects and so on).
I currently prepare a more exact statistical justification for the approach in an accompanying paper, but if you must provide a reference in the meantime, I would suggest citing
Dunn, K. P., and Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics 5, 1-10.
Gelman, A. & Hill, J. Data analysis using regression and multilevel/hierarchical models Cambridge University Press, 2006
p.s.: DHARMa stands for “Diagnostics for HierArchical Regression Models” - which, strictly speaking, would make DHARM. But in German, Darm means intestines; plus, the meaning of DHARMa in Hinduism makes the current abbreviation so much more suitable for a package that tests whether your model is in harmony with your data:
From Wikipedia, 28/08/16: In Hinduism, dharma signifies behaviours that are considered to be in accord with rta, the order that makes life and universe possible, and includes duties, rights, laws, conduct, virtues and ‘right way of living’.
If you haven’t installed the package yet, either run
Or follow the instructions on https://github.com/florianhartig/DHARMa to install a development version.
Loading and citation
## 
## To cite package 'DHARMa' in publications use:
## 
##   Florian Hartig (2020). DHARMa: Residual Diagnostics for
##   Hierarchical (Multi-Level / Mixed) Regression Models. R package
##   version 0.3.0. http://florianhartig.github.io/DHARMa/
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models},
##     author = {Florian Hartig},
##     year = {2020},
##     note = {R package version 0.3.0},
##     url = {http://florianhartig.github.io/DHARMa/},
##   }Let’s assume we have a fitted model that is supported by DHARMa.
testData = createData(sampleSize = 250)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) , family = "poisson", data = testData)Most functions in DHARMa could be calculated directly on the fitted model. So, for example, if you are only interested in testing dispersion, you could calculate
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted
##  vs. simulated
## 
## data:  simulationOutput
## ratioObsSim = 0.86249, p-value = 0.952
## alternative hypothesis: two.sidedIn this case, the randomized quantile residuals are calculated on the fly. However, residual calculation can take a while, and would have to be repeated by every other test you call. It is therefore highly recommended to first calculate the residuals once, using the simulateResiduals() function. This function returns a DHARMa object, which can then be passed on to all other plots and test functions.
Using the simulateResiduals function has the added benefit that you can modify the way residuals are calculated. For example, the default number of simulations to run is 250, which proved to be a reasonable compromise between computation time and precision, but if high precision is desired, n should be raised to 1000 at least.
What the function does is a) creating n new synthetic datasets by simulating from the fitted model, b) calculates the cumulative distribution of simulated values for each observed value, and c) returning the quantile value that corresponds to the observed value.
For example, a scaled residual value of 0.5 means that half of the simulated data are higher than the observed value, and half of them lower. A value of 0.99 would mean that nearly all simulated data are lower than the observed value. The minimum/maximum values for the residuals are 0 and 1.
The calculated residuals can be accesed via
As discussed above, for a correctly specified model we would expect
a uniform (flat) distribution of the overall residuals
uniformity in y direction if we plot against any predictor.
Note: the expected uniform distribution is the only differences to the linear regression that one has to keep in mind when interpreting DHARMa residuals. If you cannot get used to this and you must have residuals that behave exactly like a linear regression, you can transform the uniform distribution to another distribution, for example normal.
These normal residuals will behave exactly like the residuals of a linear regression. However, for reasons of a) numeric stability with low number of simulations and b) my conviction that it is much easier to visually detect deviations from uniformity than normality, I would STRONGLY advice against using this transformation.
The main plot function for DHARMa residuals is the plot.DHARMa() function
The plot function creates two plots, which can also be called separately
plotQQunif(simulationOutput) # left plot in plot.DHARMa()
plotResiduals(simulationOutput) # right plot in plot.DHARMa()plotQQunif creates a qq-plot to detect overall deviations from the expected distribution, by default with added tests for uniformity, dispersion and outliers.
plotResiduals produces a plot of the residuals against the predicted value (or alternatively, other variable). Simulation outliers (data points that are outside the range of simulated values) are highlighted as red stars. These points should be carefully interpreted, because we actually don’t know “how much” these values deviate from the model expectation. Note also that the probability of an outlier depends on the number of simulations (in fact, it is 1/(nSim +1) for each side), so whether the existence of outliers is a reason for concern depends also on the number of simulations.
To provide a visual aid in detecting deviations from uniformity in y-direction, the plot function calculates an (optional) quantile regression, which compares the empirical 0.25, 0.5 and 0.75 quantiles in y direction (red solid lines) with the theoretical 0.25, 0.5 and 0.75 quantiles (dashed black line), and provides a p-value for the deviation from the expected quantile.
If you want to plot the residuals against other predictors (highly recommend), you can use the function
You can also generate a histogram of the residuals via
To support the visual inspection of the residuals, the DHARMa package provides a number of specialized goodness-of-fit tests on the simulated residuals:
See the help of the functions and further comments below for a more detailed description. The wrapper function testResiduals calculates the first three tests, including their graphical outputs
There are a few important technical details regarding how the simulations are performed, in particular regarding the treatments of random effects and integer responses. It is strongly recommended to read the help of
if refit = F (default), new data is simulated from the fitted model, and residuals are calculated by comparing the observed data to the new data
if refit = T, a parametric bootstrap is performed, meaning that the model is refit to the new data, and residuals are created by comparing observed residuals against refitted residuals
The second option is much much slower, and also seemed to have lower power in some tests I ran. ** It is therefore not recommended for standard residual diagnostics!** I only recommend using it if you know what you are doing, and have particular reasons, for example if you estimate that the tested model is biased. A bias could, for example, arise in small data situations, or when estimating models with shrinkage estimators that include a purposeful bias, such as ridge/lasso, random effects or the splines in GAMs. My idea was then that simulated data would not fit to the observations, but that residuals for model fits on simulated data would have the same patterns/bias than model fits on the observed data.
Note also that refit = T can sometimes run into numerical problems, if the fitted model does not converge on the newly simulated data.
The second option is the treatment of the stochastic hierarchy. In a hierarchical model, several layers of stochasticity are placed on top of each other. Specifically, in a GLMM, we have a lower level stochastic process (random effect), whose result enters into a higher level (e.g. Poisson distribution). For other hierarchical models, such as state-space models, similar considerations apply, but the hierarchy can be more complex. When simulating, we have to decide if we want to re-simulate all stochastic levels, or only a subset of those. For example, in a GLMM, it is common to only simulate the last stochastic level (e.g. Poisson) conditional on the fitted random effects, meaning that the random effects are set on the fitted values.
For controlling how many levels should be re-simulated, the simulateResidual function allows to pass on parameters to the simulate function of the fitted model object. Please refer to the help of the different simulate functions (e.g. ?simulate.merMod) for details. For merMod (lme4) model objects, the relevant parameters are “use.u”, and “re.form”, as, e.g., in
If the model is correctly specified and the fitting procedure is unbiased (disclaimer: GLMM estimators are not always unbiased), the simulated residuals should be flat regardless how many hierarchical levels we re-simulate. The most thorough procedure would be therefore to test all possible options. If testing only one option, I would recommend to re-simulate all levels, because this essentially tests the model structure as a whole. This is the default setting in the DHARMa package. A potential drawback is that re-simulating the random effects creates more variability, which may reduce power for detecting problems in the upper-level stochastic processes.
A third option is the treatment of integer responses. The background of this option is that, for integer-valued variables, some additional steps are necessary to make sure that the residual distribution becomes flat (essentially, we have to smoothen away the integer nature of the data). The idea is explained in
The simulateResiduals function will automatically check if the family is integer valued, and apply randomization if that is the case. I see no reason why one would not want to randomize for an integer-valued function, so the parameter should usually not be changed.
In many situations, it can be useful to look at residuals per group, e.g. to see how much the model over / underpredicts per plot, year or subject. To do this, use the recalculateResiduals() function, together with a grouping variable
you can keep using the simulation output as before. Note, hover, that items such as simulationOutput$scaledResiduals now have as many entries as you have groups, so if you perform plots by hand, you have to aggregate predictors in the same way. For the latter purpose, recalculateResiduals adds a function aggregateByGroup to the output.
As DHARMa uses simulations to calculate the residuals, a naive implementation of the algorithm would mean that residuals would look slightly different each time a DHARMa calculation is executed. This might both be confusing and bear the danger that a user would run the simulation several times and take the result that looks better (which would amount to multiple testing / p-hacking).
By default, DHARMa therefore fixes the random seed to the same value every time a simulation is run, and afterwards restores the random state to the old value. This means that you will get exactly the same residual plot each time. If you want to avoid this behavior, for example for simulation experiments on DHARMa, use seed = NULL -> no seed set, but random state will be restored, or seed = F -> no seed set, and random state will not be restored. Whether or not you fix the seed, the setting for the random seed and the random state are stored in
If you want to reproduce simualtions for such a run, set the variable .Random.seed by hand, and simulate with seed = NULL.
Moreover (general advice), to ensure reproducibility, it’s advisable to add a set.seed() at the beginning, and a session.info() at the end of your script. The latter will list the version number of R and all loaded packages.
In all plots / tests that were shown so far, the model was correctly specified, resulting in “perfect” residual plots. In this section, we discuss how to recognize and interpret model misspecifications in the scaled residuals. Note, however, that
The fact that none of the here-presented tests shows a misspecification problem doesn’t proof that the model is correctly specified. There are likely a large number of structural problems that will not show a pattern in the standard residual plots.
Conversely, while a clear pattern in the residuals indicates with good reliability that the observed data would not be likely to originate from the fitted model, it doesn’t necessarily indicate that the model results are not useable. There are many cases where it is common practice to work “wrong models”. For example, random effect estimates (in particular in GLMMs) are often slightly biased, especially if the model is fit with MLE. For that reason, DHARMa will often show a slight pattern in the residuals even if the model is correctly specified, and tests for this can get significant for large sample sizes. Another example is data that is missing at random (MAR) (see here). It is known that this phenomenon does not createa bias on the fixed effect estimates, and it is therefore common practice to fit this data with mixed models. Nevertheless, DHARMa recognizes that the observed data looks different than what would be expected from the model assumptions, and flags the model as problematic
Important conclusion: DHARMa only flags a difference between the observed and expected data - the user has to decide whether this difference is actually a problem for the analysis!
The most common concern for GLMMs is overdispersion, underdispersion and zero-inflation.
Over/underdispersion refers to the phenomenon that residual variance is larger/smaller than expected under the fitted model. Over/underdispersion can appear for any distributional family with fixed variance, in particular for Poisson and binomial models.
A few general rules of thumb
This this is how overdispersion looks like in the DHARMa residuals
testData = createData(sampleSize = 500, overdispersion = 2, family = poisson())
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) , family = "poisson", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)Note that we get more residuals around 0 and 1, which means that more residuals are in the tail of distribution than would be expected under the fitted model.
This is an example of underdispersion
testData = createData(sampleSize = 500, intercept=0, fixedEffects = 2, overdispersion = 0, family = poisson(), roundPoissonVariance = 0.001, randomEffectVariance = 0)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) , family = "poisson", data = testData)
summary(fittedModel)## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: observedResponse ~ Environment1 + (1 | group)
##    Data: testData
## 
##      AIC      BIC   logLik deviance df.resid 
##    985.3    998.0   -489.7    979.3      497 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -0.6049 -0.3503 -0.1084  0.2300  1.0025 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  group  (Intercept) 0        0       
## Number of obs: 500, groups:  group, 10
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.19891    0.06187  -3.215   0.0013 ** 
## Environment1  2.29164    0.08927  25.670   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## Environmnt1 -0.836
## convergence code: 0
## boundary (singular) fit: see ?isSingular# plotConventionalResiduals(fittedModel)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.2035, p-value < 2.2e-16
## alternative hypothesis: two-sidedHere, we get too many residuals around 0.5, which means that we are not getting as many residuals as we would expect in the tail of the distribution than expected from the fitted model.
Although, as discussed above, over/underdispersion will show up in the residuals, and it’s possible to detect it with the testUniformity function, simulations show that this test is less powerful than more targeted tests.
DHARMa therefore contains two overdispersion tests that compares the dispersion of simulated residuals to the observed residuals.
You can call these tests as follows:
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted
##  vs. simulated
## 
## data:  simulationOutput
## ratioObsSim = 0.2552, p-value < 2.2e-16
## alternative hypothesis: two.sided# Option 3
simulationOutput2 <- simulateResiduals(fittedModel = fittedModel, refit = T, n = 20)
testDispersion(simulationOutput2)## 
##  DHARMa nonparametric dispersion test via mean deviance residual
##  fitted vs. simulated-refitted
## 
## data:  simulationOutput2
## dispersion = 0.15097, p-value < 2.2e-16
## alternative hypothesis: two.sidedNote: previous versions of DHARMa (< 0.2.0) discouraged the simulated overdispersion test in favor of the refitted and parametric tests. I have since changed the test function, and simulations show that it as powerful as the refitted or parametric test. Because of the generality and speed of this option, I see no good reason for either refitting or running parametric tests. Therefore
My recommendation for testing dispersion is to simply use the standard dispersion test, based on the simulated residuals
It’s not clear to if the refitted test is better … but it’s available.
In my simulations, parametric tests, such as AER::dispersiontest didn’t provide higher power. Because of that, and because of the higher generality of the simulated tests, I no longer provide parametric tests in DHARMa. However, you can see various implementions of the parametric tests in the DHARMa GitHub repo under Code/DHARMaPerformance/Power).
Below and example from there, which compares the four options to test for overdispersion (2 options to use DHARMa::testDispersoin, AER::dispersiontest, and DHARMa::testUniformity) for a Poisson glm
Comparison of power from simulation studies
A word of warning that applies also to all other tests that follow: significance in hypothesis tests depends on at least 2 ingredients: strenght of the signal, and number of data points. Hence, the p-value alone is not a good indicator of the extent to which your residuals deviate from assumptions. Specifically, if you have a lot of data points, residual diagnostics will nearly inevitably become significant, because having a perfectly fitting model is very unlikely. That, however, doesn’t necessarily mean that you need to change your model. The p-values confirm that there is a deviation from your null hypothesis. It is, however, in your discretion to decide whether this deviation is worth worrying about. If you see a dispersion parameter of 1.01, I would not worry, even if the test is significant. A significant value of 5, however, is clearly a reason to move to a model that accounts for overdispersion.
A common special case of overdispersion is zero-inflation, which is the situation when more zeros appear in the observation than expected under the fitted model. Zero-inflation requires special correction steps.
More generally, we can also have too few zeros, or too much or too few of any other values. We’ll discuss that at the end of this section
Here an example of a typical zero-inflated count dataset, plotted against the environmental predictor
testData = createData(sampleSize = 500, intercept = 2, fixedEffects = c(1), overdispersion = 0, family = poisson(), quadraticFixedEffects = c(-3), randomEffectVariance = 0, pZeroInflation = 0.6)
par(mfrow = c(1,2))
plot(testData$Environment1, testData$observedResponse, xlab = "Envrionmental Predictor", ylab = "Response")
hist(testData$observedResponse, xlab = "Response", main = "")We see a hump-shaped dependence of the environment, but with too many zeros.
In the normal DHARMa residual, plots, zero-inflation will look pretty much like overdispersion
fittedModel <- glmer(observedResponse ~ Environment1 + I(Environment1^2) + (1|group) , family = "poisson", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)The reason is that the model will usually try to find a compromise between the zeros, and the other values, which will lead to excess variance in the residuals.
DHARMa has a special test for zero-inflation, which compares the distribution of expected zeros in the data against the observed zeros
## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.8572, p-value < 2.2e-16
## alternative hypothesis: two.sidedThis test is likely better suited for detecting zero-inflation than the standard plot, but note that also overdispersion will lead to excess zeros, so only seeing too many zeros is not a reliable diagnostics for moving towards a zero-inflated model. A reliable differentiation between overdispersion and zero-inflation will usually only be possible when directly comparing alternative models, e.g. through residual comparison / model selection of a model with / without zero-inflation, or by simply fitting a model with zero-inflation and looking at the parameter estimate for the zero-inflation.
A good option is the R package glmmTMB, which is also supported by DHARMa. We can use this to fit
library(glmmTMB)
fittedModel <- glmmTMB(observedResponse ~ Environment1 + I(Environment1^2) + (1|group), ziformula = ~1 , family = "poisson", data = testData)
summary(fittedModel)##  Family: poisson  ( log )
## Formula:          
## observedResponse ~ Environment1 + I(Environment1^2) + (1 | group)
## Zero inflation:                    ~1
## Data: testData
## 
##      AIC      BIC   logLik deviance df.resid 
##   1201.3   1222.4   -595.7   1191.3      495 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  group  (Intercept) 0.001049 0.03239 
## Number of obs: 500, groups:  group, 10
## 
## Conditional model:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        2.02853    0.04897   41.42   <2e-16 ***
## Environment1       1.17399    0.13412    8.75   <2e-16 ***
## I(Environment1^2) -3.29006    0.22936  -14.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4999     0.1071   4.667 3.06e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1To test for generic excess / deficits of particular values, we have the function testGeneric, which compares the values of a generic, user-provided summary statistics
Choose one of alternative = c(“greater”, “two.sided”, “less”) to test for inflation / deficit or both. Default is “greater” = inflation.
countOnes <- function(x) sum(x == 1)  # testing for number of 1s
testGeneric(simulationOutput, summary = countOnes, alternative = "greater") # 1-inflation## 
##  DHARMa generic simulation test
## 
## data:  simulationOutput
## ratioObsSim = 1.2109, p-value = 0.156
## alternative hypothesis: greaterSo far, most of the things that we have tested could also have been detected with parametric tests. Here, we come to the first issue that is difficult to detect with current tests, and that is usually neglected.
Heteroscedasticity means that there is a systematic dependency of the dispersion / variance on another variable in the model. It is not sufficiently appreciated that also binomial or Poisson models can show heteroscedasticity. Basically, it means that the level of over/underdispersion depends on another parameter. Here an example where we create such data
testData = createData(sampleSize = 500, intercept = 0, overdispersion = function(x){return(rnorm(length(x), sd = 2 * abs(x)))}, family = poisson(), randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ Environment1 , family = "poisson", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)The
## 
##  Test for location of quantiles via qgam
## 
## data:  simulationOutput
## p-value < 2.2e-16
## alternative hypothesis: bothAdding a simple overdispersion correction will try to find a compromise between the different levels of dispersion in the model. The qq plot looks better now, but there is still a pattern in the residuals
testData = createData(sampleSize = 500, intercept = 0, overdispersion = function(x){return(rnorm(length(x), sd = 2*abs(x)))}, family = poisson(), randomEffectVariance = 0)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) + (1|ID), family = "poisson", data = testData)
# plotConventionalResiduals(fittedModel)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)To remove this pattern, you would need to make the dispersion parameter dependent on a predictor (e.g. in JAGS), or apply a transformation on the data.
A second test that is typically run for LMs, but not for GL(M)Ms is to plot residuals against the predictors in the model (or potentially predictors that were not in the model) to detect possible misspecifications. Doing this is highly recommended. For that purpose, you can retrieve the residuals via
Note again that the residual values are scaled between 0 and 1. If you plot the residuals against predictors, space or time, the resulting plots should not only show no systematic dependency of those residuals on the covariates, but they should also again be flat for each fixed situation. That means that if you have, for example, a categorical predictor: treatment / control, the distribution of residuals for each predictor alone should be flat as well.
Here an example with a missing quadratic effect in the model and 2 predictors
testData = createData(sampleSize = 200, intercept = 1, fixedEffects = c(1,2), overdispersion = 0, family = poisson(), quadraticFixedEffects = c(-3,0))
fittedModel <- glmer(observedResponse ~ Environment1 + Environment2 + (1|group) , family = "poisson", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
# plotConventionalResiduals(fittedModel)
plot(simulationOutput, quantreg = T)## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.089554, p-value = 0.08087
## alternative hypothesis: two-sidedIt is difficult to see that there is a problem at all in the general plot, but it becomes clear if we plot against the environment
par(mfrow = c(1,2))
plotResiduals(simulationOutput, testData$Environment1)
plotResiduals(simulationOutput, testData$Environment2)A special case of plotting residuals against predictors is the plot against time and space, which should always be performed if those variables are present in the model. Let’s create some temporally autocorrelated data
testData = createData(sampleSize = 100, family = poisson(), temporalAutocorrelation = 5)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group), data = testData, family = poisson() )
simulationOutput <- simulateResiduals(fittedModel = fittedModel)The function testTemporalAutocorrelation performs a Durbin-Watson test from the package lmtest on the uniform residuals to test for temporal autocorrelation in the residuals, and additionally plots the residuals against time.
The function also has an option to perform the test against randomized time (H0) - the sense of this is to be able to run simulations for testing if the test has correct error rates in the respective situation, i.e. is not oversensitive (too high sensitivity has sometimes been reported for Durbin-Watson).
## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 1.5546, p-value = 0.02448
## alternative hypothesis: true autocorrelation is not 0## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 1.9543, p-value = 0.8173
## alternative hypothesis: true autocorrelation is not 0Note general caveats mentioned about the DW test in the help of testTemporalAutocorrelation(). In general, as for spatial autocorrelation, it is difficult to specify one test, because temporal and spatial autocorrelation can appear in many flavors, short-scale and long scale, homogenous or not, and so on. The pre-defined functions in DHARMa are a starting point, but they are not something you should rely on blindly.
Here an example with spatial autocorrelation
testData = createData(sampleSize = 100, family = poisson(), spatialAutocorrelation = 5)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group), data = testData, family = poisson() )
simulationOutput <- simulateResiduals(fittedModel = fittedModel)The spatial autocorrelation test performs the Moran.I test from the package ape and plots the residuals against space.
An additional test against randomized space (H0) can be performed, for the same reasons as explained above.
## 
##  DHARMa Moran's I test for spatial autocorrelation
## 
## data:  
## observed = 0.081611, expected = -0.010101, sd = 0.021744, p-value
## = 2.466e-05
## alternative hypothesis: Spatial autocorrelation## 
##  DHARMa Moran's I test for spatial autocorrelation
## 
## data:  
## observed = -0.034568, expected = -0.010101, sd = 0.020376, p-value
## = 0.2298
## alternative hypothesis: Spatial autocorrelationThe usual caveats for Moran.I apply, in particular that it may miss non-local and heterogeneous (non-stationary) spatial autocorrelation. The former should be better detectable visually in the spatial plot, or via regressions on the pattern.
Note: More real-world examples on the DHARMa GitHub repository here
This example comes from Jochen Fruend. Measured are the number of parasitized observations, with population density as a covariate
Let’s fit the data with a regular binomial n/k glm
mod1 <- glm(cbind(N_parasitized, N_adult) ~ logDensity, data = data, family=binomial)
simulationOutput <- simulateResiduals(fittedModel = mod1)
plot(simulationOutput)We see various signals of overdispersion
OK, so let’s add overdispersion through an individual-level random effect
mod2 <- glmer(cbind(N_parasitized, N_adult) ~ logDensity + (1|ID), data = data, family=binomial)
simulationOutput <- simulateResiduals(fittedModel = mod2)
plot(simulationOutput)The overdispersion looks better, but you can see that the residuals still look a bit irregular (although tests are n.s.). The raw data looks a bit humped-shaped, so we might be tempted to add a quadratic effect.
mod3 <- glmer(cbind(N_parasitized, N_adult) ~ logDensity + I(logDensity^2) + (1|ID), data = data, family=binomial)
simulationOutput <- simulateResiduals(fittedModel = mod3)
plot(simulationOutput)The residuals look perfect now. That being said, we dont’ have a lot of data, and we have to be sure we’re not overfitting. A likelihood ratio test tells us that the quadratic effect is not significantly supported.
## Data: data
## Models:
## mod2: cbind(N_parasitized, N_adult) ~ logDensity + (1 | ID)
## mod3: cbind(N_parasitized, N_adult) ~ logDensity + I(logDensity^2) + 
## mod3:     (1 | ID)
##      Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## mod2  3 214.68 217.95 -104.34   208.68                           
## mod3  4 213.54 217.90 -102.77   205.54 3.1401      1    0.07639 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1We learn from that: increasing model complexity always improves the residuals, but according to standard statistical arguments (power, bias-variance trade-off) it’s not always advisable to get them perfect, just good enough!
The next examples uses the fairly well known Owl dataset which is provided in glmmTMB (see ?Owls for more info about the data).
The following shows a sequence of models, all checked with DHARMa. The example is discussed in a talk at ISEC 2018, see slides here.
m1 <- glm(SiblingNegotiation ~ FoodTreatment*SexParent + offset(log(BroodSize)), data=Owls , family = poisson)
res <- simulateResiduals(m1)
plot(res)m2 <- glmer(SiblingNegotiation ~ FoodTreatment*SexParent + offset(log(BroodSize)) + (1|Nest), data=Owls , family = poisson)
res <- simulateResiduals(m2)
plot(res)m3 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent + offset(log(BroodSize)) + (1|Nest), data=Owls , family = nbinom1)
res <- simulateResiduals(m3)
plot(res)##  Family: nbinom1  ( log )
## Formula:          
## SiblingNegotiation ~ FoodTreatment * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
## Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   3400.8   3427.2  -1694.4   3388.8      593 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.1265   0.3556  
## Number of obs: 599, groups:  Nest, 27
## 
## Overdispersion parameter for nbinom1 family (): 7.05 
## 
## Conditional model:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                          0.67674    0.11340   5.968 2.41e-09
## FoodTreatmentSatiated               -0.87038    0.13964  -6.233 4.58e-10
## SexParentMale                        0.04469    0.10712   0.417    0.677
## FoodTreatmentSatiated:SexParentMale  0.12173    0.17520   0.695    0.487
##                                        
## (Intercept)                         ***
## FoodTreatmentSatiated               ***
## SexParentMale                          
## FoodTreatmentSatiated:SexParentMale    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted
##  vs. simulated
## 
## data:  simulationOutput
## ratioObsSim = 0.79972, p-value < 2.2e-16
## alternative hypothesis: two.sided## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.2488, p-value = 0.064
## alternative hypothesis: two.sidedm4 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent + offset(log(BroodSize)) + (1|Nest), ziformula = ~ FoodTreatment + SexParent,  data=Owls , family = nbinom1)
summary(m4)##  Family: nbinom1  ( log )
## Formula:          
## SiblingNegotiation ~ FoodTreatment * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
## Zero inflation:                      ~FoodTreatment + SexParent
## Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   3361.0   3400.6  -1671.5   3343.0      590 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.07114  0.2667  
## Number of obs: 599, groups:  Nest, 27
## 
## Overdispersion parameter for nbinom1 family (): 4.07 
## 
## Conditional model:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                          0.79147    0.09841   8.042 8.82e-16
## FoodTreatmentSatiated               -0.42028    0.14476  -2.903  0.00369
## SexParentMale                       -0.06593    0.09886  -0.667  0.50481
## FoodTreatmentSatiated:SexParentMale  0.11693    0.16948   0.690  0.49022
##                                        
## (Intercept)                         ***
## FoodTreatmentSatiated               ** 
## SexParentMale                          
## FoodTreatmentSatiated:SexParentMale    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.0325     0.3084  -6.590 4.40e-11 ***
## FoodTreatmentSatiated   1.5427     0.2998   5.146 2.66e-07 ***
## SexParentMale          -0.4902     0.2740  -1.789   0.0736 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted
##  vs. simulated
## 
## data:  simulationOutput
## ratioObsSim = 0.90405, p-value = 0.168
## alternative hypothesis: two.sided## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.0389, p-value = 0.616
## alternative hypothesis: two.sidedm5 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent + offset(log(BroodSize)) + (1|Nest), dispformula = ~ FoodTreatment , ziformula = ~ FoodTreatment + SexParent,  data=Owls , family = nbinom1)
summary(m5)##  Family: nbinom1  ( log )
## Formula:          
## SiblingNegotiation ~ FoodTreatment * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
## Zero inflation:                      ~FoodTreatment + SexParent
## Dispersion:                          ~FoodTreatment
## Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   3353.0   3397.0  -1666.5   3333.0      589 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.08695  0.2949  
## Number of obs: 599, groups:  Nest, 27
## 
## Conditional model:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                          0.79825    0.09511   8.393  < 2e-16
## FoodTreatmentSatiated               -0.47113    0.16647  -2.830  0.00465
## SexParentMale                       -0.08524    0.09024  -0.945  0.34484
## FoodTreatmentSatiated:SexParentMale  0.12765    0.18960   0.673  0.50079
##                                        
## (Intercept)                         ***
## FoodTreatmentSatiated               ** 
## SexParentMale                          
## FoodTreatmentSatiated:SexParentMale    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.8392     0.2912  -6.317 2.67e-10 ***
## FoodTreatmentSatiated   1.0184     0.4131   2.465   0.0137 *  
## SexParentMale          -0.5722     0.3319  -1.724   0.0847 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Dispersion model:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             1.1061     0.1460   7.578  3.5e-14 ***
## FoodTreatmentSatiated   0.8267     0.2714   3.046  0.00232 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1The main concern in Poisson data is dispersion. Poisson regression are nearly always overdispersed. If you address this problem with quasi-poisson models, you will not be able to test the model with DHARMa. It is anyway better to move to a negative Binomial, or an observation-level random effect.
Once that is done, you should check for heteroskedasticity (via standard plot, also against all predictors), and for zero-inflation. As noted, zero-inflation tests are often negative, and rather show up as underdispersion. Work through the owl example below.
Proportional data is often modelled with beta regressions. Those can be tested with DHARMa. Note that beta regressions are often 0 or 1 inflated. Both should be tested with testZeroInflation or testGeneric.
Note: diskrete oportions, of the type k/n should NOT be modeled with the beta regression. Use the binomial (see below).
There are a lot of rumors about that can and cannot be checked with binomial 0/1 data. Note that binomial data behaves slighly different when you have a 0/1 response than when you have a k/n response.
Let’s consider a clearly misspecified binomial model with 0/1 response data
testData = createData(sampleSize = 500, overdispersion = 0, fixedEffects = 5, family = binomial(), randomEffectVariance = 3, numGroups = 25)
fittedModel <- glm(observedResponse ~ 1, family = "binomial", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)A true rumor that is that, unlike in k/n or count data, such a misspecification will not produce overdispersion if tested directly. The reason is that there is basically no “dispersion” in a 0/1 signal.
However, you can still clearly see the misfit if you plot, e.g.
Moreover, you will see overdispersion from the misfit if you group your data. Grouping basically transforms the 0/1 response in a k/n response. Here, I show the difference of the dispersion test for the same data, once ungrouped (left), and grouped according the the random effects group (right)
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted
##  vs. simulated
## 
## data:  simulationOutput
## ratioObsSim = 1.0011, p-value = 0.536
## alternative hypothesis: two.sidedsimulationOutput = recalculateResiduals(simulationOutput , group = testData$group)
testDispersion(simulationOutput)## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted
##  vs. simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.2163, p-value < 2.2e-16
## alternative hypothesis: two.sidedlm and glm and MASS::glm.nb are fully supported.
lme4 model classes are fully supported.
mgcv is partly supported. Non-standard distributions are not supported, because mgcv doesn’t implement a simulate function for those.
glmmTMB is nearly fully supported since DHARMa 0.2.7 and glmmTMB 1.0.0. A remaining limitation is that you can’t adjust whether simulation are conditional or not, so simulateResiduals(model, re.form = NULL) will have no effect, simulations will always be done from the full model.
spaMM is supported by DHARMa since 0.2.1
See my general comments about adding new R packages to DHARMa
As noted there, if you want to use DHARMa for a specific case, you could write a custom simulate function for the specific model you are working with. This will usually involve using the predict function and adding the random distribution, plus potentially drawing new data for the random effects or other hierarchical levels.
As an example, for an poisson glm, a simulate function could be programmed as in the following example, which also shows how the results are read into DHARMa and plotted (see also following section)
testData = createData(sampleSize = 200, overdispersion = 0.5, family = poisson())
fittedModel <- glm(observedResponse ~ Environment1, family = "poisson", data = testData)
simulatePoissonGLM <- function(fittedModel, n){
  pred = predict(fittedModel, type = "response")
  nObs = length(pred)
  sim = matrix(nrow = nObs, ncol = n)
  for(i in 1:n) sim[,i] = rpois(nObs, pred)
  return(sim)
}
sim = simulatePoissonGLM(fittedModel, 100)
DHARMaRes = createDHARMa(simulatedResponse = sim, observedResponse = testData$observedResponse, 
             fittedPredictedResponse = predict(fittedModel))
plot(DHARMaRes, quantreg = F)As mentioned earlier, the quantile residuals defined in DHARMa are the frequentist equivalent of the so-called “Bayesian p-values”, i.e. residuals created from posterior predictive simulations in a Bayesian analysis.
To make the plots and tests in DHARMa also available for Bayesian analysis, DHARMa provides the option to convert externally created posterior predictive simulations into a DHARMa object
What is provided as simulatedResponse is up to the user, but median posterior predictions seem most sensible to me. After the conversion, all DHARMa plots can be used, however, note that Bayesian p-values != DHARMA residuals, because in the Bayesian analysis, parameters are varied as well.
Important: as DHARMa doesn’t know the distribution fitted model, it is vital to specify the integerResponse option by hand (see above / ?simulateResiduals for details).