There are several methods to estimate the parameters of a hidden Markov model (HMM). To be complete, we discuss three methods typically used when assumed that the data is generated by one (uni-level) model. That is, the Maximum Likelihood approach, the Baum Welch algorithm utilizing the forward backward probabilities, and the Bayesian estimation method. Note that all these methods assume that the number of states is known from the context of the application, i.e., specified by the user. The issue of determining the number of states is discussed in the vignette “tutorial-mhmm”.
After discussing the simplified case of estimating the parameters where the data consists of only one observed sequence (or, with multiple sequences, assuming that all data is generated by one identical model), we proceed with elaborating on estimating the parameters of a multilevel hidden Markov model.
ML estimation can be used to estimate the parameters of the HMM. The relevant likelihood function has a convenient form: \[\begin{equation} L_T = \mathbf{\delta P}(o_1) \mathbf{\Gamma P}(o_2)\mathbf{\Gamma P}(o_3) \ldots \mathbf{\Gamma P}(o_T) \mathbf{1'}. \end{equation}\] Here, \(\mathbf{P}(o_t)\) denotes a diagonal matrix with the state-dependent conditional probabilities of observing \(O_t = o\) as entries, \(\mathbf{\delta}\) denotes the distribution of the initial probabilities \(\pi_i\), \(\mathbf{\Gamma}\) denotes the transition probability matrix, and \(\mathbf{1'}\) is a column vector consisting of \(m\) (i.e., the number of distinct states) elements which all have the value one. See the vignette “tutorial-mhmm” for an explanation of these quantities. Direct maximization of the log-likelihood poses no problems even for very long sequences, provided measures are taken to avoid numerical underflow.
The EM algorithm (Dempster, Laird, and Rubin 1977), in this context also known as the Baum-Welch algorithm (Baum et al. 1970; Rabiner 1989), can also be used to maximize the log-likelihood function. Here, the unobserved latent states are treated as missing data, and quantities known as the forward and the backward probabilities are used to obtain the ‘complete-data log-likelihood’ of the HMM parameters: the log-likelihood based on both the observed event sequence and the unobserved, or “missing”, latent states. The forward probabilities \(\boldsymbol{\alpha}_t (i)\) denote the joint probability of the observed event sequence from time point 1 to \(t\) and state \(S\) at time point \(t\) being \(i\):
\[\begin{equation} \label{forward} \boldsymbol{\alpha}_t (i) = Pr(O_1 = o_1, O_2 = o_2, \ldots, O_t = o_t, S_t = i). \end{equation}\]
The name “forward probabilities” derives from the fact that when computing the forward probabilities \(\boldsymbol{\alpha}_t\), one evaluates the sequence of hidden states in the chronological order (i.e., forward in time) until time point \(t\). The backward probabilities \(\boldsymbol{\beta}_t (i)\) denote the conditional probability of the observed event sequence after time point \(t\) until the end, so from \(t+1, t+2, \ldots ,T\), given that state \(S\) at time point \(t\) equals \(i\):
\[\begin{equation} \boldsymbol{\beta}_t (i) = Pr(O_{t+1} = o_{t+1}, O_{t+2} = o_{t+2}, \ldots, O_T = o_T \mid S_t = i). \end{equation}\]
When computing the backward probabilities \(\boldsymbol{\beta}_t\), one evaluates the sequence of hidden states in the reversed order, i.e., from \(S_T, S_{T-1}, \ldots, S_{t+1}\). The forward and backward probabilities together cover the complete event sequence from \(t = 1\) to \(T\), and combined give the joint probability of the complete event sequence and state \(S\) at time point \(t\) being \(i\):
\[\begin{equation} \boldsymbol{\alpha}_t (i)\boldsymbol{\beta}_t (i) = Pr(O_{1} = o_{1}, O_{2} = o_{2}, \ldots, O_T = o_T, S_t = i). \end{equation}\]
We refer to Cappé (2005) for a discussion on the advantages of combining forward and backward probability information in the EM algorithm over direct maximization of the likelihood for the HMM.
A third approach is to use Bayesian estimation to infer the parameters of the HMM. We refer to e.g., Gelman et al. (2014), Lynch (2007), Rossi, Allenby, and McCulloch (2012) for an in-depth exposition of Bayesian statistics. In general terms, the difference between frequentist and Bayesian estimation is the following. In frequentist estimation, we view the parameters as fixed entities in the population, which are subject only to sampling fluctuation (as quantified in the standard error of the estimate). In Bayesian estimation, however, we assume that each parameter follows a given distribution. The general shape of this distribution is determined beforehand, using a prior distribution. This prior distribution not only determines the shape of the parameter distribution, but also allows for giving some information to the model with respect to the most likely values of the parameter that is estimated. To arrive at the final distribution for the parameter - which is called the posterior distribution -, the prior distribution is combined with the likelihood function of the data using Bayes’ theorem. Here, the likelihood function provides us with the probability of the data given the parameters. While any aspect of these distributions may be of interest, the emphasis is usually on the mean (or median) of the posterior distribution, which serves as the point estimate of the parameter of interest (analogous to the frequentist parameter estimates). In the event that one has no or vague expectations about the possible parameter values, one can specify “non-informative” priors (e.g., uniform distribution). That is, one can choose parameters of the prior distributions, so called hyper-parameters, such that the parameters may assume a wide range of possible values. Non-informative priors therefore express a lack of knowledge.
In the implemented hidden Markov model, both the transitions from state \(i\) at time point \(t\) to any of the other states at time point \(t + 1\) and the observed outcomes within state \(i\) follow a categorical distribution, with parameter sets \(\mathbf{\Gamma}_i\) (i.e., the probabilities in row \(i\) of the transition probability matrix \(\mathbf{\Gamma}\)) and \(\boldsymbol{\theta}_i\) (i.e., the state \(i\) dependent probabilities of observing an act). A convenient (conjugate) prior distribution on the parameters of the categorical distribution is a (symmetric) Dirichlet distribution. We assume that the rows of \(\mathbf{\Gamma}\) and the state-dependent probabilities \(\boldsymbol{\theta}_i\) are independent. That is,
\[\begin{align} S_{t = 2, \ldots, T} \: \sim \mathbf{\Gamma}_{S_{t-1}} \quad &\text{with} \quad \mathbf{\Gamma}_i \: \sim \text{Dir}(\mathbf{a}_{10}) \quad \text{and} \label{gammaDirPrior}\\ O_{t = 1, \ldots, T} \: \sim\boldsymbol{\theta}_{S_t} \quad &\text{with} \quad \boldsymbol{\theta}_i \: \sim \text{Dir}(\mathbf{a}_{20}), \label{pDirPrior} \end{align}\]
where the probability distribution of the current state \(S_t\) is given by the row of the transition probability matrix \(\mathbf{\Gamma}\) corresponding to the previous state in the hidden state sequence \(S_{t-1}\). The probability distribution of \(S_t\) given by \(\mathbf{\Gamma}\) holds for states after the first time point, i.e., t starts at 2 as there is no previous state in the hidden state sequence for state \(S\) at \(t = 1\). The probability of the first state in the hidden state sequence \(S_1\) is given by the initial probabilities of the states \(\pi_i\). The probability distribution of the observed event \(O_t\) is given by state-dependent probabilities \(\boldsymbol{\theta}_i\) corresponding to the current state \(S_t\). The hyper-parameter \(\mathbf{a}_{10}\) of the prior Dirichlet distribution on \(\mathbf{\Gamma}_i\) is a vector with length equal to the number of states \(m\), and the hyper-parameter \(\mathbf{a}_{20}\) of the prior Dirichlet distribution on \(\boldsymbol{\theta}_i\) is a vector with length equal to the number of categorical outcomes \(q\). Note that in this model, the hyper-parameter values are assumed invariant over the states \(i\). The initial probabilities of the states \(\pi_i\) are assumed to coincide with the stationary distribution of \(\boldsymbol{\Gamma}\) and are therefore not independent (to-be-estimated) parameters.
Given these distributions, our goal is to construct the joint posterior distribution of the hidden state sequence and the parameter estimates, given the observed event sequence and the hyper-parameters \[\begin{equation} \Pr\big((S_t), \mathbf{\Gamma}_i , \boldsymbol{\theta}_i \mid (O_t)\big) \propto \Pr\big((O_t) \mid (S_t), \boldsymbol{\theta}_i \big) \Pr\big((S_t) \mid \mathbf{\Gamma}_i \big) \Pr\big(\mathbf{\Gamma}_i \mid \mathbf{a}_{10} \big) \Pr\big(\boldsymbol{\theta}_i \mid \mathbf{a}_{20} \big) % \mathbf{a}_{10}, \mathbf{a}_{20} \end{equation}\] by drawing samples from the posterior distribution. By applying a Gibbs sampler, we can iteratively sample from the appropriate conditional posterior distributions of \(S_t\), \(\mathbf{\Gamma}_i\) and \(\boldsymbol{\theta}_i\), given the remaining parameters in the model. In short, the Gibbs sampler iterates between the following two steps: first the hidden state sequence \(S_1, S_2, \ldots, S_T\) is sampled, given, the observed event sequence \(O_1, O_2, \ldots, O_T\), and the current values of the parameters \(\mathbf{\Gamma}\) and \(\boldsymbol{\theta}_i\). Subsequently, the remaining parameters in the model (\(\mathbf{\Gamma}_i\) and \(\boldsymbol{\theta}_i\)) are updated by sampling them conditional on the sampled hidden state sequence \(S_1, S_2, \ldots, S_T\) and observed event sequence \(O_1, O_2, \ldots, O_T\).
Sampling the hidden state sequence of the HMM by means of the Gibbs sampler can be performed in various ways. Here, we use the approach outlined by Scott (2002). That is, we use the forward-backward Gibbs sampler, in which first the forward probabilities \(\boldsymbol{\alpha}_{t}(i)\) (i.e., the joint probability of state \(S = i\) at time point \(t\) and the observed event sequence from time point 1 to \(t\), see above equation) are obtained, after which the hidden state sequence is sampled in a backward run (i.e., drawing \(S_T, S_{T-1}, \ldots, S_1\)) using the corresponding forward probabilities \(\boldsymbol{\alpha}_{T:1}\). The forward-backward Gibbs sampler produces sampled values that rapidly represent the complete area of the posterior distribution, and produces useful quantities as byproducts, such as the log-likelihood of the observed data given the current draws of the parameters in each iteration (Scott 2002). In the section “” , we provide a more detailed description of how the Gibbs sampler proceeds for the HMM.
As it generally takes a number of iterations before the Gibbs sampler converges to the appropriate region of the posterior distribution, the initial iterations are usually discarded as a ‘burn-in’ period. The remaining sampled values of \(\mathbf{\Gamma}_i\) and \(\boldsymbol{\theta}_i\) provide the posterior distributions of their respective parameters.
A problem that can arise when using Bayesian estimation in this context is “label switching”, i.e., as the hidden states of the HMM have no a priori ordering or interpretation, their labels (i.e., which state represents what) can switch over the iterations of the Gibbs sampler, without affecting the likelihood of the model (see e.g., Scott 2002; Jasra, Holmes, and Stephens 2005). As a result, the marginal posterior distributions of the parameters are impossible to interpret because they represent the distribution of multiple states. Sometimes, using reasonable starting values (i.e., the user-specified parameter values of the “zero-th” iteration used to start the MCMC sampler) suffices to prevent label switching. Otherwise, possible solutions are to set constraints on the parameters of the state-dependent distribution, or use (weakly) informative priors on the state-dependent distributions (Scott 2002). Hence, before making inferences from the obtained marginal distributions, one should first assess if the problem of label switching is present (e.g., by using plots of the sampled parameter values of the state-dependent distributions over the iterations), and if necessary, take steps to prevent the problem of label switching. In our own experience, the use of reasonable starting values always sufficed to prevent label switching.
Both EM and Bayesian Gibbs sampling are viable inferential procedures for HMMs, but for more complex HMMs such as multilevel HMMs, the Bayesian estimation method has several advantages (e.g., lower computational cost, and less computation time) over the EM algorithm. We refer to Rydén (2008) for a comparison on frequentist (i.e., the EM algorithm) and Bayesian approaches.
Bayesian estimation is particularly suited to model multilevel models. In the multilevel model, we have a multi-layered structure in the parameters. For the HMM, we have subject level parameters at the first level pertaining to the observations within a subject, and group level parameters at the second level that describe the mean and variation within the group, as inferred from the sample of subjects. To illustrate the multilevel model, suppose that we have \(K\) subjects for which we have each \(H\) observations on their number of cups of coffee consumed per day \(y\), i.e., subject \(k \in \{1, 2, \ldots, K\}\) and observation \(h \in \{1, 2, \ldots, H\}\). Hence, at the first level, we have daily observations on coffee consumption within subjects: \(y_{11}\), \(y_{12}\), \(\ldots\), \(y_{1H}\), \(y_{21}\), \(y_{22}\), \(\ldots\), \(y_{2H}\), \(y_{K1}\), \(y_{K2}\), \(\ldots\), \(y_{KH}\). Using a multilevel model, the observations of each subject are distributed according to the same distribution \(Q\), but each subject has its own parameter set \(\theta_k\). That is:
\[\begin{equation} y_{kh} \sim Q(\theta_k). \end{equation}\]
In addition, the subject-specific parameter sets \(\theta_k\) are realizations of a common group level distribution \(W\) with parameter set \(\Lambda\):
\[\begin{equation} \theta_k \sim W(\Lambda). \end{equation}\]
That is, in the multilevel model, the subject level model parameters that pertain to the observations within a subject are assumed to be random draws from a given distribution, and, as such, are denoted as “random”, independent of the used estimation method (i.e., Bayesian or classical frequentist estimation). This multi-layered structure fits naturally into a Bayesian paradigm since in Bayesian estimation, model parameters are by definition viewed as random. That is, parameters follow a given distribution, where the prior distribution expresses the prior expectations with respect to the most likely values of the model parameters. In the multilevel model, the prior expectations of the subject level model parameter values are reflected in the group level distribution. Hence, in Bayesian estimation, the prior distribution for the subject level parameters, is given by the group level distribution. The group level distribution provides information on the location (e.g., mean) of the subject level (i.e., subject-specific) parameters, and on the variation in the subject level parameters. As the Normal distribution is a flexible distribution with parameters that easily relate to this interpretation, the group level distribution is often taken to be a normal distribution.
To illustrate the notion of the group level (prior) distribution, suppose we assume a Poisson distribution for the observations on daily coffee consumption within each subject k, and a Normal group level distribution on the Poisson mean. In this case, the set of hyper-parameters (i.e., the parameters of the group level distribution, here the mean (\(\Lambda_\mu\)) and variance (\(\Lambda_{\sigma^2}\)) of the Normal distribution) on the Poisson mean denote the group mean number of cups of coffee consumed per day over subjects, and the variation in the mean number of cups of coffee consumed per day between subjects.
Finally, in fitting the multilevel model using Bayesian estimation, a prior distribution is placed on each of these hyper-parameters. Prior distributions on hyper-parameters are referred to as hyper-priors and allow the hyper-parameters to have a distribution instead of being fixed. That is, as the parameters that characterize the group level prior distribution (i.e., the hyper-parameters) are now also quantities of interest (i.e., to-be-estimated), they are viewed as random in Bayesian estimation methods. The randomness in the hyper-parameters is thus specific to the Bayesian estimation method of the multilevel model, in contrast to the randomness in the subject level parameters.
To continue our example, the hyper-prior on the mean of the Normal prior distribution for the subject level mean of cups of coffee consumed daily denote our prior belief on the mean number of cups of coffee consumed per day in the group. The hyper-prior on the variance of the Normal prior distribution for the subject level mean of cups of coffee consumed per day denote our prior belief on how much this mean number of cups of coffee varies over subjects. Often, the hyper-prior distribution and its values are chosen to be vague (i.e., not informative), like a uniform distribution:
\[\begin{align} \Lambda_\mu & \: \sim U(0, 20),\\ \nonumber \Lambda_{\sigma^2} & \: \sim U(0, 500). \end{align}\]
See e.g., Gelman et al. (2014), Lynch (2007), Rossi, Allenby, and McCulloch (2012) for an in-depth exposition of various multilevel Bayesian models and e.g. Snijders and Bosker (2011), Hox, Moerbeek, and Schoot (2017), Goldstein (2011) for coverage of the classical, frequentist approach to multilevel (also called hierarchical or random effects) models.
The present implemented multilevel model pertains only to data comprised of (multivariate) categorical observations, and, possibly, time invariant covariates (i.e., for each covariate, we have one value per subject). As such, data is comprised of \(O_{d, k, t}\) observations on the categorical outcome(s) for categorical outcome variable \(d = 1, 2, \ldots, D\) for subject \(k = 1, 2, \ldots, K\) at time point \(t = 1, 2, \ldots, T\). In addition, we have a matrix \(\boldsymbol{X}\) that consists of \(k\) covariate vectors with length \(p \times 1\), \(\boldsymbol{X}_k = (X_{k1}, X_{k2}, \ldots, X_{kp})\). As yet, the explanation of the estimation procedure in this vignette is restricted to the univariate case for simplicity. That is, to the instance that we only have one observed categorical outcome variable per subject, and our outcome data is comprised of \(O_{kt}\) observations. However, the explanation extents quite naturally to the multivariate case.
Given these observations, we construct a multilevel model for each of the parameters in the HMM with \(q\) observable categorical outcomes, and \(m\) hidden states, possibly predicted by \(p\) covariates. Using the multilevel framework, each parameter is assumed to follow a given distribution, and the parameter value of a given subject represents a draw from this common (i.e., group level) distribution. Hence, in the multilevel Bayesian HMM, the parameters are: the subject-specific transition probability matrix \(\boldsymbol{\Gamma}_k\) with transition probabilities \(\gamma_{kij}\) and the subject-specific state-dependent probability distribution denoting the subject-specific probabilities \(\boldsymbol{\theta}_{ki}\) of categorical outcomes within state \(i\). The initial probabilities of the states \(\pi_{k,j}\) are not estimated as \(\pi_{k}\) is assumed to be the stationary distribution of \(\boldsymbol{\Gamma}_k\). Subsequently, the parameter values of the subjects are assumed to be realizations of a model component and state specific (multivariate) Normal distribution. We discuss the multilevel model for the two components of the HMM (\(\boldsymbol{\Gamma}_k\) and \(\boldsymbol{\theta}_{ki}\) separately. Below we provide an overview of the used symbols in the multilevel models related to the two components of the HMM. We use the subscript 0 to denote values of the hyper-prior distribution parameters.
\(k\): subject \(k \in \{1, 2, \ldots, K \}\), where \(K\) is the total number of subjects in the dataset.
\(t\): Time point \(t \in \{1, 2, \ldots, T \}\), where \(T\) is the total length of each sequence of observations. In the current notation, \(T\) is assumed equal over subjects. Within the R package mHMMbayes this is not a requirement, however.
\(q\): Number of distinct observation categories.
\(m\): Number of distinct states.
\(p\): Number of (time invariant) covariates.
\(O_{kt}\): Observation on the categorical outcome for subject \(k\) at time point \(t\).
\(S_{kt}\): State for subject \(k\) at time point \(t\) for \(S \in \{1, 2, \ldots, m \}\).
\(i\): Realization of the current state \(S_t\), where \(i \in \{1, 2, \ldots , m\}\).
\(\boldsymbol{X}\): Matrix of \(k\) covariate vectors with length \(p \times 1\), \(\boldsymbol{X}_k = (X_{k1}, X_{k2}, \ldots, X_{kp})\).
\(\boldsymbol{\Gamma_k} = [\gamma_{kij}]\): Subject-specific transition probability matrix between states with the probabilities \(\gamma_{kij}\) of transitioning from state \(S_{kt} = i\) to state \(S_{k(t+1)} = j\).
\(\boldsymbol{\alpha}_{(S)ki}\): subject \(k\) and state \(i\) specific batch of \(m-1\) random intercepts that model the transitions from state \(i\) to the next state \(j\).
\(\boldsymbol{\bar{\alpha}}_{(S)i}\): State \(i\) specific group mean vector over the subject \(k\) batches of the \(m-1\) random intercepts \(\boldsymbol{\alpha}_{(S)ki}\).
\(\boldsymbol{\beta}_{(S)i}\): State \(i\) specific fixed regression coefficients to predict the random intercepts \(\boldsymbol{\alpha}_{(S)ki}\).
\(\Psi_i\): State \(i\) specific covariance between the subject \(k\) batches of the \(m-1\) random intercepts \(\boldsymbol{\alpha}_{(S)ki}\).
\(\boldsymbol{\alpha}_{(S)0}\), \(\boldsymbol{\beta}_{(S)0}\), \(K_0\): Values of the parameters of the hyper-prior on the group mean vector \(\boldsymbol{\bar{\alpha}}_{(S)i}\) and fixed regression coefficients \(\boldsymbol{\beta}_{(S)i}\) .
\(\Psi_0\), \(df_0\): Values of the parameters of the hyper-prior on the group covariance \(\Psi_i\).
\(O_{kt}\): Observed event for subject \(k\) at time point \(t\) for \(O \in \{1, 2, \ldots q\}\).
\(l\): Realization of current event \(O_{kt}\), where \(l \in \{1, 2, \ldots, q \}\).
\(\boldsymbol{\theta}_{ki} = [\theta_{kil}]\): Subject-specific state \(i\) categorical conditional distribution, with the probabilities \(\text{p}_{kil}\) of observing the categorical outcome \(O_{kt} = l\) in state \(S_{kt} = i\).
\(\boldsymbol{\alpha}_{(O)ki}\): Subject \(k\) state \(i\) specific batch of \(q-1\) random intercepts that model the probability of a categorical outcome \(O_{kt}\) within state \(i\).
\(\boldsymbol{\bar{\alpha}}_{(O)i}\): State \(i\) specific group mean vector over the subject \(k\) batches of the \(q-1\) random intercepts \(\boldsymbol{\alpha}_{(O)ki}\).
\(\boldsymbol{\beta}_{(O)i}\): State \(i\) specific fixed regression coefficients to predict the subject-specific random intercepts \(\boldsymbol{\alpha}_{(O)ki}\).
\(\Phi_{i}\): State \(i\) specific covariance between the subject \(k\) batches of the \(q-1\) random intercepts \(\boldsymbol{\alpha}_{(O)ki}\).
\(\boldsymbol{\alpha}_{(O)0}\), \(\boldsymbol{\beta}_{(O)0}\), \(K_0\): Values of the parameters of the hyper-prior on the group mean vector \(\boldsymbol{\bar{\alpha}}_{(O)i}\) and the fixed regression coefficients \(\boldsymbol{\beta}_{(O)i}\).
\(\Phi_0\), \(df_0\): Values of the parameters of the hyper-prior on the group covariance \(\Phi_{i}\).
In the standard (non-multilevel) Bayesian HMM estimation, we specified a Dirichlet prior distribution on the state-dependent probabilities \(\boldsymbol{\theta}_{i}\). To provide a flexible model that allows for the inclusion of random effects and (time invariant) covariates, we follow Altman (2007) and extend the subject-specific state-dependent probabilities \(\boldsymbol{\theta}_{ki}\) to a Multinomial logit (MNL) model. Hence, we utilize a linear predictor function to estimate the probability of observing categorical outcome \(l\) within state \(i\). The state \(i\) specific linear predictor function at the subject level consists of \(q-1\) random intercepts (i.e., each subject has its own intercept). That is, each categorical outcome \(l\) has its own intercept, with the exception of the first categorical outcome in the set for which the intercept is omitted for reasons of model identification (i.e., not all probabilities can be estimated freely as within subject \(k\) and state \(i\), the probabilities need to add up to 1). By making the intercepts random (i.e., each subject has its own intercept), we accommodate heterogeneity between subjects in their state conditional probabilities. Hence, in the MNL model for \(\boldsymbol{\theta}_{ki}\), subject \(k\)’s probabilities of observing categorical outcome \(l \in \{1, 2, \ldots, q \}\) within a state \(i \in \{1, 2, \ldots , m\}\), \({\theta}_{kil}\), are modeled using \(m\) batches of \(q-1\) random intercepts, \(\boldsymbol{\alpha}_{(O)ki} = (\alpha_{(O)ki2}, \alpha_{(O)ki3}, \ldots, \alpha_{(O)kiq})\). That is,
\[\begin{equation} {\theta}_{kil} = \frac{\text{exp}(\alpha_{(O)kil})}{1 + \sum_{\bar{l} = 2}^q \text{exp}(\alpha_{(O)ki\bar{l}})}, \end{equation}\]
where \(K\), \(m\), and \(q\) are the number of subjects, states, and
categorical outcomes, respectively. The numerator is set equal to one
for \(l = 1\), making the first
categorical outcome in the set the baseline category in every
state.
At the group level, these subject-level intercepts are (possibly) partly
determined by covariates that differentiate between subjects. Thus, in
addition to the subject level random intercepts \(\boldsymbol{\alpha}_{(O)ki}\), we have
\(m\) matrices of \(p * (q-1)\) fixed regression coefficients,
\(\boldsymbol{\beta}_{(O)i}\), where
\(p\) denotes the number of used
covariates. The columns of \(\boldsymbol{\beta}_{(O)i}\) are \(\boldsymbol{\beta}_{(O)il} = (\beta_{(O)il1},
\beta_{(O)il2}, \ldots, \beta_{(O)ilp})\) to model the random
intercepts for state \(i\) and
categorical outcome \(l\) given \(p\) covariates. Combining both terms, each
batch of random intercepts \(\boldsymbol{\alpha}_{(O)ki}\) (i.e., the
batch of \(q-1\) intercepts for the
state \(i\) conditional probabilities
of a categorical outcome for subject \(k\)) come from a state \(i\) specific population level multivariate
Normal distribution, with mean vector \(\boldsymbol{\bar{\alpha}}_{(O)i} +
\boldsymbol{X_{k}^\top} \boldsymbol{\beta_{(O)il}}\) that has
length \(q-1\), and covariance \(\Phi_{i}\) that denotes the covariance
between the \(q-1\) state \(i\) specific intercepts over subjects and
models the dependence of the probabilities of categorical outcomes
within state \(i\) (i.e., we specify a
state specific multivariate Normal prior distribution on the
subject-specific \(\boldsymbol{\alpha_{(O)ki}}\) parameters).
A convenient hyper-prior on the hyper-parameters of the group level
prior distribution is a multivariate Normal distribution for the mean
vector \(\boldsymbol{\bar{\alpha}}_{(O)i}\) and the
fixed regression coefficients \(\boldsymbol{\beta_{(O)il}}\), and an
Inverse Wishart distribution for the covariance \(\Phi_{i}\) (see
e.g., Gelman et al. 2014). That is,
\[\begin{align} O_{kt} \: \sim \boldsymbol{\theta}_{k, S_{kt}} & \quad \text{with} \quad \boldsymbol{\theta}_{ki} \: \sim \text{MNL}\big(\boldsymbol{\alpha}_{(O)ki}\big), \label{pPriorHier}\\ \boldsymbol{\alpha}_{(O)ki} \: \sim \text{N}\big(\boldsymbol{\bar{\alpha}}_{(O)i} + \boldsymbol{X_{k}^\top} \boldsymbol{\beta_{(O)il}}, \Phi_{i}\big) & \quad \text{with} \quad \boldsymbol{\bar{\alpha}}_{(O)i} \: \sim \text{N}\big(\boldsymbol{\alpha}_{(O)0}, \tfrac{1}{K_0}\Phi_{i} \big), \label{popPHier}\\ & \quad \text{and} \quad \boldsymbol{\beta}_{(O)i} \: \sim \text{N}\big(\boldsymbol{\beta}_{(O)0}, \tfrac{1}{K_0}\Phi_{i} \big), \\ & \quad \text{and} \quad \: \Phi_{i} \: \sim \text{IW}\big(\Phi_0, df_0 \big), \nonumber \end{align}\]
where the probability distribution of the observed categorical outcomes \(O_{kt}\) is given by the subject \(k\) specific state-dependent probabilities \(\boldsymbol{\theta}_{ki}\) corresponding to the current state \(S_{kt}\) (where \(t\) indicates the time point). The parameters \(\boldsymbol{\alpha}_{(O)0}\), \(\boldsymbol{\beta}_{(O)0}\), and \(K_0\) denote the values of the parameters of the hyper-prior on the group (mean) vector \(\boldsymbol{\bar{\alpha}}_{(O)i}\) and \(\boldsymbol{\beta}_{(O)i}\), respectively. Here, \(\boldsymbol{\alpha}_{(O)0}\) and \(\boldsymbol{\beta}_{(O)0}\) represent a vector of means and \(K_0\) denotes the number of observations (i.e., the number of hypothetical prior subjects) on which the prior mean vector \(\boldsymbol{\alpha}_{(O)0}\) and \(\boldsymbol{\beta}_{(O)0}\) are based, i.e., \(K_0\) determines the weight of the prior on \(\boldsymbol{\bar{\alpha}}_{(O)i}\) and \(\boldsymbol{\beta}_{(O)i}\). The parameters \(\Phi_0\) and \(df_0\), respectively, denote the values of the covariance and the degrees of freedom of the hyper-prior Inverse Wishart distribution on the population variance \(\Phi_{i}\) of the subject-specific random intercepts \(\boldsymbol{\alpha}_{(O)ki}\). Note that we chose the values of the parameters of the hyper-prior distributions that result in uninformative hyper-prior distributions, as such the values of the parameters of the hyper-priors are assumed invariant over the states \(i\).
Similar to the state-dependent probabilities \(\boldsymbol{\theta}_{ki}\), we extend each set of state \(i\) specific state transition probabilities \(\gamma_{kij}\) to a MNL model to allow for the inclusion of random effects and (time invariant) covariates. Hence, we use a linear predictor function to estimate the probability to transition from behavioral state \(i\) to state \(j\). The linear predictor function consists of \(m-1\) random intercepts to allow for heterogeneity between subjects in their probabilities to switch between states. That is, within row \(i\) of the transition probability matrix \(\boldsymbol{\Gamma_k}\), each state \(j\) has its own intercept, where the intercept that relates to transitioning to the first state in the set is omitted for reasons of model identification (i.e., not all probabilities can be estimated freely as the row-probabilities need to add up to 1). Hence, each subject’s probability to transition from behavioral state \(i \in\{1, 2, \ldots, m \}\) to state \(j \in \{1, 2, \ldots, m \}\) is modeled using \(m\) batches of \(m-1\) random intercepts, \(\boldsymbol{\alpha}_{(S)ki} = (\alpha_{(S)k13}, \ldots, \alpha_{(S)k1m}, \alpha_{(S)k23}, \ldots, \alpha_{(S)k2m},\) \(\ldots, \alpha_{(S)km2},\) \(\dots,\) \(\alpha_{(S)km(m-1)})\). That is,
\[\begin{equation} \gamma_{kij} = \frac{\text{exp}(\alpha_{(S)kij})}{1 + \sum_{\bar{j} \in Z} \text{exp}(\alpha_{(S)ki\bar{j}})}, \end{equation}\] \[\begin{equation} \text{where} \ Z \in \{2, \ldots, m\} \nonumber \end{equation}\]
where \(K\) and \(m\) are again the number of subjects in the dataset, and the distinct number of states, respectively. The numerator is set equal to 1 for \(j = 1\), making the first state of every row of the transition probability matrix \(\boldsymbol{\Gamma}_k\) the baseline category.
At the group level, these subject-level intercepts are (possibly) partly determined by covariates that differentiate between subjects. Thus, in addition to the subject level random intercepts \(\boldsymbol{\alpha}_{(S)ki}\), we have \(m\) matrices of \(p * (q-1)\) fixed regression coefficients, \(\boldsymbol{\beta}_{(S)i}\), where \(p\) denotes the number of used covariates. The columns of \(\boldsymbol{\beta}_{(S)i}\) are \(\boldsymbol{\beta}_{(S)ij} = (\beta_{(S)ij1}, \beta_{(S)ij2}, \ldots, \beta_{(S)ijp})\) to model the random intercepts denoting the probability of transitioning from behavioral state \(i\) to state \(j\) given \(p\) covariates. Combining both terms, each batch of random intercepts \(\boldsymbol{\alpha}_{(S)ki}\) come from a state \(i\) specific population level multivariate Normal distribution, with mean vector \(\boldsymbol{\bar{\alpha}}_{(S)i} + \boldsymbol{X_{k}^\top} \boldsymbol{\beta_{(S)ij}}\) that has length \(q-1\), and covariance \(\Psi_i{i}\) that denotes the covariance between the \(q-1\) state \(i\) specific intercepts over subjects, and models the dependency between the probabilities of states within random intercept batch \(\boldsymbol{\alpha}_{(S)ki}\) (i.e., we specify a state specific multivariate Normal prior distribution on the subject-specific \(\boldsymbol{\alpha}_{(S)ki}\) parameters). A convenient hyper-prior on the hyper-parameters of the group level prior distribution is a multivariate Normal distribution for the mean vector \(\boldsymbol{\bar{\alpha}}_{(S)i}\) and the fixed regression coefficients \(\boldsymbol{\beta_{(S)il}}\) and an Inverse Wishart distribution for the covariance \(\Psi_i\). That is,
\[\begin{align} S_{k, t = 2, \ldots, T} \: \sim \boldsymbol{\Gamma}_{k, S_{k, t-1}} \quad &\text{with} \quad \boldsymbol{\Gamma}_{k, i} \: \sim \text{MNL}\big(\boldsymbol{\alpha}_{(S)ki}\big), \label{tpmPriorHier}\\ \boldsymbol{\alpha}_{(S)ki} \: \sim \text{N}\big(\boldsymbol{\bar{\alpha}}_{(S)i} + \boldsymbol{X_{k}^\top} \boldsymbol{\beta_{(S)il}}, \Psi_i\big) \quad &\text{with} \quad \boldsymbol{\bar{\alpha}}_{(S)i} \: \sim \text{N}\big(\boldsymbol{\alpha}_{(S)0}, \tfrac{1}{K_0}\Psi_i \big), \label{popTmpHier} \\ &\text{and} \quad \boldsymbol{\beta}_{(S)i} \: \sim \text{N}\big(\boldsymbol{\beta}_{(S)0}, \tfrac{1}{K_0}\Psi_{i} \big), \\ &\text{and} \quad \: \Psi_i \: \sim \text{IW}\big(\Psi_0, df_0 \big), \nonumber \end{align}\]
where the subject-specific probability distribution of the current state \(S_{kt}\) is given by the row of the transition probability matrix \(\mathbf{\Gamma}_k\) corresponding to the previous state in the hidden state sequence \(S_{k, t-1}\). The probability distribution of \(S_{kt}\) given by \(\mathbf{\Gamma}_k\) holds for states after the first time point, i.e., \(t\) starts at 2 as there is no previous state in the hidden state sequence for state \(S_{kt}\) at \(t\) = 1. The probability of the first state in the hidden state sequence \(S_{k,1}\) is given by the initial probabilities of the states \(\pi_{k,j}\). The parameters \(\boldsymbol{\alpha}_{(S)0}\), \(\boldsymbol{\beta}_{(S)0}\), and \(K_0\) denote the values of the parameters of the hyper-prior on the group (mean) vector \(\boldsymbol{\bar{\alpha}}_{(S)i}\) and \(\boldsymbol{\beta}_{(S)i}\), respectively. Here, \(\boldsymbol{\alpha}_{(S)0}\) and \(\boldsymbol{\beta}_{(S)0}\) represent a vector of means and \(K_0\) denotes the number of observations (i.e., the number of hypothetical prior subjects) on which the prior mean vector \(\boldsymbol{\alpha}_{(S)0}\) and \(\boldsymbol{\beta}_{(S)0}\) are based. The parameters \(\Psi_0\) and \(df_0\), respectively, denote values of the covariance and the degrees of freedom of the hyper-prior Inverse Wishart distribution on the group variance \(\Psi_i\) of the subject-specific random intercepts \(\boldsymbol{\alpha}_{(S)ki}\).
Given the above distributions, our goal is to construct the joint posterior distribution of the parameters - i.e., the subject-specific hidden state sequences, the subject level (i.e., subject-specific) parameters and the group level parameter estimates - given the observations (i.e., the observed event sequences for all \(k\) subjects that are analyzed simultaneously as one group, and the hyper-prior parameter values)
\[\begin{align} & \Pr\big(S_{kt}, \mathbf{\Gamma}_{ki}, \boldsymbol{\alpha}_{(S)ki}, \boldsymbol{\bar{\alpha}}_{(S)i}, \boldsymbol{\beta}_{(S)i}, \Psi_i, \boldsymbol{\theta}_{ki}, \boldsymbol{\alpha}_{(O)ki}, \boldsymbol{\bar{\alpha}}_{(O)i},\boldsymbol{\beta}_{(O)ki}, \Phi_{i} \mid O_{kt}, \boldsymbol{X}\big) \nonumber \\ & \quad \propto \Pr\big(O_{kt} \mid S_{kt}, \boldsymbol{\theta}_{ki}\big) \Pr\big(S_{kt} \mid \mathbf{\Gamma}_{ki}\big) \Pr\big( \boldsymbol{\theta}_{i} \mid \boldsymbol{\alpha}_{(O)ki} \big) \Pr\big(\mathbf{\Gamma}_i \mid \boldsymbol{\alpha}_{(S)ki} \big) \Pr\big( \boldsymbol{\alpha}_{(O)ki} \mid \boldsymbol{\bar{\alpha}}_{(O)i}, \boldsymbol{X}, \boldsymbol{\beta}_{(O)i}, \Phi_{i}\big) \nonumber \\ & \quad \quad \quad \quad \Pr\big(\boldsymbol{\alpha}_{(S)ki} \mid \boldsymbol{\bar{\alpha}}_{(S)i}, \boldsymbol{X}, \boldsymbol{\beta}_{(S)i}, \Psi_i \big) \Pr\big(\boldsymbol{\bar{\alpha}}_{(O)i} \mid \boldsymbol{\alpha}_{(O)0}, K_0, \Phi_i \big) \Pr\big(\boldsymbol{{\beta}}_{(O)i} \mid \boldsymbol{\beta}_{(O)0}, K_0, \Phi_i \big) \Pr\big(\Phi_{i} \mid \Phi_0, df_0 \big) \nonumber \\ & \quad \quad \quad \quad \quad \quad \Pr\big(\boldsymbol{\bar{\alpha}}_{(S)i} \mid \boldsymbol{\alpha}_{(S)0}, K_0, \Psi_{i} \big) \Pr\big(\boldsymbol{{\beta}}_{(S)i} \mid \boldsymbol{\beta}_{(S)0}, K_0, \Psi_{i} \big) \Pr\big(\Psi_i \mid \Psi_0, df_0 \big) \nonumber \\ \end{align}\]
by drawing samples from the posterior distribution. We follow a MCMC sampler algorithm to iteratively sample from the appropriate conditional posterior distributions of \(\boldsymbol{\alpha}_{(O)ki}\), \(\boldsymbol{\alpha}_{(S)ki}\), \(\boldsymbol{\bar{\alpha}}_{(O)i}\), \(\boldsymbol{{\beta}}_{(O)i}\), \(\Phi_{i}\), \(\boldsymbol{\bar{\alpha}}_{(S)i}\), \(\boldsymbol{{\beta}}_{(S)i}\), and \(\Psi_i\) given the remaining parameters in the model (see below). The conditional posterior distributions of all parameters are provided in the Section “”.
In Bayesian estimation, it is preferable to use the natural conjugate prior as prior distribution, as this conveniently results in a closed form expression of the (conditional) posterior distribution(s), making Gibbs sampling possible. However, as the non-conjugate Normal prior provides a much more intuitive interpretation of the prior group level distribution compared to using the natural conjugate prior of the MNL model, and since the asymptotic Normal approximation is excellent for the MNL likelihood (Rossi, Allenby, and McCulloch 2012), we opt for the former and do not use the conjugate prior of the MNL model. Therefore, we cannot use a Gibbs sampler to update the parameters of the subject-specific state-dependent distributions and the subject-specific transition probabilities, \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\), respectively. Instead, we use a combination of the Gibbs sampler and the Metropolis algorithm, i.e., a Hybrid Metropolis within Gibbs sampler. That is, we use a Metropolis sampler to update \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\), and we use a Gibbs sampler to update all other model parameters. There are various types of Metropolis algorithms, and each type involves specific choices. Simulation studies showed that, in line with Rossi, Allenby, and McCulloch (2012), the Random Walk (RW) Metropolis sampler outperformed the Independence Metropolis sampler in terms of efficiency for estimating the parameters of the (multilevel) HMM, we chose to use the RW Metropolis sampler to update the parameters of the subject-specific state-dependent distributions (\(\boldsymbol{\alpha}_{(O)ki}\)) and subject-specific state transition probabilities (\(\boldsymbol{\alpha}_{(S)ki}\)) in our Hybrid Metropolis within Gibbs sampler.
The Hybrid Metropolis within Gibbs sampler for the multilevel HMM proceeds in a similar fashion as the Gibbs sampler for the HMM: first the hidden state sequences are sampled (for each subject separately), after which the (subject level and group level) parameters are sampled given the observed event sequence (for each subject, \(O_{kt}\)), the sampled hidden state sequences (of each subject, \(S_{kt}\)), and the current values of the remaining parameters in the model. We provide a stepwise walkthrough of the hybrid Metropolis within Gibbs sampler for the multilevel HMM below.
The Hybrid Metropolis within Gibbs sampler used to fit the multilevel HMM proceeds as described below. We use the subscript \(c\) to denote the current (i.e., updated using a combination of the value of the hyper-prior and the data) parameters of the conditional posterior distributions.
These steps are repeated for a large number of iterations, and, after discarding the first iterations as a “burn-in” period, the sampled parameter estimates provide the empirical posterior distribution of the model parameters.
Regarding the acceptance rate of the RW Metropolis sampler for the subject-specific sets of intercepts \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\) (i.e., related to the subject-specific state-dependent probabilities \(\boldsymbol{\theta}_{ki}\) and the subject-specific state transition probability matrix \(\boldsymbol{\Gamma}_k\), respectively), an acceptance rate of \(\sim23\%\) is considered optimal when many parameters are being updated at once (Gelman et al. 2014). Within the R package mHMMbayes, the number of accepted draws of a model are stored in emiss_naccept and gamma_naccept for the conditional distributions and the transition probabilities, respectively.
To obtain the scale parameter \(\Sigma_{\alpha(O)ki}\) and \(\Sigma_{\alpha(S) ki}\) of the proposal
distributions of the RW Metropolis sampler for \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\),
respectively, we followed the method outlined by Rossi, Allenby, and McCulloch (2012), which has
several advantages as discussed below.
The general challenge of the RW Metropolis sampler is that it has to be
“tuned” by choosing the scale of the symmetric proposal distribution
(e.g., the variance or covariance of a Normal or multivariate Normal
proposal distribution, respectively). The scale of the proposal
distribution is composed of a covariance matrix \(\Sigma\), which is then tuned by
multiplying it by a scaling factor \(s^2\). Hence we denote the scale of the
proposal distribution by \(s^2\Sigma\).
The scale \(s^2\Sigma\) has to be set
such that the drawn parameter estimates cover the entire area of the
posterior distribution (i.e., the scale \(\Sigma\) should not be set too narrow
because then only candidate parameters in close proximity of the current
parameter will be drawn), but remains reasonably efficient (i.e., the
scale \(\Sigma\) should not be set too
wide because then many candidate parameters will be rejected resulting
in a slowly progressing chain of drawn parameter estimates).
There are various options for the covariance matrix \(\Sigma\). Often, the covariance matrix
\(\Sigma\) is set such that it
resembles the covariance matrix of the actual posterior distribution. To
capture the curvature of each subject’s conditional posterior
distribution, the scale of the RW Metropolis proposal distribution
should be customized to each subject. This also facilitates the
possibility to let the amount of information available within the data
of a subject for a parameter determine to which degree the group level
distribution dominates the estimation of the subject-specific
parameters. Hence, to approximate the conditional posterior distribution
of each subject, the covariance matrix is set to be a combination of the
covariance matrix obtained from the subject data and the group level
covariance matrix \(\Phi_{i}\) or \(\Psi_i\). To estimate the covariance matrix
from the subject data, which is only used for the proposal distribution
of the RW Metropolis sampler, we simply use a Maximum Likelihood
Estimate (MLE), as this quantity is only used for the purpose of scaling
the proposal distribution and is not part of the estimated parameter
values that constitute the posterior distribution. The MLE estimate of
the covariance matrix is obtained by maximizing the likelihood of the
Multinomial Logit (MNL) model on the data, and retrieving the Hessian
matrix \(H_{ki}\) (i.e., the second
order partial derivatives of the likelihood function with respect to the
parameters). The covariance matrix of the parameters is the inverse of
the Hessian matrix, \(H_{ki}^{-1}\).
Hence, the covariance matrices for \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\), are defined
by \(\Sigma_{\boldsymbol{\alpha}_{(O)ki}} =
(H_{\alpha(O) ki} + \Phi_{i}^{-1})^{-1}\) and \(\Sigma_{\boldsymbol{\alpha}_{(S)ki}} =
(H_{\alpha(S) ki} + \Psi_i^{-1})^{-1}\), respectively. For \(\boldsymbol{\alpha}_{(O)ki}\), the data on
which the Hessian is obtained is the frequency with which a categorical
outcome is observed in state \(i\) of
subject \(k\). For \(\boldsymbol{\alpha}_{(S)ki}\), this data is
the frequency with which state \(i\)
transitions to another state within subject \(k\). Hence, the subject-specific covariance
matrix (i.e., the inverse of the Hessian matrix) is based on the sampled
hidden state sequence. Therefore, the MLE estimates of the
subject-specific covariance matrices that are used for the RW Metropolis
proposal distributions have to be obtained in each iteration, as the
sampled hidden state sequence changes in each iteration.
A potential problem with maximizing the log-likelihood of each subject’s
data, is that a certain state might not be sampled for a subject. To
circumvent this problem, we modify the subject likelihood function by
adding a so-called regularizing likelihood function that has a defined
maximum to the subject-level likelihood function. We maximize the
resulting pooled likelihood function in order to obtain the MLE
estimates. Here, we use the likelihood function of the combined data
over all subjects that are considered to be part of one group as the
regularizing likelihood function. The pooled likelihood function is
scaled by \(1-w \times \text{subject-level
likelihood} + w \times \text{overall likelihood}^{n.obs_k /
N.obs}\), so that the overall likelihood function does not
dominate the subject-level likelihood function, and where \(n.obs_k\) is the number of data
observations for subject \(k\) and
\(N.obs\) is the total number of data
observations over all subjects in a group.
Now that we defined the covariance matrix \(\Sigma\) for the scale of the RW Metropolis
sampler proposal distribution, we have to define the scalar factor \(s^2\) to obtain the scale \(s^2\Sigma\) of the proposal distribution.
As in Rossi, Allenby, and McCulloch
(2012)}, we adopt the scaling proposal of Roberts, Rosenthal, et al. (2001), and set
scaling to \(s^2 = (2.93 /
\sqrt{\text{n.param}})^2\), where \(n.param\) is the number of parameters to be
estimated for \(\boldsymbol{\alpha}_{(S)ki}\) or \(\boldsymbol{\alpha}_{(O)ki}\) in the RW
Metropolis sampler, which equals \(m -
1\) in case of \(\boldsymbol{\alpha}_{(S)ki}\) (where \(m\) denotes the number of states) and \(q - 1\) in case of \(\boldsymbol{\alpha}_{(O)ki}\) (where \(q\) denotes the number of categorical
outcomes).
In summary, the scale parameter \(s^2\Sigma_{\alpha(O) ki}\) and \(s^2\Sigma_{\alpha(S) ki}\) of the proposal
distributions of the RW Metropolis sampler for \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\) are defined
as: \[\begin{align}
s^2\Sigma_{\alpha(O) ki} = (2.93 / \sqrt{q-1})^2 & \: \times \:
(H_{\alpha(O) ki} + \Phi_{i}^{-1})^{-1}, \ \text{and} \\
s^2\Sigma_{\alpha(S) ki} = (2.93 / \sqrt{m-1})^2 & \: \times \:
(H_{\alpha(S) ki} + \Psi_i^{-1})^{-1} ,
\end{align}\] where \(H_{\alpha(O)
ki}\) is the Hessian of the \(k^{th}\) subject’s data of the frequency
with which a categorical outcome is observed within state \(i\) evaluated at the MLE of the pooled
likelihood, \(H_{\alpha(S) ki}\) is the
Hessian of the \(k^{th}\) subject’s
data of the frequency with which state \(i\) transitions to another state evaluated
at the MLE of the pooled likelihood, and \(\Phi_{i}^{-1}\) and \(\Psi_i^{-1}\) are the inverses of the group
level covariance matrices. This provides us with \(m\) pairs of scale parameters that closely
resemble the scale of the subject-level conditional posterior
distribution, and that 1) are automatically tuned (i.e., we do not
require experimentation to determine \(s^2\) to tune the covariance matrix), 2)
allow the amount of information available within the data of a specific
subject to determine the degree to which the group level distribution
dominates the estimation of that subject’s level parameters, and 3) do
not require each state to be sampled in the hidden state sequence as not
each subject-level likelihood is required to have a maximum.
In the hybrid Metropolis within Gibbs sampler, all level 2 model parameters are directly sampled from their full conditional posterior distributions. The full conditional posterior distributions are obtained by applying Bayes theorem, combining the (hyper-)prior distribution of the model parameter and the likelihood function. Direct sampling from the conditional posterior distributions for these model parameters is possible, as the choice of the (hyper-)prior distribution results in a closed form expression of the full conditional posterior distribution. That is:
For the random intercepts \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\), related to
the subject-specific state-dependent probabilities of observing a
categorical outcome \(\boldsymbol{\theta}_{ki}\) and the
subject-specific state transition probability matrix \(\boldsymbol{\Gamma}_{k}\), respectively,
the choice of prior distributions does not result in closed form
expressions of the full conditional posterior distributions. That is,
for the subject-specific sets of intercepts \(\boldsymbol{\alpha}_{(O)ki}\) related to
the subject-specific state-dependent probabilities of observing a
categorical outcome within state \(i\),
the full conditional posterior distribution when we assess a standard
multivariate normal prior is: \[\begin{align}
Pr(\boldsymbol{\alpha}_{(O)ki} \mid ) \: & \propto
L\big(\boldsymbol{\alpha}_{(O)ki} \mid O_{kt}, S_{kt} = i\big)
\Pr\big(\boldsymbol{\alpha}_{(O)ki} \mid
\boldsymbol{\bar{\alpha}}_{(O)i}, \boldsymbol{\beta_{(O)i}},
\Phi_{i}\big),\\
\Pr\big(\boldsymbol{\alpha}_{(O)ki} \mid
\boldsymbol{\bar{\alpha}}_{(O)i}, \boldsymbol{\beta_{(O)i}},
\Phi_{i}\big) \: & \sim \text{N}( \boldsymbol{\bar{\alpha}}_{(O)i} +
\boldsymbol{X^\top} \boldsymbol{\beta_{(O)i}}, \Phi_{i}\big)), \nonumber
\end{align}\] and the likelihood is the product of the
probabilities of the observed outcomes \(O_{kt} = l \in \{1, 2, \ldots, q\}\) within
sampled states \(S = i\) in subject
\(k\) over time points \(t\): \[\begin{align}
L\big(\boldsymbol{\alpha}_{(O)ki} \mid O_{kt}, S_{kt} = i\big) \:
& = \prod_{{t}} Pr( O_{k,{t}} = l \mid S_{kt} = i,
\boldsymbol{\alpha}_{(O)ki}), \\
Pr( O_{k,{t}} = l \mid S_{kt} = i, \boldsymbol{\alpha}_{(O)ki}) \:
& = \frac{\text{exp}(\alpha_{(O)kil})}{1 + \sum_{\bar{l} = 2}^q
\text{exp}(\alpha_{(O)ki\bar{l}})}, \nonumber
\end{align}\] where the product is restricted to the set of
time points that coincide with the sampled state \(S\) for subject \(k\) at time point \(t\) being \(i\), and \(q\) is the number of categorical outcomes.
The numerator is set equal to one for \(l =
1\), making the first categorical outcome in the set the baseline
category in every state.
For the subject-specific sets of intercepts \(\boldsymbol{\alpha}_{(S)ki}\) related to
the state-transition probabilities to transition from state \(i\) to any of the other states \(j \in \{1, 2, \ldots,\) \(m\}\), the full conditional posterior
distribution when we assess a standard multivariate normal prior is:
\[\begin{align}
Pr(\boldsymbol{\alpha}_{(S)ki} \mid ) \: & \propto
L\big(\boldsymbol{\alpha}_{(S)ki} \mid S_{kt}, S_{k(t-1)} = i \big)
\Pr\big(\boldsymbol{\alpha}_{(S)ki} \mid
\boldsymbol{\bar{\alpha}}_{(S)i}, \boldsymbol{\beta_{(S)i}},
\Psi_i\big),\\
\Pr\big(\boldsymbol{\alpha}_{(S)ki} \mid
\boldsymbol{\bar{\alpha}}_{(S)i}, \boldsymbol{\beta_{(S)i}},
\Psi_{i}\big) \: & \sim \text{N}(\boldsymbol{\bar{\alpha}}_{(S)i} +
\boldsymbol{X^\top} \boldsymbol{\beta_{(S)i}}, \Psi_{i}), \nonumber
\end{align}\] and the likelihood is the product of the
probabilities of the observed transitions from state \(i\) in the previous time point \(t-1\) to any of the other states \(S_{kt} = j\) over time points \(t\) in subject \(k\):
\[\begin{align}
L\big(\boldsymbol{\alpha}_{(S)ki} \mid S_{kt}, S_{k(t-1)} = i \big) \:
& = \prod_{{n}} Pr( S_{k,{t}} = j \mid S_{k(t-1)} = i,
\boldsymbol{\alpha}_{(S)ki}), \\
Pr( S_{k,{t}} = j \mid S_{k(t-1)} = i, \boldsymbol{\alpha}_{(S)ki}) \:
& = \frac{\text{exp}(\alpha_{(S)kij})}{1 + \sum_{\bar{j} \in Z}
\text{exp}(\alpha_{(S)k i\bar{j}})}, \nonumber
\end{align}\] \[\begin{equation}
\text{where} \ Z \in \{1, 2, \ldots, m, \ Z \neq 1 \} \nonumber
\end{equation}\] where the product is restricted to the set of
time points that coincide with the sampled state \(S\) in the previous time point \(t-1\) being \(i\) for subject \(k\), and \(m\) is the number of states. The numerator
is set equal to 1 for \(j = 1\), making
the first state of every row of the transition probability matrix \(\boldsymbol{\Gamma}_k\) the baseline
category.
As the conditional posterior distributions for \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\) do not result
in a closed form expression of a know distribution, we cannot directly
sample values of \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\) from their
conditional posterior distributions with pre-defined equations on how to
obtain the current (i.e., updated using a combination of the value of
the hyper-prior and the data) parameters of the conditional posterior
distributions. Instead, new values for \(\boldsymbol{\alpha}_{(O)ki}\) and \(\boldsymbol{\alpha}_{(S)ki}\) are sampled
using a RW Metropolis sampler, as described above.