Our understanding of dynamical biological systems such as developmental processes or transitions into disease states is limited by our ability to reverse-engineer these systems using available data. Most medium- and high-throughput experimental protocols (e.g. single cell RNA-seq) are destructive in nature, generating cross-sectional time series in which it is not possible to track the progress of one cell through the system. In typical systems, individual cells progress at different rates and a sample’s experimental capture time will not accurately reflect how far it has progressed. In these cases, cross-sectional data can appear particularly noisy.
We propose a probabilistic model that uses smoothness assumptions to estimate and correct for this effect. Each cell is assigned a pseudotime that represents its progress through the system. These pseudotimes are related to but not determined by the cells’ experimental capture times. Replacing capture times with pseudotimes gives us a more representative view of the underlying system, improving downstream analyses.
We model the smoothness assumption on each gene’s expression profile \(y_g\) using a Gaussian process over pseudotime. Gene-specific parameters in the covariance function represent intrinsic measurement noise \(\omega_g\) and variation of the profile over time \(\psi_g\). Each sample’s pseudotime \(\tau_c\) is given a normal prior centred on its capture time. \[latex \begin{aligned} y_{g} &\sim \mathcal{GP}(\phi_g, \Sigma_g) \\ \Sigma_g(\tau_1, \tau_2) &= \psi_g \Sigma_\tau(\tau_1, \tau_2) + \omega_g \delta_{\tau_1,\tau_2} \\ \log \psi_g &\sim \mathcal{N}(\mu_\psi, \sigma_\psi) \\ \log \omega_g &\sim \mathcal{N}(\mu_\omega, \sigma_\omega) \\ \Sigma_\tau(\tau_1, \tau_2) &= \textrm{Matern}_{3/2}\bigg(r=\frac{|\tau_1 - \tau_2|}{l}\bigg) = (1 + \sqrt{3}r) \exp[-\sqrt{3}r] \\ \tau_c &\sim \mathcal{N}(k_c, \sigma_\tau) \end{aligned} \] This model is effectively a one-dimensional Gaussian process latent variable model with a structured prior on the latent variable (pseudotime).
Guo et al. assayed single cell expression values at 7 time points in mouse embryonic cells and the data is contained in the DeLorean package. We will load this data and analyse a subset of it corresponding to the epiblast lineage. First we must build a de.lorean object with the correct data. Load the data.
library(DeLorean)
library(dplyr)
data(GuoDeLorean)
# Limit number of cores to 2 for CRAN
options(DL.num.cores=min(default.num.cores(), 2))Create a de.lorean object from the full data set.
dl <- de.lorean(guo.expr, guo.gene.meta, guo.cell.meta)Estimate hyperparameters for the model from the whole data set. Here we set the width of the normal prior on the pseudotimes to be 0.5.
dl <- estimate.hyper(
    dl,
    sigma.tau=0.5,
    length.scale=1.5,
    model.name='simple-model')DeLorean also offers slight variations of the model, we could use model.name=lowrank or model.name=simple-model. See the documentation for estimate.hyper for more details.
Choose a few cells from each capture point from the epiblast lineage.
num.at.each.stage <- 5
epi.sampled.cells <- guo.cell.meta %>%
    filter(capture < "32C" |
           "EPI" == cell.type |
           "ICM" == cell.type) %>%
    group_by(capture) %>%
    do(sample_n(., num.at.each.stage))
dl <- filter.cells(dl, cells=epi.sampled.cells$cell)We only have data for a few genes and can easily model them all which is typical for qPCR data. RNA-seq data often has far too many genes for the model to fit. In any case most are probably irrelevant. In these cases we recommend an analysis of variance across the capture times to choose those genes whose means vary most across time. These are most likely to be relevant for fitting the model.
dl <- aov.dl(dl)The most temporally variable genes (by p-value) are at the head of the result:
head(dl$aov)## Source: local data frame [6 x 7]
## 
##     gene    term    df    sumsq    meansq statistic      p.value
##   (fctr)   (chr) (dbl)    (dbl)     (dbl)     (dbl)        (dbl)
## 1   Bmp4 capture     6 321.5776  53.59627  64.81288 4.009178e-15
## 2 Tspan8 capture     6 274.3294  45.72156  59.26683 1.269996e-14
## 3  DppaI capture     6 500.6138  83.43564  49.58779 1.231703e-13
## 4   Actb capture     6  42.0375   7.00625  47.18512 2.303513e-13
## 5  Grhl1 capture     6 777.4955 129.58258  40.69546 1.458516e-12
## 6  Sall4 capture     6 124.9205  20.82008  39.48431 2.117974e-12The least temporally variable genes (by p-value) are at the tail of the result:
tail(dl$aov)## Source: local data frame [6 x 7]
## 
##      gene    term    df    sumsq    meansq statistic    p.value
##    (fctr)   (chr) (dbl)    (dbl)     (dbl)     (dbl)      (dbl)
## 1   Esrrb capture     6 57.25519  9.542531  3.294826 0.01400297
## 2   Tcf23 capture     6 98.87686 16.479476  2.243248 0.06821476
## 3  Atp12a capture     6 32.66503  5.444171  2.186209 0.07449330
## 4   Eomes capture     6 49.19526  8.199210  2.150825 0.07867960
## 5   Sox17 capture     6 55.95551  9.325919  1.881317 0.11939631
## 6 Tcfap2a capture     6 39.57599  6.595999  1.242028 0.31536827and for instance you could run the model on the 20 most variable genes by executing
dl <- filter.genes(dl, genes=head(dl$aov, 20)$gene)otherwise do not call filter.genes and DeLorean will use all the genes.
Now we have the data we can fit our model using Stan’s ADVI variational Bayes algorithm. To run the No-U-Turn sampler use method='sample'.
dl <- fit.dl(dl, method='vb')If running a sampler, Stan provides \(\hat{R}\) statistics that can aid detecting convergence problems. This makes no sense for ADVI but we show how to produce the boxplots here for users of the samplers.
dl <- examine.convergence(dl)
plot(dl, type='Rhat')Plot the pseudotimes from the best sample (best in the sense of highest likelihood). The prior means for the capture points are shown as dashed lines.
plot(dl, type='pseudotime')Plot the expression data over the pseudotimes from the best sample.
dl <- make.predictions(dl)
plot(dl, type='profiles')