Department of Biostatistics, Fielding School of Public Health,
University of California, Los Angeles, CA, USA
Department of Computational Biomedicine,
Cedars-Sinai Medical Center, Los Angeles, CA,
USA
TemporalForest is an R package for reproducible feature
selection in high-dimensional longitudinal data. Such data—where
multiple subjects are measured repeatedly over time—pose challenges
including strong predictor correlations, temporal dependence within
subjects, and an extremely high predictor-to-sample ratio. The
TemporalForest algorithm addresses these by combining network-based
dimensionality reduction (WGCNA/TOM), mixed-effects model trees that
respect within-subject correlation, and stability selection for
reproducibility. Together, these components provide users with an
end-to-end framework for identifying stable and interpretable predictors
in longitudinal omics or other time-resolved studies.
The algorithm is a sequential pipeline designed to filter features based on their temporal stability and predictive relevance.
This stage reduces dimensionality by grouping predictors into modules whose correlation structures are stable across time. It begins by constructing a time-specific Topological Overlap Matrix (TOM) for each time point, a robust measure of network similarity from WGCNA (Langfelder and Horvath 2008). To enforce temporal persistence, a consensus TOM is created by taking the element-wise minimum across all time points. This ensures that only connections that are strong across all time points are preserved. Hierarchical clustering is then applied to this consensus matrix to identify robust modules of co-expressed features.
This stage screens predictors within each temporally stable module. The base learner is a linear mixed-effects model tree (LMER-tree) (Fokkema et al. 2018). This approach is critical as it explicitly models the longitudinal data structure using random effects (e.g., random intercepts and slopes per subject). The tree then uses an unbiased splitting rule based on parameter instability tests to select the most important predictor in the module, avoiding the selection biases common in traditional random forests.
To ensure the final results are reproducible, the screening process is embedded in a stability selection framework (Meinshausen and Buhlmann 2010) The data is repeatedly resampled (bootstrapped), and the screening process is run on each sample. For each feature, the algorithm calculates its selection probability—the proportion of bootstrap samples in which it was selected. Only features with a selection probability above a user-defined threshold are included in the final set, which provides statistical control over the number of false discoveries.
T; each element is
an n × p numeric matrix.Y, id, and
time must follow a subject-major ×
time-minor orderY/id/time are dropped
with a message;X undergoes column-level consistency checks.glmertree) are planned but not
yet enabled.set.seed() affects
bootstrap resampling and tree partitioning; TOM andThe temporal_forest() function includes an internal
check to ensure your list of predictor matrices X is
formatted correctly. This check runs automatically. Here is a
demonstration of what happens with both a valid and an invalid input, by
calling the internal helper check_temporal_consistency()
directly.
First, let’s create a valid X where both matrices have
identical column names. The function will run silently and pass without
any issues.
# Create two matrices with matching column names
mat1 <- matrix(rnorm(20), nrow = 10, dimnames = list(NULL, c("V1", "V2")))
mat2 <- matrix(rnorm(20), nrow = 10, dimnames = list(NULL, c("V1", "V2")))
good_X <- list(mat1, mat2)
# The check passes silently in the background
check_temporal_consistency(good_X)
cat("Input 'good_X' has the correct format and passed the consistency check.")
#> Input 'good_X' has the correct format and passed the consistency check.Now, let’s create an invalid X where the column names do
not match. The helper function will automatically catch this and stop
with a clear, informative error message.
Note: The error=TRUE in the code chunk
header below is a special command that allows the vignette to show the
error message without halting the build process.
# Create two matrices with mismatched column names
mat1 <- matrix(rnorm(20), nrow = 10, dimnames = list(NULL, c("V1", "V2")))
mat3 <- matrix(rnorm(20), nrow = 10, dimnames = list(NULL, c("V1", "V3"))) # Mismatch
bad_X <- list(mat1, mat3)
# This will fail with a helpful error message because of the inconsistency
check_temporal_consistency(bad_X)
#> Error: Inconsistent data format: The column names of the matrix for time point 2 do not match the column names of the first time point.As you can see, the internal helper provides a clear message to the user, preventing them from running a long analysis with improperly formatted data.
This tiny demo is designed to always return the three true signals quickly. We inject strong per-feature effects and pass a precomputed dissimilarity matrix to skip Stage 1.
set.seed(11)
n_subjects <- 60; n_timepoints <- 2; p <- 20
# Build X (two time points) with matching colnames
X <- replicate(n_timepoints, matrix(rnorm(n_subjects * p), n_subjects, p), simplify = FALSE)
colnames(X[[1]]) <- colnames(X[[2]]) <- paste0("V", 1:p)
# Long view and IDs
X_long <- do.call(rbind, X)
id <- rep(seq_len(n_subjects), each = n_timepoints)
time <- rep(seq_len(n_timepoints), times = n_subjects)
# Strong signal on V1, V2, V3 + modest subject random effect + small noise
u_subj <- rnorm(n_subjects, 0, 0.7)
eps <- rnorm(length(id), 0, 0.08)
Y <- 4*X_long[, "V1"] + 3.5*X_long[, "V2"] + 3.2*X_long[, "V3"] +
rep(u_subj, each = n_timepoints) + eps
# Lightweight dissimilarity to skip Stage 1 (fast on CRAN)
A <- 1 - abs(stats::cor(X_long)); diag(A) <- 0
dimnames(A) <- list(colnames(X[[1]]), colnames(X[[1]]))
fit <- TemporalForest::temporal_forest(
X = X, Y = Y, id = id, time = time,
dissimilarity_matrix = A, # skip WGCNA/TOM (Stage 1)
n_features_to_select = 3, # expect V1, V2, V3
n_boot_screen = 6, n_boot_select = 18,
keep_fraction_screen = 1,
min_module_size = 2,
alpha_screen = 0.5, alpha_select = 0.6
)
#> ..cutHeight not given, setting it to 0.951 ===> 99% of the (truncated) height range in dendro.
#> ..done.
print(fit$top_features)
#> [1] "V1" "V3" "V2"The following example generates a small, self-contained longitudinal
dataset that satisfies the required format.
It uses 30 subjects, 4 time points, and 40 predictors, with four of them
contributing to the outcome.
set.seed(456) # reproducibility
# Data dimensions
n_subjects <- 30
n_timepoints <- 4
n_predictors <- 40
total_obs <- n_subjects * n_timepoints
# Define the "true" causal predictors
true_predictors <- c("V3", "V15", "V22", "V38")
# Create the list of predictor matrices (X)
X <- lapply(seq_len(n_timepoints), function(t) {
mat <- matrix(rnorm(n_subjects * n_predictors), nrow = n_subjects, ncol = n_predictors)
colnames(mat) <- paste0("V", seq_len(n_predictors))
mat
})
# Create response with a true signal
all_X_long <- do.call(rbind, X)
signal <- 10*all_X_long[,"V3"] - 10*all_X_long[,"V15"] +
10*all_X_long[,"V22"] - 10*all_X_long[,"V38"]
Y <- signal + rnorm(total_obs, 0, 0.1)
# Metadata vectors
id <- rep(seq_len(n_subjects), each = n_timepoints)
time <- rep(seq_len(n_timepoints), times = n_subjects)n_boot_screen = 10,
n_boot_select = 20.keep_fraction_screen
(e.g., 0.25 → 0.4) or alpha_screen.keep_fraction_screen or
use smaller alpha_select.set.seed(123)) for
reproducibility.This quick start reuses the toy dataset defined
above (X, Y, id,
time, true_predictors) and fits a small model
with minimal bootstrapping so it completes in seconds.
We rely on the objects created in the “A minimal reproducible toy
dataset” section. Our data will have 30 subjects, 4 time points, and 40
predictors. We will define 4 of these predictors as having a “true”
relationship with the outcome Y. The checks below ensure
they exist and satisfy the input contract.
Now, we call the main temporal_forest() function. We
keep the number of bootstraps small for a quick demonstration.
quiet_eval({
old_seed <- .Random.seed
set.seed(456) # local deterministic state
tf_results <- TemporalForest::temporal_forest(
X = X, Y = Y, id = id, time = time,
n_features_to_select = 4,
n_boot_screen = 8, n_boot_select = 8
)
assign("tf_results", tf_results, envir = parent.frame())
.Random.seed <<- old_seed # restore RNG outside the sink
})The function returns an object containing the top selected features.
print(tf_results)
#> --- Temporal Forest Results ---
#>
#> Top 4 feature(s) selected:
#> V3
#> V15
#> V22
#> V38
#>
#> 4 feature(s) were candidates in the final stage.
# Check how many of the true predictors were found
found_mask <- true_predictors %in% tf_results$top_features
n_found <- sum(found_mask)
cat(sprintf("\nThe algorithm found %d out of %d true predictors:\n", n_found, length(true_predictors)))
#>
#> The algorithm found 4 out of 4 true predictors:
print(true_predictors[found_mask])
#> [1] "V3" "V15" "V22" "V38"The TemporalForest run begins with Stage 1, where it
evaluates the scale-free topology fit for the network at each time
point, printing the results of these calculations.
After completing all three stages, the analysis successfully
identified a final set of 4 top features:
V3, V15, V22,V38.
The validation check confirms a great result, showing that the algorithm
correctly found all 4 of the known true predictors in
this ideal, high signal-to-noise setting.
| Symptom | Likely Cause | What to Try |
|---|---|---|
| No features selected | Screening is too strict | Increase keep_fraction_screen or
alpha_screen |
| Too many features selected | Selection is too liberal | Decrease keep_fraction_screen or
alpha_select |
| Strange looking modules | Soft power is not optimal | Re-run select_soft_power() and inspect
plots |
| Runs too slowly | Data is too large | Decrease bootstrap numbers, pre-filter predictors, or
provide a dissimilarity_matrix |
temporal_forest ParametersThe temporal_forest function has several parameters that
allow you to control the algorithm’s behavior. While the defaults are
chosen to be sensible for many applications, understanding each
parameter can help you tailor the analysis to your specific dataset.
X: A list of numeric matrices. Each
matrix in the list represents one time point. The rows must be subjects
and the columns must be predictors. This is the primary data input.Y: A single numeric vector containing
the outcome variable for all subjects at all time points, ordered by
subject and then time (e.g., subject 1/time 1, subject 1/time 2,
…).id: A vector specifying the subject ID
for each observation in Y.time: A vector specifying the time
point for each observation in Y.dissimilarity_matrix: An optional square matrix. This
is for advanced users who have already performed network construction
(Stage 1) and want to provide the resulting dissimilarity matrix (e.g.,
1 - TOM) directly. If this is provided, Stage 1 is
skipped.n_features_to_select: An integer specifying the final
number of top features you want the algorithm to return. The default is
10.min_module_size: The minimum number of features that
can constitute a module during the WGCNA clustering in Stage 1. The
default is 4.These parameters control the bootstrapping process in Stage 3, which is crucial for ensuring the reproducibility of the results.
n_boot_screen: The number of bootstrap repetitions for
the initial screening stage within modules. Higher values lead to more
stable and reliable selection probabilities but increase computation
time. The default is 50.n_boot_select: The number of bootstrap repetitions for
the final stability selection stage. This should generally be higher
than n_boot_screen. The default is 100.keep_fraction_screen: A number between 0 and 1. It
controls the aggressiveness of the initial screening. It is the
proportion of features from each module that are passed to the final
selection stage. A smaller value (e.g., 0.1) is more stringent, while a
larger value (e.g., 0.4) is more liberal. The default is 0.25.These are advanced parameters that are passed down to the
glmertree functions that perform the unbiased recursive
partitioning.
alpha_screen: The significance level (p-value) for a
variable to be considered for a split in the screening
stage trees. The default of 0.2 is relatively liberal to ensure
potentially important variables are not prematurely discarded.alpha_select: The significance level for splitting in
the final selection stage trees. The default of 0.05 is
more conservative, ensuring that the final candidates have a stronger
association with the outcome.The TemporalForest package also exports several utility
functions for advanced users and developers.
select_soft_power()This function is used internally to choose the soft-thresholding power for WGCNA, but is also exported for standalone use.
select_soft_power Do?This function automates a key step in network analysis: choosing the soft-thresholding power (often called beta, \(\beta\)).
Think of it like tuning a radio. You turn a knob (the
power) to find the clearest signal. In this case, the
“clearest signal” is a network that has a scale-free
topology. This is a characteristic of many real-world
biological networks where most nodes have few connections, but a few
“hub” nodes are highly connected.
The select_soft_power function tests a range of power
values and automatically selects the best one based on standard criteria
from the WGCNA method, ensuring that the networks built in
TemporalForest are biologically plausible.
This function is called automatically by
temporal_forest, but you can also use it as a standalone
tool to explore your data.
The simplest way to use it is to provide a numeric matrix of your data (with samples in rows and features in columns).
# --- Example: Data WITHOUT Ideal Scale-Free Topology ---
# 1. Load required libraries
library(WGCNA) # For the soft power selection function
library(MASS) # For simulating correlated data (mvrnorm)
# 2. Set reproducible seed and parameters
set.seed(123)
nSamples = 100
# --- Create Our Simulated Data ---
# 3. Define Module 1 (30 features, high 0.85 correlation)
nMod1 = 30
Mod1Cor = matrix(0.85, nrow = nMod1, ncol = nMod1)
diag(Mod1Cor) = 1
Mod1Data = mvrnorm(n = nSamples, mu = rep(0, nMod1), Sigma = Mod1Cor)
colnames(Mod1Data) = paste0("Mod1Gene_", 1:nMod1)
# 4. Define Module 2 (30 features, high 0.8 correlation)
nMod2 = 30
Mod2Cor = matrix(0.8, nrow = nMod2, ncol = nMod2)
diag(Mod2Cor) = 1
Mod2Data = mvrnorm(n = nSamples, mu = rep(0, nMod2), Sigma = Mod2Cor)
colnames(Mod2Data) = paste0("Mod2Gene_", 1:nMod2)
# 5. Define Noise (40 features, 0 correlation)
nNoise = 40
NoiseData = matrix(rnorm(nSamples * nNoise), nrow = nSamples, ncol = nNoise)
colnames(NoiseData) = paste0("NoiseGene_", 1:nNoise)
# 6. Combine modules and noise into the final 100x100 dataset
sample_data = cbind(Mod1Data, Mod2Data, NoiseData)
# --- Run the Function ---
# 7. Try to find the best power using the ideal 0.9 threshold
# Note: This simple simulation is not truly "scale-free,"
# so the function will correctly report that the R^2 threshold is not met
# and will fall back to the "max curvature" rule.
best_power <- select_soft_power(sample_data, r2_threshold = 0.9)
#> Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
#> 1 1 0.1380 -6.770 0.272 21.200 29.000 32.00
#> 2 2 0.0170 -1.300 0.795 12.800 20.100 21.60
#> 3 3 0.0646 -1.770 0.709 10.000 15.900 17.70
#> 4 4 0.1070 -1.730 0.648 8.190 12.900 15.00
#> 5 5 0.0644 -1.130 0.751 6.780 10.600 12.70
#> 6 6 0.0780 -1.060 0.726 5.620 8.660 10.70
#> 7 7 0.0482 -0.723 0.806 4.670 7.100 9.10
#> 8 8 0.0560 -0.694 0.774 3.880 5.830 7.72
#> 9 9 0.0634 -0.665 0.741 3.230 4.790 6.55
#> 10 10 0.0398 -0.464 0.782 2.680 3.930 5.56
#> 11 12 0.0462 -0.429 0.726 1.860 2.650 4.00
#> 12 14 0.0331 -0.294 0.675 1.300 1.790 2.89
#> 13 16 0.0355 -0.273 0.635 0.904 1.220 2.08
#> 14 18 0.0282 -0.189 0.498 0.632 0.823 1.50
#> 15 20 0.0289 -0.176 0.453 0.443 0.558 1.09
#> R^2 threshold not met. Selected power by max curvature: 2
# 8. Print the fallback result
print(paste("The selected soft power is:", best_power))
#> [1] "The selected soft power is: 2"Running select_soft_power on data that does not have a
strong scale-free topology (like our simple simulation above) will often
result in a plot where the R^2 value never crosses the desired
threshold. The function correctly falls back to its “max curvature”
rule.
A “good” example plot—one from data with an ideal scale-free structure—should look like the following. To demonstrate the concept clearly, we will create a perfect, idealized dataset to generate the plots.
# --- Example: Plotting an "Ideal" Fit ---
# To create a clear example for users, we will manually define a
# "perfect" fit_indices data frame. This ensures we show
# what users should ideally look for.
# 1. Define the powers to test (This vector has 15 elements)
powers <- c(1:10, seq(from = 12, to = 20, by = 2))
# 2. Create FAKE R-square values (Corrected to 15 elements)
# We'll make the R-square cleanly cross 0.9 at power = 6
SFT.R.sq <- c(0.01, 0.20, 0.50, 0.75, 0.88, 0.92, 0.91, 0.89, 0.88, 0.87,
0.85, 0.83, 0.82, 0.81, 0.80)
# 3. Create FAKE mean connectivity values (Corrected to 15 elements)
mean.k. <- c(500, 200, 100, 50, 25, 12, 6, 3, 1.5, 0.8,
0.4, 0.2, 0.1, 0.05, 0.02)
# 4. Combine into the ideal fit_indices data frame (This will now work)
fit_indices <- data.frame(Power = powers, SFT.R.sq = SFT.R.sq, mean.k. = mean.k.)Now we create the two plots using this “ideal” data. The first plot shows the R-square clearly crossing the red line at 0.9.
# Plot R^2 vs Power (This plot will look "perfect")
plot(fit_indices[, "Power"], fit_indices[, "SFT.R.sq"],
type = "b", col = "blue", pch = 20,
xlab = "Soft Threshold (power)", ylab = "Scale-Free Topology Fit (R^2)",
main = "Ideal Scale-Free Fit (Example)",
ylim = c(0, 1.0) # Force y-axis between 0 and 1
)
# Add the 0.9 "ideal" threshold line, as suggested by Ramirez
abline(h = 0.9, col = "red", lty = 2)
The second plot shows the mean connectivity. As the power increases, the
connectivity decreases, and the network becomes sparser. We want to
choose a power that achieves a good scale-free fit without sacrificing
too much connectivity.
# Plot mean connectivity
plot(fit_indices[, "Power"], fit_indices[, "mean.k."],
type = "b", col = "darkgreen", pch = 20,
xlab = "Soft Threshold (power)", ylab = "Mean Connectivity",
main = "Mean Connectivity (Example)")
To summarize, the generated plots are:
scale_free_fit_plot.png: This shows how well the network
fits the scale-free model at each power. You want to pick the lowest
power that crosses the red line (\(R^2\) threshold).
mean_connectivity_plot.png: This shows how connected the
network is at each power. Higher powers lead to sparser networks.
Fallback rule. If no power achieves the target \(R^2\), the function selects the smallest
power at the maximal curvature (“elbow”) of the \(R^2\) curve (cf. WGCNA heuristic).
data_matrix: The main input. This must be a numeric
matrix or data frame where rows are samples (e.g.,
subjects) and columns are features (e.g., genes,
proteins).r2_threshold: A number between 0 and 1 that defines
your goal for the scale-free topology fit. The function will try to find
the lowest power where the model’s \(R^2\) value is above this threshold. The
default of 0.8 is a common convention and
a good starting point, but a higher value like 0.9 is
often preferred for a stronger scale-free fit.make_plots: A simple switch. If
FALSE (the default), no plots are created.
If TRUE, the function will save two
diagnostic plots as PNG files.output_dir: A character string specifying the folder
where the plots should be saved if make_plots is
TRUE. The default is ., which
means the current working directory.The function returns a single integer—the selected soft-thresholding power to be used for constructing the network.
calculate_fs_metrics_cv() &
calculate_pred_metrics_cv()The package exports two utility functions for evaluating performance, which are particularly useful in simulation studies where the “ground truth” is known.
calculate_fs_metrics_cv()This function evaluates the performance of a feature selection algorithm. It compares the set of features selected by a model to the known set of true, important features and calculates several standard metrics.
# Imagine our model selected 3 variables: V1, V2, and V10
selected <- c("V1", "V2", "V10")
# And the "true" important variables were V1, V2, V3, and V4
true_set <- c("V1", "V2", "V3", "V4")
# And the total pool of variables was 50
p <- 50
# Calculate the performance metrics
metrics <- calculate_fs_metrics_cv(
selected_vars = selected,
true_vars_global = true_set,
total_feature_count_p_val = p
)
print(metrics)
#> $TP
#> [1] 2
#>
#> $FP
#> [1] 1
#>
#> $FN
#> [1] 2
#>
#> $TN
#> [1] 45
#>
#> $Sens
#> [1] 0.5
#>
#> $Spec
#> [1] 0.9782609
#>
#> $Prec
#> [1] 0.6666667
#>
#> $F1
#> [1] 0.5714286
#>
#> $N_Selected
#> [1] 3calculate_pred_metrics_cv()This function evaluates the predictive accuracy of a model by comparing the model’s predicted outcomes to the actual outcomes.
# Example predicted values from a model
predicted_values <- c(2.5, 3.8, 6.1, 7.9)
# The corresponding actual, true values
actual_values <- c(2.2, 4.1, 5.9, 8.3)
# Calculate the prediction metrics
pred_metrics <- calculate_pred_metrics_cv(
predictions = predicted_values,
actual = actual_values
)
print(pred_metrics)
#> $RMSE
#> [1] 0.3082207
#>
#> $R_squared
#> [1] 0.9812693%||%This package exports a simple but powerful “null-coalescing”
operator, %||%. Its purpose is to provide a concise
shortcut for setting default values. The operator returns the object on
its left-hand side if it is not NULL; otherwise, it returns
the object on its right-hand side.
A very common task in R is to check if a variable is
NULL and, if it is, assign a default value to it. The
standard way to do this uses an if/else statement, which
can be verbose.
%||% OperatorThe %||% operator simplifies this entire
if/else block into a single, easy-to-read line.
# Example variables
maybe_null <- NULL
default_value <- 5
# Using the %||% operator
final_value_elegant <- maybe_null %||% default_value
print(final_value_elegant)
#> [1] 5
#> [1] 5
# It also works when the variable is not NULL
not_null <- 10
final_value_elegant_2 <- not_null %||% default_value
print(final_value_elegant_2)
#> [1] 10
#> [1] 10
# If other packages (like rlang or purrr) are loaded,
# you can use the fully qualified form to be explicit:
final_value_explicit <- TemporalForest::`%||%`(NULL, 42)
print(final_value_explicit)
#> [1] 42Note: The
TemporalForestpackage exports the infix operator%||%,
which is also provided by other packages such as rlang and purrr.
To avoid ambiguity, you can always call the operator explicitly as
TemporalForest::\%||\%if another package defining%||%is loaded.
The two definitions behave equivalently for most use cases, but fully
qualifying the operator (e.g., TemporalForest::%||%)
ensures that your code uses the implementation from this package.
This is most useful inside a function with optional arguments that might not be provided by the user.
# A function with an optional parameter
plot_data <- function(data, plot_title = NULL, col = "steelblue", pch = 19) {
# Use %||% to set a default title if one wasn't provided
plot_title <- plot_title %||% "Default Plot Title"
plot(
data,
main = plot_title,
xlab = "Index",
ylab = "Value",
col = col,
pch = pch,
cex = 1.2,
cex.main = 1.2,
cex.lab = 1.1,
cex.axis = 0.9,
bty = "l" # remove top/right box
)
grid(col = "gray80") # add a light grid
lines(data, col = adjustcolor(col, alpha.f = 0.5), lwd = 2) # smoother line overlay
}
# Call without providing a title
plot_data(1:10)| Function | Purpose |
|---|---|
temporal_forest() |
Full 3-stage pipeline |
TemporalTree_time() |
Stage 2–3 driver on long data |
select_soft_power() |
Chooses WGCNA soft threshold |
calculate_fs_metrics_cv() |
Feature-selection metrics (TP, FP, F1, …) |
calculate_pred_metrics_cv() |
Prediction metrics (RMSE, R²) |
%||% |
Null-coalescing helper |
To showcase TemporalForest, we replicate the
Moderate Difficulty setting from the manuscript and run
a single simulation replicate (for a full study you
would repeat this many times to average over Monte Carlo noise).
TemporalForest ConsumesNote: All results reported in this vignette section correspond to one simulation replicate under the specification above. For formal performance summaries (e.g., mean F1, RMSE), repeat across many replicates.
# --- (Optional) bring in your TF implementation ---
# source("../R/temporal_forest_functions.R") # uncomment if needed
# --- Required packages for data generation ---
if (!requireNamespace("igraph", quietly = TRUE) ||
!requireNamespace("Matrix", quietly = TRUE) ||
!requireNamespace("MASS", quietly = TRUE)) {
knitr::knit_exit("Please install igraph, Matrix, and MASS to run this vignette example.")
}
suppressPackageStartupMessages(library(WGCNA))
set.seed(456) # fixed seed for reproducibility
# --- Dimensions & index sets (Moderate_Difficulty) ---
n_subjects <- 100
n_timepoints <- 5
n_predictors <- 500
total_obs <- n_subjects * n_timepoints
true_indices <- 1:10
S_L_indices <- 1:5 # linear truths
S_Q_indices <- 6:10 # quadratic truths
predictor_names <- paste0("V", 1:n_predictors)
true_predictors <- paste0("V", true_indices)
# --- Scale-free graph -> edge-weighted adjacency -> correlation matrix ---
g_sf <- igraph::sample_pa(n_predictors, power = 1, m = 3, directed = FALSE)
A_sf <- as.matrix(igraph::as_adjacency_matrix(g_sf)); diag(A_sf) <- 1
# Weight off-diagonal edges with log-normal draws (as in master script)
edges <- which(A_sf > 0, arr.ind = TRUE)
A_sf[edges[edges[,1] != edges[,2], ]] <- rlnorm(sum(edges[,1] != edges[,2]),
meanlog = -1, sdlog = 1)
# Add small noise, symmetrize, set diag=1, project to nearest PD *twice*
cov_full <- A_sf + matrix(rnorm(n_predictors^2, 0, 0.02), n_predictors, n_predictors)
cov_full <- (cov_full + t(cov_full)) / 2; diag(cov_full) <- 1
cov_full <- as.matrix(Matrix::nearPD(cov_full, corr = TRUE, maxit = 500)$mat)
# Boost correlations within the true block and re-project
cov_full[true_indices, true_indices] <-
pmin(cov_full[true_indices, true_indices] + 0.2, 0.7)
cov_full <- as.matrix(Matrix::nearPD(cov_full, corr = TRUE, maxit = 500)$mat)
# --- Draw X and STANDARDIZE (scale) exactly like the master script ---
X_raw <- MASS::mvrnorm(n = total_obs, mu = rep(0, n_predictors), Sigma = cov_full)
all_X_data <- scale(X_raw)
colnames(all_X_data) <- predictor_names
# --- Time-varying coefficients for the 10 true predictors ---
time_vec <- rep(1:n_timepoints, times = n_subjects)
true_betas <- matrix(0, nrow = n_timepoints, ncol = n_predictors)
for (j in true_indices) {
a <- rnorm(1, 0.18, 0.05)
b <- rnorm(1, 0.065, 0.02)
c <- rnorm(1, -0.0035, 0.001)
true_betas[, j] <- a + b * (1:n_timepoints) + c * (1:n_timepoints)^2
}
# Linear and quadratic signal pieces
linear_signal <- rowSums(
all_X_data[, S_L_indices, drop = FALSE] *
true_betas[time_vec, S_L_indices, drop = FALSE]
)
quadratic_signal <- rowSums(
(all_X_data[, S_Q_indices, drop = FALSE]^2) *
true_betas[time_vec, S_Q_indices, drop = FALSE]
)
signal <- linear_signal + quadratic_signal
# --- Treatment (subject-level, coefficient 2) ---
treatment_binary <- sample(0:1, n_subjects, replace = TRUE)
treatment_effect <- 2 * rep(treatment_binary, each = n_timepoints)
# --- Random effects (Moderate_Difficulty) ---
u_sd <- 1.40; v_sd <- 0.85
random_intercepts <- rep(rnorm(n_subjects, 0, u_sd), each = n_timepoints)
random_slopes <- rep(rnorm(n_subjects, 0, v_sd), each = n_timepoints) * time_vec
random_effects <- random_intercepts + random_slopes
# --- AR(1) errors generated PER SUBJECT (panel AR(1)), stationary init ---
phi <- 0.65; sigma_eps <- 1.45
errors <- numeric(total_obs)
for (s in 1:n_subjects) {
idx <- ((s - 1) * n_timepoints + 1):(s * n_timepoints)
init_sd <- if (abs(phi) < 1) sigma_eps / sqrt(1 - phi^2) else sigma_eps
errors[idx[1]] <- rnorm(1, 0, init_sd)
for (tt in 2:n_timepoints) {
errors[idx[tt]] <- phi * errors[idx[tt - 1]] + rnorm(1, 0, sigma_eps)
}
}
# --- Outcome ---
Y <- signal + treatment_effect + random_effects + errors
# --- Build long data.frame like the master script expects ---
df_long <- data.frame(
patient = factor(rep(1:n_subjects, each = n_timepoints)),
time = factor(time_vec),
time_numeric = as.numeric(time_vec),
treatment = factor(rep(treatment_binary, each = n_timepoints)),
y = Y
)
df_long <- cbind(df_long, as.data.frame(all_X_data))
predictors_global <- colnames(all_X_data)
# --- Compute signed TOM at each time, power = 6, then consensus by MIN ---
softPower <- 6
time_levels <- levels(df_long$time)
TOMs_list <- lapply(time_levels, function(tt) {
X_t <- as.matrix(df_long[df_long$time == tt, predictors_global, drop = FALSE])
Adj_t <- adjacency(X_t, power = softPower, type = "signed")
TOMsimilarity(Adj_t, TOMType = "signed", verbose = 0)
})
arr <- simplify2array(TOMs_list) # p x p x T
consTOM <- apply(arr, c(1, 2), min) # consensus across time (min)
A_combined <- 1 - consTOM # dissimilarity fed to TFWe now run the algorithm on this full-scale dataset using the
function TemporalTree_time.
This function takes as input:
df_long),A_combined = 1 - TOM),time_numeric, treatment),predictors_global),patient),This single run reflects the Moderate Difficulty simulation described above, with bootstrapping values reduced for speed in the vignette. To reproduce the computation locally, remove eval=FALSE (or run this chunk interactively).
tf_fit <- TemporalTree_time(
data = df_long,
A_combined = A_combined, # dissimilarity (1 - TOM)
fixed_regress = c("time_numeric","treatment"),
var_select = predictors_global,
cluster = "patient",
number_selected_final = 10,
keep_fraction_screen = 0.25,
n_boot_screen = 25,
n_boot_select = 50
)To keep the vignette fast on CRAN, we ship a precomputed result from a single replicate. The chunk below loads that object; if it isn’t found, it prints a helpful message.
# Try to load a pre-computed result. If you're developing locally (not installed),
# fall back to the source tree path.
if (!exists("tf_fit")) {
tf_fit_path <- system.file("extdata", "tf_fit_moderate_seed456.rds",
package = "TemporalForest")
# Dev fallback when running from source (system.file() returns "")
if (!nzchar(tf_fit_path)) tf_fit_path <- "inst/extdata/tf_fit_moderate_seed456.rds"
if (file.exists(tf_fit_path)) {
tf_fit <- readRDS(tf_fit_path)
message("Loaded precomputed tf_fit from: ", normalizePath(tf_fit_path))
} else {
message("Precomputed result not found. To reproduce, enable the chunk above (eval=TRUE).")
}
} else {
message("tf_fit already exists in the environment; skipping load.")
}
#> Loaded precomputed tf_fit from: /Users/sisishao/Desktop/TemporalForest/vignettes/inst/extdata/tf_fit_moderate_seed456.rdsThe object tf_fit contains two main outputs:
final_selection: the set of features most robustly
selected across resamples,second_stage_splitters: all features that entered the
final selection stage.We now compare the selected features against the known set of 10 true predictors used to generate the data.
# Guard: ensure tf_fit is available
if (!exists("tf_fit")) {
stop("tf_fit is not available. Load the precomputed object or run the estimation chunk with eval=TRUE.")
}
top_feats <- tf_fit$final_selection %||% character(0)
found_mask <- true_predictors %in% top_feats
n_found <- sum(found_mask)
cat(sprintf("\nFrom a set of %d true predictors, TemporalForest correctly identified %d:\n",
length(true_predictors), n_found))
#>
#> From a set of 4 true predictors, TemporalForest correctly identified 1:
print(sort(true_predictors[found_mask]))
#> [1] "V3"
# Optional: peek at what entered the final stage
if (!is.null(tf_fit$second_stage_splitters)) {
cat("\nNumber of candidates in the final stage:", length(tf_fit$second_stage_splitters), "\n")
}
#>
#> Number of candidates in the final stage: 16In this single simulation replicate, TemporalForest successfully recovered the majority of the true predictors.
For example, in one run the method identified 9 out of
10 of the ground-truth features
(V1, V2, V3, V4, V5, V7, V8, V9, V10), missing only
V6.
Such variability is expected in finite samples, and performance will fluctuate across replicates depending on signal strength, correlation structure, and bootstrap stability.
citation("TemporalForest")
#> To cite the TemporalForest package in publications, please use:
#>
#> Shao S, Moore JH, Ramirez CM (2025). Network-Guided TemporalForest
#> for Feature Selection in High-Dimensional Longitudinal Data.
#> Manuscript submitted for publication.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Unpublished{,
#> title = {Network-Guided TemporalForest for Feature Selection in High-Dimensional Longitudinal Data},
#> author = {Sisi Shao and Jason H. Moore and Christina M. Ramirez},
#> year = {2025},
#> note = {Manuscript submitted for publication},
#> }set.seed(456) # main vignette seed used above
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Sonoma 14.2.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Los_Angeles
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] MASS_7.3-65 WGCNA_1.73 fastcluster_1.3.0
#> [4] dynamicTreeCut_1.63-1 TemporalForest_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rdpack_2.6.4 DBI_1.2.3 gridExtra_2.3
#> [4] rlang_1.1.6 magrittr_2.0.4 matrixStats_1.5.0
#> [7] compiler_4.4.1 RSQLite_2.4.3 png_0.1-8
#> [10] vctrs_0.6.5 stringr_1.5.2 pkgconfig_2.0.3
#> [13] crayon_1.5.3 fastmap_1.2.0 backports_1.5.0
#> [16] XVector_0.44.0 inum_1.0-5 rmarkdown_2.30
#> [19] UCSC.utils_1.0.0 nloptr_2.2.1 preprocessCore_1.66.0
#> [22] bit_4.6.0 xfun_0.53 zlibbioc_1.50.0
#> [25] cachem_1.1.0 flashClust_1.01-2 GenomeInfoDb_1.40.1
#> [28] jsonlite_2.0.0 blob_1.2.4 parallel_4.4.1
#> [31] cluster_2.1.8.1 R6_2.6.1 glmertree_0.2-6
#> [34] bslib_0.9.0 stringi_1.8.7 RColorBrewer_1.1-3
#> [37] boot_1.3-32 rpart_4.1.24 jquerylib_0.1.4
#> [40] Rcpp_1.1.0 iterators_1.0.14 knitr_1.50
#> [43] base64enc_0.1-3 IRanges_2.38.1 Matrix_1.7-4
#> [46] splines_4.4.1 nnet_7.3-20 tidyselect_1.2.1
#> [49] rstudioapi_0.17.1 yaml_2.3.10 partykit_1.2-24
#> [52] doParallel_1.0.17 codetools_0.2-20 lattice_0.22-7
#> [55] tibble_3.3.0 Biobase_2.64.0 KEGGREST_1.44.1
#> [58] S7_0.2.0 evaluate_1.0.5 foreign_0.8-90
#> [61] survival_3.8-3 Biostrings_2.72.1 pillar_1.11.1
#> [64] checkmate_2.3.3 foreach_1.5.2 stats4_4.4.1
#> [67] reformulas_0.4.1 generics_0.1.4 S4Vectors_0.42.1
#> [70] ggplot2_4.0.0 scales_1.4.0 minqa_1.2.8
#> [73] glue_1.8.0 Hmisc_5.2-4 tools_4.4.1
#> [76] data.table_1.17.8 lme4_1.1-37 mvtnorm_1.3-3
#> [79] grid_4.4.1 impute_1.78.0 libcoin_1.0-10
#> [82] rbibutils_2.3 AnnotationDbi_1.66.0 colorspace_2.1-2
#> [85] nlme_3.1-168 GenomeInfoDbData_1.2.12 htmlTable_2.4.3
#> [88] Formula_1.2-5 cli_3.6.5 dplyr_1.1.4
#> [91] gtable_0.3.6 sass_0.4.10 digest_0.6.37
#> [94] BiocGenerics_0.50.0 htmlwidgets_1.6.4 farver_2.1.2
#> [97] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
#> [100] httr_1.4.7 GO.db_3.19.1 bit64_4.6.0-1