This vignette focuses on two practical knobs in the MetaHunt
pipeline: the latent rank K and the d-fSPA denoising
parameters (N, Delta). For the broader setup — the four
assumptions, the three-step pipeline, and the running notation — see
vignette("metahunt-intro", package = "MetaHunt").
Choosing K is the single most consequential decision in
a MetaHunt fit. Picking K too small underfits: real
cross-study heterogeneity gets squashed into a low-rank approximation
that cannot represent the data, and downstream predictions are biased.
Picking K too large inflates variance and risks recovering
spurious “bases” that fit noise. The denoising step in d-fSPA controls
finite-sample variance in a complementary way: averaging each study with
its near neighbours before basis hunting smooths over per-study
estimation error, at the cost of a small smoothing bias.
m <- 30; G <- 20; K_true <- 3
x <- seq(0, 1, length.out = G)
basis <- rbind(sin(pi * x), cos(pi * x), x)
W <- data.frame(w1 = rnorm(m), w2 = rnorm(m))
beta <- cbind(c(1, -0.8), c(-0.5, 1.2), c(0, 0))
pi_true <- exp(as.matrix(W) %*% beta); pi_true <- pi_true / rowSums(pi_true)
F_hat <- pi_true %*% basis + matrix(rnorm(m * G, sd = 0.05), m, G)The elbow plot tracks how well the recovered bases
reconstruct the observed F_hat as a function of
K. It is unsupervised — it does not use W —
and is fast.
elbow <- reconstruction_error_curve(F_hat, K_range = 2:6,
dfspa_args = list(denoise = FALSE))
plot(elbow$K, elbow$error, type = "b",
xlab = "K", ylab = "reconstruction error",
main = "Reconstruction error vs K",
ylim = c(0, max(elbow$error, na.rm = TRUE) * 1.05))The CV prediction-error curve uses the metadata
W to predict held-out studies’ functions and reports the
average prediction error. This is supervised and tends to identify a
tighter elbow when the metadata is informative.
cv <- cv_error_curve(F_hat, W, K_range = 2:6, n_folds = 4,
dfspa_args = list(denoise = FALSE), seed = 1)
plot(cv$K, cv$cv_error, type = "b",
xlab = "K", ylab = "CV prediction error",
main = "CV prediction error vs K",
ylim = c(0, max(cv$cv_error, na.rm = TRUE) * 1.05))Both curves should dip near K = 3, the true rank in this
simulation.
dfspa() averages each study with its near neighbours
before running the projection algorithm. Two parameters control this:
N (the neighbourhood size, in number of studies) and
Delta (a distance threshold). Larger N and
Delta smooth more aggressively.
In clean simulations or with small m, the simplest
choice is to bypass denoising entirely. This avoids the small-sample
failure mode where aggressive denoising prunes too many studies.
If you have a sense of scale for the within-study estimation error,
pass N and Delta directly. These two calls
illustrate a hand-tuned and a near-default configuration on the same
data.
select_denoising_params() cross-validates over a grid of
(N, Delta) combinations at fixed K. With small
m, the search will frequently warn that some combinations
prune everything (“Only 0 studies survive denoising but K = 3…”). These
warnings are expected: aggressive (N, Delta) on small
training folds is too strong. The function records those folds as
failures and returns the best surviving combination.
K.
Refine with the CV curve if W is informative.m (say m < 30), bypass
denoising (denoise = FALSE) and pick K from
the CV curve.m, leave the d-fSPA defaults on or tune
(N, Delta) with
select_denoising_params().select_denoising_params() as
informative, not fatal. The reported best is the best
surviving combination.plot(fit). Bases that look like noise are a sign of
K set too high.vignette("metahunt-intro", package = "MetaHunt") — the
full pipeline and key assumptions.?metahunt — the wrapper around the three pipeline
steps.?dfspa — d-fSPA basis hunting and its denoising
arguments.?reconstruction_error_curve — the unsupervised elbow
diagnostic.?cv_error_curve — the supervised CV diagnostic.?select_denoising_params — cross-validating
(N, Delta) at fixed K.