This vignette demonstrates integration of corrselect into complete
modeling workflows, from raw predictors to final models under
correlation constraints. Four applied settings illustrate interface
selection and workflow composition. See
vignette("quickstart") for interface descriptions and
vignette("theory") for mathematical foundations.
Functions used: corrPrune(),
modelPrune() — see vignette("quickstart") for
full signatures and parameter details.
Each workflow showcases different aspects of the package:
Goal: Build an interpretable species distribution model from highly correlated bioclimatic variables.
Challenge: WorldClim’s 19 bioclimatic variables contain many temperature and precipitation metrics that are mathematically related (e.g., mean temperature, minimum temperature, maximum temperature). Using all variables leads to multicollinearity, unstable coefficients, and poor model interpretation.
Strategy: Two-stage pruning:
corrPrune() removes pairwise correlations > 0.7modelPrune() refines further using variance inflation
factors (VIF)This approach balances model fit with interpretability and numerical stability.
data(bioclim_example)
# Data structure
dim(bioclim_example)
#> [1] 100 20
head(names(bioclim_example))
#> [1] "species_richness" "BIO1" "BIO2" "BIO3"
#> [5] "BIO4" "BIO5"
# Response variable
summary(bioclim_example$species_richness)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 5.0 96.0 118.5 120.2 144.2 206.0The dataset contains 100 sampling locations with 19 bioclimatic predictors and a species richness response. The bioclimatic variables are standard WorldClim metrics (BIO1-BIO19) measuring temperature and precipitation patterns.
We start by removing variables with pairwise correlations exceeding
0.7. The mode = "auto" setting uses the exact algorithm
(Bron-Kerbosch) to enumerate all maximal subsets, then selects the
largest one.
# Remove highly correlated predictors
bio_clean <- corrPrune(
data = bioclim_example[, -1], # Exclude response
threshold = 0.7,
mode = "auto"
)
# How much did we reduce?
cat(sprintf("Reduced from %d → %d variables\n",
ncol(bioclim_example) - 1,
ncol(bio_clean)))
#> Reduced from 19 → 12 variables
# Which variables were kept?
head(attr(bio_clean, "selected_vars"), 10)
#> [1] "BIO1" "BIO3" "BIO6" "BIO9" "BIO11" "BIO12" "BIO13" "BIO14" "BIO16"
#> [10] "BIO17"The pruning successfully reduced dimensionality while ensuring no remaining pair exceeds the 0.7 threshold. The selected variables span both temperature and precipitation domains, maintaining ecological interpretability.
The histogram below shows how corrPrune() reshapes the
correlation structure. Before pruning (red), many variable pairs exceed
the 0.7 threshold. After pruning (blue), all pairwise correlations fall
below the threshold.
cor_before <- cor(bioclim_example[, -1])
cor_after <- cor(bio_clean)
vals_before <- abs(cor_before[upper.tri(cor_before)])
vals_after <- abs(cor_after[upper.tri(cor_after)])
# Common x limits
xlim <- c(0, 1)
# Shared breaks for fair comparison
breaks <- seq(0, 1, length.out = 30)
# Before histogram
hist(vals_before,
breaks = breaks,
freq = FALSE,
main = "Distribution of Absolute Correlations",
xlab = "Absolute Correlation",
col = rgb(0.8, 0.2, 0.2, 0.4),
border = "white",
xlim = xlim)
# After histogram
hist(vals_after,
breaks = breaks,
freq = FALSE,
col = rgb(0.2, 0.5, 0.8, 0.4),
border = "white",
add = TRUE)
# Threshold line
abline(v = 0.7, col = "black", lty = 2, lwd = 2)
legend("topright",
legend = c("Before", "After", "Threshold"),
fill = c(rgb(0.8, 0.2, 0.2, 0.4),
rgb(0.2, 0.5, 0.8, 0.4),
NA),
border = c("white", "white", NA),
lty = c(NA, NA, 2),
lwd = c(NA, NA, 2),
col = c(NA, NA, "black"),
bty = "o",
bg = "white")We fit three models to compare the effect of pruning:
The VIF criterion complements correlation-based pruning by detecting multicollinearity involving more than two variables simultaneously.
# Model 1: Full model (19 variables)
model_full <- lm(species_richness ~ ., data = bioclim_example)
# Model 2: After corrPrune (correlation-based pruning)
bio_clean_full <- data.frame(
species_richness = bioclim_example$species_richness,
bio_clean
)
model_corrprune <- lm(species_richness ~ ., data = bio_clean_full)
# Model 3: Sequential VIF-based refinement
bio_final <- modelPrune(
formula = species_richness ~ .,
data = bio_clean_full,
limit = 5
)
model_final <- attr(bio_final, "final_model")The table below shows that pruning dramatically improves numerical stability (condition number κ) while maintaining model fit (adjusted R²). A lower κ indicates better-conditioned matrices and more stable coefficient estimates.
# Variable counts
n_full <- 19
n_corrprune <- length(attr(bio_clean, "selected_vars"))
n_final <- length(attr(bio_final, "selected_vars"))
# Compute condition numbers (measure of collinearity)
X_full <- model.matrix(model_full)[, -1]
X_corrprune <- model.matrix(model_corrprune)[, -1]
X_final <- model.matrix(model_final)[, -1]
kappa_full <- kappa(X_full, exact = TRUE)
kappa_corrprune <- kappa(X_corrprune, exact = TRUE)
kappa_final <- kappa(X_final, exact = TRUE)
# Summary table
comparison <- data.frame(
Step = c("Full", "corrPrune", "+ modelPrune"),
Predictors = c(n_full, n_corrprune, n_final),
Adj_R2 = c(
summary(model_full)$adj.r.squared,
summary(model_corrprune)$adj.r.squared,
summary(model_final)$adj.r.squared
),
Kappa = c(kappa_full, kappa_corrprune, kappa_final)
)
print(comparison)
#> Step Predictors Adj_R2 Kappa
#> 1 Full 19 0.9821862 914.8146
#> 2 corrPrune 12 0.8962848 538.1157
#> 3 + modelPrune 12 0.8962848 538.1157Key insights:
The plot below illustrates the pruning tradeoff: as we remove variables, condition number (κ) drops dramatically while adjusted R² remains high. This demonstrates that many of the original 19 variables were redundant for prediction.
# Extract data
n_vars <- comparison$Predictors
adj_r2 <- comparison$Adj_R2
kappa <- comparison$Kappa
# Left y-axis: Adjusted R²
par(mar = c(5, 4, 4, 4)) # extra space on the right for second axis
plot(
n_vars, adj_r2,
type = "b",
pch = 19,
cex = 1.5,
col = rgb(0.2, 0.5, 0.8, 1),
lwd = 2,
xlab = "Number of Predictors",
ylab = "Adjusted R²",
ylim = c(0, 1),
main = "Pruning Reduces Collinearity While Preserving Fit"
)
# Right y-axis: log10(κ)
log_kappa <- log10(kappa)
# Nice ylim for log10(κ)
ylim_right <- range(log_kappa, finite = TRUE)
ylim_right <- ylim_right * c(0.9, 1.1)
par(new = TRUE)
plot(
n_vars, log_kappa,
type = "b",
pch = 17,
cex = 1.5,
col = rgb(0.8, 0.2, 0.2, 1),
lwd = 2,
xaxt = "n",
yaxt = "n",
xlab = "",
ylab = "",
ylim = ylim_right
)
# Right-hand axis ticks using pretty() on log scale
log_ticks <- pretty(log_kappa)
kappa_labels <- round(10^log_ticks)
axis(4, at = log_ticks, labels = kappa_labels)
mtext("Condition Number (κ)", side = 4, line = 3)
# Legend centered at top
legend(
"top",
inset = 0.02,
legend = c("Adjusted R² (higher better)", "κ (lower better)"),
col = c(
rgb(0.2, 0.5, 0.8, 1),
rgb(0.8, 0.2, 0.2, 1)
),
pch = c(19, 17),
lwd = 2,
horiz = TRUE,
bty = "o",
bg = "white",
x.intersp = 0.8
)Multicollinearity inflates coefficient variance, making estimates unstable. The plot below compares coefficients between the full model (19 variables) and the final pruned model. Notice how:
# Extract coefficients (excluding intercept)
coef_full <- coef(model_full)[-1]
coef_final <- coef(model_final)[-1]
# Use *all* variables (not just common ones)
all_vars <- union(names(coef_full), names(coef_final))
# Align coefficients to the same full variable list
vals_full <- coef_full[all_vars]
vals_pruned <- coef_final[all_vars]
# Replace missing values (variables dropped in a model) with 0
vals_full[is.na(vals_full)] <- 0
vals_pruned[is.na(vals_pruned)] <- 0
# Colours
col_full <- rgb(0.8, 0.2, 0.2, 0.5)
col_pruned <- rgb(0.2, 0.5, 0.8, 0.5)
x <- seq_along(all_vars)
# Symmetric Y range
ylim <- range(c(vals_full, vals_pruned)) * 1.15
# Empty plot first
plot(
x, vals_full,
type = "n",
xaxt = "n",
xlab = "",
ylab = "Coefficient",
main = "Coefficient Comparison (Full vs modelPrune)",
ylim = ylim
)
axis(1, at = x, labels = all_vars, las = 2, cex.axis = 0.7)
# Full model bars
rect(
x - 0.4, 0,
x + 0.4, vals_full,
col = col_full, border = NA
)
# Pruned model bars
rect(
x - 0.4, 0,
x + 0.4, vals_pruned,
col = col_pruned, border = NA
)
legend(
"topright",
legend = c(
sprintf("Full (%d vars)", n_full),
sprintf("Final (%d vars)", n_final)
),
fill = c(col_full, col_pruned),
border = "white",
bty = "o",
bg = "white"
)Goal: Reduce questionnaire length while preserving construct coverage and protecting key demographic variables.
Challenge: Survey instruments often contain redundant items within constructs (e.g., multiple satisfaction questions that are highly correlated). Reducing the number of items improves response rates and reduces respondent burden without losing measurement quality.
Strategy: Use corrPrune() with
force_in to:
data(survey_example)
# Data structure
dim(survey_example)
#> [1] 200 35
str(survey_example[, 1:10]) # First 10 columns
#> 'data.frame': 200 obs. of 10 variables:
#> $ respondent_id : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ age : num 38 32 18 18 19 39 33 26 26 42 ...
#> $ gender : Factor w/ 3 levels "Female","Male",..: 2 3 1 2 2 1 1 2 2 1 ...
#> $ education : Ord.factor w/ 4 levels "High School"<..: 3 1 4 2 2 1 1 1 2 3 ...
#> $ overall_satisfaction: num 58 40 44 40 58 67 61 49 51 52 ...
#> $ satisfaction_1 : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: 6 3 5 3 4 5 5 4 4 5 ...
#> $ satisfaction_2 : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: 6 3 4 3 4 6 6 4 4 3 ...
#> $ satisfaction_3 : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: 6 3 4 3 3 4 5 3 4 4 ...
#> $ satisfaction_4 : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: 6 3 4 4 4 5 4 3 2 4 ...
#> $ satisfaction_5 : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: 7 4 5 5 5 4 3 6 4 6 ...The dataset contains 200 respondents, 30 Likert-scale items (10 each for satisfaction, engagement, loyalty), plus demographics (age, gender, education) and an overall satisfaction outcome.
We use force_in = "age" to ensure age remains in the
analysis regardless of its correlation with other variables. This is
useful when domain knowledge identifies theoretically important
covariates that must not be removed.
# Exclude respondent_id, overall_satisfaction, and factor variables
survey_predictors <- survey_example[, !(names(survey_example) %in%
c("respondent_id", "overall_satisfaction",
"gender", "education"))]
# Convert ordered factors (Likert items 1-7) to numeric for correlation analysis
survey_numeric <- as.data.frame(lapply(survey_predictors, function(x) {
if (is.ordered(x)) as.numeric(as.character(x)) else as.numeric(x)
}))
# Prune with protected variables
survey_clean <- corrPrune(
data = survey_numeric,
threshold = 0.6,
force_in = "age"
)
# How many items remain?
cat(sprintf("Reduced from %d → %d variables\n",
ncol(survey_numeric),
ncol(survey_clean)))
#> Reduced from 31 → 4 variables
# Which items were kept?
selected <- attr(survey_clean, "selected_vars")
print(selected)
#> [1] "age" "satisfaction_9" "engagement_1" "loyalty_3"The pruning reduced the questionnaire substantially while ensuring age was retained. The remaining items span all three constructs, avoiding the loss of entire domains.
It’s important to verify that pruning didn’t eliminate entire constructs. We check how many items from each domain (satisfaction, engagement, loyalty) survived the correlation threshold.
# Count items per construct
satisfaction_kept <- sum(grepl("satisfaction_", selected))
engagement_kept <- sum(grepl("engagement_", selected))
loyalty_kept <- sum(grepl("loyalty_", selected))
cat(sprintf("Satisfaction: %d/10 items kept\n", satisfaction_kept))
#> Satisfaction: 1/10 items kept
cat(sprintf("Engagement: %d/10 items kept\n", engagement_kept))
#> Engagement: 1/10 items kept
cat(sprintf("Loyalty: %d/10 items kept\n", loyalty_kept))
#> Loyalty: 1/10 items keptGood balance: all three constructs retained representation, ensuring the reduced questionnaire still measures all intended dimensions.
The barplots show (1) how many items survived pruning within each construct and (2) the overall variable reduction.
par(mfrow = c(1, 2))
# Items kept per construct
construct_data <- rbind(
c(10, 10, 10),
c(satisfaction_kept, engagement_kept, loyalty_kept)
)
barplot(construct_data,
beside = TRUE,
names.arg = c("Satisfaction", "Engagement", "Loyalty"),
col = c("lightgray", "lightblue"),
legend.text = c("Original (10)", "After pruning"),
args.legend = list(x = "topright", bty = "n"),
main = "Items per Construct",
ylab = "Number of Items",
ylim = c(0, 12))
# Percentage reduction
barplot(c(ncol(survey_numeric), ncol(survey_clean)),
names.arg = c("Before", "After"),
col = c("salmon", "lightgreen"),
main = "Total Variables",
ylab = "Count",
ylim = c(0, max(ncol(survey_numeric)) * 1.2))
text(0.7, ncol(survey_numeric) + 1, ncol(survey_numeric), pos = 3)
text(1.9, ncol(survey_clean) + 1, ncol(survey_clean), pos = 3)Now we fit a regression model predicting overall satisfaction from the pruned item set. Despite using fewer predictors, the model should maintain good explanatory power because we removed only redundant items.
# Add response back
survey_model_data <- data.frame(
overall_satisfaction = survey_example$overall_satisfaction,
survey_clean
)
# Fit regression model
model_survey <- lm(overall_satisfaction ~ ., data = survey_model_data)
# Summary
summary(model_survey)
#>
#> Call:
#> lm(formula = overall_satisfaction ~ ., data = survey_model_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17.0653 -4.1416 -0.1467 4.2400 22.2405
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 24.85040 2.03665 12.202 <2e-16 ***
#> age 0.01951 0.04109 0.475 0.6354
#> satisfaction_9 4.96640 0.34387 14.443 <2e-16 ***
#> engagement_1 0.24937 0.32131 0.776 0.4386
#> loyalty_3 0.54190 0.30190 1.795 0.0742 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.372 on 195 degrees of freedom
#> Multiple R-squared: 0.6893, Adjusted R-squared: 0.683
#> F-statistic: 108.2 on 4 and 195 DF, p-value: < 2.2e-16Comparing the pruned model against the full 33-variable model shows that we retain most of the explanatory power (R²) while dramatically reducing model complexity.
# Full model (all 30 items + demographics)
full_survey_data <- data.frame(
overall_satisfaction = survey_example$overall_satisfaction,
survey_predictors
)
model_full_survey <- lm(overall_satisfaction ~ ., data = full_survey_data)
# Compare
data.frame(
Model = c("Full (33 vars)", "Pruned (10 vars)"),
R2 = c(summary(model_full_survey)$r.squared,
summary(model_survey)$r.squared),
Adj_R2 = c(summary(model_full_survey)$adj.r.squared,
summary(model_survey)$adj.r.squared),
Num_Predictors = c(33, 10)
)
#> Model R2 Adj_R2 Num_Predictors
#> 1 Full (33 vars) 0.9790144 0.7679931 33
#> 2 Pruned (10 vars) 0.6893327 0.6829600 10Goal: Reduce dimensionality in a gene expression dataset where the number of predictors far exceeds the number of samples (p >> n).
Challenge: With 200 genes and only 100 samples, exact enumeration of all maximal subsets becomes computationally expensive. Standard regression is also impossible due to rank deficiency.
Strategy: Use mode = "greedy" for fast,
approximate pruning:
data(genes_example)
# Data structure
dim(genes_example)
#> [1] 100 202
# Disease prevalence
table(genes_example$disease_status)
#>
#> Healthy Disease
#> 2 98The dataset contains gene expression measurements for 200 genes across 100 samples, with a binary disease outcome. This is a classic p >> n scenario where regularization or dimensionality reduction is essential.
The greedy algorithm iteratively selects variables, prioritizing those with the lowest maximum correlation to already-selected variables. This heuristic runs in O(p²) time compared to the exponential complexity of exact methods.
library(microbenchmark)
# Extract gene expression data (exclude ID and outcome)
gene_expr <- genes_example[, -(1:2)]
# Greedy pruning with timing
greedy_timing <- microbenchmark(
genes_pruned <- corrPrune(
data = gene_expr,
threshold = 0.8,
mode = "greedy" # Fast for large p
),
times = 3
)
greedy_ms <- median(greedy_timing$time) / 1e6
# Reduction
cat(sprintf("Reduced from %d → %d genes (%.1f ms)\n",
ncol(gene_expr),
ncol(genes_pruned),
greedy_ms))
#> Reduced from 200 → 177 genes (3.8 ms)The greedy algorithm completed in milliseconds while ensuring all pairwise correlations remain below 0.8.
# Barplot showing reduction
reduction_data <- c(ncol(gene_expr), ncol(genes_pruned))
barplot(reduction_data,
names.arg = c("Original", "After Pruning"),
main = "Gene Dimensionality Reduction",
ylab = "Number of Genes",
col = c("salmon", "lightblue"),
ylim = c(0, max(reduction_data) * 1.2))
text(0.7, reduction_data[1] + 10, paste(reduction_data[1], "genes"), pos = 3)
text(1.9, reduction_data[2] + 10, paste(reduction_data[2], "genes\n(",
round(100 * reduction_data[2] / reduction_data[1], 1), "% retained)"), pos = 3)To demonstrate the speed advantage of the greedy algorithm, we benchmark both approaches on a smaller subset. The performance gap widens dramatically as the number of variables increases.
# Subset for comparison (use smaller subset for vignette build speed)
gene_subset <- gene_expr[, 1:20] # Reduced from 50 to 20 for faster builds
# Benchmark exact mode
exact_time <- median(microbenchmark(
exact_result <- corrPrune(gene_subset, threshold = 0.8, mode = "exact"),
times = 3,
unit = "ms"
)$time) / 1e6 # Convert nanoseconds to milliseconds
# Benchmark greedy mode
greedy_time <- median(microbenchmark(
greedy_result <- corrPrune(gene_subset, threshold = 0.8, mode = "greedy"),
times = 3,
unit = "ms"
)$time) / 1e6 # Convert nanoseconds to milliseconds
# Run once more to get actual results for comparison
exact_result <- corrPrune(gene_subset, threshold = 0.8, mode = "exact")
greedy_result <- corrPrune(gene_subset, threshold = 0.8, mode = "greedy")
# Compare
cat(sprintf("Exact mode: %d genes kept (%.1f ms)\n", ncol(exact_result), exact_time))
#> Exact mode: 11 genes kept (1.8 ms)
cat(sprintf("Greedy mode: %d genes kept (%.1f ms)\n", ncol(greedy_result), greedy_time))
#> Greedy mode: 10 genes kept (0.4 ms)
cat(sprintf("Speedup: %.1fx faster\n", exact_time / greedy_time))
#> Speedup: 4.9x fasterThe greedy mode is substantially faster. For the full 200-gene dataset, exact enumeration would be prohibitively slow, while greedy mode completes in milliseconds.
Finally, we demonstrate that the pruned gene set is suitable for downstream classification, despite the dramatic dimensionality reduction.
# Prepare classification data
classification_data <- data.frame(
disease_status = genes_example$disease_status,
genes_pruned
)
# Logistic regression
model_genes <- glm(disease_status ~ .,
data = classification_data,
family = binomial())
# Prediction accuracy
predictions <- ifelse(predict(model_genes, type = "response") > 0.5,
"Disease", "Healthy")
accuracy <- mean(predictions == genes_example$disease_status)
cat(sprintf("Classification accuracy: %.1f%%\n", accuracy * 100))
#> Classification accuracy: 100.0%Goal: Apply correlation-based pruning to fixed effects in a mixed-effects model with longitudinal data.
Challenge: Longitudinal data has hierarchical structure (observations nested within subjects, subjects nested within sites). Standard VIF calculations don’t account for random effects, but we still need to control multicollinearity among fixed-effect predictors.
Strategy: Use modelPrune() with
engine = "lme4":
(1|subject) and (1|site)
are preserved in the model formulaNote: This workflow requires the lme4
package and is shown with eval=FALSE for portability.
data(longitudinal_example)
# Data structure
dim(longitudinal_example)
#> [1] 500 25
head(longitudinal_example)
#> obs_id subject site time outcome x1 x2 x3
#> 1 1 1 1 1 12.10893 -0.62045078 -0.8629274 -0.39625483
#> 2 2 1 1 2 13.97195 0.04255998 0.4086801 0.27069591
#> 3 3 1 1 3 15.71622 0.20445067 -1.0457672 1.69557480
#> 4 4 1 1 4 12.61343 0.38437289 2.0301256 -1.17894923
#> 5 5 1 1 5 15.73139 -0.06402543 1.1645219 -0.19757260
#> 6 6 1 1 6 13.19571 0.65208588 -1.5040345 0.07671762
#> x4 x5 x6 x7 x8 x9
#> 1 -0.4557709 -1.2550770 -0.35966846 -1.9288176 0.8393628 0.151111346
#> 2 -0.7652493 -1.2136755 -0.05181211 -0.4759841 0.9063848 1.429621637
#> 3 0.5569423 -1.4962408 0.23986408 -1.7077822 1.0343806 0.797974670
#> 4 0.1604224 -0.8192580 -0.94858863 -1.9314095 -0.0639552 0.887413164
#> 5 -1.3331022 -0.6162473 -0.61339605 -0.9439486 -0.1220047 -0.796244469
#> 6 0.5205094 -0.8547461 0.24120308 -1.3083776 0.8716209 0.005423635
#> x10 x11 x12 x13 x14 x15 x16
#> 1 0.9681971 -0.1384342 0.2598923 0.37082493 -0.1829715 -1.1853721 -0.72703262
#> 2 2.0989531 0.6303914 0.6211064 -0.91103843 -0.1125705 -0.3458058 -0.44331165
#> 3 0.2329831 -0.8618361 2.0854289 0.02634099 -1.3522055 -1.2151804 0.00311954
#> 4 1.5491815 1.7595441 0.8198899 0.18115372 -0.5638401 -1.3637072 -0.55274223
#> 5 2.3485234 -0.1267969 2.6230203 -1.13918808 -1.1559678 -0.3214032 -0.87713586
#> 6 -0.1860167 -0.4620834 1.6075140 -1.81586413 -0.0172986 -0.1041232 -0.56268259
#> x17 x18 x19 x20
#> 1 0.91333399 1.2398417 1.2973339 1.0590334
#> 2 -0.09799793 -1.3670758 -1.0050198 0.9990451
#> 3 -0.14957945 -0.8573087 0.9004974 0.5399732
#> 4 0.55650397 0.7458841 0.7415159 1.7067212
#> 5 0.36401595 -0.4019171 1.1669136 0.7799264
#> 6 0.30234421 -0.6646296 -0.6615620 1.5092899
# Study design
cat(sprintf("Subjects: %d\n", length(unique(longitudinal_example$subject))))
#> Subjects: 50
cat(sprintf("Sites: %d\n", length(unique(longitudinal_example$site))))
#> Sites: 5
cat(sprintf("Observations per subject: %d\n",
nrow(longitudinal_example) / length(unique(longitudinal_example$subject))))
#> Observations per subject: 10The dataset has 500 observations from 50 subjects across 5 sites, with 10 measurements per subject. We have 5 correlated fixed-effect predictors (x1-x5).
The modelPrune() function with
engine = "lme4" respects the random-effects structure. Only
the fixed effects (x1-x5) are candidates for removal; the random
intercepts for subject and site remain untouched.
# Note: This example requires lme4 package
library(lme4)
# Define formula with random effects
# Note: Only fixed effects (x1-x5) will be pruned
# Random effects (1|subject), (1|site) are preserved
pruned_mixed <- modelPrune(
formula = outcome ~ x1 + x2 + x3 + x4 + x5 + (1|subject) + (1|site),
data = longitudinal_example,
engine = "lme4",
limit = 5
)
# Which fixed effects were kept?
selected_fixed <- attr(pruned_mixed, "selected_vars")
cat("Fixed effects kept:\n")
print(selected_fixed)
# Which were removed?
removed_fixed <- attr(pruned_mixed, "removed_vars")
cat("\nFixed effects removed:\n")
print(removed_fixed)The algorithm sequentially removes fixed effects with VIF > 5 until all remaining predictors satisfy the limit. The random effects structure is never modified.
The final model contains only the fixed effects that passed the VIF threshold, along with the original random effects.
We can manually verify that the pruning successfully reduced multicollinearity by comparing VIF values before and after pruning.
# Note: This example requires lme4 package
library(lme4)
# Fit full model
full_formula <- as.formula(paste("outcome ~",
paste(paste0("x", 1:5), collapse = " + "),
"+ (1|subject) + (1|site)"))
model_full_mixed <- lmer(full_formula, data = longitudinal_example)
# Extract fixed effects design matrices
X_full <- getME(model_full_mixed, "X")
X_pruned <- getME(final_mixed, "X")
# Compute VIF
compute_vif <- function(X) {
X_scaled <- scale(X[, -1]) # Remove intercept
sapply(seq_len(ncol(X_scaled)), function(i) {
r2 <- summary(lm(X_scaled[, i] ~ X_scaled[, -i]))$r.squared
1 / (1 - r2)
})
}
vif_full <- compute_vif(X_full)
vif_pruned <- compute_vif(X_pruned)
# Compare
comparison_vif <- data.frame(
Predictor = colnames(X_pruned)[-1],
VIF_Before = vif_full,
VIF_After = vif_pruned
)
print(comparison_vif)All remaining predictors now have VIF < 5, indicating acceptable multicollinearity levels for mixed-effects modeling.
The plot below shows how pruning reduced VIF values across the retained fixed effects:
# Extract VIF values
predictors <- comparison_vif$Predictor
vif_before <- comparison_vif$VIF_Before
vif_after <- comparison_vif$VIF_After
# Set up bar positions
x <- seq_along(predictors)
width <- 0.35
# Create plot
par(mar = c(5, 4, 4, 2))
plot(
x, vif_before,
type = "n",
xaxt = "n",
xlab = "Fixed Effects",
ylab = "VIF",
main = "VIF Reduction After Pruning (Mixed Model)",
ylim = c(0, max(vif_before) * 1.15)
)
# Add horizontal line at VIF = 5 threshold
abline(h = 5, col = "red", lty = 2, lwd = 2)
# Before bars (darker)
rect(
x - width, 0,
x, vif_before,
col = rgb(0.8, 0.2, 0.2, 0.6),
border = "white"
)
# After bars (lighter)
rect(
x, 0,
x + width, vif_after,
col = rgb(0.2, 0.5, 0.8, 0.6),
border = "white"
)
# Add x-axis labels
axis(1, at = x, labels = predictors, las = 2)
# Add legend
legend(
"topright",
legend = c("Before Pruning", "After Pruning", "VIF = 5 Threshold"),
fill = c(rgb(0.8, 0.2, 0.2, 0.6), rgb(0.2, 0.5, 0.8, 0.6), NA),
border = c("white", "white", NA),
lty = c(NA, NA, 2),
lwd = c(NA, NA, 2),
col = c(NA, NA, "red"),
bty = "o",
bg = "white"
)These four workflows demonstrate how corrselect integrates into diverse analytical pipelines:
Key takeaways:
force_in protects
theoretically important variables while reducing redundancyWorkflow selection guide:
| Scenario | Function | Key Parameters |
|---|---|---|
| Small to moderate p (< 50) | corrPrune() |
mode = "auto" or "exact" |
| High-dimensional (p >> n) | corrPrune() |
mode = "greedy" |
| Model-based refinement | modelPrune() |
limit (VIF threshold) |
| Protected variables | corrPrune() |
force_in |
| Mixed-effects models | modelPrune() |
engine = "lme4" |
When to use each approach:
vignette("quickstart") - Interface overviewvignette("advanced") - Custom engines and algorithmic
controlvignette("comparison") - Comparison with
alternativesvignette("theory") - Mathematical foundationsThresholds:
Methods: - See package documentation and JOSS paper for algorithm details
sessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Luxembourg
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] igraph_2.1.4 car_3.1-3 carData_3.0-5
#> [4] microbenchmark_1.5.0 corrselect_3.0.2
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 timeDate_4051.111 dplyr_1.1.4
#> [4] farver_2.1.2 S7_0.2.0 fastmap_1.2.0
#> [7] pROC_1.19.0.1 caret_7.0-1 digest_0.6.37
#> [10] rpart_4.1.24 timechange_0.3.0 lifecycle_1.0.4
#> [13] survival_3.8-3 magrittr_2.0.4 compiler_4.5.1
#> [16] rlang_1.1.6 sass_0.4.10 tools_4.5.1
#> [19] yaml_2.3.10 Boruta_9.0.0 data.table_1.17.8
#> [22] knitr_1.50 plyr_1.8.9 RColorBrewer_1.1-3
#> [25] abind_1.4-8 withr_3.0.2 purrr_1.2.0
#> [28] nnet_7.3-20 grid_4.5.1 stats4_4.5.1
#> [31] future_1.67.0 ggplot2_4.0.0 globals_0.18.0
#> [34] scales_1.4.0 iterators_1.0.14 MASS_7.3-65
#> [37] cli_3.6.5 rmarkdown_2.30 generics_0.1.4
#> [40] rstudioapi_0.17.1 future.apply_1.20.0 reshape2_1.4.4
#> [43] cachem_1.1.0 stringr_1.5.2 splines_4.5.1
#> [46] parallel_4.5.1 vctrs_0.6.5 hardhat_1.4.2
#> [49] glmnet_4.1-10 Matrix_1.7-4 jsonlite_2.0.0
#> [52] Formula_1.2-5 listenv_0.9.1 systemfonts_1.3.1
#> [55] foreach_1.5.2 gower_1.0.2 jquerylib_0.1.4
#> [58] recipes_1.3.1 glue_1.8.0 parallelly_1.45.1
#> [61] codetools_0.2-20 lubridate_1.9.4 stringi_1.8.7
#> [64] gtable_0.3.6 shape_1.4.6.1 tibble_3.3.0
#> [67] pillar_1.11.1 htmltools_0.5.8.1 ipred_0.9-15
#> [70] lava_1.8.2 R6_2.6.1 textshaping_1.0.3
#> [73] evaluate_1.0.5 lattice_0.22-7 bslib_0.9.0
#> [76] class_7.3-23 Rcpp_1.1.0 svglite_2.2.2
#> [79] nlme_3.1-168 prodlim_2025.04.28 ranger_0.17.0
#> [82] xfun_0.53 ModelMetrics_1.2.2.2 pkgconfig_2.0.3