# Install from CRAN
install.packages("corrselect")
# Or install development version from GitHub
# install.packages("devtools")
devtools::install_github("GillesColling/corrselect")Suggested packages (for extended functionality):
lme4, glmmTMB: Mixed-effects models in
modelPrune()WGCNA: Biweight midcorrelation
(bicor)energy: Distance correlationminerva: Maximal information coefficientcorrselect identifies and removes redundant variables based on
pairwise correlation or association. Given a threshold \(\tau\), it finds subsets where all pairwise
associations satisfy \(|a_{ij}| <
\tau\) (see vignette("theory") for mathematical
formulation).
corrselect provides three levels of interface:
corrPrune() - Removes redundant predictors based on pairwise correlation:
modelPrune() - Reduces VIF in regression models:
corrSelect() - Returns all maximal subsets (numeric data):
vignette("theory"))assocSelect() - Returns all maximal subsets (mixed-type data):
MatSelect() - Direct matrix input:
data(mtcars)
# Remove correlated predictors (threshold = 0.7)
pruned <- corrPrune(mtcars, threshold = 0.7)
# Results
cat(sprintf("Reduced from %d to %d variables\n", ncol(mtcars), ncol(pruned)))
#> Reduced from 11 to 5 variables
names(pruned)
#> [1] "mpg" "drat" "qsec" "gear" "carb"Variables removed:
How corrPrune() selects among multiple maximal subsets:
When multiple maximal subsets exist (which is common),
corrPrune() returns the subset with the lowest
average absolute correlation. This selection criterion balances
three goals:
To explore all maximal subsets instead of just the
optimal one, use corrSelect() (see below).
# Prune based on VIF (limit = 5)
model_data <- modelPrune(
formula = mpg ~ .,
data = mtcars,
limit = 5
)
# Results
cat("Variables kept:", paste(attr(model_data, "selected_vars"), collapse = ", "), "\n")
#> Variables kept: drat, qsec, vs, am, gear, carb
cat("Variables removed:", paste(attr(model_data, "removed_vars"), collapse = ", "), "\n")
#> Variables removed: disp, cyl, wt, hpresults <- corrSelect(mtcars, threshold = 0.7)
show(results)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: pearson
#> Threshold: 0.700
#> Subsets: 15 maximal subsets
#> Data Rows: 32 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] mpg, drat, qsec, gear, carb 0.416 0.700 5
#> [ 2] cyl, drat, qsec, gear, carb 0.434 0.700 5
#> [ 3] mpg, drat, vs, gear, carb 0.466 0.700 5
#> [ 4] wt, qsec, am, carb 0.373 0.692 4
#> [ 5] wt, qsec, gear, carb 0.388 0.656 4
#> ... (10 more combinations)Inspect subsets:
as.data.frame(results)[1:5, ] # First 5 subsets
#> VarName01 VarName02 VarName03 VarName04 VarName05
#> Subset01 [avg=0.416] mpg drat qsec gear carb
#> Subset02 [avg=0.434] cyl drat qsec gear carb
#> Subset03 [avg=0.466] mpg drat vs gear carb
#> Subset04 [avg=0.373] wt qsec am carb <NA>
#> Subset05 [avg=0.388] wt qsec gear carb <NA>Extract a specific subset:
# Create mixed-type data
df <- data.frame(
x1 = rnorm(100),
x2 = rnorm(100),
cat1 = factor(sample(c("A", "B", "C"), 100, replace = TRUE)),
ord1 = ordered(sample(1:5, 100, replace = TRUE))
)
# Handle mixed types automatically
results_mixed <- assocSelect(df, threshold = 0.5)
show(results_mixed)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: mixed
#> AssocMethod: numeric_numeric = pearson, numeric_factor = eta, numeric_ordered
#> = spearman, factor_ordered = cramersv
#> Threshold: 0.500
#> Subsets: 1 maximal subsets
#> Data Rows: 100 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] x1, x2, cat1, ord1 0.103 0.198 4
# Verify all pairwise associations are below threshold
cat("Max pairwise association:", max(results_mixed@max_corr), "\n")
#> Max pairwise association: 0.1981817Use force_in to ensure specific variables are always
retained:
Common thresholds: 0.5 (strict), 0.7 (moderate, recommended default), 0.9 (lenient).
Lower thresholds are stricter because they allow fewer variable pairs to coexist, resulting in smaller subsets. Higher thresholds permit stronger correlations, retaining more variables.
For detailed threshold selection strategies including visualization
techniques, VIF guidelines, and sensitivity analysis, see
vignette("advanced").
| Scenario | Function | Key Parameters |
|---|---|---|
| Quick dimensionality reduction | corrPrune() |
threshold, mode |
| Model-based refinement | modelPrune() |
limit (VIF threshold), engine |
| Enumerate all maximal subsets | corrSelect() |
threshold |
| Mixed-type data | assocSelect() |
threshold |
| Precomputed matrices | MatSelect() |
threshold, method |
| Protect key variables | Any function | force_in |
Removes redundant predictors based on pairwise correlation.
corrPrune(data, threshold = 0.7, measure = "auto", mode = "auto",
force_in = NULL, by = NULL, group_q = 1, max_exact_p = 100)| Parameter | Description | Default |
|---|---|---|
data |
Data frame or matrix | required |
threshold |
Maximum allowed correlation | 0.7 |
measure |
Correlation type: "auto", "pearson",
"spearman", "kendall" |
"auto" |
mode |
Algorithm: "auto", "exact",
"greedy" |
"auto" |
force_in |
Variables that must be retained | NULL |
Returns: Data frame with pruned variables.
Attributes: selected_vars, removed_vars.
Iteratively removes predictors with high VIF from a regression model.
modelPrune(formula, data, engine = "lm", criterion = "vif",
limit = 5, force_in = NULL, max_steps = NULL, ...)| Parameter | Description | Default |
|---|---|---|
formula |
Model formula (e.g., y ~ .) |
required |
data |
Data frame | required |
engine |
"lm", "glm", "lme4",
"glmmTMB", or custom |
"lm" |
limit |
Maximum allowed VIF | 5 |
force_in |
Variables that must be retained | NULL |
Returns: Pruned data frame. Attributes:
selected_vars, removed_vars,
final_model.
Enumerates all maximal subsets satisfying correlation threshold (numeric data).
| Parameter | Description | Default |
|---|---|---|
df |
Data frame (numeric columns only) | required |
threshold |
Maximum allowed correlation | 0.7 |
method |
Algorithm: "bron-kerbosch", "els" |
auto |
cor_method |
"pearson", "spearman",
"kendall", "bicor", "distance",
"maximal" |
"pearson" |
force_in |
Variables required in all subsets | NULL |
Returns: CorrCombo S4 object with
slots: subset_list, avg_corr,
min_corr, max_corr.
Enumerates all maximal subsets for mixed-type data (numeric, factor, ordered).
assocSelect(df, threshold = 0.7, method = NULL, force_in = NULL,
method_num_num = "pearson", method_num_ord = "spearman",
method_ord_ord = "spearman", ...)| Parameter | Description | Default |
|---|---|---|
df |
Data frame (any column types) | required |
threshold |
Maximum allowed association | 0.7 |
method_num_num |
Numeric-numeric: "pearson", "spearman",
etc. |
"pearson" |
method_num_ord |
Numeric-ordered: "spearman",
"kendall" |
"spearman" |
method_ord_ord |
Ordered-ordered: "spearman",
"kendall" |
"spearman" |
Returns: CorrCombo S4 object.
Direct matrix interface for precomputed correlation/association matrices.
| Parameter | Description | Default |
|---|---|---|
mat |
Symmetric correlation/association matrix | required |
threshold |
Maximum allowed value | 0.7 |
method |
Algorithm: "bron-kerbosch", "els" |
auto |
force_in |
Variables required in all subsets | NULL |
Returns: CorrCombo S4 object.
Extracts a specific subset from a CorrCombo result.
| Parameter | Description | Default |
|---|---|---|
res |
CorrCombo object from
corrSelect/assocSelect/MatSelect |
required |
df |
Original data frame | required |
which |
Subset index or "best" (lowest avg correlation) |
"best" |
keepExtra |
Include non-numeric columns in output? | FALSE |
Returns: Data frame containing only the selected variables.
“No valid subsets found” error - Threshold too
strict—all variable pairs exceed it - Solution: Increase threshold or
use force_in to keep at least one variable
VIF computation fails in modelPrune() - Perfect
multicollinearity (R² = 1) present - Solution: Use
corrPrune(threshold = 0.99) first to remove
near-duplicates
Forced variables conflict - Variables in
force_in are too highly correlated with each other -
Solution: Increase threshold or reduce force_in set
Slow performance with many variables - Exact mode is
exponential for large p - Solution: Use mode = "greedy" for
p > 25
For comprehensive troubleshooting with code examples, see
vignette("advanced"), Section 5.
vignette("workflows") - Complete real-world workflows
(ecological, survey, genomic, mixed models)vignette("advanced") - Algorithmic control and custom
enginesvignette("comparison") - Comparison with caret, Boruta,
glmnetvignette("theory") - Theoretical foundations and
formulation?corrPrune, ?modelPrune,
?corrSelect, ?assocSelect,
?MatSelectsessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Luxembourg
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] car_3.1-3 carData_3.0-5 microbenchmark_1.5.0
#> [4] corrselect_3.0.2
#>
#> loaded via a namespace (and not attached):
#> [1] shape_1.4.6.1 gtable_0.3.6 xfun_0.53
#> [4] bslib_0.9.0 ggplot2_4.0.0 recipes_1.3.1
#> [7] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1
#> [10] generics_0.1.4 stats4_4.5.1 parallel_4.5.1
#> [13] tibble_3.3.0 pkgconfig_2.0.3 ModelMetrics_1.2.2.2
#> [16] Matrix_1.7-4 data.table_1.17.8 RColorBrewer_1.1-3
#> [19] S7_0.2.0 lifecycle_1.0.4 compiler_4.5.1
#> [22] farver_2.1.2 stringr_1.5.2 textshaping_1.0.3
#> [25] codetools_0.2-20 Boruta_9.0.0 htmltools_0.5.8.1
#> [28] class_7.3-23 sass_0.4.10 glmnet_4.1-10
#> [31] yaml_2.3.10 Formula_1.2-5 prodlim_2025.04.28
#> [34] pillar_1.11.1 jquerylib_0.1.4 MASS_7.3-65
#> [37] cachem_1.1.0 gower_1.0.2 iterators_1.0.14
#> [40] abind_1.4-8 rpart_4.1.24 foreach_1.5.2
#> [43] nlme_3.1-168 parallelly_1.45.1 lava_1.8.2
#> [46] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7
#> [49] future_1.67.0 dplyr_1.1.4 reshape2_1.4.4
#> [52] purrr_1.2.0 listenv_0.9.1 splines_4.5.1
#> [55] fastmap_1.2.0 grid_4.5.1 cli_3.6.5
#> [58] magrittr_2.0.4 survival_3.8-3 future.apply_1.20.0
#> [61] withr_3.0.2 scales_1.4.0 lubridate_1.9.4
#> [64] timechange_0.3.0 rmarkdown_2.30 globals_0.18.0
#> [67] nnet_7.3-20 timeDate_4051.111 ranger_0.17.0
#> [70] evaluate_1.0.5 knitr_1.50 hardhat_1.4.2
#> [73] caret_7.0-1 rlang_1.1.6 Rcpp_1.1.0
#> [76] glue_1.8.0 pROC_1.19.0.1 ipred_0.9-15
#> [79] svglite_2.2.2 rstudioapi_0.17.1 jsonlite_2.0.0
#> [82] R6_2.6.1 plyr_1.8.9 systemfonts_1.3.1