Quick Start

Gilles Colling

2025-11-28

Installation

# Install from CRAN
install.packages("corrselect")

# Or install development version from GitHub
# install.packages("devtools")
devtools::install_github("GillesColling/corrselect")

Suggested packages (for extended functionality):

What corrselect Does

corrselect identifies and removes redundant variables based on pairwise correlation or association. Given a threshold \(\tau\), it finds subsets where all pairwise associations satisfy \(|a_{ij}| < \tau\) (see vignette("theory") for mathematical formulation).

Interface Hierarchy

corrselect provides three levels of interface:

Level 1: Simple Pruning

corrPrune() - Removes redundant predictors based on pairwise correlation:

modelPrune() - Reduces VIF in regression models:

Level 2: Structured Subset Selection

corrSelect() - Returns all maximal subsets (numeric data):

assocSelect() - Returns all maximal subsets (mixed-type data):

Level 3: Low-Level Matrix Interface

MatSelect() - Direct matrix input:

Quick Examples

corrPrune(): Association-Based Pruning

data(mtcars)

# Remove correlated predictors (threshold = 0.7)
pruned <- corrPrune(mtcars, threshold = 0.7)

# Results
cat(sprintf("Reduced from %d to %d variables\n", ncol(mtcars), ncol(pruned)))
#> Reduced from 11 to 5 variables
names(pruned)
#> [1] "mpg"  "drat" "qsec" "gear" "carb"

Variables removed:

attr(pruned, "removed_vars")
#> [1] "cyl"  "disp" "hp"   "wt"   "vs"   "am"

How corrPrune() selects among multiple maximal subsets:

When multiple maximal subsets exist (which is common), corrPrune() returns the subset with the lowest average absolute correlation. This selection criterion balances three goals:

  1. Minimize redundancy: Lower average correlation means more independent variables
  2. Maximize information: Prefers diverse variable combinations over tightly clustered ones
  3. Deterministic behavior: Always returns the same result for the same data

To explore all maximal subsets instead of just the optimal one, use corrSelect() (see below).

modelPrune(): VIF-Based Pruning

# Prune based on VIF (limit = 5)
model_data <- modelPrune(
  formula = mpg ~ .,
  data = mtcars,
  limit = 5
)

# Results
cat("Variables kept:", paste(attr(model_data, "selected_vars"), collapse = ", "), "\n")
#> Variables kept: drat, qsec, vs, am, gear, carb
cat("Variables removed:", paste(attr(model_data, "removed_vars"), collapse = ", "), "\n")
#> Variables removed: disp, cyl, wt, hp

corrSelect(): Enumerate All Maximal Subsets

results <- corrSelect(mtcars, threshold = 0.7)
show(results)
#> CorrCombo object
#> -----------------
#>   Method:      bron-kerbosch
#>   Correlation: pearson
#>   Threshold:   0.700
#>   Subsets:     15 maximal subsets
#>   Data Rows:   32 used in correlation
#>   Pivot:       TRUE
#> 
#> Top combinations:
#>   No.  Variables                          Avg    Max    Size
#>   ------------------------------------------------------------
#>   [ 1] mpg, drat, qsec, gear, carb       0.416  0.700     5
#>   [ 2] cyl, drat, qsec, gear, carb       0.434  0.700     5
#>   [ 3] mpg, drat, vs, gear, carb         0.466  0.700     5
#>   [ 4] wt, qsec, am, carb                0.373  0.692     4
#>   [ 5] wt, qsec, gear, carb              0.388  0.656     4
#>   ... (10 more combinations)

Inspect subsets:

as.data.frame(results)[1:5, ]  # First 5 subsets
#>                      VarName01 VarName02 VarName03 VarName04 VarName05
#> Subset01 [avg=0.416]       mpg      drat      qsec      gear      carb
#> Subset02 [avg=0.434]       cyl      drat      qsec      gear      carb
#> Subset03 [avg=0.466]       mpg      drat        vs      gear      carb
#> Subset04 [avg=0.373]        wt      qsec        am      carb      <NA>
#> Subset05 [avg=0.388]        wt      qsec      gear      carb      <NA>

Extract a specific subset:

subset_data <- corrSubset(results, mtcars, which = 1)
names(subset_data)
#> [1] "mpg"  "drat" "qsec" "gear" "carb"

assocSelect(): Mixed-Type Data

# Create mixed-type data
df <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100),
  cat1 = factor(sample(c("A", "B", "C"), 100, replace = TRUE)),
  ord1 = ordered(sample(1:5, 100, replace = TRUE))
)

# Handle mixed types automatically
results_mixed <- assocSelect(df, threshold = 0.5)
show(results_mixed)
#> CorrCombo object
#> -----------------
#>   Method:      bron-kerbosch
#>   Correlation: mixed
#>   AssocMethod: numeric_numeric = pearson, numeric_factor = eta, numeric_ordered
#>                = spearman, factor_ordered = cramersv
#>   Threshold:   0.500
#>   Subsets:     1 maximal subsets
#>   Data Rows:   100 used in correlation
#>   Pivot:       TRUE
#> 
#> Top combinations:
#>   No.  Variables                          Avg    Max    Size
#>   ------------------------------------------------------------
#>   [ 1] x1, x2, cat1, ord1                0.103  0.198     4

# Verify all pairwise associations are below threshold
cat("Max pairwise association:", max(results_mixed@max_corr), "\n")
#> Max pairwise association: 0.1981817

Protecting Variables

Use force_in to ensure specific variables are always retained:

# Force "mpg" to remain in all subsets
pruned_force <- corrPrune(
  data = mtcars,
  threshold = 0.7,
  force_in = "mpg"
)

# Verify forced variable is present
"mpg" %in% names(pruned_force)
#> [1] TRUE

Threshold Selection

Common thresholds: 0.5 (strict), 0.7 (moderate, recommended default), 0.9 (lenient).

Lower thresholds are stricter because they allow fewer variable pairs to coexist, resulting in smaller subsets. Higher thresholds permit stronger correlations, retaining more variables.

For detailed threshold selection strategies including visualization techniques, VIF guidelines, and sensitivity analysis, see vignette("advanced").

Interface Selection Guide

Scenario Function Key Parameters
Quick dimensionality reduction corrPrune() threshold, mode
Model-based refinement modelPrune() limit (VIF threshold), engine
Enumerate all maximal subsets corrSelect() threshold
Mixed-type data assocSelect() threshold
Precomputed matrices MatSelect() threshold, method
Protect key variables Any function force_in

Quick Reference

corrPrune()

Removes redundant predictors based on pairwise correlation.

corrPrune(data, threshold = 0.7, measure = "auto", mode = "auto",
          force_in = NULL, by = NULL, group_q = 1, max_exact_p = 100)
Parameter Description Default
data Data frame or matrix required
threshold Maximum allowed correlation 0.7
measure Correlation type: "auto", "pearson", "spearman", "kendall" "auto"
mode Algorithm: "auto", "exact", "greedy" "auto"
force_in Variables that must be retained NULL

Returns: Data frame with pruned variables. Attributes: selected_vars, removed_vars.

modelPrune()

Iteratively removes predictors with high VIF from a regression model.

modelPrune(formula, data, engine = "lm", criterion = "vif",
           limit = 5, force_in = NULL, max_steps = NULL, ...)
Parameter Description Default
formula Model formula (e.g., y ~ .) required
data Data frame required
engine "lm", "glm", "lme4", "glmmTMB", or custom "lm"
limit Maximum allowed VIF 5
force_in Variables that must be retained NULL

Returns: Pruned data frame. Attributes: selected_vars, removed_vars, final_model.

corrSelect()

Enumerates all maximal subsets satisfying correlation threshold (numeric data).

corrSelect(df, threshold = 0.7, method = NULL, force_in = NULL,
           cor_method = "pearson", ...)
Parameter Description Default
df Data frame (numeric columns only) required
threshold Maximum allowed correlation 0.7
method Algorithm: "bron-kerbosch", "els" auto
cor_method "pearson", "spearman", "kendall", "bicor", "distance", "maximal" "pearson"
force_in Variables required in all subsets NULL

Returns: CorrCombo S4 object with slots: subset_list, avg_corr, min_corr, max_corr.

assocSelect()

Enumerates all maximal subsets for mixed-type data (numeric, factor, ordered).

assocSelect(df, threshold = 0.7, method = NULL, force_in = NULL,
            method_num_num = "pearson", method_num_ord = "spearman",
            method_ord_ord = "spearman", ...)
Parameter Description Default
df Data frame (any column types) required
threshold Maximum allowed association 0.7
method_num_num Numeric-numeric: "pearson", "spearman", etc. "pearson"
method_num_ord Numeric-ordered: "spearman", "kendall" "spearman"
method_ord_ord Ordered-ordered: "spearman", "kendall" "spearman"

Returns: CorrCombo S4 object.

MatSelect()

Direct matrix interface for precomputed correlation/association matrices.

MatSelect(mat, threshold = 0.7, method = NULL, force_in = NULL, ...)
Parameter Description Default
mat Symmetric correlation/association matrix required
threshold Maximum allowed value 0.7
method Algorithm: "bron-kerbosch", "els" auto
force_in Variables required in all subsets NULL

Returns: CorrCombo S4 object.

corrSubset()

Extracts a specific subset from a CorrCombo result.

corrSubset(res, df, which = "best", keepExtra = FALSE)
Parameter Description Default
res CorrCombo object from corrSelect/assocSelect/MatSelect required
df Original data frame required
which Subset index or "best" (lowest avg correlation) "best"
keepExtra Include non-numeric columns in output? FALSE

Returns: Data frame containing only the selected variables.

Troubleshooting

“No valid subsets found” error - Threshold too strict—all variable pairs exceed it - Solution: Increase threshold or use force_in to keep at least one variable

VIF computation fails in modelPrune() - Perfect multicollinearity (R² = 1) present - Solution: Use corrPrune(threshold = 0.99) first to remove near-duplicates

Forced variables conflict - Variables in force_in are too highly correlated with each other - Solution: Increase threshold or reduce force_in set

Slow performance with many variables - Exact mode is exponential for large p - Solution: Use mode = "greedy" for p > 25

For comprehensive troubleshooting with code examples, see vignette("advanced"), Section 5.

See Also

Session Info

sessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Luxembourg
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] car_3.1-3            carData_3.0-5        microbenchmark_1.5.0
#> [4] corrselect_3.0.2    
#> 
#> loaded via a namespace (and not attached):
#>  [1] shape_1.4.6.1        gtable_0.3.6         xfun_0.53           
#>  [4] bslib_0.9.0          ggplot2_4.0.0        recipes_1.3.1       
#>  [7] lattice_0.22-7       vctrs_0.6.5          tools_4.5.1         
#> [10] generics_0.1.4       stats4_4.5.1         parallel_4.5.1      
#> [13] tibble_3.3.0         pkgconfig_2.0.3      ModelMetrics_1.2.2.2
#> [16] Matrix_1.7-4         data.table_1.17.8    RColorBrewer_1.1-3  
#> [19] S7_0.2.0             lifecycle_1.0.4      compiler_4.5.1      
#> [22] farver_2.1.2         stringr_1.5.2        textshaping_1.0.3   
#> [25] codetools_0.2-20     Boruta_9.0.0         htmltools_0.5.8.1   
#> [28] class_7.3-23         sass_0.4.10          glmnet_4.1-10       
#> [31] yaml_2.3.10          Formula_1.2-5        prodlim_2025.04.28  
#> [34] pillar_1.11.1        jquerylib_0.1.4      MASS_7.3-65         
#> [37] cachem_1.1.0         gower_1.0.2          iterators_1.0.14    
#> [40] abind_1.4-8          rpart_4.1.24         foreach_1.5.2       
#> [43] nlme_3.1-168         parallelly_1.45.1    lava_1.8.2          
#> [46] tidyselect_1.2.1     digest_0.6.37        stringi_1.8.7       
#> [49] future_1.67.0        dplyr_1.1.4          reshape2_1.4.4      
#> [52] purrr_1.2.0          listenv_0.9.1        splines_4.5.1       
#> [55] fastmap_1.2.0        grid_4.5.1           cli_3.6.5           
#> [58] magrittr_2.0.4       survival_3.8-3       future.apply_1.20.0 
#> [61] withr_3.0.2          scales_1.4.0         lubridate_1.9.4     
#> [64] timechange_0.3.0     rmarkdown_2.30       globals_0.18.0      
#> [67] nnet_7.3-20          timeDate_4051.111    ranger_0.17.0       
#> [70] evaluate_1.0.5       knitr_1.50           hardhat_1.4.2       
#> [73] caret_7.0-1          rlang_1.1.6          Rcpp_1.1.0          
#> [76] glue_1.8.0           pROC_1.19.0.1        ipred_0.9-15        
#> [79] svglite_2.2.2        rstudioapi_0.17.1    jsonlite_2.0.0      
#> [82] R6_2.6.1             plyr_1.8.9           systemfonts_1.3.1