| Title: | Correlation-Based and Model-Based Predictor Pruning |
| Version: | 3.0.2 |
| Description: | Provides functions for predictor pruning using association-based and model-based approaches. Includes corrPrune() for fast correlation-based pruning, modelPrune() for VIF-based regression pruning, and exact graph-theoretic algorithms (Eppstein–Löffler–Strash, Bron–Kerbosch) for exhaustive subset enumeration. Supports linear models, GLMs, and mixed models ('lme4', 'glmmTMB'). |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| LinkingTo: | Rcpp |
| Imports: | Rcpp, methods, stats |
| Suggests: | svglite, GO.db, WGCNA, preprocessCore, impute, energy, minerva, lme4, glmmTMB, MASS, caret, car, carData, microbenchmark, igraph, Boruta, glmnet, corrplot, knitr, rmarkdown, testthat (≥ 3.0.0), tibble |
| VignetteBuilder: | knitr |
| URL: | https://gillescolling.com/corrselect/ |
| BugReports: | https://github.com/gcol33/corrselect/issues |
| Depends: | R (≥ 3.5) |
| LazyData: | true |
| NeedsCompilation: | yes |
| Packaged: | 2025-11-28 19:34:15 UTC; Gilles Colling |
| Author: | Gilles Colling [aut, cre] |
| Maintainer: | Gilles Colling <gilles.colling051@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-11-29 16:40:02 UTC |
CorrCombo S4 class
Description
Holds the result of corrSelect or MatSelect: a list of valid variable combinations
and their correlation statistics.
This class stores all subsets of variables that meet the specified correlation constraint, along with metadata such as the algorithm used, correlation method(s), variables forced into every subset, and summary statistics for each combination.
Usage
## S4 method for signature 'CorrCombo'
show(object)
Arguments
object |
A |
Slots
subset_listA list of character vectors. Each vector is a valid subset (variable names).
avg_corrA numeric vector. Average absolute correlation within each subset.
min_corrA numeric vector. Minimum pairwise absolute correlation in each subset.
max_corrA numeric vector. Maximum pairwise absolute correlation within each subset.
namesCharacter vector of all variable names used for decoding.
thresholdNumeric scalar. The correlation threshold used during selection.
forced_inCharacter vector. Variable names that were forced into each subset.
search_typeCharacter string. One of
"els"or"bron-kerbosch".cor_methodCharacter string. Either a single method (e.g. "pearson") or "mixed" if multiple methods used.
n_rows_usedInteger. Number of rows used for computing the correlation matrix (after removing missing values).
See Also
corrSelect, MatSelect, corrSubset
Examples
show(new("CorrCombo",
subset_list = list(c("A", "B"), c("A", "C")),
avg_corr = c(0.2, 0.3),
min_corr = c(0.1, 0.2),
max_corr = c(0.3, 0.4),
names = c("A", "B", "C"),
threshold = 0.5,
forced_in = character(),
search_type = "els",
cor_method = "mixed",
n_rows_used = as.integer(5)
))
Select Variable Subsets with Low Correlation or Association (Matrix Interface)
Description
Identifies all maximal subsets of variables from a symmetric matrix (typically a correlation matrix) such that all pairwise absolute values stay below a specified threshold. Implements exact algorithms such as Eppstein–Löffler–Strash (ELS) and Bron–Kerbosch (with or without pivoting).
Usage
MatSelect(mat, threshold = 0.7, method = NULL, force_in = NULL, ...)
Arguments
mat |
A numeric, symmetric matrix with 1s on the diagonal (e.g. correlation matrix). Column names (if present) are used to label output variables. |
threshold |
A numeric scalar in (0, 1). Maximum allowed absolute pairwise value.
Defaults to |
method |
Character. Selection algorithm to use. One of |
force_in |
Optional integer vector of 1-based column indices to force into every subset. |
... |
Additional arguments passed to the backend, e.g., |
Value
An object of class CorrCombo, containing all valid subsets and their
correlation statistics.
Examples
set.seed(42)
mat <- matrix(rnorm(100), ncol = 10)
colnames(mat) <- paste0("V", 1:10)
cmat <- cor(mat)
# Default method (Bron-Kerbosch)
res1 <- MatSelect(cmat, threshold = 0.5)
# Bron–Kerbosch without pivot
res2 <- MatSelect(cmat, threshold = 0.5, method = "bron-kerbosch", use_pivot = FALSE)
# Bron–Kerbosch with pivoting
res3 <- MatSelect(cmat, threshold = 0.5, method = "bron-kerbosch", use_pivot = TRUE)
# Force variable 1 into every subset (with warning if too correlated)
res4 <- MatSelect(cmat, threshold = 0.5, force_in = 1)
Coerce CorrCombo to a Data Frame
Description
Converts a CorrCombo object into a data frame of variable combinations.
Usage
## S3 method for class 'CorrCombo'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
Arguments
x |
A |
row.names |
Optional row names for the output data frame. |
optional |
Logical. Passed to |
... |
Additional arguments passed to |
Value
A data frame where each row corresponds to a subset of variables. Columns are named
VarName01, VarName02, ..., up to the size of the largest subset. Subsets shorter than the
maximum length are padded with NA.
See Also
Examples
set.seed(1)
mat <- matrix(rnorm(100), ncol = 10)
colnames(mat) <- paste0("V", 1:10)
res <- corrSelect(cor(mat), threshold = 0.5)
as.data.frame(res)
Select Variable Subsets with Low Association (Mixed-Type Data Frame Interface)
Description
Identifies combinations of variables of any common data type (numeric,
ordered factors, or unordered) factors—whose pair-wise association does not
exceed a user-supplied threshold.
The routine wraps MatSelect() and handles all pre-processing
(type conversion, missing-row removal, constant-column checks) for typical
data-frame/tibble/data-table inputs.
Usage
assocSelect(
df,
threshold = 0.7,
method = NULL,
force_in = NULL,
method_num_num = c("pearson", "spearman", "kendall", "bicor", "distance", "maximal"),
method_num_ord = c("spearman", "kendall"),
method_ord_ord = c("spearman", "kendall"),
...
)
Arguments
df |
A data frame (or tibble / data.table). May contain any mix of:
|
threshold |
Numeric in |
method |
Character; the subset-search algorithm. One of
|
force_in |
Optional character vector or column indices specifying variables that must appear in every returned subset. |
method_num_num |
Association measure for numeric–numeric pairs.
One of |
method_num_ord |
Association measure for numeric–ordered pairs.
One of |
method_ord_ord |
Association measure for ordered–ordered pairs.
One of |
... |
Additional arguments passed unchanged to |
Details
A single call can therefore screen a data set that mixes continuous and categorical features and return every subset whose internal associations are “sufficiently low” under the metric(s) you choose.
Rows containing NA are dropped with a warning; constant columns are
treated as having zero association with every other variable.
The default association measure for each variable-type combination is:
- numeric – numeric
method_num_num(default"pearson")- numeric – ordered
method_num_ord- numeric – unordered
"eta"(ANOVA\eta^{2})- ordered – ordered
method_ord_ord- ordered – unordered
"cramersv"- unordered – unordered
"cramersv"
All association measures are rescaled to [0,1] before thresholding.
External packages are required for
"bicor" (WGCNA),
"distance" (energy),
and "maximal" (minerva); an informative error is thrown if they
are missing.
Value
A CorrCombo S4 object containing:
all valid subsets,
their summary association statistics,
metadata (algorithm used, rows kept, forced-in variables, etc.).
The object’s show() method prints the association metrics that were
actually used for this data set.
See Also
corrSelect(), MatSelect(), corrSubset()
Examples
set.seed(42)
df <- data.frame(
height = rnorm(15, 170, 10),
weight = rnorm(15, 70, 12),
group = factor(rep(LETTERS[1:3], each = 5)),
score = ordered(sample(c("low","med","high"), 15, TRUE))
)
## keep every subset whose internal associations <= 0.6
assocSelect(df, threshold = 0.6)
## use Kendall for all rank-based comparisons and force 'height' to appear
assocSelect(df,
threshold = 0.5,
method_num_num = "kendall",
method_num_ord = "kendall",
method_ord_ord = "kendall",
force_in = "height")
Example Bioclimatic Data for Ecological Modeling
Description
A simulated dataset with the 19 WorldClim bioclimatic variables (https://www.worldclim.org/data/bioclim.html) measured at 100 geographic locations, with species richness as the response variable. Variables are organized into correlated blocks representing temperature (BIO1-BIO11) and precipitation (BIO12-BIO19).
Usage
bioclim_example
Format
A data frame with 100 rows and 20 variables:
- species_richness
Integer. Number of species observed (response variable)
- BIO1
Numeric. Annual Mean Temperature
- BIO2
Numeric. Mean Diurnal Range
- BIO3
Numeric. Isothermality
- BIO4
Numeric. Temperature Seasonality
- BIO5
Numeric. Max Temperature of Warmest Month
- BIO6
Numeric. Min Temperature of Coldest Month
- BIO7
Numeric. Temperature Annual Range
- BIO8
Numeric. Mean Temperature of Wettest Quarter
- BIO9
Numeric. Mean Temperature of Driest Quarter
- BIO10
Numeric. Mean Temperature of Warmest Quarter
- BIO11
Numeric. Mean Temperature of Coldest Quarter
- BIO12
Numeric. Annual Precipitation
- BIO13
Numeric. Precipitation of Wettest Month
- BIO14
Numeric. Precipitation of Driest Month
- BIO15
Numeric. Precipitation Seasonality
- BIO16
Numeric. Precipitation of Wettest Quarter
- BIO17
Numeric. Precipitation of Driest Quarter
- BIO18
Numeric. Precipitation of Warmest Quarter
- BIO19
Numeric. Precipitation of Coldest Quarter
Details
This dataset demonstrates a common problem in ecological modeling: bioclimatic predictors are highly correlated within groups (temperature variables BIO1-BIO11 are highly correlated; precipitation variables BIO12-BIO19 are moderately correlated), leading to multicollinearity issues. The species richness response depends on a subset of predictors.
Use case: Demonstrating corrPrune() and modelPrune() for reducing correlated
environmental predictors before fitting species distribution models.
Source
Simulated data based on the 19 WorldClim bioclimatic variables
See Also
Examples
data(bioclim_example)
# The 19 WorldClim bioclimatic variables (https://www.worldclim.org/data/bioclim.html)
# Many are highly correlated, making them ideal for pruning
# Remove highly correlated variables
pruned <- corrPrune(bioclim_example[, -1], threshold = 0.7)
ncol(pruned) # Reduced from 19 to ~8 variables
# Model-based pruning with VIF
model_data <- modelPrune(species_richness ~ .,
data = bioclim_example,
limit = 5)
attr(model_data, "selected_vars")
Example Correlation Matrix with Block Structure
Description
A 20x20 correlation matrix with known block structure designed for demonstrating threshold selection, algorithm comparison, and visualization examples in vignettes.
Usage
cor_example
Format
A 20x20 numeric correlation matrix with row and column names V1-V20. The matrix has four distinct correlation blocks:
- Block 1 (V1-V5)
High correlation: mean = 0.81, range = (0.75, 0.95)
- Block 2 (V6-V10)
Moderate correlation: mean = 0.57, range = (0.5, 0.7)
- Block 3 (V11-V15)
Low correlation: mean = 0.28, range = (0.2, 0.4)
- Block 4 (V16-V20)
Minimal correlation: mean = 0.06, range = (0.0, 0.15)
Between-block correlations are low (range = (0.0, 0.3)). The matrix is guaranteed to be positive definite.
Details
This dataset provides a controlled correlation structure useful for:
Threshold sensitivity analysis (comparing results at tau = 0.5, 0.7, 0.9)
Algorithm comparison (exact vs greedy modes)
Visualization examples (heatmaps, correlation distributions)
Reproducible benchmarks across vignettes
Expected behavior with different thresholds:
tau = 0.5: Block 1 requires pruning (all pairs > 0.75)
tau = 0.7: Blocks 1-2 require pruning
tau = 0.9: Only Block 1 requires pruning
Source
Generated with data-raw/create_cor_example.R using
seed 20250125.
Examples
data(cor_example)
# Matrix dimensions
dim(cor_example)
# Visualize structure
if (requireNamespace("corrplot", quietly = TRUE)) {
corrplot::corrplot(cor_example, method = "color", type = "upper",
tl.col = "black", tl.cex = 0.7)
}
# Distribution of correlations
hist(cor_example[upper.tri(cor_example)],
breaks = 30,
main = "Distribution of Correlations in cor_example",
xlab = "Correlation",
col = "steelblue")
# Use with MatSelect
library(corrselect)
results <- MatSelect(cor_example, threshold = 0.7, method = "els")
show(results)
Association-Based Predictor Pruning
Description
corrPrune() performs model-free variable subset selection by iteratively
removing predictors until all pairwise associations fall below a specified
threshold. It returns a single pruned data frame with predictors that satisfy
the association constraint.
Usage
corrPrune(
data,
threshold = 0.7,
measure = "auto",
mode = "auto",
force_in = NULL,
by = NULL,
group_q = 1,
max_exact_p = 100,
...
)
Arguments
data |
A data.frame containing candidate predictors. |
threshold |
Numeric scalar. Maximum allowed pairwise association (default: 0.7). Must be non-negative. |
measure |
Character string specifying the association measure to use.
Options: |
mode |
Character string specifying the search algorithm. Options:
|
force_in |
Character vector of variable names that must be retained in the final subset. Default: NULL. |
by |
Character vector naming one or more grouping variables. If provided,
associations are computed separately within each group, then aggregated
using the quantile specified by |
group_q |
Numeric scalar in (0, 1]. Quantile used to aggregate
associations across groups when |
max_exact_p |
Integer. Maximum number of predictors for which exact
mode is used when |
... |
Additional arguments (reserved for future use). |
Details
corrPrune() identifies a subset of predictors whose pairwise associations
are all below threshold. The function works in several stages:
-
Variable type detection: Identifies numeric vs. categorical predictors
-
Association measurement: Computes appropriate pairwise associations
-
Grouping (optional): If
byis specified, computes associations within each group and aggregates using the specified quantile -
Feasibility check: Verifies that
force_invariables satisfy the threshold constraint -
Subset selection: Uses either exact or greedy search to find a valid subset
Grouped Pruning: When by is provided, the function ensures the selected
predictors satisfy the threshold constraint across groups. For example, with
group_q = 1 (default), the returned predictors will have pairwise associations
below threshold in all groups. With group_q = 0.9, they will satisfy
the constraint in at least 90% of groups.
Mode Selection: Exact mode guarantees finding all maximal subsets and returns the largest one (with deterministic tie-breaking). Greedy mode is faster but approximate, using a deterministic removal strategy based on association scores.
Value
A data.frame containing the pruned subset of predictors. The result has the following attributes:
- selected_vars
Character vector of retained variable names
- removed_vars
Character vector of removed variable names
- mode
Character string indicating which mode was used ("exact" or "greedy")
- measure
Character string indicating which association measure was used
- threshold
The threshold value used
See Also
corrSelect for exhaustive subset enumeration,
assocSelect for mixed-type data subset enumeration,
modelPrune for model-based predictor pruning.
Examples
# Basic numeric data pruning
data(mtcars)
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)
# Force certain variables to be included
pruned <- corrPrune(mtcars, threshold = 0.7, force_in = "mpg")
# Use greedy mode for faster computation
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")
Select Variable Subsets with Low Correlation (Data Frame Interface)
Description
Identifies combinations of numeric variables in a data frame such that all pairwise absolute correlations
fall below a specified threshold. This function is a wrapper around MatSelect()
and accepts data frames, tibbles, or data tables with automatic preprocessing.
Usage
corrSelect(
df,
threshold = 0.7,
method = NULL,
force_in = NULL,
cor_method = c("pearson", "spearman", "kendall", "bicor", "distance", "maximal"),
...
)
Arguments
df |
A data frame. Only numeric columns are used. |
threshold |
A numeric value in (0, 1). Maximum allowed absolute correlation. Defaults to 0.7. |
method |
Character. Selection algorithm to use. One of |
force_in |
Optional character vector or numeric indices of columns to force into all subsets. |
cor_method |
Character string indicating which correlation method to use.
One of |
... |
Additional arguments passed to |
Details
Only numeric columns are used for correlation analysis. Non‐numeric columns (factors, characters,
logicals, etc.) are ignored, and their names and types are printed to inform the user. These can be
optionally reattached later using corrSubset() with keepExtra = TRUE.
Rows with missing values are removed before computing correlations. A warning is issued if any rows are dropped.
The cor_method controls how the correlation matrix is computed:
-
"pearson": Standard linear correlation. -
"spearman": Rank-based monotonic correlation. -
"kendall": Kendall's tau. -
"bicor": Biweight midcorrelation (WGCNA::bicor). -
"distance": Distance correlation (energy::dcor). -
"maximal": Maximal information coefficient (minerva::mine).
For "bicor", "distance", and "maximal", the corresponding
package must be installed.
Value
An object of class CorrCombo, containing selected subsets and correlation statistics.
See Also
assocSelect(), MatSelect(), corrSubset()
Examples
set.seed(42)
n <- 100
# Create 20 variables: 5 blocks of correlated variables + some noise
block1 <- matrix(rnorm(n * 4), ncol = 4)
block2 <- matrix(rnorm(n), ncol = 1)
block2 <- matrix(rep(block2, 4), ncol = 4) + matrix(rnorm(n * 4, sd = 0.1), ncol = 4)
block3 <- matrix(rnorm(n * 4), ncol = 4)
block4 <- matrix(rnorm(n * 4), ncol = 4)
block5 <- matrix(rnorm(n * 4), ncol = 4)
df <- as.data.frame(cbind(block1, block2, block3, block4, block5))
colnames(df) <- paste0("V", 1:20)
# Add a non-numeric column to be ignored
df$label <- factor(sample(c("A", "B"), n, replace = TRUE))
# Basic usage
corrSelect(df, threshold = 0.8)
# Try Bron–Kerbosch with pivoting
corrSelect(df, threshold = 0.6, method = "bron-kerbosch", use_pivot = TRUE)
# Force in a specific variable and use Spearman correlation
corrSelect(df, threshold = 0.6, force_in = "V10", cor_method = "spearman")
Extract Variable Subsets from a CorrCombo Object
Description
Extracts one or more variable subsets from a CorrCombo object as data frames.
Typically used after corrSelect or MatSelect to obtain filtered
versions of the original dataset containing only low‐correlation variable combinations.
Usage
corrSubset(res, df, which = "best", keepExtra = FALSE)
Arguments
res |
A |
df |
A data frame or matrix. Must contain all variables listed in |
which |
Subsets to extract. One of:
Subsets are ranked by decreasing size, then increasing average correlation. |
keepExtra |
Logical. If |
Value
A data frame if a single subset is extracted, or a list of data frames if multiple subsets are extracted. Each data frame contains the selected variables (and optionally extras).
Note
A warning is issued if any rows contain missing values in the selected variables.
See Also
corrSelect, MatSelect, CorrCombo
Examples
# Simulate input data
set.seed(123)
df <- as.data.frame(matrix(rnorm(100), nrow = 10))
colnames(df) <- paste0("V", 1:10)
# Compute correlation matrix
cmat <- cor(df)
# Select subsets using corrSelect
res <- corrSelect(cmat, threshold = 0.5)
# Extract the best subset (default)
corrSubset(res, df)
# Extract the second-best subset
corrSubset(res, df, which = 2)
# Extract the first three subsets
corrSubset(res, df, which = 1:3)
# Extract all subsets
corrSubset(res, df, which = "all")
# Extract best subset and retain additional numeric column
df$CopyV1 <- df$V1
corrSubset(res, df, which = 1, keepExtra = TRUE)
Example Gene Expression Data for Bioinformatics
Description
A simulated gene expression dataset with 200 genes measured across 100 samples, organized into co-expression modules with a binary disease outcome.
Usage
genes_example
Format
A data frame with 100 rows and 202 variables:
- sample_id
Character. Unique sample identifier
- disease_status
Factor. Disease status (Healthy, Disease)
- GENE001, GENE002, GENE003, GENE004, GENE005, GENE006, GENE007, GENE008, GENE009, GENE010, GENE011, GENE012, GENE013, GENE014, GENE015, GENE016, GENE017, GENE018, GENE019, GENE020, GENE021, GENE022, GENE023, GENE024, GENE025, GENE026, GENE027, GENE028, GENE029, GENE030, GENE031, GENE032, GENE033, GENE034, GENE035, GENE036, GENE037, GENE038, GENE039, GENE040, GENE041, GENE042, GENE043, GENE044, GENE045, GENE046, GENE047, GENE048, GENE049, GENE050, GENE051, GENE052, GENE053, GENE054, GENE055, GENE056, GENE057, GENE058, GENE059, GENE060, GENE061, GENE062, GENE063, GENE064, GENE065, GENE066, GENE067, GENE068, GENE069, GENE070, GENE071, GENE072, GENE073, GENE074, GENE075, GENE076, GENE077, GENE078, GENE079, GENE080, GENE081, GENE082, GENE083, GENE084, GENE085, GENE086, GENE087, GENE088, GENE089, GENE090, GENE091, GENE092, GENE093, GENE094, GENE095, GENE096, GENE097, GENE098, GENE099, GENE100, GENE101, GENE102, GENE103, GENE104, GENE105, GENE106, GENE107, GENE108, GENE109, GENE110, GENE111, GENE112, GENE113, GENE114, GENE115, GENE116, GENE117, GENE118, GENE119, GENE120, GENE121, GENE122, GENE123, GENE124, GENE125, GENE126, GENE127, GENE128, GENE129, GENE130, GENE131, GENE132, GENE133, GENE134, GENE135, GENE136, GENE137, GENE138, GENE139, GENE140, GENE141, GENE142, GENE143, GENE144, GENE145, GENE146, GENE147, GENE148, GENE149, GENE150, GENE151, GENE152, GENE153, GENE154, GENE155, GENE156, GENE157, GENE158, GENE159, GENE160, GENE161, GENE162, GENE163, GENE164, GENE165, GENE166, GENE167, GENE168, GENE169, GENE170, GENE171, GENE172, GENE173, GENE174, GENE175, GENE176, GENE177, GENE178, GENE179, GENE180, GENE181, GENE182, GENE183, GENE184, GENE185, GENE186, GENE187, GENE188, GENE189, GENE190, GENE191, GENE192, GENE193, GENE194, GENE195, GENE196, GENE197, GENE198, GENE199, GENE200
Numeric. Gene expression values (log-transformed)
Details
This dataset simulates a high-dimensional, low-sample scenario common in genomics. Genes are organized into four co-expression modules:
Module 1 (GENE001-GENE050): Highly correlated (r ~= 0.80), disease-associated
Module 2 (GENE051-GENE100): Moderately correlated (r ~= 0.60)
Module 3 (GENE101-GENE150): Weakly correlated (r ~= 0.40)
Module 4 (GENE151-GENE200): Independent (r ~= 0)
Disease outcome depends on a subset of genes from Module 1.
Use case: Demonstrating corrPrune() with mode = "greedy" for handling
high-dimensional data efficiently.
Source
Simulated data based on typical gene expression microarray structures
See Also
Examples
data(genes_example)
# Greedy pruning for high-dimensional data
gene_data <- genes_example[, -(1:2)] # Exclude ID and outcome
pruned <- corrPrune(gene_data, threshold = 0.8, mode = "greedy")
ncol(pruned) # Reduced from 200 to ~50 genes
# Use pruned genes for classification
pruned_with_outcome <- data.frame(
disease_status = genes_example$disease_status,
pruned
)
Example Longitudinal Data for Clinical Research
Description
A simulated longitudinal study dataset with 50 subjects measured at 10 timepoints each, with 20 correlated predictors and nested random effects (subject and site).
Usage
longitudinal_example
Format
A data frame with 500 rows and 25 variables:
- obs_id
Integer. Observation identifier (1-500)
- subject
Factor. Subject identifier (1-50)
- site
Factor. Study site identifier (1-5)
- time
Integer. Measurement timepoint (1-10)
- outcome
Numeric. Continuous outcome variable
- x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20
Numeric. Correlated predictor variables
Details
This dataset represents a typical longitudinal study with repeated measures. Predictors are correlated both within and between subjects:
Predictors x1-x10: Highly correlated (r ~= 0.75)
Predictors x11-x20: Moderately correlated (r ~= 0.50)
The outcome depends on time (linear trend), random effects (subject and site), and a subset of fixed-effect predictors (x1, x5, x15).
Use case: Demonstrating modelPrune() with mixed models (lme4 engine)
to prune fixed effects while preserving random effects structure.
Source
Simulated data based on typical clinical trial designs
See Also
Examples
data(longitudinal_example)
## Not run:
# Prune fixed effects in mixed model (requires lme4)
if (requireNamespace("lme4", quietly = TRUE)) {
pruned <- modelPrune(
outcome ~ x1 + x2 + x3 + x4 + x5 + (1|subject) + (1|site),
data = longitudinal_example,
engine = "lme4",
limit = 5
)
# Random effects preserved, only fixed effects pruned
attr(pruned, "selected_vars")
}
## End(Not run)
Model-Based Predictor Pruning
Description
modelPrune() performs iterative removal of fixed-effect predictors based on
model diagnostics (e.g., VIF) until all remaining predictors satisfy a
specified threshold. It supports linear models, generalized linear models,
and mixed models.
Usage
modelPrune(
formula,
data,
engine = "lm",
criterion = "vif",
limit = 5,
force_in = NULL,
max_steps = NULL,
...
)
Arguments
formula |
A model formula specifying the response and predictors.
May include random effects for mixed models (e.g., |
data |
A data.frame containing the variables in the formula. |
engine |
Either a character string for built-in engines, or a list defining a custom engine. Built-in engines (character string):
Custom engine (named list with required components):
|
criterion |
Character string specifying the diagnostic criterion for pruning.
For built-in engines, only |
limit |
Numeric scalar. Maximum allowed value for the criterion. Predictors with diagnostic values exceeding this limit are iteratively removed. Default: 5 (common VIF threshold). |
force_in |
Character vector of predictor names that must be retained in the final model. These variables will not be removed during pruning. Default: NULL. |
max_steps |
Integer. Maximum number of pruning iterations. If NULL (default), pruning continues until all diagnostics are below the limit or no more removable predictors remain. |
... |
Additional arguments passed to the modeling function (e.g., |
Details
modelPrune() works by:
Parsing the formula to identify fixed-effect predictors
Fitting the initial model
Computing diagnostics for each fixed-effect predictor
Checking feasibility of
force_inconstraintsIteratively removing the predictor with the worst diagnostic value (excluding
force_invariables) until all diagnostics <=limitReturning the pruned data frame
Random Effects: For mixed models (lme4, glmmTMB), only fixed-effect predictors are considered for pruning. Random-effect structure is preserved exactly as specified in the original formula.
VIF Computation: Variance Inflation Factors are computed from the fixed-effects design matrix. For categorical predictors, VIF represents the inflation for the entire factor (not individual dummy variables).
Determinism: The algorithm is deterministic. Ties in diagnostic values are broken by removing the predictor that appears last in the formula.
Force-in Constraints: If variables in force_in violate the diagnostic
threshold, the function will error. This ensures that the constraint is
feasible before pruning begins.
Value
A data.frame containing only the retained predictors (and response). The result has the following attributes:
- selected_vars
Character vector of retained predictor names
- removed_vars
Character vector of removed predictor names (in order of removal)
- engine
Character string indicating which engine was used (for custom engines, this is the engine's
namefield)- criterion
Character string indicating which criterion was used
- limit
The threshold value used
- final_model
The final fitted model object (optional)
See Also
corrPrune for association-based predictor pruning,
corrSelect for exhaustive subset enumeration.
Examples
# Linear model with VIF-based pruning
data(mtcars)
pruned <- modelPrune(mpg ~ ., data = mtcars, engine = "lm", limit = 5)
names(pruned)
# Force certain predictors to remain
pruned <- modelPrune(mpg ~ ., data = mtcars, force_in = "drat", limit = 20)
# GLM example (requires family argument)
pruned <- modelPrune(am ~ ., data = mtcars, engine = "glm",
family = binomial(), limit = 5)
## Not run:
# Custom engine example (INLA)
inla_engine <- list(
name = "inla",
fit = function(formula, data, ...) {
inla::inla(formula = formula, data = data,
family = list(...)$family %||% "gaussian",
control.compute = list(config = TRUE))
},
diagnostics = function(model, fixed_effects) {
scores <- model$summary.fixed[, "sd"]
names(scores) <- rownames(model$summary.fixed)
scores[fixed_effects]
}
)
pruned <- modelPrune(y ~ x1 + x2 + x3, data = df,
engine = inla_engine, limit = 0.5)
## End(Not run)
Example Survey Data for Social Science Research
Description
A simulated questionnaire dataset with 30 Likert-scale items measuring three latent constructs (satisfaction, engagement, loyalty), plus demographic variables and an overall satisfaction score.
Usage
survey_example
Format
A data frame with 200 rows and 35 variables:
- respondent_id
Integer. Unique respondent identifier
- age
Integer. Respondent age (18-75 years)
- gender
Factor. Gender (Male, Female, Other)
- education
Ordered factor. Education level (High School, Bachelor, Master, PhD)
- overall_satisfaction
Integer. Overall satisfaction score (0-100)
- satisfaction_1, satisfaction_2, satisfaction_3, satisfaction_4, satisfaction_5, satisfaction_6, satisfaction_7, satisfaction_8, satisfaction_9, satisfaction_10
Ordered factor. Satisfaction items (1-7 Likert scale)
- engagement_1, engagement_2, engagement_3, engagement_4, engagement_5, engagement_6, engagement_7, engagement_8, engagement_9, engagement_10
Ordered factor. Engagement items (1-7 Likert scale)
- loyalty_1, loyalty_2, loyalty_3, loyalty_4, loyalty_5, loyalty_6, loyalty_7, loyalty_8, loyalty_9, loyalty_10
Ordered factor. Loyalty items (1-7 Likert scale)
Details
This dataset represents a common scenario in survey research: multiple items measuring similar constructs lead to redundancy and multicollinearity. Items within each construct are correlated (satisfaction, engagement, loyalty), and the constructs themselves are inter-correlated.
Use case: Demonstrating assocSelect() for identifying redundant questionnaire
items in mixed-type data (ordered factors + numeric variables).
Source
Simulated data based on typical customer satisfaction survey structures
See Also
Examples
data(survey_example)
# This dataset has mixed types: numeric (age, overall_satisfaction),
# factors (gender, education), and ordered factors (Likert items)
str(survey_example[, 1:10])
# Use assocSelect() for mixed-type data pruning
# This may take a few seconds with 34 variables
pruned <- assocSelect(survey_example[, -1], # Exclude respondent_id
threshold = 0.8,
method_ord_ord = "spearman")
length(attr(pruned, "selected_vars"))