| Type: | Package |
| Title: | Evolutionary Feature Engineering |
| Version: | 0.1.0 |
| Description: | Automates feature engineering using evolutionary algorithms inspired by genetic programming. Starting from raw input features, the package evolves candidate transformation recipes through selection, crossover, and mutation, evaluating fitness via cross-validation or train/validation splits with gradient-boosted tree models ('LightGBM' or 'XGBoost'). Built-in transformers include arithmetic, logarithmic, and power operations, interaction terms, target encoding, quantile and log-based binning, principal component analysis, truncated singular value decomposition, Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction, and minimum spanning tree (MST) graph-based clustering. The evolutionary search yields an optimised feature recipe that can be applied to new data for prediction. Methods are described in McInnes et al. (2018) <doi:10.21105/joss.00861>, Ke et al. (2017) https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-framework, Chen and Guestrin (2016) <doi:10.1145/2939672.2939785>, Gagolewski (2021) <doi:10.1016/j.softx.2021.100722>, Gagolewski (2026) <doi:10.32614/CRAN.package.lumbermark>, and Gagolewski (2026) <doi:10.32614/CRAN.package.deadwood>. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| Imports: | data.table, lightgbm, xgboost, stats, digest, uwot, quitefastmst, genieclust |
| Suggests: | RhpcBLASctl, testthat, knitr, rmarkdown, lumbermark, deadwood |
| VignetteBuilder: | knitr |
| Config/roxygen2/version: | 8.0.0 |
| RoxygenNote: | 7.3.2 |
| NeedsCompilation: | no |
| Packaged: | 2026-06-04 20:51:19 UTC; vero |
| Author: | Gustavo Pereira [aut, cre] |
| Maintainer: | Gustavo Pereira <tanopereira@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-09 15:50:14 UTC |
Apply a single gene to a dataset
Description
Apply a single gene to a dataset
Usage
apply_gene(
gene,
train_data,
val_data = NULL,
target_col = NULL,
state_cache = NULL,
data_hash = NULL
)
Arguments
gene |
A gene list representing a feature transformation. |
train_data |
A data.frame or data.table representing the training data. |
val_data |
Optional validation data.frame or data.table. |
target_col |
Name of the target column. |
state_cache |
Optional environment to cache full-dataset fitted states of stateful transformers. |
data_hash |
Optional pre-computed xxhash64 digest of the target column, to avoid redundant hashing when applying multiple genes. |
Value
A list with three elements: train (the modified training
data.table with the new gene column appended), val (the
modified validation data.table or NULL), and gene
(the gene list, with its state element populated if the transformer
is stateful).
Apply an entire individual's recipe to data
Description
Apply an entire individual's recipe to data
Usage
apply_individual(
ind,
train_data,
val_data = NULL,
target_col = NULL,
state_cache = NULL
)
Arguments
ind |
An evo_individual object. |
train_data |
A data.frame or data.table representing the training data. |
val_data |
Optional validation data.frame or data.table. |
target_col |
Name of the target column. |
state_cache |
Optional environment to cache full-dataset fitted states of stateful transformers. |
Value
A list with three elements: train (the transformed training
data.table with all gene columns applied), val (the
transformed validation data.table or NULL), and ind
(the updated evo_individual whose genes now carry fitted states).
Create a single gene
Description
Create a single gene
Usage
create_gene(transformer_name, input_cols)
Arguments
transformer_name |
Name of the transformer |
input_cols |
Vector of input column names |
Value
A gene list with elements transformer_name, input_cols,
params (transformer-specific parameters), state (NULL
until fitted), and output_col (auto-generated column name).
Create an individual
Description
Create an individual
Usage
create_individual(
genes = list(),
numeric_cols = character(0),
categorical_cols = character(0)
)
Arguments
genes |
List of genes |
numeric_cols |
Vector of numeric column names |
categorical_cols |
Vector of categorical column names |
Value
An evo_individual S3 object: a list with elements
genes (topologically sorted), numeric_cols,
categorical_cols, and fitness (initialised to
NA_real_).
Create a transformer definition
Description
Create a transformer definition
Usage
create_transformer(
name,
type,
input_type = "numeric",
output_type = "numeric",
fit_func = NULL,
apply_func,
name_generator,
allow_replace = FALSE
)
Arguments
name |
Transformer name |
type |
Type: "unary", "binary", "supervised_unary" |
input_type |
Type of input: "numeric" or "categorical" |
output_type |
Type of output: "numeric" or "categorical" |
fit_func |
function(data, input_cols, target_col = NULL) returning state |
apply_func |
function(data, input_cols, state = NULL) returning new column vector |
name_generator |
function(input_cols) returning output column name |
allow_replace |
Logical. Whether column sampling allows replacement. |
Value
An evo_transformer S3 object: a list with elements
name, type, input_type, output_type,
fit_func, apply_func, name_generator, and
allow_replace.
Crossover two individuals
Description
Crossover two individuals
Usage
crossover(ind1, ind2, verbose = FALSE)
Arguments
ind1 |
Parent 1 |
ind2 |
Parent 2 |
verbose |
Logical. Whether to print crossover details. |
Value
An evo_individual child created by randomly sampling genes
from both parents with duplicate gene outputs removed.
Evaluate the fitness of an individual
Description
Evaluate the fitness of an individual
Usage
evaluate_fitness(
ind,
data,
target_col,
task = "classification",
cv_folds = 3,
evaluation_strategy = "cv",
split_ids = NULL,
shared_splits = NULL,
evaluator = "lightgbm",
fold_ids = NULL,
shared_folds = NULL,
shared_full = NULL,
state_cache = NULL,
threads = 2
)
Arguments
ind |
An evo_individual object. |
data |
A data.frame or data.table containing the dataset. |
target_col |
Name of the target column. |
task |
"classification" or "regression". |
cv_folds |
Number of cross-validation folds. |
evaluation_strategy |
Character string, either "cv" (cross-validation) or "split" (train/validation split). |
split_ids |
Optional vector of pre-defined split assignments (e.g. "train", "val", "holdout"). |
shared_splits |
Optional list of shared data.table splits for in-place caching. |
evaluator |
The ML model to use ("lightgbm" or "xgboost"). |
fold_ids |
Optional vector of pre-defined fold assignments. |
shared_folds |
Optional list of shared data.table CV folds for in-place caching. |
shared_full |
Optional data.table of the full dataset for in-place caching. |
state_cache |
Optional environment to cache full-dataset fitted states of stateful transformers. |
threads |
Number of threads to use for parallel execution (default 2) |
Value
The input evo_individual with its fitness field set to
the computed score (higher is better), importances set to a named
numeric vector of feature importances, holdout_fitness set to
NULL, and genes updated with fitted transformer states.
Evaluate holdout fitness for an individual
Description
Evaluate holdout fitness for an individual
Usage
evaluate_holdout_fitness(
ind,
data,
split_ids,
shared_splits,
target_col,
task,
evaluator,
threads,
state_cache,
classes,
num_class
)
Evaluate all unevaluated individuals in a population
Description
Evaluate all unevaluated individuals in a population
Usage
evaluate_pop(
pop,
data,
target_col,
task,
cv_folds,
evaluation_strategy,
split_ids,
shared_splits,
evaluator,
fold_ids,
shared_folds,
shared_full,
state_cache,
fitness_cache,
threads,
verbose,
running_best_fitness
)
Built-in feature transformers
Description
A list of default transformer definitions available for feature engineering.
Usage
evo_transformers
Value
A named list of evo_transformer objects, each defining a
feature transformation (e.g. log, pca, target_encode).
Run evolutionary feature engineering
Description
Run evolutionary feature engineering
Usage
evolve_features(
data,
target_col,
task = "classification",
generations = 10,
pop_size = 10,
cv_folds = 3,
evaluation_strategy = "cv",
split_ratio = c(0.6, 0.2, 0.2),
split_ids = NULL,
early_stopping_rounds = 3,
evaluator = "lightgbm",
dynamic_population = TRUE,
crossover_type = "both",
threads = 2,
max_clustering_size = 5000,
verbose = TRUE
)
Arguments
data |
A data.frame or data.table |
target_col |
Name of the target column |
task |
"classification" or "regression" |
generations |
Number of generations (max iterations) |
pop_size |
Population size |
cv_folds |
Number of cross-validation folds |
evaluation_strategy |
"cv" or "split". Strategy to evaluate candidate recipes. |
split_ratio |
A numeric vector of length 2 or 3 defining train/validation/holdout proportions (e.g. c(0.6, 0.2, 0.2)). |
split_ids |
An optional character vector of split assignments (e.g. "train", "val", "holdout"). |
early_stopping_rounds |
Stop if fitness doesn't improve for this many generations |
evaluator |
The ML model to use ("lightgbm" or "xgboost") |
dynamic_population |
Logical. If TRUE, population expands dynamically during stagnation. |
crossover_type |
Crossover type: "both" (default, 50% random / 50% union), "random", or "union" |
threads |
Number of threads to use for parallel execution (default 2) |
max_clustering_size |
Maximum unique training rows to cluster (default 5000, 0/NULL for unlimited) |
verbose |
Logical. If TRUE, prints progress. |
Value
An evo_recipe S3 object: a list with elements
best_individual (the top-scoring evo_individual),
history (list of all evaluated individuals across generations),
task, best_model (the trained model object),
evaluator, and classes (class levels for multiclass tasks,
otherwise NULL).
Convert a gene to a formula string
Description
Convert a gene to a formula string
Usage
gene_to_formula(gene)
Arguments
gene |
A gene list |
Value
A character string representing the gene as a human-readable
formula, e.g. "log(col1)" or "pca2(col1, col2)".
Convert a gene to a formula string for state caching (ignoring component index)
Description
Convert a gene to a formula string for state caching (ignoring component index)
Usage
gene_to_state_formula(gene)
Arguments
gene |
A gene list |
Value
A character string representing the gene formula suitable for state caching. For multi-component transformers (PCA, SVD, UMAP) the component index is omitted so that all components share one cache key.
Convert an individual to a recipe string of formulas
Description
Convert an individual to a recipe string of formulas
Usage
individual_to_recipe_string(ind)
Arguments
ind |
An evo_individual |
Value
A character string listing all gene formulas in bracket notation,
e.g. "[log(x), sqrt(y)]", or "[Original features only]"
when the individual has no genes.
Initialize a population
Description
Initialize a population
Usage
initialize_population(
pop_size,
numeric_cols,
categorical_cols,
initial_genes = 2,
task = "classification"
)
Arguments
pop_size |
Population size. |
numeric_cols |
Vector of numeric column names. |
categorical_cols |
Vector of categorical column names. |
initial_genes |
Number of initial genes per individual. |
task |
Task type ("classification", "regression", or "multiclass"). |
Value
A list of evo_individual objects of length pop_size.
The first individual is a baseline with no genes; the remaining individuals
each carry initial_genes randomly generated genes.
Check whether a candidate individual is a duplicate or known-inferior
Description
Check whether a candidate individual is a duplicate or known-inferior
Usage
is_invalid_individual(c_ind, pop_list, cache, best_fit)
Mutate an individual
Description
Mutate an individual
Usage
mutate(
ind,
verbose = FALSE,
force_add = FALSE,
importances = numeric(0),
temperature = 1,
task = "classification",
tested_gene_outputs = NULL
)
Arguments
ind |
An evo_individual. |
verbose |
Logical. Whether to print mutation details. |
force_add |
Logical. If TRUE, forces adding a new gene. |
importances |
A numeric vector of feature importances. |
temperature |
A numeric temperature value controlling selection weights. |
task |
The task type ("classification", "regression", or "multiclass") |
tested_gene_outputs |
Character vector of gene output names that have been evaluated in a previous generation and are safe for chaining. When NULL (default), all existing gene outputs are available. Pass character(0) to block all chaining (e.g. during initialization). |
Value
An evo_individual with the mutation applied (gene added,
removed, or modified) and fitness reset to NA_real_.
Apply feature engineering recipe to new data
Description
Apply feature engineering recipe to new data
Usage
## S3 method for class 'evo_recipe'
predict(object, newdata, ...)
Arguments
object |
An evo_recipe object |
newdata |
A data.frame or data.table |
... |
Additional arguments |
Value
A data.table containing the engineered feature columns
(original plus all gene-derived columns) for newdata, ready for
downstream modelling.
Predict target values using the fully evolved model
Description
Predict target values using the fully evolved model
Usage
predict_model(object, newdata, ...)
Arguments
object |
An evo_recipe object containing the trained model and best individual |
newdata |
A data.frame or data.table to make predictions on |
... |
Additional arguments (currently unused) |
Value
For binary classification and regression tasks a numeric vector of predictions. For multiclass tasks a numeric matrix with one column per class (columns named after class levels).
Stratified or random splitting helper
Description
Stratified or random splitting helper
Usage
stratified_split(y, ratio)
Train a boosted tree model
Description
Internal helper that encapsulates LightGBM / XGBoost parameter construction and training. Returns the fitted model, optional predictions on validation data, and feature importances.
Usage
train_model(
x_train,
y_train,
x_val = NULL,
task = "classification",
evaluator = "lightgbm",
threads = 2,
num_class = NULL,
nrounds = 50
)
Arguments
x_train |
Numeric matrix of training features. |
y_train |
Numeric vector of training labels. |
x_val |
Optional numeric matrix of validation features. |
task |
Task type: "classification", "multiclass", or "regression". |
evaluator |
Model type: "lightgbm" or "xgboost". |
threads |
Number of threads. |
num_class |
Number of classes (required for multiclass). |
nrounds |
Number of boosting rounds. |
Value
A list with elements model, predictions (NULL when
x_val is NULL), and importances (named numeric vector or
NULL).
Union Crossover of two individuals
Description
Union Crossover of two individuals
Usage
union_crossover(ind1, ind2, verbose = FALSE)
Arguments
ind1 |
Parent 1 |
ind2 |
Parent 2 |
verbose |
Logical. Whether to print crossover details. |
Value
An evo_individual child created by taking the union of all
genes from both parents with duplicate gene outputs removed.