Benchmark genomic-selection models — classic and machine-learning — from SNP marker data, through one interface, with breeding-relevant cross-validation and honest accuracy reporting.
The problem GSbench addresses: people increasingly throw
glmnet, ranger, or xgboost at
marker matrices, but hand-roll the cross-validation (often incorrectly)
and compare models on unequal footing. GSbench fits the standard
baselines (GBLUP, ridge marker effects) and the ML
methods behind a single gs_fit()/predict()
API, runs them through the same CV, and reports predictive ability you
can actually trust — plus a stacked ensemble that combines them.
# install.packages("remotes")
remotes::install_github("mqfarooqi1/GSbench")Only graphics, stats and withr
are required. The ML backends — glmnet,
ranger, xgboost — are optional (Suggests);
install whichever you want to use.
library(GSbench)
sim <- simulate_population(n = 300, m = 2000, h2 = 0.5, seed = 1)
# one model
fit <- gs_fit(sim$pheno, sim$geno, model = "gblup")
gebv <- predict(fit, sim$geno)
# compare every available model (incl. the stacked ensemble) under one CV
bench <- gs_benchmark(sim$pheno, sim$geno, k = 5, seed = 1)
bench
plot(bench) model mean sd n_folds
elastic_net 0.367 0.187 5
gblup 0.334 0.189 5
ensemble 0.328 0.165 5
random_forest 0.269 0.185 5
xgboost 0.185 0.318 5
(accuracy = predictive ability, cor(pred, observed) on held-out data)
Core (base R, no compiled code, no heavy deps):
| Function | Purpose |
|---|---|
simulate_population() |
Reproducible SNP + phenotype simulator with known h² |
qc_markers(), impute_markers() |
Call-rate / MAF / monomorphic filtering, mean imputation |
Gmatrix() |
VanRaden additive genomic relationship matrix |
gblup() |
GBLUP by REML — validated to match
rrBLUP::mixed.solve to 6×10⁻⁵ |
Modelling & evaluation:
| Function | Purpose |
|---|---|
gs_fit() / predict() |
Unified interface: "gblup", "elastic_net",
"random_forest", "xgboost",
"ensemble" |
gs_cv() |
Cross-validation: random k-fold (CV1) or leave-one-group-out (family/environment) |
gs_ensemble() |
Stacked super-learner — combines base models with non-negative CV-learned weights |
gs_benchmark() + plot() |
Run all available models through one CV and compare |
available_models() |
Which models are usable in your session |
rrBLUP in the test suite — same
variance components, GEBVs correlating at 1.0.Muhammad Farooqi · https://github.com/mqfarooqi1