GSbench

Benchmark genomic-selection models — classic and machine-learning — from SNP marker data, through one interface, with breeding-relevant cross-validation and honest accuracy reporting.

The problem GSbench addresses: people increasingly throw glmnet, ranger, or xgboost at marker matrices, but hand-roll the cross-validation (often incorrectly) and compare models on unequal footing. GSbench fits the standard baselines (GBLUP, ridge marker effects) and the ML methods behind a single gs_fit()/predict() API, runs them through the same CV, and reports predictive ability you can actually trust — plus a stacked ensemble that combines them.

Installation

# install.packages("remotes")
remotes::install_github("mqfarooqi1/GSbench")

Only graphics, stats and withr are required. The ML backends — glmnet, ranger, xgboost — are optional (Suggests); install whichever you want to use.

Quick start

library(GSbench)

sim <- simulate_population(n = 300, m = 2000, h2 = 0.5, seed = 1)

# one model
fit <- gs_fit(sim$pheno, sim$geno, model = "gblup")
gebv <- predict(fit, sim$geno)

# compare every available model (incl. the stacked ensemble) under one CV
bench <- gs_benchmark(sim$pheno, sim$geno, k = 5, seed = 1)
bench
plot(bench)

         model  mean    sd n_folds
   elastic_net 0.367 0.187       5
         gblup 0.334 0.189       5
      ensemble 0.328 0.165       5
 random_forest 0.269 0.185       5
       xgboost 0.185 0.318       5
  (accuracy = predictive ability, cor(pred, observed) on held-out data)

What’s in it

Core (base R, no compiled code, no heavy deps):

Function	Purpose
`simulate_population()`	Reproducible SNP + phenotype simulator with known h²
`qc_markers()`, `impute_markers()`	Call-rate / MAF / monomorphic filtering, mean imputation
`Gmatrix()`	VanRaden additive genomic relationship matrix
`gblup()`	GBLUP by REML — validated to match `rrBLUP::mixed.solve` to 6×10⁻⁵

Modelling & evaluation:

Function	Purpose
`gs_fit()` / `predict()`	Unified interface: `"gblup"`, `"elastic_net"`, `"random_forest"`, `"xgboost"`, `"ensemble"`
`gs_cv()`	Cross-validation: random k-fold (CV1) or leave-one-group-out (family/environment)
`gs_ensemble()`	Stacked super-learner — combines base models with non-negative CV-learned weights
`gs_benchmark()` + `plot()`	Run all available models through one CV and compare
`available_models()`	Which models are usable in your session

Why the methods are trustworthy

GBLUP is built from scratch in base R (spectral REML, the Endelman 2011 / EMMA method) and is numerically validated against rrBLUP in the test suite — same variance components, GEBVs correlating at 1.0.
Cross-validation is the part people get wrong, so it’s the part GSbench is opinionated about: correct fold construction, leave-group-out for family/environment structure, and accuracy aggregated across folds.
The stacked ensemble is the Breiman / van der Laan super-learner: base models are combined by weights fit to their out-of-fold predictions (non-negative, summing to one). It tends to match or beat the best single model without you having to know which that is in advance.

Honest limitations

Single trait, single environment. Multi-trait and GxE (CV2) models are not here yet — that’s the obvious next direction.
Pure-R performance. The GBLUP solver eigendecomposes an n×n matrix; fine for typical breeding populations (hundreds–few thousand lines), but very large panels would want a C++ backend.
Imputation is simple (marker means); model-based imputation upstream is better for real data.
The simulator is for demos/tests — bring your own genotypes and phenotypes for real work.

References

VanRaden, P. M. (2008) J. Dairy Sci. 91:4414–4423. doi:10.3168/jds.2007-0980
Endelman, J. B. (2011) Plant Genome 4:250–255. doi:10.3835/plantgenome2011.08.0024
Meuwissen, Hayes & Goddard (2001) Genetics 157:1819–1829. doi:10.1093/genetics/157.4.1819
van der Laan, Polley & Hubbard (2007) Stat. Appl. Genet. Mol. Biol. 6:Art.25. doi:10.2202/1544-6115.1309

Muhammad Farooqi · https://github.com/mqfarooqi1