| Title: | Semi-Supervised Deconvolution of Bulk RNA-Seq Data with Hyperparameter Optimization |
| Version: | 1.0.2 |
| Description: | Performs semi-supervised deconvolution of bulk RNA sequencing (RNA-seq) data. Known cell-type proportions are estimated using supervised methods – 'CIBERSORTx' (CSx), 'CIBERSORT' (CS), 'FARDEEP' (Fast And Robust DEconvolution of Expression Profiles), and 'DCQ' (Digital Cell Quantifier) – while unknown components are inferred using non-negative matrix factorization ('NMF') with limited-memory Broyden-Fletcher-Goldfarb-Shanno with bounds ('L-BFGS-B') optimization. Hyperparameters are selected automatically using a Pareto-frontier-based approach with knee-point detection, allowing application when the reference signature matrix is incomplete. More details about 'DICEpro' can be found in Ba et al. (2026) <doi:10.64898/2026.06.17.732876>. |
| Depends: | R (≥ 4.1.0) |
| Encoding: | UTF-8 |
| License: | MIT + file LICENSE |
| LazyData: | true |
| LazyDataCompression: | xz |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| LinkingTo: | Rcpp (≥ 1.0.12), RcppEigen |
| Imports: | Rcpp (≥ 1.0.12), lbfgsb3c, KraljicMatrix, ggplot2, patchwork, FARDEEP, ComICS, lme4, e1071, preprocessCore, parallel |
| Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
| URL: | https://github.com/kalidouBA/dicepro |
| BugReports: | https://github.com/kalidouBA/dicepro/issues |
| Config/roxygen2/version: | 8.0.0 |
| RoxygenNote: | 7.3.3 |
| NeedsCompilation: | yes |
| Packaged: | 2026-06-24 10:46:56 UTC; kalidou |
| Author: | Kalidou BA [aut, cre], Boris P. Hejblum [aut] |
| Maintainer: | Kalidou BA <sistm.soft.maintain@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-30 12:20:14 UTC |
dicepro: Semi-Supervised Deconvolution of Bulk RNA-Seq Data with Hyperparameter Optimization
Description
Performs semi-supervised deconvolution of bulk RNA sequencing (RNA-seq) data. Known cell-type proportions are estimated using supervised methods – 'CIBERSORTx' (CSx), 'CIBERSORT' (CS), 'FARDEEP' (Fast And Robust DEconvolution of Expression Profiles), and 'DCQ' (Digital Cell Quantifier) – while unknown components are inferred using non-negative matrix factorization ('NMF') with limited-memory Broyden-Fletcher-Goldfarb-Shanno with bounds ('L-BFGS-B') optimization. Hyperparameters are selected automatically using a Pareto-frontier-based approach with knee-point detection, allowing application when the reference signature matrix is incomplete. More details about 'DICEpro' can be found in Ba et al. (2026) doi:10.64898/2026.06.17.732876.
Author(s)
Maintainer: Kalidou BA sistm.soft.maintain@gmail.com
Authors:
Boris P. Hejblum boris.hejblum@u-bordeaux.fr
See Also
Useful links:
BlueCode Reference Signature Matrix
Description
Gene expression reference matrix for bulk RNA-seq deconvolution, derived from sorted cell populations spanning Immune, Stromal, Endothelial, epithelial, and muscle compartments. BlueCode serves as the default reference for dicepro and directly informs the hierarchical Dirichlet simulation framework.
Usage
data(BlueCode)
Format
A numeric matrix of dimensions G \times 34, where G
denotes the number of genes after quality filtering. Rows are labeled
by HGNC gene symbol; columns are labeled by cell-type name. All values
are non-negative read counts prior to normalization; apply
.normalize_zscore_per_gene or TPM scaling before use in
downstream analyses.
Details
Construction
BlueCode was built from publicly available sorted bulk RNA-seq profiles downloaded from the ENCODE and GEO repositories. For each cell population, replicate profiles were averaged after quantile normalization. Genes with zero variance across all 34 profiles, or with missing values in more than 10 \ log-transformed so that it can be used directly with both Gaussian and Poisson deconvolution models.
Compartment structure
The 34 cell types are partitioned into five major tissue compartments. The column ordering in the matrix follows this partition exactly (columns 1–9, 10–17, 18–20, 21–25, 26–34).
Immune compartment – columns 1–9:
-
B.cell.naive -
B.cell.memory -
T.Cell.CD4 -
T.cell.CD4.memory -
T.cell.CD8.memory -
NK.cell.CD56 -
Monocyte.CD14 -
Macrophage -
Mature.neutrophil
Stromal compartment – columns 10–17:
-
Fibroblast.cardiac.ventricle -
Fibroblast.of.arm -
Fibroblast.of.lung -
Fibroblast.of.the.aortic.adventitia -
Fibroblast.or.papilla.dermal.cell -
MSC.like.pluripotent.cell -
Chondrocyte.articular -
Osteoblast
Endothelial compartment – columns 18–20:
-
Endothelial.large.blood.vessel -
Endothelial.microvascular.mammary.or.endometrial -
Endothelial.microvascular.non.reproductive
Epithelial compartment – columns 21–25:
-
Epithelial.cell.mammary -
Epithelial.renal.cortical -
Epithelial.respiratory -
Hair.follicular.keratinocyte -
Melanocyte.of.skin
Muscle compartment – columns 26–34:
-
Smooth.muscle.cell.aortic -
Smooth.muscle.cell.bronchial -
Smooth.muscle.cell.coronary.artery -
Smooth.muscle.cell.pulmonary.artery -
Smooth.muscle.cell.of.bladder -
Smooth.muscle.cell.of.trachea -
Smooth.muscle.cell.uterine -
Myocyte.regular.cardiac -
Myometrial.cell
Link to the simulation framework
This compartment structure directly informs the hierarchical Dirichlet
model in generateProp with scenario = "hierarchical".
The Dirichlet concentration parameters at each level
(\alpha_\text{Immune} = 6, \alpha_\text{Stromal} = 2,
\alpha_\text{Endothelial} = \alpha_\text{Epithelial} = 1.5,
\alpha_\text{Muscle} = 1) reflect the expected relative abundance
of each compartment in mixed tissue samples, so that simulated proportions
are biologically realistic with respect to BlueCode cell-type coverage.
Recommended usage
data(BlueCode) # Quick inspection dim(BlueCode) # G x 34 colnames(BlueCode) # 34 cell-type labels # Use directly in dicepro out <- dicepro( reference = BlueCode, bulk = my_bulk_matrix, methodDeconv = "FARDEEP" )
Source
BlueCode reference matrix constructed for the dicepro benchmarking framework from ENCODE (https://www.encodeproject.org) and GEO (https://www.ncbi.nlm.nih.gov/geo/) sorted RNA-seq profiles. See the accompanying manuscript for full details on data sources, replicate averaging, quality filtering, and normalization.
CellMixtures Bulk RNA-seq Dataset
Description
Experimentally constructed bulk RNA-seq mixtures of sorted cell populations, designed for benchmarking deconvolution algorithms. Each sample is a known mixture of cell types drawn from the BlueCode reference panel, making ground-truth proportions available for quantitative evaluation.
Usage
data(CellMixtures)
Format
A numeric matrix of dimensions G \times 12, where G
denotes the number of genes (~31 400 after quality filtering). Rows are
labeled by HGNC gene symbol; columns are labeled A through
L, corresponding to 12 experimentally mixed samples. Values are
raw read counts prior to normalization.
Details
Experimental design
The 12 mixture samples (A–L) were prepared by combining sorted cell
populations at known proportions across the five tissue compartments
represented in BlueCode. Each sample targets a distinct
mixture composition, providing a controlled benchmark spanning a wide
range of cell-type abundance profiles – from immune-dominated samples
to Stromal- or muscle-enriched configurations.
Gene coverage
CellMixtures covers approximately 31 400 gene symbols, a super-set of
those present in BlueCode. dicepro automatically intersects the two
gene sets before deconvolution; the effective number of informative
genes is therefore determined by the BlueCode reference (see
BlueCode for details on its gene filtering criteria).
A gene-overlap check prior to running dicepro is
recommended:
data(BlueCode)
data(CellMixtures)
n_common <- length(intersect(rownames(BlueCode), rownames(CellMixtures)))
cat(sprintf("Common genes: %d / %d reference genes\n",
n_common, nrow(BlueCode)))
normalization
Raw counts should be normalized before deconvolution.
dicepro applies .normalize_zscore_per_gene internally;
no preprocessing is required when using the main function.
For exploratory analyses, log-CPM or TPM normalization is recommended:
# log2-CPM normalization cpm <- sweep(CellMixtures, 2, colSums(CellMixtures) / 1e6, FUN = "/") log2_cpm <- log2(cpm + 1)
Recommended usage
data(BlueCode) data(CellMixtures) out <- dicepro( reference = BlueCode, bulk = CellMixtures, methodDeconv = "FARDEEP", bulkName = "CellMixtures", refName = "BlueCode", hp_max_evals = 100, hspaceTechniqueChoose = "all" ) # Inspect estimated proportions head(out$H)
Relationship to BlueCode
CellMixtures and BlueCode are designed as a paired benchmark: the reference matrix is derived from the same cell populations used to construct the mixtures, ensuring that all cell types present in the bulk are represented in the reference. This controlled setting enables direct evaluation of the reconstruction accuracy of dicepro.
Source
CellMixtures was constructed for the DICEPro benchmarking framework. Sorted cell populations were obtained from ENCODE (https://www.encodeproject.org) and mixed in silico at predefined proportions from experimentally validated RNA-seq profiles. See the accompanying manuscript for full details on mixture design, library preparation, and sequencing.
Select optimal hyper-parameters using a Pareto frontier
Description
Identifies the best (\lambda, \gamma) pair from a set of
optimization trials by:
Filtering trials where
|1 - \text{constraint}|exceedsconstraint_threshold.Computing
abs_constraint = |1 - constraint|.Extracting the Pareto frontier (minimize
frobNormANDabs_constraint) viaKraljicMatrix::get_frontierwithquadrant = "bottom.left".Selecting the knee point via maximum perpendicular distance to the utopia-nadir line in normalized objective space.
Usage
best_hyperParams(trials_df, W, H, constraint_threshold = 0.1)
Arguments
trials_df |
data.frame of optimization trials as returned
by |
W |
List of |
H |
List of |
constraint_threshold |
Numeric scalar. Maximum allowed
|
Value
Named list, or invisible(NULL) when no valid configuration
survives filtering:
- hyperparameters
List with
lambdaandgamma.- metrics
List with
frobNorm,abs_constraint, andconstraintfor the selected trial.- trials
Filtered trials data.frame (constraint-passing rows only).
- W
The
Wmatrix of the selected trial.- H
The
Hmatrix of the selected trial.- plot
A
ggplot2Pareto frontier figure.
Check if a value contains NaN or Inf
Description
Check if a value contains NaN or Inf
Usage
contains_nan_or_inf(value)
Arguments
value |
Numeric, vector, matrix, or data frame to check. |
Value
Logical. TRUE if NaN or Inf is present, FALSE otherwise.
Visualize the gamma–lambda hyper-parameter search space
Description
Generates a ggplot2 scatter plot of \gamma vs \lambda
sampled from the search space defined by hspaceTechniqueChoose,
with feasibility bounds overlaid for "restrictionEspace".
Usage
create_gamma_lambda_plot(
hspaceTechniqueChoose = c("all", "restrictionEspace"),
n_samples = 200,
seed = NULL
)
Arguments
hspaceTechniqueChoose |
Character scalar.
|
n_samples |
Positive integer. Configurations to draw (default |
seed |
Integer. Random seed used for full pipeline reproducibility.
Defaults to |
Value
A ggplot2 figure object.
Examples
create_gamma_lambda_plot(hspaceTechniqueChoose = "all")
create_gamma_lambda_plot(hspaceTechniqueChoose = "restrictionEspace")
Semi-supervised bulk RNA-seq deconvolution with hyper-parameter optimization
Description
Combines supervised estimation of known cell types with unsupervised discovery of latent components, with automatic Pareto-frontier-based hyper-parameter optimization.
Usage
dicepro(
reference,
bulk,
methodDeconv = "CS",
cibersortx_email = NULL,
cibersortx_token = NULL,
cibersort_perm = 0,
cibersort_QN = TRUE,
W_prime = 0,
bulkName = "Bulk",
refName = "Reference",
hp_max_evals = 100,
N_unknownCT = 1L,
algo_select = "random",
output_path = NULL,
hspaceTechniqueChoose = "gamma_dominant",
gamma_ratio_min = 10,
constraint_threshold = 0.1,
out_Decon = NULL,
normalize = TRUE,
seed = NULL
)
Arguments
reference |
Numeric matrix (genes * cell types). |
bulk |
Numeric matrix (genes * samples). |
methodDeconv |
Character. One of |
cibersortx_email |
Character. CIBERSORTx email (required for
|
cibersortx_token |
Character. CIBERSORTx token (required for
|
cibersort_perm |
Non-negative integer. Number of permutations
for p-value estimation in the built-in |
cibersort_QN |
Logical. Apply quantile normalisation to the
bulk mixture in the built-in |
W_prime |
Initial unknown-signature matrix or |
bulkName |
Character scalar. Label for the bulk data-set. |
refName |
Character scalar. Label for the reference. |
hp_max_evals |
Positive integer. Number of hyper-parameter trials. |
N_unknownCT |
Positive integer. Number of unknown cell types. |
algo_select |
Character. One of |
output_path |
Character scalar. Root output directory. Defaults to tempdir(); pass an explicit directory to persist results across sessions. |
hspaceTechniqueChoose |
Character. One of
\item{\code{"all"}}{Full independent log-uniform grid for
\code{lambda_}, \code{gamma}, \code{p_prime}.}
\item{\code{"restrictionEspace"}}{Restricted space where
\code{lambda_ = gamma * lambda_factor},
\code{lambda_factor} \eqn{\in [2, 100]}.}
|
gamma_ratio_min |
Positive numeric. Minimum ratio
|
constraint_threshold |
Positive numeric. Maximum allowed deviation
from the constraint (|1 - constraint|) when filtering hyper-parameter
trials before Pareto selection (default
|
out_Decon |
Optional pre-computed deconvolution matrix (samples * cell types). When provided, the supervised deconvolution step is skipped entirely. |
normalize |
Logical. Apply z-score normalisation per
gene (default |
seed |
Integer. Random seed used for full pipeline reproducibility.
Defaults to |
Details
When out_Decon is provided, the supervised step is skipped.
Gene matrices are optionally z-score normalized per gene, and only
intersecting genes are retained before optimization.
The built-in "CIBERSORT" method closely follows Newman et al.
(2015): nu-SVR with three candidate nu values (0.25, 0.50, 0.75),
optional quantile normalisation, and optional permutation-based p-values.
It requires packages e1071, parallel, and
preprocessCore. No external account or Docker installation is
needed, unlike "CSx".
Value
An object of class "dicepro" (a named list) containing:
-
hyper-parameters– selected\lambdaand\gamma -
metrics– loss and constraint of the best trial -
trials– all evaluated configurations -
W– estimated unknown signature matrix -
H– estimated proportion matrix -
plot– Pareto frontier ggplot2 figure -
plot_hyperopt– hyper-parameter space ggplot2 figure
Returns invisible(NULL) with a warning when no valid configuration
is found.
References
Newman AM et al. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453-457. doi:10.1038/nmeth.3337
See Also
running_method, run_experiment,
best_hyperParams
Examples
sim_data <- simulation(
loi = "gauss",
scenario = "hierarchical",
nSample = 5,
nGenes = 50,
nCellsType = 5,
sigma_bio = 0.07,
sigma_tech = 0.07,
seed = 2101
)
out <- dicepro(
reference = as.matrix(sim_data$W)[, -c(1)],
bulk = as.matrix(sim_data$B),
methodDeconv = "FARDEEP",
W_prime = 0,
bulkName = "SimBulk",
refName = "SimRef",
hp_max_evals = 5,
algo_select = "random",
output_path = tempdir(),
hspaceTechniqueChoose = "all",
normalize = FALSE
)
Full agreement metrics via mixed-effects modeling
Description
Fits a two-way mixed-effects model (populations and subjects as random
effects) and derives ICC(3,1), Concordance correlation coefficient, and relative RMSE, complemented by
sample-wise correlation and RMSE from samplewise_metrics.
Usage
full_metrics(x, y)
Arguments
x |
Numeric matrix of observed values (rows = subjects, columns = variables / populations). |
y |
Numeric matrix of predicted values; same dimensions as |
Value
Named list:
- Correlation_mean
Mean sample-wise Pearson correlation.
- RMSE_mean
Mean sample-wise RMSE.
- ICC3_adapted
ICC(3,1) from the mixed model (
NAfor constant data).- CCC_adapted
Concordance correlation coefficient (
NAfor constant data).- rRMSE_adapted
Relative RMSE
\sqrt{\sigma^2_\varepsilon / V_T}(NAfor constant data).
Generate Cell-Type Proportion Matrix
Description
Simulates cell-type proportion matrices for bulk RNA-seq deconvolution benchmarking. Four scenarios are supported: equal proportions across cell types, uniform random sampling, hierarchical Dirichlet sampling reflecting realistic tissue compartment organization, or a BlueCode-specific hierarchical model that preserves real cell-type names.
Usage
generateProp(
n_cell_types = NULL,
nSample,
nCell = 500,
scenario = NULL,
alpha_groups = c(Immune = 4, Stromal = 2.5, Endothelial = 1.8, Epithelial = 1.8, Muscle
= 1.5),
alpha_subtypes = list(Immune = 8, Stromal = 8, Endothelial = 8, Epithelial = 8, Muscle
= 8),
seed = NULL
)
Arguments
n_cell_types |
Integer. Number of distinct cell types. Ignored when
|
nSample |
Integer. Number of samples to generate. |
nCell |
Integer. Total number of cells per sample (used for rounding precision in the uniform scenario). |
scenario |
Character. Proportion generation strategy. One of:
|
alpha_groups |
Named numeric vector of length 5. Dirichlet
concentration parameters for the five tissue compartments
(Immune, Stromal, Endothelial, Epithelial, Muscle). Only used when
|
alpha_subtypes |
Named list with one scalar per compartment. Controls
the concentration of sub-type Dirichlet draws within each compartment.
Higher values produce more even sub-type distributions. Only used when
|
seed |
Integer. Random seed. Only used when |
Value
A numeric matrix of dimensions nSample x n_cell_types where
each row sums to 1. For all scenarios except "bluecode", row names
are Sample_1, ..., Sample_nSample and column names are
CellType_1, ..., CellType_n_cell_types. For
"bluecode", row names are sample_1, ...,
sample_nSample and column names are the real BlueCode cell-type
names ordered by compartment.
Examples
set.seed(2101)
# Equal proportions
prop_even <- generateProp(nSample = 20, n_cell_types = 10, scenario = "even")
# Generic hierarchical Dirichlet
prop_hier <- generateProp(nSample = 20, n_cell_types = 34,
scenario = "hierarchical")
all(abs(rowSums(prop_hier) - 1) < 1e-8) # TRUE
# BlueCode hierarchical Dirichlet (real cell-type names)
prop_bc <- generateProp(nSample = 20, scenario = "bluecode")
colnames(prop_bc) # real BlueCode cell-type names
Generate a Reference Signature Matrix
Description
Constructs a synthetic gene-by-cell-type reference matrix using either Poisson-based (via Gaussian copula) or log-normal gene expression models, with optional block sparsity and TPM normalization.
Usage
generate_ref_matrix(
loi = "rpois",
tpm = FALSE,
bloc = FALSE,
nGenesByCellType = 50,
nCell = 500,
nCellsType = 10,
nGenes = 500,
lambda_vec = NULL,
corr = NULL,
sparse = FALSE,
prob_sparse = NULL
)
Arguments
loi |
Character. Expression model: |
tpm |
Logical. If |
bloc |
Logical. If |
nGenesByCellType |
Integer. Number of genes per cell-type block
(used when |
nCell |
Integer. Total number of cells (unused directly; retained for API compatibility). |
nCellsType |
Integer. Number of cell types. Default: |
nGenes |
Integer. Number of genes. Default: |
lambda_vec |
Numeric vector of length |
corr |
Numeric matrix. Inter-cell-type correlation matrix.
If |
sparse |
Logical. If |
prob_sparse |
Numeric. Probability of a non-zero entry when
|
Details
For loi = "rpois", correlated Poisson counts are generated via a
Gaussian copula: multivariate normal draws with covariance corr are
mapped to uniform marginals via pnorm, then to Poisson quantiles via
qpois. This preserves the inter-cell-type correlation structure
without requiring SimMultiCorrData.
For any other value of loi, a log-normal model is used: multivariate
normal draws are exponentiated after a location shift of 6.
Value
A numeric matrix of dimensions nGenes x nCellsType.
Row names are Gene_1, ..., Gene_nGenes; column names are
CellType_1, ..., CellType_nCellsType.
Examples
set.seed(2101)
ref_pois <- generate_ref_matrix(loi = "rpois", nGenes = 50, nCellsType = 5)
ref_gauss <- generate_ref_matrix(loi = "gauss", nGenes = 50, nCellsType = 5)
dim(ref_pois) # 50 x 5
dim(ref_gauss) # 50 x 5
Heatmap of cell-type abundances
Description
Generates a raster heatmap showing the estimated abundance of each cell type across NMF iterations (or samples). Color encodes abundance using the viridis scale (reversed: dark = high abundance).
Usage
heatmap_abundances(res2plot, base_size = 9, title = NULL, midpoint = NULL)
Arguments
res2plot |
A
|
base_size |
Numeric. Base font size in points (default |
title |
Character. Plot title (default |
midpoint |
Numeric or |
Value
A ggplot object. It can be printed or saved using
ggplot2::ggsave().
Examples
df <- data.frame(
Iterate = 1:10,
CellTypeA = runif(10),
CellTypeB = runif(10)
)
heatmap_abundances(df)
heatmap_abundances(df, title = "Estimated abundances", midpoint = 0.5)
Build a performance metrics table for composition matrices
Description
Aligns predicted and observed matrices to their common populations, row-normalizes both, then computes global and sample-wise agreement metrics.
Usage
makeTable1Tool(pred_mat, obs_mat)
Arguments
pred_mat |
Numeric matrix of predicted compositions (rows = samples, columns = populations). |
obs_mat |
Numeric matrix of observed compositions; must share at
least one column name with |
Details
Populations absent from pred_mat but present in obs_mat
(and vice versa) are silently dropped after the intersection step.
Both matrices are row-normalized via row_norm_pos before
any metric is computed.
Value
A list with one element:
- Perf
A one-row
data.framecontaining:Correlation,RMSE,Correlation_mean,RMSE_mean,NRMSE,ICC3,CCC. All fields areNAwhen computation is not possible.
See Also
Performance metric line plot across iterations
Description
Plots the evolution of one or more scalar performance metrics (e.g.,
NRMSE, R^2, loss) over NMF iterations. Useful for monitoring
convergence and diagnosing early stopping.
Usage
metric_plot(
perf2plot,
ylab = "Error between folds",
title = NULL,
base_size = 9,
add_smooth = FALSE,
show_points = FALSE
)
Arguments
perf2plot |
A
Optionally a |
ylab |
Character. Y-axis label
(default |
title |
Character. Plot title (default |
base_size |
Numeric. Base font size in points (default |
add_smooth |
Logical. When |
show_points |
Logical. When |
Value
A ggplot object.
Examples
df <- data.frame(
Iterate = 1:50,
metric = cumsum(runif(50, -0.01, 0.1))
)
metric_plot(df)
metric_plot(df, ylab = "NRMSE", add_smooth = TRUE)
# Multi-group
df2 <- data.frame(
Iterate = rep(1:20, 2),
metric = c(cumsum(runif(20, 0, 0.1)), cumsum(runif(20, 0, 0.05))),
group = rep(c("Fold 1", "Fold 2"), each = 20)
)
metric_plot(df2)
NMF with L-BFGS-B optimization
Description
Performs non-negative matrix factorization (NMF) using L-BFGS-B optimization. The objective combines a Gaussian log-likelihood with an augmented Lagrangian penalty that enforces the sum-to-one constraint on cell-type proportions.
Usage
nmf_lbfgsb(
r_dataset,
W_prime = 0,
p_prime = 0,
lambda_ = 10,
gamma_par = 100,
N_unknownCT = 1L,
con = list(maxit = 3000)
)
Arguments
r_dataset |
Named list with three elements:
|
W_prime |
Numeric scalar or matrix. Initial value(s) for the
unknown cell-type column(s) in |
p_prime |
Numeric scalar or matrix. Initial value(s) for the
unknown cell-type column(s) in |
lambda_ |
Numeric scalar. Augmented Lagrangian multiplier
initialization (default |
gamma_par |
Numeric scalar. Penalty coefficient for the sum-to-one
constraint (default |
N_unknownCT |
Positive integer. Number of latent cell types to
estimate (default |
con |
Named list of control parameters forwarded to
|
Value
A named list on success:
- W
Optimized signature matrix (N_gene × N_cells_Type).
- frob_W
Frobenius norm of
W.- var_W
Variance of the unknown column of
W.- H
Optimized proportion data.frame (N_sample × N_cells_Type + 1), with a leading
Mixturecolumn.- frob_H
Frobenius norm of
H.- var_H
Variance of the unknown column of
H.- loss
Gaussian log-likelihood (objective terms 1 + 2).
- frobNorm
Objective term 1 (data-fit component).
- constNorm
Objective term 2 (log-normalization component).
- c1
Penalty term 1 (linear Lagrangian).
- c2
Penalty term 2 (quadratic penalty).
- objectiveValue
Total objective value (
loss).- penalty
Total penalty (
c1 + c2).- cvrge
Convergence code from
lbfgsb3c.- constraint
Absolute constraint violation:
|1 - \sum h_i|.
Returns NULL when the optimizer fails or the solution is degenerate.
NMF L-BFGS-B wrapper for hyper-parameter optimization
Description
Thin wrapper around nmf_lbfgsb that normalizes the
p_prime argument and cleans the returned matrices.
Usage
nmf_lbfgsb_hyperOpt(
dataset,
W_prime = NULL,
p_prime = NULL,
lambda_ = 10,
gamma_par = 100
)
Arguments
dataset |
List containing matrices |
W_prime |
Optional numeric matrix. Initial |
p_prime |
Optional numeric matrix or scalar. Initial |
lambda_ |
Numeric. Regularization parameter |
gamma_par |
Numeric. Regularization parameter |
Value
List output from nmf_lbfgsb with H and
W cleaned by .clean_nmf_matrix.
Normalized Root Mean Square Error (NRMSE)
Description
Computes RMSE between back-transformed pred and obs vectors,
then normalizes by a summary statistic of the observations.
Usage
nrmse(
pred,
obs,
method = "sd",
transformation = "none",
trans_function = "none"
)
Arguments
pred |
Numeric vector of predicted values (on the transformed scale). |
obs |
Numeric vector of observed values (same scale as
|
method |
Normalization denominator: |
transformation |
Back-transformation applied before computing RMSE.
One of |
trans_function |
R expression string used when
|
Value
Non-negative numeric scalar, or NA_real_ (with a message)
when NRMSE is undefined (fewer than 2 complete observations, or zero
denominator).
Examples
set.seed(1)
obs <- rnorm(100)
pred <- obs + rnorm(100, sd = 0.1)
nrmse(pred, obs)
nrmse(pred, obs, method = "maxmin")
Objective function for hyper-parameter optimization
Description
Runs one NMF trial for a given (\lambda, \gamma, p') configuration
and returns a structured result list, or NULL when the trial
produces invalid values.
Usage
objective_opt(
dataset,
config = list(),
lambda_ = NULL,
gamma_factor = NULL,
gamma = NULL,
p_prime = NULL,
W_prime = 0
)
Arguments
dataset |
List with matrices |
config |
List. Configuration object (needs |
lambda_ |
Numeric. Regularization parameter |
gamma_factor |
Numeric or |
gamma |
Numeric. |
p_prime |
Numeric matrix ( |
W_prime |
Numeric matrix or scalar. Initial |
Value
A named list with elements loss, constraint,
status, current_params, W, H, cvrge;
or NULL when the trial is invalid.
Plot cell abundance heatmap and error plot
Description
Combines a cell-abundance heatmap with a fold-error plot using
patchwork. Requires at least two unique iterations in
x$Matrix_prediction.
Usage
## S3 method for class 'dicepro'
plot(x, ...)
Arguments
x |
A |
... |
Additional arguments (currently unused; reserved for future use). |
Value
A patchwork figure, or invisible(NULL) with a warning
when only one iteration is present.
See Also
heatmap_abundances, metric_plot
Plot hyperparameter optimization report
Description
Generic function for plotting the hyperparameter search report stored in a
dicepro object. Dispatches to plot_hyperopt.dicepro.
Builds a scatter-matrix of all evaluated (\lambda, \gamma, p')
configurations, color-coded by loss value, with violin/bar marginals for
the top 5\
Usage
plot_hyperopt(x, ...)
## S3 method for class 'dicepro'
plot_hyperopt(
x,
params,
metric = "loss",
loss_metric = "loss",
loss_behaviour = "min",
not_log = NULL,
categorical = NULL,
max_deviation = NULL,
title = NULL,
...
)
Arguments
x |
A |
... |
Currently unused. Reserved for future extensions. |
params |
Character vector of hyperparameter column names to
display (e.g. |
metric |
Character scalar. Column used for point size
(default |
loss_metric |
Character scalar. Column used as the loss axis
(default |
loss_behaviour |
Character scalar. Direction of the loss:
|
not_log |
Character vector of parameter names that should
not be log-scaled on their axis (default |
categorical |
Character vector of categorical parameter names;
these are displayed as bar charts in the marginal row
(default |
max_deviation |
Numeric scalar. Trials with
|
title |
Character scalar. Optional title displayed above the
combined figure (default |
Value
Whatever the dispatched method returns (a patchwork figure
for dicepro objects).
A patchwork object which can be printed, saved with
ggplot2::ggsave(), or embedded in R Markdown documents.
See Also
plot_hyperopt.dicepro
Examples
sim_data <- simulation(
loi = "gauss",
scenario = "hierarchical",
nSample = 5,
nGenes = 150,
nCellsType = 10,
sigma_bio = 0.07,
sigma_tech = 0.07,
seed = 2101
)
out <- dicepro(
reference = as.matrix(sim_data$W)[, -c(1, 5, 10)],
bulk = as.matrix(sim_data$B),
methodDeconv = "FARDEEP",
W_prime = 0,
bulkName = "SimBulk",
refName = "SimRef",
hp_max_evals = 10,
algo_select = "random",
output_path = tempdir(),
hspaceTechniqueChoose = "all",
normalize = FALSE
)
plot_hyperopt(out, params = c("lambda_", "gamma", "p_prime"))
Hyper-parameter optimization loop for dicepro
Description
Runs hp_max_evals NMF trials by sampling from hp_space,
using either random or TPE sampling, and collects results.
Usage
research_hyperOpt(
objective_opt,
dataset,
config,
hp_space = NULL,
W_prime = NULL,
seed = NULL
)
Arguments
objective_opt |
Function passed verbatim to
|
dataset |
List with |
config |
List. Configuration object; must contain
|
hp_space |
prebuilt named list of parsed hyper-parameter specs.
If |
W_prime |
Optional initial |
seed |
Integer. Random seed used for full pipeline reproducibility. |
Value
A named list:
- trials
data.frame of all successful trial results.
- W
List of
Wmatrices, one per successful trial.- H
List of
Hmatrices, one per successful trial.
Row-normalize a matrix with non-negative clamping
Description
Clamps negative values to 0, then divides each row by its sum so that
rows sum to 1. Rows whose sum equals 0 become all-NaN (degenerate
samples are flagged rather than silently zeroed).
Usage
row_norm_pos(mat)
Arguments
mat |
Numeric matrix. |
Value
Numeric matrix of the same dimensions as mat, with
each row summing to 1 (or all-NaN when the row sum is 0).
Examples
m <- matrix(c(-1, 2, 3, 0, 0, 0, 1, 1, 1), nrow = 3, byrow = TRUE)
row_norm_pos(m)
Run CIBERSORTx Deconvolution Method
Description
This function runs the CIBERSORTx deconvolution method using Docker with the provided email and token. The method is used to estimate cell type proportions in bulk RNA-seq data.
Usage
run_CSx(
bulk,
reference,
cibersortx_email,
cibersortx_token,
work_dir = tempdir()
)
Arguments
bulk |
A matrix of bulk RNA-seq data. |
reference |
A matrix of reference cell type gene expression profiles. |
cibersortx_email |
The CIBERSORTx account email. |
cibersortx_token |
The CIBERSORTx account token. |
work_dir |
Character. Path to the working directory where CIBERSORTx
output files are written. Defaults to |
Details
The function performs the following steps:
Runs CIBERSORTx with the provided email and token.
Reads the output results from CIBERSORTx.
Extracts cell type proportions and returns them as a matrix.
Value
A matrix containing the estimated cell type proportions.
Run a dicepro hyperparameter optimization experiment
Description
Builds the hyperparameter search space from hspaceTechniqueChoose
and runs research_hyperOpt(), returning the collected trials.
Usage
run_experiment(
dataset,
W_prime = 0,
bulkName,
refName,
hp_max_evals,
algo_select,
output_base_dir = tempdir(),
hspaceTechniqueChoose,
gamma_ratio_min = 10,
seed
)
Arguments
dataset |
List containing at least |
W_prime |
Numeric matrix (or |
bulkName |
Character scalar. Identifier for the bulk dataset (used in output path construction). |
refName |
Character scalar. Identifier for the reference dataset (used in output path construction). |
hp_max_evals |
Positive integer. Number of hyperparameter trials to run. |
algo_select |
Character scalar. Sampling algorithm passed to
the configuration (e.g. |
output_base_dir |
Character scalar. Root directory for all outputs
(default |
hspaceTechniqueChoose |
Character scalar. Search-space strategy:
|
gamma_ratio_min |
Positive numeric. Minimum ratio
|
seed |
Integer. Random seed used for full pipeline reproducibility. |
Value
The list returned by research_hyperOpt:
trials, W, and H.
Run cell-type deconvolution
Description
Performs cell-type deconvolution on bulk RNA-seq data using one of four supported methods. All methods receive gene-intersected, numeric matrices and return a proportion matrix normalised to sum to 1 per sample (columns).
Usage
running_method(
bulk,
reference,
methodDeconv = "CS",
cibersortx_email = NULL,
cibersortx_token = NULL,
cibersort_perm = 0,
cibersort_QN = TRUE
)
Arguments
bulk |
Numeric matrix (genes x samples). Bulk RNA-seq expression data. Row names must be gene symbols. |
reference |
Numeric matrix (genes x cell types). Reference cell-type signature profiles. Row names must be gene symbols. |
methodDeconv |
Character scalar. Deconvolution method:
|
cibersortx_email |
Character. CIBERSORTx account email (required
when |
cibersortx_token |
Character. CIBERSORTx account token (required
when |
cibersort_perm |
Non-negative integer. Number of permutations for
CIBERSORT p-values (default |
cibersort_QN |
Logical. Apply quantile normalisation in
CIBERSORT (default |
Details
Gene intersection is performed automatically: only genes present in both
bulk and reference are used. Both matrices are coerced to
numeric before processing.
For "CS", the built-in implementation closely follows the
original Newman et al. (2015) algorithm: nu-SVR with three candidate nu
values (0.25, 0.50, 0.75), optional quantile normalisation, and optional
permutation-based p-values. Packages e1071, parallel, and
preprocessCore must be installed.
Value
Numeric matrix (cell types x samples) of estimated cell-type proportions. Each column sums to 1.
References
Newman AM et al. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453-457. doi:10.1038/nmeth.3337
See Also
Sample-wise Pearson correlation and RMSE
Description
For each row (sample) of two matrices, computes the Pearson correlation and RMSE between predicted and observed values, then returns their cross-sample means.
Usage
samplewise_metrics(obs_mat, pred_mat)
Arguments
obs_mat |
Numeric matrix of observations (rows = samples, columns = variables). |
pred_mat |
Numeric matrix of predictions; must have the same
dimensions as |
Details
Rows where either obs_mat or pred_mat are constant
(zero standard deviation) yield NA for correlation but still
contribute to the RMSE mean.
Value
Named list with two elements:
- Correlation_mean
Mean Pearson correlation across samples (ignoring
NArows).- RMSE_mean
Mean RMSE across samples.
See Also
Simulate Bulk RNA-seq Data with Biological and Technical Noise
Description
Generates synthetic bulk RNA-seq expression matrices by linearly mixing reference cell-type profiles according to simulated proportions, with optional biological and technical noise. The noise model follows the Gaussian likelihood framework underlying the dicepro objective function: biological noise is multiplicative, whereas technical noise is additive. Consequently, normalization to a Gaussian scale prior to deconvolution is required.
Usage
simulation(
W = NULL,
prop = NULL,
nSample = 50,
nCell = 500,
nCellsType = 50,
nGenes = 500,
lambda_vec = NULL,
corr = NULL,
scenario = NULL,
loi = "rpois",
tpm = FALSE,
bloc = FALSE,
nGenesByCellType = 50,
sparse = FALSE,
prob_sparse = 0.5,
sigma_bio = 0.07,
sigma_tech = 0.07,
seed = NULL
)
Arguments
W |
Numeric matrix (genes x cell types) or |
prop |
Numeric matrix (samples x cell types) or |
nSample |
Integer. Number of samples. Default: |
nCell |
Integer. Total cells per sample. Default: |
nCellsType |
Integer. Number of cell types. Default: |
nGenes |
Integer. Number of genes. Default: |
lambda_vec |
Numeric vector of length |
corr |
Numeric matrix. Inter-cell-type correlation.
If |
scenario |
Character. Proportion scenario passed to
|
loi |
Character. Expression law for reference generation.
|
tpm |
Logical. TPM normalization of reference.
Default: |
bloc |
Logical. Block sparsity in reference.
Default: |
nGenesByCellType |
Integer. Genes per block (when |
sparse |
Logical. Random sparsity in reference.
Default: |
prob_sparse |
Numeric. Sparsity probability. Default: |
sigma_bio |
Numeric. Standard deviation of multiplicative
biological noise. Default: |
sigma_tech |
Numeric. Standard deviation of additive technical
noise (proportional to expression level). Default: |
seed |
Integer. Random seed. Default: |
Details
The bulk expression matrix is generated as:
\mathbf{B} = \mathbf{W} \cdot \mathbf{P}^\top
Noise is then applied as:
B^*_{gn} = B_{gn}(1 + \varepsilon^{\text{bio}}_{gn}) +
\varepsilon^{\text{tech}}_{gn} \cdot B_{gn}
where \varepsilon^{\text{bio}} \sim \mathcal{N}(0, \sigma^2_{\text{bio}})
and \varepsilon^{\text{tech}} \sim \mathcal{N}(0, \sigma^2_{\text{tech}}).
Negative values are clipped to zero.
Value
A named list with three elements:
pdata.frame. True cell-type proportions (samples x cell types).
Wdata.frame. Reference signature matrix (genes x cell types).
Bdata.frame. Noisy bulk expression matrix (genes x samples).
Examples
set.seed(2026)
sim <- simulation(
scenario = "hierarchical",
nSample = 20,
nGenes = 100,
nCellsType = 10,
sigma_bio = 0.07,
sigma_tech = 0.07,
seed = 2101
)
dim(sim$p) # 20 x 10
dim(sim$B) # 100 x 20
Simulate Bulk RNA-seq Data Using the BlueCode Reference Matrix
Description
Generates synthetic bulk RNA-seq expression matrices by mixing the bundled BlueCode reference signature matrix with hierarchically simulated cell-type proportions, then applying biological and technical noise.
Usage
simulation_bluecode(
nSample = 50,
alpha_groups = c(Immune = 4, Stromal = 2.5, Endothelial = 1.8, Epithelial = 1.8, Muscle
= 1.5),
alpha_subtypes = list(Immune = 8, Stromal = 8, Endothelial = 8, Epithelial = 8, Muscle
= 8),
sigma_bio = 0.15,
sigma_tech = 0.02,
seed = NULL
)
Arguments
nSample |
Integer. Number of samples to simulate.
Default: |
alpha_groups |
Named numeric vector of length 5. Dirichlet
concentration parameters for the five tissue compartments
(Immune, Stromal, Endothelial, Epithelial, Muscle). Larger values
produce more even compartment proportions.
Default: |
alpha_subtypes |
Named list with one numeric scalar per compartment.
Controls the concentration of sub-type Dirichlet draws within each
compartment. Higher values produce more even sub-type distributions.
Default: all compartments set to |
sigma_bio |
Numeric. Standard deviation of multiplicative
biological noise (log-normal). Default: |
sigma_tech |
Numeric. Standard deviation of additive technical
noise, expressed as a fraction of the global expression standard
deviation. Default: |
seed |
Integer. Random seed for full reproducibility.
Proportion sampling uses |
Details
The proportion model follows a two-level Dirichlet hierarchy that mirrors the biological organization of human tissues:
-
Compartment level – five major tissue compartments (Immune, Stromal, Endothelial, Epithelial, Muscle) are drawn from
\text{Dirichlet}(\alpha_{\text{groups}}). -
Cell-type level – within each compartment, sub-type proportions are drawn from a symmetric Dirichlet controlled by
alpha_subtypesand scaled by the compartment proportion.
This two-level structure induces realistic positive within-compartment and negative between-compartment correlations between cell types.
The bulk matrix is computed as B = W \cdot P^\top, then noise is
applied as:
B^*_{gn} = B_{gn} \cdot \exp(\varepsilon^{\text{bio}}_{gn}) +
\varepsilon^{\text{tech}}_{gn}
where
\varepsilon^{\text{bio}} \sim \mathcal{N}(0, \sigma^2_{\text{bio}})
and
\varepsilon^{\text{tech}} \sim
\mathcal{N}(0,\, \sigma^2_{\text{tech}} \cdot \mathrm{sd}(B)).
Negative values are clipped to zero.
The function validates at runtime that the cell-type names in
.bluecode_cell_groups exactly match colnames(BlueCode),
ensuring consistency between the proportion model and the reference matrix.
Value
A named list with three elements:
pdata.frame (samples x 34 cell types). True cell-type proportions with real BlueCode cell-type names. Each row sums to 1.
Wdata.frame (genes x 34 cell types). BlueCode reference signature matrix, column-ordered to match
p.Bdata.frame (genes x samples). Noisy simulated bulk expression matrix.
Examples
sim <- simulation_bluecode(
nSample = 30,
sigma_bio = 0.15,
sigma_tech = 0.02
)
dim(sim$p)
dim(sim$W)
dim(sim$B)
all(abs(rowSums(sim$p) - 1) < 1e-8)