Analyses can slow to a crawl when models need hours to run. In this
article you will find a few tricks to prevent this bottleneck when using
orsf().
controlThe default control for orsf() is
NULL because, if unspecified, orsf() will pick
the fastest possible control for you depending on the type
of forest being grown. The default control run-time
compared to other approaches can be striking. For example:
time_fast <- system.time(
expr = orsf(pbc_orsf,
formula = time+status~. -id,
n_tree = 5)
)
time_net <- system.time(
expr = orsf(pbc_orsf,
formula = time+status~. -id,
control = orsf_control_survival(method = 'net'),
n_tree = 5)
)
# unspecified control is much faster
time_net['elapsed'] / time_fast['elapsed']
#> elapsed
#> 102n_threadThe n_thread argument uses multi-threading to run
aorsf functions in parallel when possible. If you know how
many threads you want, e.g. you want exactly 5, set
n_thread = 5. If you aren’t sure how many threads you have
available but want to use a feasible amount, using
n_thread = 0 (the default) tells aorsf to do
that for you.
# automatically pick number of threads based on amount available
orsf(pbc_orsf,
formula = time+status~. -id,
n_tree = 5,
n_thread = 0)Note: sometimes multi-threading is not possible. For example, because
R is a single threaded language, multi-threading cannot be applied when
orsf() needs to call R functions from C++, which occurs
when a customized R function is used to find linear combination of
variables or compute prediction accuracy.
There are some inputs in orsf() that can be adjusted to
make it run faster:
set n_retry to 0
set oobag_pred_type to 'none'
set importance to 'none'
increase split_min_events,
split_min_obs, leaf_min_events, or
leaf_min_obs to make trees stop growing sooner
increase split_min_stat to enforce more strict
requirements for growing deeper trees.
Applying these tips:
orsf(pbc_orsf,
formula = time+status~.,
n_thread = 0,
n_tree = 5,
n_retry = 0,
oobag_pred_type = 'none',
importance = 'none',
split_min_events = 20,
leaf_min_events = 10,
split_min_stat = 10)While modifying these inputs can make orsf() run faster,
they can also impact prediction accuracy.
Setting verbose_progress = TRUE doesn’t make anything
run faster, but it can help make it feel like things are
running less slow.
Instead of running a model and hoping it will be fast, you can
estimate how long a specification of that model will take by using
no_fit = TRUE in the call to orsf().
fit_spec <- orsf(pbc_orsf,
formula = time+status~. -id,
control = orsf_control_survival(method = 'net'),
n_tree = 2000,
no_fit = TRUE)
# how much time it takes to estimate training time:
system.time(
time_est <- orsf_time_to_train(fit_spec, n_tree_subset = 5)
)
#> user system elapsed
#> 0.3 0.0 0.3
# the estimated training time:
time_est
#> Time difference of 120.8236 secs