---
title: "Validation Study"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Validation Study}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4.5
)
```

## Scope

This vignette documents the shipped release-validation study for
`SelectBoost.quantile`. The goal is not to claim universal superiority, but to
show how the current prototype behaves against two direct baselines:

- plain quantile lasso (`lasso`)
- cross-validated quantile lasso with a 1-SE penalty rule (`lasso_tuned`)
- `selectboost_quantile()` with tau-aware screening, stronger tuning,
  complementary-pairs stability selection, capped neighborhoods, and a hybrid
  support score

The included benchmark artifacts were generated with:

- scenarios from `default_quantile_benchmark_scenarios()`
- `tau = c(0.25, 0.5, 0.75)`
- 4 Monte Carlo replications per scenario
- `selectboost_quantile(..., B = 8, step_num = 0.5, screen = "auto",
  tune_lambda = "cv", lambda_rule = "one_se", lambda_inflation = 1.25,
  complementary_pairs = TRUE, max_group_size = 15, nlambda = 8)`
- stable support extracted with the hybrid summary score at `threshold = 0.55`

```{r}
summary_path <- system.file(
  "extdata",
  "validation",
  "quantile_benchmark_release_summary.csv",
  package = "SelectBoost.quantile"
)
raw_path <- system.file(
  "extdata",
  "validation",
  "quantile_benchmark_release_raw.csv",
  package = "SelectBoost.quantile"
)

resolve_validation_path <- function(installed_path, filename) {
  if (nzchar(installed_path) && file.exists(installed_path)) {
    return(installed_path)
  }

  candidates <- c(
    file.path("inst", "extdata", "validation", filename),
    file.path("..", "inst", "extdata", "validation", filename)
  )
  candidates <- candidates[file.exists(candidates)]
  if (!length(candidates)) {
    stop("Could not locate shipped validation artifact: ", filename, call. = FALSE)
  }
  candidates[[1]]
}

summary_path <- resolve_validation_path(summary_path, "quantile_benchmark_release_summary.csv")
raw_path <- resolve_validation_path(raw_path, "quantile_benchmark_release_raw.csv")

validation_summary <- utils::read.csv(summary_path, stringsAsFactors = FALSE)
validation_raw <- utils::read.csv(raw_path, stringsAsFactors = FALSE)

validation_summary$family <- sub("_tau_.*$", "", validation_summary$scenario)
validation_summary$is_high_dim <- grepl("^high_dim", validation_summary$scenario)
validation_summary$mean_f1 <- with(
  validation_summary,
  ifelse(
    (2 * mean_tp + mean_fp + mean_fn) > 0,
    2 * mean_tp / (2 * mean_tp + mean_fp + mean_fn),
    NA_real_
  )
)
```

## Overall summary

The first table averages the scenario-level summaries across the full shipped
grid, including the `n < p` stress regime.

```{r}
overall <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1, failure_rate, mean_runtime_sec) ~ method,
  data = validation_summary,
  FUN = mean
)

knitr::kable(overall, digits = 3)
```

Across the full grid, tuned lasso has the highest average true-positive rate,
but it also carries the highest average false-discovery rate. The current
`selectboost_quantile()` release is markedly more conservative: it gives up
some recall, but in exchange it sharply lowers the false-discovery rate across
the shipped benchmark grid and yields the best average F1 score.

## Correlated but not high-dimensional regimes

The `high_dim` scenario is intentionally hard and changes the picture
substantially. Excluding that regime gives a cleaner view of the correlated and
misspecified-noise settings that the current prototype handles more naturally.

```{r}
stable_regimes <- subset(validation_summary, !is_high_dim)

stable_overall <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1, failure_rate, mean_runtime_sec) ~ method,
  data = stable_regimes,
  FUN = mean
)

knitr::kable(stable_overall, digits = 3)
```

On these non-high-dimensional settings, the shipped study shows a consistent
pattern:

- `lasso_tuned` has the highest mean recall
- `selectboost_quantile()` has the lowest mean false-discovery rate by a large margin
- `selectboost_quantile()` also has the highest mean F1 score on the shipped grid
- `selectboost_quantile()` remains slower than either lasso baseline, which is
  expected because it perturbs, subsamples, and refits repeatedly

The family-level breakdown is below.

```{r}
family_summary <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1) ~ family + method,
  data = stable_regimes,
  FUN = mean
)

knitr::kable(family_summary, digits = 3)
```

```{r}
plot_df <- stable_regimes
method_levels <- c("lasso", "lasso_tuned", "selectboost")
cols <- c("lasso" = "#4C78A8", "lasso_tuned" = "#F58518", "selectboost" = "#54A24B")
plot(
  plot_df$mean_fdr,
  plot_df$mean_f1,
  col = cols[plot_df$method],
  pch = 19,
  xlab = "Mean FDR",
  ylab = "Mean F1",
  main = "Validation Summary by Scenario"
)
legend(
  "bottomleft",
  legend = method_levels,
  col = cols[method_levels],
  pch = 19,
  bty = "n"
)
```

## High-dimensional stress regime

The `high_dim` family remains difficult, but it is no longer a failure mode in
the earlier sense of selecting almost everything. The improved SelectBoost
workflow now returns much sparser and more stable supports than either lasso
baseline.

```{r}
high_dim <- subset(validation_summary, is_high_dim)

high_dim_overall <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1, failure_rate, mean_support_size) ~ method,
  data = high_dim,
  FUN = mean
)

knitr::kable(high_dim_overall, digits = 3)
```

The main remaining tradeoff is recall: `selectboost_quantile()` is much cleaner
than the lasso baselines in `high_dim`, but it is still more conservative and
can miss weaker signals. Even so, on the shipped study it achieves the best
mean F1 score in that regime because it avoids the large false-positive burden
of the lasso baselines. This is the main reason the package is best described
as a polished `v2` prototype rather than a finished methodological endpoint.

```{r}
failure_rows <- subset(validation_summary, failure_rate > 0)
if (nrow(failure_rows)) {
  knitr::kable(failure_rows[, c(
    "scenario",
    "method",
    "failure_rate",
    "mean_tpr",
    "mean_fdr",
    "mean_support_size"
  )], digits = 3)
} else {
  cat("No method failures were recorded in the shipped study.\n")
}
```

## Reproducing the study

From a source checkout, regenerate benchmark artifacts into a temporary
directory with:

```{r, eval = FALSE}
out_dir <- file.path(tempdir(), "SelectBoost.quantile-validation")
system2(
  "Rscript",
  c("inst/scripts/run_quantile_benchmark.R", out_dir, "4", "0.55")
)
```

The script loads the local package automatically when run from a source tree.
It writes raw results, aggregated summaries, and a `sessionInfo` record to the
chosen output directory. If no output directory is supplied, it defaults to a
subdirectory of `tempdir()`. In the current source tree, that rerun uses the
screening, stronger lambda, complementary-pairs stability, neighborhood-cap,
and hybrid-support defaults defined in the package benchmark helper.