---
title: "rtForecastEval guide"
subtitle: "EVALuating Real-Time Probabilistic Forecast"
author: |
  | Chi-Kuang Yeh (Georgia State University)
  | Gregory Rice, Joel A. Dubin (University of Waterloo)
date: "`r Sys.Date()`"
output:
  rmarkdown::html_document:
    toc: yes
    toc_float: true
    theme: readable
    highlight: tango
  rmarkdown::html_vignette:
    toc: yes
vignette: >
  %\VignetteIndexEntry{rtForecastEval guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
.old_opts <- options(digits = 3)
```

## Mind map and workflow

**rtForecastEval** compares two real-time forecasters on a common time grid. The schematic below matches the analysis order in this vignette: prepare a long tibble, run the **global delta test** (`calc_Z` → `calc_eig` → `calc_pval`), and optionally summarize **pointwise** loss with `calc_L_s2` and `plot_pcb`.

```{r workflow-schematic, echo=FALSE, warning = FALSE, message= FALSE, fig.width=8, fig.height=4, fig.cap="Typical workflow: global test (top) and pointwise loss plot (bottom)."}
library(ggplot2)

box <- function(x, y, w, h, label) {
  data.frame(
    xmin = x - w / 2, xmax = x + w / 2,
    ymin = y - h / 2, ymax = y + h / 2,
    label = label, x = x, y = y
  )
}

top <- rbind(
  box(1.2, 5, 1.1, 0.55, "Data\n(df_gen or yours)"),
  box(3.1, 5, 1.0, 0.55, "Long format +\ncentered diffs"),
  box(4.9, 5.45, 0.85, 0.45, "calc_Z"),
  box(4.9, 4.55, 0.85, 0.45, "calc_eig"),
  box(6.8, 5, 0.95, 0.55, "calc_pval")
)
bot <- rbind(
  box(3.1, 2.6, 1.0, 0.55, "calc_L_s2"),
  box(5.0, 2.6, 0.85, 0.55, "plot_pcb")
)
rects <- rbind(top, bot)

seg <- data.frame(
  x = c(1.75, 2.65, 4.45, 4.45, 5.35, 5.35, 3.1, 3.65),
  y = c(5, 5, 5.45, 4.55, 5.45, 4.55, 4.45, 2.6),
  xend = c(2.6, 3.6, 5.35, 5.35, 6.35, 6.35, 3.1, 4.55),
  yend = c(5, 5, 5.45, 4.55, 5, 5, 3.15, 2.6)
)

ggplot() +
  geom_rect(
    data = rects,
    aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
    fill = "grey96", color = "grey35", size = 0.35
  ) +
  geom_text(data = rects, aes(x = x, y = y, label = label), size = 3) +
  geom_segment(
    data = seg,
    aes(x = x, y = y, xend = xend, yend = yend),
    arrow = grid::arrow(length = grid::unit(0.12, "cm"), type = "closed"),
    size = 0.35,
    color = "grey25"
  ) +
  annotate("text", x = 4.9, y = 5.95, label = "Global delta test (chi-square approx.)", size = 3.2, fontface = "italic") +
  annotate("text", x = 4.0, y = 1.85, label = "Pointwise mean loss + naive band", size = 3.2, fontface = "italic") +
  coord_cartesian(xlim = c(0.3, 7.5), ylim = c(1.5, 6.2), clip = "off") +
  theme_void() +
  theme(plot.margin = grid::unit(c(12, 8, 8, 8), "pt"))
```


## Figures in the paper and where they are produced

The [paper](https://doi.org/10.1080/00031305.2021.1967781) (preprint [arXiv:2010.00781](https://arxiv.org/abs/2010.00781)) has two kinds of graphics:

## NBA application: calibration surfaces, reliability diagrams, and model comparisons

Those figures are **not** self-contained in **rtForecastEval** because they require the scraped NBA play-by-play inputs, the `load_nba_data()` / `pre_process()` pipeline, and several bespoke plotting scripts.

Reproduce them from the separate replication repository (**RTPForNBA**; historically bundled with the paper’s supplementary code), not from this package alone:

| Idea in the paper | Typical script (replication repo) | Depends on |
|-------------------|-----------------------------------|------------|
| Calibration **surface** (3D: time × forecast × centered residual) | `plotting/surfacePlot.R` | `nba_FD.R` (fits logit, builds `my_df`), **plot3D**, **akima**, **lattice**, binned CIs |
| Reliability-style **plot** at one time (e.g. mid-game) | `plotting/calibrationPlot.R` (see also end of `surfacePlot.R`) | Same prepared `df_try` object |
| Skill curves / $\hat\Delta_n(t)$ for ESPN vs models (PgRS, LTW, etc.) | `plotting/1_PgRS.R`, `2_LTW.R`, `3_CF.R`, … | `utility.R` (`calc_L_s2`, `L_smoothing`), `nba_FD.R` |
| **Simulation** figures (oracle, score difference, etc.) | `simulation/PF-sim/plot_sim_*.R` | `simulation/PF-sim/simulator.R` and generated outputs |

**Suggested order in the replication repo:** prepare data under `data/`, run `nba_FD.R` (or the season scripts it sources), then `source()` the relevant file under `plotting/` with working directory set to that project root.

The rest of this vignette focuses on (1).

```{r load, message = FALSE, include = FALSE}
require(dplyr)
require(tidyr)
require(gridExtra)
require(RSpectra)
require(rlist)
```

```{r, include = FALSE}
library(rtForecastEval)
```

```{r}
library(ggplot2)
library(tibble)
library(MASS)
nsamp <- 100 # number of in-game events
ngame <- 200 # number of games (README example uses the same for comparable figures)

#' Parameter for generating the eigenvalues, and p-values
D <- 10 # Number of eigenvalues to keep
N_MC <- 5000 # for simulating the p-value
L <- function(x, y) {
  return((x - y) ^ 2)
}

# Data generation ---------------------------------------------------------
df_equ <- df_gen(N = nsamp, Ngame = ngame) %>%
  group_by(grid) %>%
  mutate(
    p_bar_12 = mean(phat_A - phat_B),
    diff_non_cent = phat_A - phat_B,
    diff_cent = phat_A - phat_B - p_bar_12
  ) %>% ungroup()

# Apply our test (explicit construction, as in utility.R) ----------------

Z <- df_equ %>% group_by(grid) %>%
  summarise(delta_n = mean(L(phat_A, Y) - L(phat_B, Y))) %>%
  {
    sum((.)$delta_n^2) / nsamp * ngame
  }

temp <- df_equ %>% group_split(grid, .keep = FALSE)

eigV_hat <- lapply(1:nsamp, function(i) {
  sapply(1:nsamp, function(j) {
    as.numeric(temp[[i]]$diff_non_cent %*% temp[[j]]$diff_non_cent / ngame)
  })
}) %>% list.rbind() %>% {
  RSpectra::eigs_sym(
    A = (.),
    k = D,
    which = "LM",
    opts = list(retvec = FALSE)
  )$values
} %>%
  {
    (.)/nsamp
  }

eigV_til <- lapply(1:nsamp, function(i) {
  sapply(1:nsamp, function(j) {
    as.numeric(temp[[i]]$diff_cent %*% temp[[j]]$diff_cent / ngame)
  })
}) %>% list.rbind() %>% {
  RSpectra::eigs_sym(
    A = (.),
    k = D,
    which = "LM",
    opts = list(retvec = FALSE)
  )$values
} %>%
  {
    (.)/nsamp
  }

MC_hat <- sapply(1:N_MC, function(x) {
  crossprod(eigV_hat, rchisq(D, df = 1))
})

q_90_hat <- quantile(MC_hat, 0.90)
q_95_hat <- quantile(MC_hat, 0.95)
q_99_hat <- quantile(MC_hat, 0.99)

MC_til <- sapply(1:N_MC, function(x) {
  crossprod(eigV_til, rchisq(D, df = 1))
})

q_90_til <- quantile(MC_til, 0.90)
q_95_til <- quantile(MC_til, 0.95)
q_99_til <- quantile(MC_til, 0.99)

p_hat <- 1 - ecdf(MC_hat)(Z)

tibble(
  type = c("non-center", "center"),
  Z = rep(Z, 2),
  "pval" = c(p_hat, p_hat),
  "90%" = c(q_90_hat, q_90_til),
  "95%" = c(q_95_hat, q_95_til),
  "99%" = c(q_99_hat, q_99_til)
)
```

### Same analysis using package functions

```{r function wrappers, fig.width = 7, fig.height = 4.2, fig.cap = "Pointwise mean loss difference (A vs B) with naive normal band — simulation setting. This is a skill curve, not a calibration diagram."}
to_center <- FALSE

ZZ <- calc_Z(df = df_equ, pA = "phat_A", pB = "phat_B", Y = "Y", nsamp = nsamp, ngame = ngame)
eigg <- calc_eig(
  df = df_equ, n_eig = D, ngame = ngame,
  nsamp = nsamp, grid = "grid", cent = to_center
)
oh <- calc_pval(ZZ, eig = eigg, quan = c(0.90, 0.95, 0.99), n_MC = N_MC)

temp <- calc_L_s2(df = df_equ, pA = "phat_A", pB = "phat_B")

plot_pcb(df = temp)

tibble(
  type = ifelse(to_center, "center", "non-center"),
  Z = ZZ,
  pval = oh$p_val,
  "90%" = oh$quantile[1],
  "95%" = oh$quantile[2],
  "99%" = oh$quantile[3]
)
```

### A simple calibration (reliability) view at one time

The previous figure tracks **skill** (which forecaster loses less Brier loss over time). A complementary check is **marginal calibration**: within a narrow time slice, do predicted probabilities match observed event frequencies? **Grey vertical** segments show a **95% central range** for $\bar Y$ under **$H_0$** in each bin: $n\bar Y \sim \mathrm{Binomial}(n, p)$ with $p = \overline{\hat p}$, using **exact** `qbinom` bounds (avoiding asymmetric normal approximation + clipping that can distort small-sample bands). The paper’s NBA figures use richer calibration **surfaces** (time × forecast × residual); here we only **bin** forecasts at a single grid point (closest to mid-game, $t \approx 0.5$) and plot mean outcome vs mean forecast — a standard reliability diagram.

```{r calibration-reliability, fig.width = 7.5, fig.height = 4, fig.cap = "Binned reliability diagram at a fixed grid (closest to 0.5): mean outcome vs mean forecast for A and B; grey bars are exact 95% central Binomial range for mean(Y) under H0. Points near the diagonal indicate better marginal calibration at that time."}
g_mid <- df_equ %>%
  distinct(grid) %>%
  slice_min(order_by = abs(grid - 0.5), n = 1) %>%
  pull(grid)

slice_t <- df_equ %>%
  filter(grid == g_mid) %>%
  transmute(
    game,
    Y,
    phat_A,
    phat_B
  )

rel_long <- slice_t %>%
  tidyr::pivot_longer(c(phat_A, phat_B), names_to = "forecaster", values_to = "phat") %>%
  mutate(forecaster = ifelse(forecaster == "phat_A", "Forecaster A", "Forecaster B"))

rel_binned <- rel_long %>%
  group_by(forecaster) %>%
  mutate(bin = ntile(phat, 10)) %>%
  group_by(forecaster, bin) %>%
  summarise(
    mean_forecast = mean(phat),
    mean_outcome = mean(Y),
    n_games = dplyr::n(),
    .groups = "drop"
  ) %>%
  mutate(
    p_bin = pmin(pmax(mean_forecast, 0), 1),
    ci_lo = stats::qbinom(0.025, size = n_games, prob = p_bin) / n_games,
    ci_hi = stats::qbinom(0.975, size = n_games, prob = p_bin) / n_games
  )

ggplot(rel_binned, aes(mean_forecast, mean_outcome)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
  geom_errorbar(
    aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
    width = 0.02,
    color = "grey55",
    alpha = 0.85,
    linewidth = 0.35
  ) +
  geom_point(aes(size = n_games), alpha = 0.9, color = "darkred") +
  facet_wrap(~forecaster, nrow = 1) +
  coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
  labs(
    title = "Binned reliability (calibration) at one grid",
    subtitle = paste0(
      "Grid = ", signif(g_mid, 3),
      " (closest to 0.5); grey: exact 95% Binomial range for mean(Y) under H0"
    ),
    x = "Mean forecast in bin",
    y = "Mean outcome (Y) in bin",
    size = "Games"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = ggplot2::element_blank(),
    plot.title = ggplot2::element_text(face = "bold")
  )
```

### Reliability with shared bins (two standard diagrams)

The facet reliability figure above uses **separate** decile bins per forecaster. Here we bin the **same games** with **equal-width** cuts of the **midpoint** $m = (\hat p_A + \hat p_B)/2$, then plot **one classical reliability diagram per forecaster** (mean forecast vs mean outcome, 45° line). **Grey vertical** segments are the **exact Binomial** **95%** central range for $\bar{Y$} under **H0** at each bin’s mean forecast. Overlaying both forecasters in one square is avoided.

```{r reliability-joint-difference, fig.width = 8, fig.height = 4.2, fig.cap = "Shared midpoint bins: two standard reliability diagrams (A and B); grey bars = exact 95% Binomial range for mean(Y) under H0."}
cal_bins <- slice_t %>%
  mutate(mid = (phat_A + phat_B) / 2) %>%
  mutate(bin = cut(mid, breaks = seq(0, 1, length.out = 11), include.lowest = TRUE))

rel_joint <- cal_bins %>%
  group_by(bin) %>%
  summarise(
    mean_forecast_A = mean(phat_A),
    mean_forecast_B = mean(phat_B),
    mean_outcome = mean(Y),
    n_games = dplyr::n(),
    .groups = "drop"
  ) %>%
  filter(!is.na(bin)) %>%
  mutate(
    p_A = pmin(pmax(mean_forecast_A, 0), 1),
    p_B = pmin(pmax(mean_forecast_B, 0), 1),
    ci_lo_A = stats::qbinom(0.025, size = n_games, prob = p_A) / n_games,
    ci_hi_A = stats::qbinom(0.975, size = n_games, prob = p_A) / n_games,
    ci_lo_B = stats::qbinom(0.025, size = n_games, prob = p_B) / n_games,
    ci_hi_B = stats::qbinom(0.975, size = n_games, prob = p_B) / n_games
  )

rel_joint_long <- rel_joint %>%
  tidyr::pivot_longer(
    c(mean_forecast_A, mean_forecast_B),
    names_to = "which",
    names_prefix = "mean_forecast_",
    values_to = "mean_forecast"
  ) %>%
  mutate(
    forecaster = ifelse(which == "A", "Forecaster A", "Forecaster B"),
    ci_lo = ifelse(which == "A", ci_lo_A, ci_lo_B),
    ci_hi = ifelse(which == "A", ci_hi_A, ci_hi_B)
  )

ggplot(rel_joint_long, aes(mean_forecast, mean_outcome)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
  geom_errorbar(
    aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
    width = 0.02,
    color = "grey55",
    alpha = 0.85,
    linewidth = 0.4
  ) +
  geom_point(aes(color = forecaster, size = n_games), alpha = 0.9) +
  facet_wrap(~forecaster, nrow = 1) +
  coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
  scale_color_manual(values = c("Forecaster A" = "steelblue", "Forecaster B" = "orangered")) +
  labs(
    title = "Reliability: shared midpoint bins (one classical diagram per forecaster)",
    subtitle = "Bins from cut((phat_A + phat_B) / 2); grey = exact 95% Binomial H0 range for mean(Y)",
    x = "Mean forecast in bin",
    y = "Mean outcome (Y) in bin",
    color = NULL,
    size = "Games"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = ggplot2::element_blank(),
    plot.title = ggplot2::element_text(face = "bold"),
    legend.position = "right"
  ) +
  guides(color = "none")
```

### Comparing calibration between forecasters (error curves)

The reliability diagram shows **levels** on the 45° plot; here we plot **calibration errors** $\bar Y - \bar{\hat p}_k$ vs bin midpoint. The **dashed** line is $\bar{\hat p}_B - \bar{\hat p}_A$.

```{r calibration-compare-forecasters, fig.width = 7.5, fig.height = 4.5, fig.cap = "Mean calibration error per forecaster (solid) and difference in mean forecasts (dashed), binned by (phat_A + phat_B) / 2 at the same grid as the reliability figure."}
cal_compare <- cal_bins %>%
  group_by(bin) %>%
  summarise(
    mid_hat = mean(mid),
    cal_err_A = mean(Y) - mean(phat_A),
    cal_err_B = mean(Y) - mean(phat_B),
    n_games = dplyr::n(),
    .groups = "drop"
  ) %>%
  filter(!is.na(bin))

cal_long <- cal_compare %>%
  tidyr::pivot_longer(
    c(cal_err_A, cal_err_B),
    names_to = "which",
    values_to = "cal_err"
  ) %>%
  mutate(
    which = ifelse(which == "cal_err_A", "A: mean(Y) - phat_A", "B: mean(Y) - phat_B")
  )

ggplot(cal_long, aes(mid_hat, cal_err, color = which)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey70") +
  geom_line(linewidth = 0.85) +
  geom_point(aes(size = n_games), alpha = 0.85) +
  geom_line(
    data = cal_compare,
    aes(mid_hat, cal_err_B - cal_err_A),
    inherit.aes = FALSE,
    color = "grey35",
    linetype = "dashed",
    linewidth = 0.65
  ) +
  labs(
    title = "Calibration comparison (same midpoint bins)",
    subtitle = "Solid: mean calibration error per forecaster; dashed: mean(phat_B) - mean(phat_A)",
    x = "Mean (phat_A + phat_B) / 2 in bin",
    y = "mean(Y) - mean(phat)",
    color = "Curve",
    size = "Games"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = ggplot2::element_blank(),
    plot.title = ggplot2::element_text(face = "bold"),
    legend.position = "bottom"
  )
```

```{r vignette-options-restore, include = FALSE}
options(.old_opts)
```