Policy-evaluation primitives from the HM Treasury Magenta Book, in R.
The Magenta Book is HM Treasury’s guidance on how to evaluate policies, programmes, and projects funded by UK central government. It is the evaluation companion to the Green Book (which covers appraisal). Together they bookend the full ROAMEF cycle: rationale, objectives, appraisal, monitoring, evaluation, feedback. The current edition is the 2020 update.
The guidance covers four core areas:
The Magenta Book is supplemented by sector-specific guidance (DESNZ, DfT, DHSC) and by the What Works Network’s confidence-rating taxonomy.
A practitioner planning or reporting a Magenta Book evaluation typically:
Today, this is mostly assembled in Word documents and spreadsheets,
with sample-size formulas hand-typed from textbooks and confidence
rubrics copied from PDFs. magentabook puts the same
primitives in R so an evaluation becomes code that can be tested,
reviewed, and reproduced.
No existing R or Python package implements the Magenta Book. UK evaluation practitioners hand-roll the same theory-of-change templates, sample-size formulas, and confidence rubrics every project. The arithmetic is simple but the framework is large, and the parameters change: the SMS rubric is a five-level table, ICCs vary by domain, the Magenta Book confidence rubric has three levels with explicit dimensions.
magentabook solves three problems:
mb_data_versions() shows source and
last-updated date for every bundled rubric and reference table.greenbook so the same R session covers appraisal and
evaluation.The package is pure computation: no network calls, no API keys.
Bundled rubric and reference tables in inst/extdata/ are
refreshed via data-raw/ scripts.
# install.packages("magentabook") # not yet on CRAN
# Development version:
# install.packages("remotes")
remotes::install_github("charlescoverdale/magentabook")library(magentabook)
# Theory of change for a skills programme
toc <- mb_theory_of_change(
inputs = c("GBP 50m grant", "12 FTE programme team"),
activities = c("Design training", "Deliver workshops"),
outputs = c("500 workshops delivered", "8000 attendees"),
outcomes = c("Improved skills", "Increased confidence"),
impact = "Higher employment among the target group",
assumptions = "Workshops cause skills uplift",
external_factors = "Macro labour market remains stable",
name = "Skills uplift programme"
)
mb_logframe(toc)
# Power and sample size
mb_sample_size(effect_size = 0.3, power = 0.8)
mb_mde(n_per_group = 500, type = "proportion", baseline = 0.4)
mb_cluster_design(individuals_per_cluster = 30, icc = 0.05, n_clusters = 20)
# Maryland SMS rating + confidence
mb_sms_rate(level = 4, study = "DiD on admin data",
design = "Difference-in-differences with matched comparison")
mb_confidence(
rating = "medium",
question = "Did the policy raise employment",
evidence_strength = "One Level 4 DiD; one Level 3 matched cohort",
methodological_quality = "Adequate; parallel trends plausible",
generalisability = "Findings established in a single region",
rationale = "Effect direction consistent across two studies"
)
# Cost-effectiveness
mb_cea(cost = 1e6, effect = 250, label = "Workshop programme")
mb_icer(cost_a = 1e6, effect_a = 200,
cost_b = 1.5e6, effect_b = 300,
label_a = "Status quo", label_b = "Enhanced")
# Quick estimators
set.seed(1)
n <- 400
treated <- rep(c(0, 1), each = n / 2)
post <- rep(c(0, 1), times = n / 2)
y <- 0.4 * treated * post + rnorm(n)
mb_did_2x2(y, treated, post)
# Inspect bundled vintages
mb_data_versions()| Family | Functions |
|---|---|
| Theory of change | mb_theory_of_change(), mb_logframe(),
mb_assumptions() |
| Planning | mb_evaluation_plan(), mb_questions(),
mb_counterfactual(), mb_stakeholders(),
mb_balance_table() |
| Power and design | mb_power(), mb_mde(),
mb_sample_size(), mb_cluster_design(),
mb_stepped_wedge(), mb_icc_reference() |
| Maryland SMS | mb_sms_rate(), mb_sms_explain() |
| Confidence | mb_confidence(),
mb_confidence_summary() |
| Estimators | mb_did_2x2(), mb_its(),
mb_event_study() |
| Cost-effectiveness | mb_cea(), mb_icer(),
mb_ceac(), mb_inb(), mb_qaly(),
mb_daly() |
| Realist / theory-based | mb_cmo(), mb_contribution_claim() |
| Reporting | mb_evaluation_report(), mb_to_word(),
mb_to_excel(), mb_to_latex() |
| Lookups | mb_data_versions(),
mb_schedule_table() |
| Dataset | Source | Notes |
|---|---|---|
| Maryland SMS rubric | Sherman et al. (1997); Magenta Book (2020) | 1-5 rubric: design examples, causal inference, typical uses |
| Confidence rubric | Synthesis across What Works Centre traditions | 3-level rubric: evidence strength, methodological quality, generalisability |
| ICC reference values | Hedges & Hedberg (2007); Adams et al. (2004); Campbell et al. (2000); EEF / DfE / DWP / MHCLG / MoJ | Reference low / central / high ICCs across UK policy domains |
| Question taxonomy | Magenta Book (2020) | 19 canonical evaluation questions tagged by type and method |
All datasets are refreshed via the scripts in data-raw/.
Vintages are visible via mb_data_versions().
Decision-grade use depends on knowing what is a direct quotation and what is a researcher synthesis. magentabook is explicit about this:
| Bundled item | Status | What is verbatim | What is magentabook synthesis |
|---|---|---|---|
| Maryland SMS levels 1-5 | Verbatim numeric scale | The five-level structure is direct from Sherman et al. (1997) | Word labels (Weakest / Weak / Moderate / Strong / Strongest) follow What Works UK / EEF convention. The design-examples and typical-use columns are practitioner-oriented synthesis. |
| Magenta Book confidence rubric | Synthesis | The three-level high / medium / low structure aligns with the Magenta Book (2020) supplementary value-for-money framing | The full rubric is not a direct quotation from the Magenta Book. It is synthesised from EEF (5 padlocks), Early Intervention Foundation (Foundation Standards), College of Policing (1-5 scale), and Justice Data Lab (red / amber / green) confidence traditions. |
| ICC reference values | Mixed | Each row carries a value_source flag:
"table_quote" for direct extraction with table number,
"central_estimate" for researcher synthesis within the
published range. |
At v0.1.0 every row is central_estimate. Future
versions will upgrade individual rows to table_quote as
exact citations are added. Always compute domain-specific ICCs from
baseline data before relying on these in a published power
calculation. |
| Question taxonomy | Verbatim structure | The four types (process, impact, economic, value-for-money) and their canonical questions are from Magenta Book (2020) chapters | Sub-types (e.g. “attribution”, “fidelity”) are conventional categories used across HMG evaluation practice. |
Practitioner rule: use the structure of the bundled rubrics with confidence; substitute your project-specific content (rubric values, ICC estimates) where decision-grade reporting requires it.
The arithmetic primitives are cross-validated against the canonical
reference implementations on every R CMD check (when the
optional packages are installed):
pwr
(pwr.t.test, pwr.2p.test): agreement within
~2-3 percentage points of power, ~5 per arm on required N. Discrepancy
reflects magentabook’s normal-approximation vs pwr’s
noncentral t.mb_did_2x2 vs
sandwich::vcovCL with type = "HC1": agreement
to within 1e-6. The CR1 estimator and the Stata-style
finite-sample correction (G/(G-1)) * (N-1)/(N-K) are
implemented identically.lm(y ~ treated * post)$coefficients: agreement to
floating-point precision.swCRTdesign::swPwr: the closed-form Hemming approximation
tracks the exact Hussey-Hughes variance to within roughly 0.5x to 2x for
typical UK evaluation designs (T = 4-6, m = 20-50, rho = 0.02-0.10). For
decision-grade sample-size work, prefer swCRTdesign::swPwr;
magentabook’s stepped-wedge function is intended for quick comparative
exploration.BCEA:
mb_icer agrees with BCEA::bcea point ICER to
floating-point precision; mb_ceac produces identical CEAC
probabilities for the same draws.mb_balance_table vs cobalt::bal.tab with
s.d.denom = "pooled": agreement to 1e-8 on
balanced samples (where the equal-weighted and df-weighted pooled-SD
forms coincide).See the tests/testthat/test-*-equivalence.R files for
the full test grids.
magentabook provides framework primitives plus
light-weight versions of the most common quantitative methods. For
production-grade quasi-experimental estimation, use the specialist
packages:
did, didimputation,
fixest::feols(... sunab(...))Synth, tidysynth,
augsynthrdrobust,
rddtoolsivreg,
fixest::feols(... | ... ~ ... ), ivcheck for
diagnosticspredictsetsandwich,
clubSandwichThe light-weight implementations of mb_did_2x2,
mb_its, and mb_event_study are deliberately
canonical: they are useful for sanity checks, teaching, and headline
estimates, and each docstring points to the right specialist package for
production work.
greenbook provides UK Green Book appraisal primitives
(STPR, NPV, optimism bias, distributional weights, METB, DESNZ carbon
values, VPF, WELLBYs). Together, greenbook +
magentabook cover the full appraisal-to-evaluation
spine.
# Appraisal: discount future net benefits to present value
greenbook::gb_npv(cashflow = c(-100, 30, 30, 30, 30, 30))
# Evaluation: did the realised effect justify the cost?
magentabook::mb_icer(cost_a = 1e6, effect_a = 200,
cost_b = 1.5e6, effect_b = 300)See the vignette “Cost-effectiveness with magentabook and greenbook” for a worked end-to-end example.
HM Treasury (2020). The Magenta Book: Central Government Guidance on Evaluation. London: HMSO.
Sherman, L. W., Gottfredson, D. C., MacKenzie, D. L., Eck, J., Reuter, P., Bushway, S. (1997). Preventing Crime: What Works, What Doesn’t, What’s Promising. Report to the US Congress.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum.
Drummond, M. F., Sculpher, M. J., Claxton, K., Stoddart, G. L., Torrance, G. W. (2015). Methods for the Economic Evaluation of Health Care Programmes (4th ed.). Oxford University Press.
Hemming, K., Haines, T. P., Chilton, P. J., Girling, A. J., Lilford, R. J. (2015). The stepped wedge cluster randomised trial: rationale, design, analysis, and reporting. BMJ 350.
If you use magentabook in published work, please cite
via:
citation("magentabook")The package citation and the underlying HM Treasury Magenta Book are both returned.
Report bugs or request features at GitHub Issues.
policy-evaluation, magenta-book, hm-treasury, theory-of-change, logframe, evaluation-design, power-analysis, sample-size, minimum-detectable-effect, cluster-rct, stepped-wedge, intra-class-correlation, maryland-sms, scientific-methods-scale, what-works, confidence-rating, cost-effectiveness, icer, ceac, qaly, daly, difference-in-differences, interrupted-time-series, event-study, realist-evaluation, contribution-analysis, cabinet-office-evaluation-task-force