---
title: "Canonical replications: top 1% share, corporate ETR, tax gap trend"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Canonical replications: top 1% share, corporate ETR, tax gap trend}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE)
```

Three canonical tax-research exercises, each in around 20 lines.

## 1. Top 1 per cent income share (Atkinson-Leigh style)

The approach below loosely mirrors Atkinson and Leigh (2007),
"The Distribution of Top Incomes in Australia," *Economic Record*,
83(262), 247-261 (<https://doi.org/10.1111/j.1475-4932.2007.00412.x>).
The Atkinson-Leigh method reconstructs top-income shares from ATO
published tabulations; a postcode-based approximation is a cruder
but data-only proxy. For rigorous top-share work use Pareto
interpolation on the ATO top-percentile table or apply for ALife
microdata access. See also Burkhauser, Hahn and Wilkins (2015)
for caveats (<https://doi.org/10.1007/s10888-014-9281-z>).

```{r}
library(ato)
ato_snapshot("2026-04-24")

pc_panel <- ato_individuals_postcode(
  year = c("2015-16", "2016-17", "2017-18", "2018-19",
           "2019-20", "2020-21", "2021-22", "2022-23")
)
pc_panel <- ato_harmonise(pc_panel)

# For each year, rank postcodes by mean taxable income per return,
# take top 1% of returns, compute their share of total income.
top1 <- function(df) {
  df <- df[order(-df$taxable_income / df$number_of_individuals), ]
  cum_returns <- cumsum(df$number_of_individuals)
  total_returns <- sum(df$number_of_individuals, na.rm = TRUE)
  cutoff <- which(cum_returns >= 0.01 * total_returns)[1]
  sum(df$taxable_income[seq_len(cutoff)], na.rm = TRUE) /
    sum(df$taxable_income, na.rm = TRUE)
}

shares <- by(pc_panel, pc_panel$year, top1)
shares
```

## 2. Corporate effective tax rate by industry (transparency data)

```{r}
ctt <- ato_top_taxpayers(year = "2022-23")

# Effective tax rate = tax payable / taxable income, for entities
# with positive taxable income. Drop zero-taxable rows (they bias
# the ratio; rely on loss-makers analysis separately).
ctt <- ctt[!is.na(ctt$taxable_income) & ctt$taxable_income > 0, ]
ctt$etr <- ctt$tax_payable / ctt$taxable_income

by_industry <- aggregate(etr ~ entity_type, data = ctt, FUN = median)
by_industry[order(-by_industry$etr), ]
```

## 3. Tax gap trend and confidence context

```{r}
tg <- ato_tax_gaps()

library(ggplot2)
ggplot(tg, aes(x = year, y = tax_gap_estimate,
               colour = tax_gap_type)) +
  geom_line() +
  labs(title = "ATO estimated tax gaps over time",
       x = NULL, y = "Estimated tax gap (AUD million)",
       colour = "Gap type",
       caption = "Source: ATO Taxation Statistics. Retrieved via ato package.") +
  theme_minimal()
```

## 4. HELP debt by age cohort

```{r}
help_data <- ato_help()

# Bucketed by age range; real-terms deflation to 2022-23
help_data$real <- ato_deflate(help_data$total_debt,
                               year = help_data$year,
                               base = "2022-23")
head(help_data)
```

Each of these replications takes an ATO published release, a
`harmonise/deflate/reconcile` transformation, and a small
computation. The provenance header (snapshot pin + SHA-256)
makes the result fully auditable.