---
title: "Advanced Workflows"
author: "Maciej Nasinski"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Advanced Workflows}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE, message = FALSE, warning = FALSE,
  collapse = TRUE, comment = "#>"
)
```

This vignette collects the advanced `cat2cat` workflows in one place: 
ML weights, multi-period chaining, panel data with identifiers, aggregated data, and regression on replicated data. 
It assumes you've read [Get Started](cat2cat.html).

Use this vignette by module:

1. **ML weights** if you want feature-informed probabilities rather than only naive or frequency weights.
2. **Multi-period chaining** if your harmonisation spans 3 or more waves.
3. **Panel data with subject identifiers** if you have a rotational panel or other stable subject IDs.
4. **Aggregated data and special cases** if you only observe category totals or need hierarchical-code mappings.
5. **Regression on replicated data** if your main task is estimation and inference after harmonisation.

```{r load-data}
library(cat2cat)
library(dplyr)
library(tidyr)
library(fixest)

data(occup, package = "cat2cat")
data(occup_panel, package = "cat2cat")
data(trans, package = "cat2cat")
data(verticals, package = "cat2cat")
data(verticals2, package = "cat2cat")

occup_2006 <- occup[occup$year == 2006, ]
occup_2008 <- occup[occup$year == 2008, ]
occup_2010 <- occup[occup$year == 2010, ]
occup_2012 <- occup[occup$year == 2012, ]
```

## ML weights

Machine-learning weights are useful when the mapping table alone is too coarse and observed features help distinguish which target category is most plausible for a replicated observation.

This section shows:

- when ML weights are worth trying,
- how to specify the `ml` argument in `cat2cat()`,
- how to validate whether ML improves on simpler baselines,
- and how to handle rows where ML probabilities cannot be produced.

In practice, ML is most helpful when replication is substantial and the ambiguous categories differ systematically by observed characteristics such as age, education, experience, or salary. If ML does not improve on the frequency baseline in `cat2cat_ml_run()`, the simpler `wei_freq_c2c` weights are usually the better choice.

ML features must be numeric, logical, or factor columns. Factor columns are
one-hot encoded automatically; character columns are not, so convert character
categories to factors before listing them in `features`.

```{r ml-minimal, eval=FALSE}
ml_setup <- list(
  data = bind_rows(occup_2010, occup_2012),
  cat_var = "code",
  method = c("knn", "rf", "lda"),
  features = c("age", "sex", "edu", "exp", "parttime", "salary"),
  args = list(k = 10, ntree = 50),
  on_fail = "freq",
  fail_warn = TRUE
)

result_ml <- cat2cat(
  data = list(old = occup_2008, new = occup_2010,
              cat_var = "code", time_var = "year"),
  mappings = list(trans = trans, direction = "backward"),
  ml = ml_setup
)
```

Validate whether ML adds value before using it in production:

```{r ml-validate, eval=FALSE}
cv <- cat2cat_ml_run(
  mappings = list(trans = trans, direction = "backward"),
  ml = ml_setup
)
print(cv)
```

Baseline-only diagnostics are also available:

```{r baseline-only-validate, eval=FALSE}
ml_baseline <- list(
  data = bind_rows(occup_2010, occup_2012),
  cat_var = "code",
  method = character(0),
  features = character(0)
)

cv_baseline <- cat2cat_ml_run(
  mappings = list(trans = trans, direction = "backward"),
  ml = ml_baseline
)
print(cv_baseline)
```

Use the same `direction` in diagnostics as in the mapping workflow you want to
evaluate, because the mapping groups and base-period frequencies differ by
direction.

If ML probabilities cannot be produced for some replicated rows, use `on_fail` and `fail_warn`:

- `on_fail = "freq"` (default): replace failed ML rows with `wei_freq_c2c`
- `on_fail = "naive"`: replace failed ML rows with `wei_naive_c2c`
- `on_fail = "na"`: keep failed ML rows as `NA`
- `on_fail = "error"`: stop immediately
- `fail_warn = TRUE` (default): warn with affected rows/observations per method
- `fail_warn = FALSE`: silence the warning

## Multi-period chaining

Repeated cross-sections or longitudinal datasets often contain more than two waves. To bring all waves onto the same encoding, apply `cat2cat()` iteratively, feeding the output of one step into the next.

Use this module when you need one harmonised categorical variable across 3 or more periods. The main design choice is whether to chain backward or forward, and whether truncating hierarchical codes can reduce replication before chaining.

In the examples below, `occup` is a repeated cross-section dataset, not a panel. The chained outputs therefore combine harmonised cross-sections from multiple years.

### Optional: passing `mappings$freqs_df`

In most cases you do **not** need to pass `mappings$freqs_df`.
If it is omitted, `cat2cat()` computes base-period frequencies internally from the provided data.

Pass `mappings$freqs_df` only when you need explicit control over base frequencies.

The object should be a two-column data frame: category name in the first column and counts in the second.

### Mapping table stability and truncation

Before chaining, inspect how disruptive the transition table is:

```{r trans-stability}
max_digits <- max(nchar(as.character(trans[[1]])), nchar(as.character(trans[[2]])))

stability <- sapply(1:max_digits, function(d) {
  old_trunc <- substr(as.character(trans[[1]]), 1, d)
  new_trunc <- substr(as.character(trans[[2]]), 1, d)
  mean(old_trunc != new_trunc) * 100
})

data.frame(
  digits = 1:max_digits,
  pct_changed = round(stability, 1)
)
```

Truncating codes to fewer digits can reduce replication:

For backward mapping this is often true, but for forward mapping truncation can move in the opposite direction. Collapsing detailed codes into broader prefixes can create additional many-to-one and one-to-many ties on the old-code side, which may increase replication in the mapped period.

```{r truncation-diagnostic}
occup_2008_trunc <- occup_2008
occup_2008_trunc$code <- substr(occup_2008_trunc$code, 1, 4)
occup_2010_trunc <- occup_2010
occup_2010_trunc$code <- substr(occup_2010_trunc$code, 1, 4)
trans_trunc <- unique(data.frame(
  old = substr(trans$old, 1, 4),
  new = substr(trans$new, 1, 4)
))

back_full <- cat2cat(
  data = list(old = occup_2008, new = occup_2010,
              cat_var = "code", time_var = "year"),
  mappings = list(trans = trans, direction = "backward")
)
back_trunc <- cat2cat(
  data = list(old = occup_2008_trunc, new = occup_2010_trunc,
              cat_var = "code", time_var = "year"),
  mappings = list(trans = trans_trunc, direction = "backward")
)

fwd_full <- cat2cat(
  data = list(old = occup_2008, new = occup_2010,
              cat_var = "code", time_var = "year"),
  mappings = list(trans = trans, direction = "forward")
)
fwd_trunc <- cat2cat(
  data = list(old = occup_2008_trunc, new = occup_2010_trunc,
              cat_var = "code", time_var = "year"),
  mappings = list(trans = trans_trunc, direction = "forward")
)

data.frame(
  mapping = c("backward full (4->6)", "backward trunc (4->4)",
              "forward full (6->4)", "forward trunc (4->4)"),
  mean_rep = c(mean(back_full$old$rep_c2c), mean(back_trunc$old$rep_c2c),
               mean(fwd_full$new$rep_c2c), mean(fwd_trunc$new$rep_c2c))
)
```

### Backward chaining

```{r backward-chain}
step1 <- cat2cat(
  data = list(old = occup_2008, new = occup_2010,
              cat_var = "code", time_var = "year"),
  mappings = list(trans = trans, direction = "backward")
)

step2 <- cat2cat(
  data = list(old = occup_2006, new = step1$old,
              cat_var_old = "code", cat_var_new = "g_new_c2c",
              time_var = "year"),
  mappings = list(trans = trans, direction = "backward")
)

harmonised_back <- bind_rows(
  step2$old,
  step1$old,
  step1$new,
  dummy_c2c(occup_2012, "code")
)
```

Validation: weighted counts should match the original counts within each year.

```{r validate-backward-counts}
harmonised_back %>%
  group_by(year) %>%
  summarise(weighted_n = round(sum(wei_freq_c2c)), .groups = "drop") %>%
  left_join(count(occup, year), by = "year")
```

### Forward chaining

```{r forward-chain}
trans_fwd <- rbind(
  trans,
  data.frame(old = "no_cat",
             new = setdiff(c(occup_2010$code, occup_2012$code), trans$new))
)

fwd1 <- cat2cat(
  data = list(old = occup_2008, new = occup_2010,
              cat_var = "code", time_var = "year"),
  mappings = list(trans = trans_fwd, direction = "forward")
)

fwd2 <- cat2cat(
  data = list(old = fwd1$new, new = occup_2012,
              cat_var_old = "g_new_c2c", cat_var_new = "code",
              time_var = "year"),
  mappings = list(trans = trans_fwd, direction = "forward")
)

harmonised_fwd <- bind_rows(
  dummy_c2c(occup_2006, "code"),
  fwd1$old,
  fwd1$new,
  fwd2$new
)
```

### Adding ML to the chain

```{r ml-chain, eval=FALSE}
step1_ml <- cat2cat(
  data = list(old = occup_2008, new = occup_2010,
              cat_var = "code", time_var = "year"),
  mappings = list(trans = trans, direction = "backward"),
  ml = ml_setup
)

step2_ml <- cat2cat(
  data = list(old = occup_2006, new = step1_ml$old,
              cat_var_old = "code", cat_var_new = "g_new_c2c",
              time_var = "year"),
  mappings = list(trans = trans, direction = "backward"),
  ml = ml_setup
)
```

## Panel data with subject identifiers

If subjects have stable identifiers across waves, `id_var` can reduce unnecessary replication by directly matching returning subjects.

Use this module only when the identifier truly tracks the same subject across adjacent waves and short-run category changes are unlikely to represent genuine transitions rather than coding changes.

If you have a complete panel with every subject observed in both periods and no missing category values, you may not need probabilistic harmonisation at all. In that case, the target-period category can often be joined back to the earlier record by `id_var`, and the task is mostly a deterministic join. `cat2cat()` is more useful when the panel is incomplete, rotational, or mixed with new entrants and leavers, so some observations still need the mapping-table replication path.

```{r panel-load}
panel_old <- occup_panel[occup_panel$quarter == "2009Q4", ]
panel_new <- occup_panel[occup_panel$quarter == "2010Q1", ]
shared_ids <- intersect(panel_old$panel_id, panel_new$panel_id)
length(shared_ids)
```

```{r idvar-mapping}
result_id <- cat2cat(
  data = list(
    old = panel_old,
    new = panel_new,
    id_var = "panel_id",
    cat_var = "code",
    time_var = "quarter"
  ),
  mappings = list(trans = trans, direction = "backward")
)
```

How `id_var` works:

- direct match: workers observed in both periods receive `rep_c2c = 1` and weight 1,
- replication path: workers observed in only one period go through the standard mapping-table replication.

```{r idvar-inspect}
table(result_id$old$rep_c2c)
sum(result_id$old$wei_freq_c2c)
nrow(panel_old)
```

Compare with and without identifiers:

```{r idvar-compare}
result_no_id <- cat2cat(
  data = list(
    old = panel_old,
    new = panel_new,
    cat_var = "code",
    time_var = "quarter"
  ),
  mappings = list(trans = trans, direction = "backward")
)

cat("WITH id_var average replication:", round(mean(result_id$old$rep_c2c), 2), "\n")
cat("WITHOUT id_var average replication:", round(mean(result_no_id$old$rep_c2c), 2), "\n")
```

Use `id_var` when:

- identifiers are reliable,
- the time gap is short relative to true mobility,
- and direct matches are informative about the coding change.

## Aggregated data and special cases

When only aggregate counts are available, use `cat2cat_agg()` with mapping equations rather than micro-data replication.

Use this module when you do not have person-level data, or when the classification itself has a hierarchical code structure that can be exploited to build coarser mappings.

```{r agg-data}
agg_old <- verticals[verticals$v_date == "2020-04-01", ]
agg_new <- verticals[verticals$v_date == "2020-05-01", ]
```

```{r agg-mapping}
agg <- cat2cat_agg(
  data = list(
    old = agg_old,
    new = agg_new,
    cat_var = "vertical",
    time_var = "v_date",
    freq_var = "counts"
  ),
  Automotive %<% c(Automotive1, Automotive2),
  c(Kids1, Kids2) %>% c(Kids),
  Home %>% c(Home, Supermarket)
)
```

Inspect how categories were proportionally redistributed:

```{r agg-inspect}
agg$old[agg$old$vertical %in% c("Automotive1", "Automotive2"), ]
agg$new[agg$new$vertical %in% c("Kids1", "Kids2"), ]
```

Hierarchical codes can also be used to build coarser mapping tables when an official transition table is unavailable:

```{r hierarchical-codes}
trans_2digit <- data.frame(
  old = substr(trans$old, 1, 2),
  new = substr(trans$new, 1, 2)
)
trans_2digit <- unique(trans_2digit)

cat("2-digit mapping rows:", nrow(trans_2digit),
    "vs full mapping rows:", nrow(trans))
```

This works for classifications with stable prefix hierarchies such as ISCO, ICD, NACE, CPC, or HS codes.

## Regression on replicated data

The replication is neutral for regressions on non-mapped covariates because per-subject weights sum to one. Standard errors, however, must be corrected because replication inflates the row count.

Use this module when your end goal is estimation rather than descriptive harmonisation. The key issue is not coefficient bias for non-mapped regressors, but valid inference after replication.

### Building a 4-period harmonised repeated cross-section dataset

```{r build-panel}
cat2cat_data_back <- bind_rows(
  step2$old,
  step1$old,
  step1$new,
  dummy_c2c(occup_2012, "code")
)
```

### Neutral impact demonstration

```{r regression-neutral}
lms_orig <- lm(
  I(log(salary)) ~ age + sex + factor(edu) + parttime + exp,
  data = occup,
  weights = multiplier
)

lms_harmonised <- lm(
  I(log(salary)) ~ age + sex + factor(edu) + parttime + exp,
  data = cat2cat_data_back,
  weights = multiplier * wei_freq_c2c
)

summary_c2c(lms_harmonised, df_old = nrow(occup))
```

`summary_c2c()` scales naive standard errors by the replication factor:

$$\text{SE}_{\text{corrected}} = \text{SE}_{\text{naive}} \times \sqrt{\frac{n_{\text{rep}}}{n_{\text{orig}}}}$$

Report coefficient estimates with corrected standard errors and p-values from
`summary_c2c()`. Ordinary $R^2$ is preserved in this neutral setup because the
response and covariates do not vary across replicated copies and the weights for
each source observation sum to the original weight. Adjusted $R^2$, AIC, and BIC
depend on sample-size and degrees-of-freedom conventions, so do not report their
replicated-model values unless you recompute them on the intended
original-observation scale. If the harmonised category itself enters the model,
fit statistics are conditional on the chosen harmonisation weights.

### Fixed effects regression

```{r regression-fe}
harmonised_fe <- cat2cat_data_back %>%
  prune_c2c(method = "nonzero") %>%
  mutate(orig_obs_id = interaction(year, index_c2c, drop = TRUE, lex.order = TRUE)) %>%
  filter(!is.na(g_new_c2c), !is.na(salary), salary > 0)

fe_model_cluster <- feols(
  log(salary) ~ age + sex + factor(edu) + parttime + exp | g_new_c2c + year,
  data = harmonised_fe,
  weights = ~multiplier * wei_freq_c2c,
  cluster = ~orig_obs_id
)
summary(fe_model_cluster)
```

Do not cluster directly on `index_c2c` after binding multiple waves, because `index_c2c` is created separately inside each `cat2cat()` call and can repeat across years. 
Instead, build a wave-specific original-observation identifier such as `interaction(year, index_c2c)` and cluster on that. This treats all replications of the same source row as one cluster without incorrectly merging different people from different waves.

## Choosing the right advanced workflow

| Problem | Recommended tool |
|--------|-------------------|
| Need feature-informed weights | `cat2cat(..., ml = ...)` + `cat2cat_ml_run()` |
| Need 3+ wave harmonisation | iterative `cat2cat()` chaining |
| Stable subject identifiers across waves | `id_var` |
| Only aggregated counts available | `cat2cat_agg()` |
| Regression on replicated data | `summary_c2c()` or clustered inference |

## Next steps

- [Get Started](cat2cat.html) for core concepts and the two-period workflow
- [Choosing Weights and Validating ML](cat2cat_validation.html) for weight comparisons and robustness checks