---
title: "Variable schema reference"
format: html
vignette: >
  %\VignetteIndexEntry{Variable schema reference}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(chmsflow)
library(DT)
library(knitr)
library(kableExtra)
```

## Overview

chmsflow uses two CSV metadata files to define how raw CHMS variables are harmonized. These files are bundled with the package in `inst/extdata/` and are also available as data objects (`variables` and `variable_details`).

- **`variables.csv`** -- lists every harmonized variable with its name, label, type, and unit
- **`variable-details.csv`** -- defines the row-by-row recoding rules that `rec_with_table()` applies

This vignette is a column-by-column reference for both files. For an explanation of how these files fit into the harmonization workflow, see [Methodology](methodology.html).

## `variables.csv`

```{r, echo=FALSE}
cat(
  "There are", nrow(variables), "variables, grouped in", sum(!duplicated(variables$subject)),
  "subjects and", sum(!duplicated(variables$section)), "sections.\n"
)
```

```{r echo=FALSE, results='asis', warning=FALSE}
datatable(variables, filter = "top", options = list(pageLength = 5))
```

### Columns

**1. `variable`** -- the name of the harmonized variable.

**2. `label`** -- a short label for the variable.

**3. `labelLong`** -- a more detailed label for the variable.

**4. `section`** -- the broad grouping where this variable belongs (e.g., sociodemographics, health behaviour, health status).

**5. `subject`** -- the specific topic the variable pertains to (e.g., age, smoking, blood pressure).

**6. `variableType`** -- whether the harmonized variable is `Categorical` or `Continuous`.

**7. `units`** -- the units of the harmonized variable, or `N/A` if unitless.

**8. `databaseStart`** -- the CHMS cycles that contain the variable, separated by commas.

**9. `variableStart`** -- the source variable names as listed in each CHMS cycle. Uses the same format conventions as `variable-details.csv` (see below).

## `variable-details.csv`

```{r, echo=FALSE}
cat(
  "There are", nrow(variable_details), "rows and", ncol(variable_details), "columns.\n"
)
datatable(variable_details, options = list(pageLength = 5))
```

### Row structure

Each row defines the recoding rule for one category of one variable. For a categorical variable with 4 categories, plus a not-applicable category, a missing category, and an else row, there are 7 rows.

Missing data rows use `haven::tagged_na()`:

- `NA::a` -- valid skip (not applicable)
- `NA::b` -- missing (don't know, refusal, not stated)

The `else` row catches values not matched by any other row.

### Columns

We use `clc_sex` as a running example.

**1. `variable`** -- name of the harmonized variable.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", 1], col.names = "variable")
```

**2. `dummyVariable`** -- dummy variable name for each category (categorical variables only; `N/A` for continuous).

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:2)])
```

**3. `typeEnd`** -- variable type of the harmonized variable (`cat` or `cont`).

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:3)])
```

**4. `databaseStart`** -- CHMS cycles containing this variable, separated by commas.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:4)])
```

**5. `variableStart`** -- source variable names in each CHMS cycle. Supports several formats:

| Format | Meaning | Example |
|--------|---------|---------|
| `[variable_name]` | Same name across all cycles | `[clc_sex]` |
| `cycle1::name1, [default_name]` | Cycle-specific exception with a default | `cycle1::amsdmva1, [ammdmva1]` |
| `DerivedVar::[var1, var2, ...]` | Computed by a function from listed inputs | `DerivedVar::[lab_bcre, pgdcgt, clc_sex, clc_age]` |

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:5)])
```

**6. `typeStart`** -- variable type in the source CHMS data (`cat` or `cont`).

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:6)])
```

**7. `recEnd`** -- the value to recode each category to. Special values:

- `copy` -- pass through unchanged (for continuous variables)
- `NA::a` -- not applicable
- `NA::b` -- missing
- `Func::function_name` -- derived variable computed by the named function

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:7)])
```

**8. `numValidCat`** -- number of non-missing categories (categorical only; `N/A` for continuous). Not used by `rec_with_table()`.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:8)])
```

**9. `catLabel`** -- short label for the category.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:9)])
```

**10. `catLabelLong`** -- detailed label, matching CHMS documentation where possible.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:10)])
```

**11. `units`** -- units of the variable, or `N/A`. Must be consistent across all rows of the same variable.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:11)])
```

**12. `recStart`** -- the source value or range to match. Uses [interval notation](https://en.wikipedia.org/wiki/Interval_(mathematics)#Notations_for_intervals):

- `[1, 4]` -- all integer values from 1 to 4
- `[1, 2.5]` -- all values from 1 to 2.5 (2.55 would not match)
- `else` -- all values not matched by other rows
- `copy` -- combined with `else`, copies unmatched values unchanged

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:12)])
```

**13. `catStartLabel`** -- label for the source category, matching CHMS documentation. For missing rows, describes each missing code and its value.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:13)])
```

**14. `variableStartShortLabel`** -- short label for the source variable.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:14)])
```

**15. `variableStartLabel`** -- detailed label for the source variable, matching CHMS documentation.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:15)])
```

**16. `notes`** -- relevant notes about changes between CHMS cycles, missing categories, or variable type changes.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[variable_details$variable == "clc_sex", c(1:16)])
```

### Derived variables

Derived variables use two special column values:

- **`variableStart`**: `DerivedVar::[var1, var2, var3]` -- lists the input variables
- **`recEnd`**: `Func::function_name` -- names the R function that computes the derived variable

See [Derived variables](derived_variables.html) for details on how derived variables work.

## Next steps

- **See it in action** -- Follow the [Analysis walkthrough](analysis_walkthrough.html) to see how these metadata files drive a real analysis.
- **Understand the methodology** -- For the design rationale behind the rules-as-data approach, see [Methodology](methodology.html).
- **Add your own variables** -- To extend the schema with custom variables, see [How to add variables](how_to_add_variables.html).