---
title: "How to add variables to chmsflow"
format: html
vignette: >
  %\VignetteIndexEntry{How to add variables to chmsflow}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction 

This vignette explains how you can add variables to the _chmsflow_ package. There are two types of variables that can be added:

1. Existing CHMS variables to be harmonized across cycles.
2. Derived variables based on harmonized CHMS cycles.

## How to add existing CHMS variables to chmsflow

When adding variables that already exist across CHMS cycles, there are two worksheets that need to be specified:

1. `variable_details.csv`: This worksheet maps variables across CHMS cycles.
2. `variables.csv`: This worksheet lists all the variables that exist in _chmsflow_

### Example of an existing CHMS variable: Age

This example will show how the existing CHMS age variable was developed using `variable_details.csv` and `variables.csv`. **Note** this variable is different from the derived age variable that is also included in _chmsflow_. For this article, a sample `variable_details.csv` & `variables.csv` will be loaded to demonstrate how to add variables.

```{r}
variables <- read.csv(system.file("extdata", "variables.csv", package = "chmsflow"))
variable_details <- read.csv(system.file("extdata", "variable-details.csv", package = "chmsflow"))
```

### Specifying the variable on `variable_details.csv`

* For this variable, there are 4 rows, 1 for the continuous "category", 1 for not applicable, 1 for missing, and 1 for else. In many instances there are changes in how variable categories are coded between CHMS cycles. But since the overall variable structure remains intact, extra rows can be used to help rectify this issue to make sure all values feed into the newly transformed variable.

### Columns

1. **variable:** the most common variable name for age is `clc_age`. This should be written for each row.

```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(knitr)
library(kableExtra)
kable(variable_details[275:278, 1], col.names = "variable")
```

2. **dummyVariable:** age is a continuous variable, so it does not have dummy variables.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:2])
```

3. **typeEnd:** age was captured in the CHMS as a continuous variable. It does not make much sense to transform it into a categorical variable, so the toType should be `cont` in each row of age.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:3])
```

4. **databaseStart:** age was captured in all CHMS surveys between cycles 1--6, so in the first row with the continuous "category" and the else row, the CHMS identifiers will be listed this column:

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[c(275), 1:4])
```

5. **variableStart:** From cycles 1--6, the age variable is the same as the common name. Therefore for all the rows, the variableStart column will look like this:
      
```{r, echo=FALSE, warning=FALSE}
kable(variable_details[c(1, 6), 1:5])
```
    
6. **typeStart:** As mentioned previously, age was measured as a continuous variable in the CHMS, so the fromType should be `cont` in each row of age.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:6])
```

7. **recEnd:** Since this is a continuous variable, the first row (the main "category") has `copy` written. For the not applicable rows `NA::a` is written. For the missing and else rows `NA::b` is written. The `haven` package is used for tagging NA in numeric variables.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:7])
```

8. **numValidCat:** Since this is a continuous variable, there are no actual categories; so `N/A` is written in each row.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:8])
```

9. **catLabel:** For the first row `age` is written. Not applicable rows `not applicable` is written. Missing rows: `missing`. Else row: `else`

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:9])
```

10. **catLabelLong:** For the first row, `body mass index` is written to give further detail on what age is. The other rows remain the same.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:10])
```

11. **units:** age is measured in years, so `years` is written in each row. 

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:11])
```

12. **recStart:** Going through the CHMS data documentation from cycles 1--6, it was found that the lowest age value was 3 and the highest age value was 80. Therefore the recFrom for the first row is written as `[3,80]`. Not applicable was coded as 996 so the recFrom for this row would be `[996]`. Similarly, don't know was coded as 997, refusal was coded as 998, and not stated was coded as 999. Therefore the recFrom for the missing row would be `[997,999]`. For the else row, just write `else`.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:12])
```

13. **catStartLabel:** For the first row, `Years` is written as it is written in CHMS documentation. The other rows  remain the same, and the values for each missing category are stated in the missing rows.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:13])
```

14. **variableStartShortLabel:** Writing `Age` for each row is sufficient for this variable.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:14])
```

15. **variableStartLabel:** As per CHMS documentation, the label for this variable is `Age at clinic visit`.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, 1:15])
```

16. **notes: ** Notes are used to identify issues that may be relevant when transforming the variable or category. There are no known issues regarding age.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[275:278, ])
```

### Specifying the variable on `variables.csv`

Once mapped and specified on `variable_details.csv`, the age variable can now be specified on `variables.csv`

```{r, echo=FALSE, warning=FALSE}
library(knitr)
library(kableExtra)
kable(variables[1, ])
```

## How to create derived variables and add them to chmsflow

Along with specifying the variable on `variable_details.csv` and `variables.csv`, a previous step is required in creating derived variables and that is creating a custom function that creates the derived variable from existing CHMS variables. 

```
CustomFunctionName <- function(Vars from variableStart following same order){
  outputVar <- {Code on passed vars that generates a single value output}
  
  return(outputVar)
}
```

### Example of a derived variable: Smoking pack-years

Pack-years is a complex derived variable often used by researchers to quantify the amount of cigarette use over a period of time. Even given its complex nature, pack-years can still be calculated. This derived variable incorporates numerous CHMS smoking variables, along with age.

### Step 1. Creating a derived function

With complex derived variables, the function computes the output from multiple input variables using clinical or epidemiological logic. For pack-years, the function uses smoking status (`smkdsty`), age, and several smoking history variables to calculate cumulative cigarette exposure.

The current implementation in chmsflow uses `dplyr::case_when()` for clarity:

```{r, warning=FALSE}
calculate_pack_years <- function(smkdsty, clc_age, smk_54, smk_52, smk_31, smk_41, smk_53, smk_42, smk_21, smk_11) {
  pack_years <- dplyr::case_when(
    # Age: valid skip
    clc_age == 96 ~ haven::tagged_na("a"),
    # Age: don't know, refusal, not stated
    clc_age < 0 | clc_age %in% 97:99 ~ haven::tagged_na("b"),

    # Pack-years by smoking status
    smkdsty == 1 ~ pmax(((clc_age - smk_52) * (smk_31 / 20)), 0.0137),
    smkdsty == 2 ~ pmax(((clc_age - smk_52 - (clc_age - smk_54)) * (smk_53 / 20)), 0.0137) +
      ((pmax((smk_41 * smk_42 / 30), 1) / 20) * (clc_age - smk_54)),
    smkdsty == 3 ~ (pmax((smk_41 * smk_42 / 30), 1) / 20) * (clc_age - smk_21),
    smkdsty == 4 ~ pmax(((smk_54 - smk_52) * (smk_53 / 20)), 0.0137),
    smkdsty == 5 & smk_11 == 1 ~ 0.0137,
    smkdsty == 5 & smk_11 == 2 ~ 0.007,
    smkdsty == 6 ~ 0,

    # Smoking status: valid skip
    smkdsty == 96 ~ haven::tagged_na("a"),
    # Smoking status: don't know, refusal, not stated
    smkdsty %in% 97:99 ~ haven::tagged_na("b"),
    .default = haven::tagged_na("b")
  )
  return(pack_years)
}
```

More information on what each smoking variable means can be found in the [Reference](../reference/calculate_pack_years.html) section.

### Steps 2 and 3. Specifying pack-years in `variable_details.csv` and `variables.csv`

This is how the `variable_details.csv` sheet would look for the derived pack-years row
```{r, echo=FALSE, warning=FALSE}
kable(variable_details[953, ])
```

And this is how the `variables.csv` sheet would look for the derived pack-years row

```{r, echo=FALSE, warning=FALSE}
kable(variables[189, ])
```

### Adding labels to a derived variable

For a continuous derived variable like pack-years, the labels specified in `variables.csv` are sufficient for the variable to be properly labelled. For categorical derived variables, extra rows will need to be added on `variable_details.csv` so that labels are generated for each category. The example below shows how diab_status, a derived categorical variable flagging respondents who have diabetes based on more inclusive factors, is specified in `variable_details.csv`.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[308:311, ])
```

As you can see, the first row for diab_status specifies the function for the derived variable and the base variables included. The second and third rows specify the categories of the variables, which are then labelled.

### Creating a derived variable using derived variables

It is possible to create a derived variable that involves derived variables. When creating the custom function for it, use the derived variable name inside the function. Similarly, when specifying the variable in `variable_details.csv` and `variables.csv`, use the derived variable in the **variableStart** column. The example below shows how diab_status uses the derived diabetes drug variable, is specified in `variable_details.csv` and `variables.csv`.

```{r, echo=FALSE, warning=FALSE}
kable(variable_details[308:311, ])
kable(variables[80, ])
```

## Next steps

- **Understand the schema** -- For a column-by-column reference of `variables.csv` and `variable-details.csv`, see [Variable schema reference](variables_and_variable_details.html).
- **See derived variables in context** -- Learn how `Func::` and `DerivedVar::` entries are used in [Derived variables](derived_variables.html).
- **Contribute to chmsflow** -- See the [contributing guide](https://github.com/Big-Life-Lab/chmsflow/blob/dev/CONTRIBUTING.md) for how to submit your additions.