2 Introduction
- 2.1 Create a mock cdm
3 Summarise clinical tables
- 3.1 Tidy the summarised object
4 Summarise record counts

2 Introduction

In this vignette, we will explore the OmopSketch functions designed to provide an overview of the clinical tables within a CDM object (observation_period, visit_occurrence, condition_occurrence, drug_exposure, procedure_occurrence, device_exposure, measurement, observation, and death). Specifically, there are four key functions that facilitate this:

summariseClinicalRecords() and tableClinicalRecords(): Use them to create a summary statistics with key basic information of the clinical table (e.g., number of records, number of concepts mapped, etc.)
summariseRecordCount(), plotRecordCount() and tableRecordCount(): Use them to summarise the number of records within specific time intervals.

2.1 Create a mock cdm

Let’s see an example of its functionalities. To start with, we will load essential packages and create a mock cdm using the mockOmopSketch() database.

library(dplyr)
library(OmopSketch)

# Connect to mock database
cdm <- mockOmopSketch()

3 Summarise clinical tables

Let’s now use summariseClinicalTables()from the OmopSketch package to help us have an overview of one of the clinical tables of the cdm (i.e., condition_occurrence).

summarisedResult <- summariseClinicalRecords(cdm, "condition_occurrence")
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising condition_occurrence: `in_observation`, `standard_concept`,
#>   `source_vocabulary`, `domain_id`, and `type_concept`.

summarisedResult |> print()
#> # A tibble: 20 × 13
#>    result_id cdm_name       group_name group_level      strata_name strata_level
#>        <int> <chr>          <chr>      <chr>            <chr>       <chr>       
#>  1         1 mockOmopSketch omop_table condition_occur… overall     overall     
#>  2         1 mockOmopSketch omop_table condition_occur… overall     overall     
#>  3         1 mockOmopSketch omop_table condition_occur… overall     overall     
#>  4         1 mockOmopSketch omop_table condition_occur… overall     overall     
#>  5         1 mockOmopSketch omop_table condition_occur… overall     overall     
#>  6         1 mockOmopSketch omop_table condition_occur… overall     overall     
#>  7         1 mockOmopSketch omop_table condition_occur… overall     overall     
#>  8         1 mockOmopSketch omop_table condition_occur… overall     overall     
#>  9         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 10         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 11         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 12         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 13         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 14         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 15         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 16         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 17         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 18         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 19         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> 20         1 mockOmopSketch omop_table condition_occur… overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

Notice that the output is in the summarised result format.

We can use the arguments to specify which statistics we want to perform. For example, use the argument recordsPerPerson to indicate which estimates you are interested regarding the number of records per person.

summarisedResult <- summariseClinicalRecords(cdm,
  "condition_occurrence",
  recordsPerPerson = c("mean", "sd", "q05", "q95")
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising condition_occurrence: `in_observation`, `standard_concept`,
#>   `source_vocabulary`, `domain_id`, and `type_concept`.

summarisedResult |>
  filter(variable_name == "records_per_person") |>
  select(variable_name, estimate_name, estimate_value)
#> # A tibble: 4 × 3
#>   variable_name      estimate_name estimate_value
#>   <chr>              <chr>         <chr>         
#> 1 records_per_person mean          84            
#> 2 records_per_person q05           70            
#> 3 records_per_person q95           98            
#> 4 records_per_person sd            8.9736

You can further specify if you want to include the number of records in observation (inObservation = TRUE), the number of concepts mapped (standardConcept = TRUE), which types of source vocabulary does the table contain (sourceVocabulary = TRUE), which types of domain does the vocabulary have (domainId = TRUE) or the concept’s type (typeConcept = TRUE).

summarisedResult <- summariseClinicalRecords(cdm,
  "condition_occurrence",
  recordsPerPerson = c("mean", "sd", "q05", "q95"),
  inObservation = TRUE,
  standardConcept = TRUE,
  sourceVocabulary = TRUE,
  domainId = TRUE,
  typeConcept = TRUE
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising condition_occurrence: `in_observation`, `standard_concept`,
#>   `source_vocabulary`, `domain_id`, and `type_concept`.

summarisedResult |>
  select(variable_name, estimate_name, estimate_value) |>
  glimpse()
#> Rows: 17
#> Columns: 3
#> $ variable_name  <chr> "Number subjects", "Number subjects", "Number records",…
#> $ estimate_name  <chr> "count", "percentage", "count", "mean", "q05", "q95", "…
#> $ estimate_value <chr> "100", "100", "8400", "84", "70", "98", "8.9736", "8400…

Additionally, you can also stratify the previous results by sex and age groups:

summarisedResult <- summariseClinicalRecords(cdm,
  "condition_occurrence",
  recordsPerPerson = c("mean", "sd", "q05", "q95"),
  inObservation = TRUE,
  standardConcept = TRUE,
  sourceVocabulary = TRUE,
  domainId = TRUE,
  typeConcept = TRUE,
  sex = TRUE,
  ageGroup = list("<35" = c(0, 34), ">=35" = c(35, Inf))
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising condition_occurrence: `in_observation`, `standard_concept`,
#>   `source_vocabulary`, `domain_id`, and `type_concept`.

summarisedResult |>
  select(variable_name, strata_level, estimate_name, estimate_value) |>
  glimpse()
#> Rows: 153
#> Columns: 4
#> $ variable_name  <chr> "Number subjects", "Number subjects", "Number records",…
#> $ strata_level   <chr> "overall", "overall", "overall", "overall", "overall", …
#> $ estimate_name  <chr> "count", "percentage", "count", "mean", "q05", "q95", "…
#> $ estimate_value <chr> "100", "100", "8400", "84", "70", "98.0500", "8.9736", …

Notice that, by default, the “overall” group will be also included, as well as crossed strata (that means, sex == “Female” and ageGroup == “>35”).

Also, see that the analysis can be conducted for multiple OMOP tables at the same time:

summarisedResult <- summariseClinicalRecords(cdm,
  c("observation_period", "drug_exposure"),
  recordsPerPerson = c("mean", "sd"),
  inObservation = FALSE,
  standardConcept = FALSE,
  sourceVocabulary = FALSE,
  domainId = FALSE,
  typeConcept = FALSE
)
#> ℹ Adding variables of interest to observation_period.
#> ℹ Summarising records per person in observation_period.
#> ℹ Adding variables of interest to drug_exposure.
#> ℹ Summarising records per person in drug_exposure.

summarisedResult |>
  select(group_level, variable_name, estimate_name, estimate_value) |>
  glimpse()
#> Rows: 10
#> Columns: 4
#> $ group_level    <chr> "observation_period", "observation_period", "observatio…
#> $ variable_name  <chr> "Number subjects", "Number subjects", "Number records",…
#> $ estimate_name  <chr> "count", "percentage", "count", "mean", "sd", "count", …
#> $ estimate_value <chr> "100", "100", "100", "1", "0", "100", "100", "21600", "…

We can also filter the clinical table to a specific time window by setting the dateRange argument.

summarisedResult <- summariseClinicalRecords(cdm, "drug_exposure",
  dateRange = as.Date(c("1990-01-01", "2010-01-01"))) 
#> ℹ Adding variables of interest to drug_exposure.
#> ℹ Summarising records per person in drug_exposure.
#> ℹ Summarising drug_exposure: `in_observation`, `standard_concept`,
#>   `source_vocabulary`, `domain_id`, and `type_concept`.

summarisedResult |>
  omopgenerics::settings()|>
  glimpse()
#> Rows: 1
#> Columns: 10
#> $ result_id          <int> 1
#> $ result_type        <chr> "summarise_clinical_records"
#> $ package_name       <chr> "OmopSketch"
#> $ package_version    <chr> "0.5.1"
#> $ group              <chr> "omop_table"
#> $ strata             <chr> ""
#> $ additional         <chr> ""
#> $ min_cell_count     <chr> "0"
#> $ study_period_end   <chr> "2010-01-01"
#> $ study_period_start <chr> "1990-01-01"

3.1 Tidy the summarised object

tableClinicalRecords() will help you to tidy the previous results and create a gt table.

summarisedResult <- summariseClinicalRecords(cdm,
  "condition_occurrence",
  recordsPerPerson = c("mean", "sd", "q05", "q95"),
  inObservation = TRUE,
  standardConcept = TRUE,
  sourceVocabulary = TRUE,
  domainId = TRUE,
  typeConcept = TRUE,
  sex = TRUE
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising condition_occurrence: `in_observation`, `standard_concept`,
#>   `source_vocabulary`, `domain_id`, and `type_concept`.

summarisedResult |>
  tableClinicalRecords()

Variable name	Variable level	Estimate name	Database name
Variable name	Variable level	Estimate name	mockOmopSketch
condition_occurrence; overall
Number records	-	N	8,400.00
Number subjects	-	N (%)	100 (100.00%)
Records per person	-	Mean (SD)	84.00 (8.97)
		q05	70.00
		q95	98.05
In observation	Yes	N (%)	8,400 (100.00%)
Domain	Condition	N (%)	8,400 (100.00%)
Source vocabulary	No matching concept	N (%)	8,400 (100.00%)
Standard concept	S	N (%)	8,400 (100.00%)
Type concept id	Unknown type concept: 1	N (%)	8,400 (100.00%)
condition_occurrence; Female
Number records	-	N	4,424.00
Number subjects	-	N (%)	52 (100.00%)
Records per person	-	Mean (SD)	85.08 (8.33)
		q05	71.55
		q95	98.45
In observation	Yes	N (%)	4,424 (100.00%)
Domain	Condition	N (%)	4,424 (100.00%)
Source vocabulary	No matching concept	N (%)	4,424 (100.00%)
Standard concept	S	N (%)	4,424 (100.00%)
Type concept id	Unknown type concept: 1	N (%)	4,424 (100.00%)
condition_occurrence; Male
Number records	-	N	3,976.00
Number subjects	-	N (%)	48 (100.00%)
Records per person	-	Mean (SD)	82.83 (9.57)
		q05	70.00
		q95	96.65
In observation	Yes	N (%)	3,976 (100.00%)
Domain	Condition	N (%)	3,976 (100.00%)
Source vocabulary	No matching concept	N (%)	3,976 (100.00%)
Standard concept	S	N (%)	3,976 (100.00%)
Type concept id	Unknown type concept: 1	N (%)	3,976 (100.00%)

4 Summarise record counts

OmopSketch can also help you to summarise the trend of the records of an OMOP table. See the example below, where we use summariseRecordCount() to count the number of records within each year, and then, we use plotRecordCount() to create a ggplot with the trend. We can also use tableRecordCount() to display results in a table of type gt, reactable or datatable. By default it creates a gt table.

summarisedResult <- summariseRecordCount(cdm, "drug_exposure", interval = "years")

summarisedResult |> tableRecordCount(type = "gt")

	Time interval	mockOmopSketch
	Time interval	Number records
drug_exposure	1951-01-01 to 1951-12-31	11
	1952-01-01 to 1952-12-31	7
	1953-01-01 to 1953-12-31	19
	1954-01-01 to 1954-12-31	19
	1955-01-01 to 1955-12-31	50
	1956-01-01 to 1956-12-31	45
	1957-01-01 to 1957-12-31	68
	1958-01-01 to 1958-12-31	75
	1959-01-01 to 1959-12-31	91
	1960-01-01 to 1960-12-31	92
	1961-01-01 to 1961-12-31	111
	1962-01-01 to 1962-12-31	99
	1963-01-01 to 1963-12-31	92
	1964-01-01 to 1964-12-31	108
	1965-01-01 to 1965-12-31	113
	1966-01-01 to 1966-12-31	337
	1967-01-01 to 1967-12-31	317
	1968-01-01 to 1968-12-31	159
	1969-01-01 to 1969-12-31	119
	1970-01-01 to 1970-12-31	133
	1971-01-01 to 1971-12-31	163
	1972-01-01 to 1972-12-31	193
	1973-01-01 to 1973-12-31	194
	1974-01-01 to 1974-12-31	186
	1975-01-01 to 1975-12-31	150
	1976-01-01 to 1976-12-31	192
	1977-01-01 to 1977-12-31	266
	1978-01-01 to 1978-12-31	395
	1979-01-01 to 1979-12-31	229
	1980-01-01 to 1980-12-31	244
	1981-01-01 to 1981-12-31	240
	1982-01-01 to 1982-12-31	211
	1983-01-01 to 1983-12-31	176
	1984-01-01 to 1984-12-31	130
	1985-01-01 to 1985-12-31	125
	1986-01-01 to 1986-12-31	144
	1987-01-01 to 1987-12-31	359
	1988-01-01 to 1988-12-31	546
	1989-01-01 to 1989-12-31	377
	1990-01-01 to 1990-12-31	505
	1991-01-01 to 1991-12-31	829
	1992-01-01 to 1992-12-31	515
	1993-01-01 to 1993-12-31	342
	1994-01-01 to 1994-12-31	282
	1995-01-01 to 1995-12-31	282
	1996-01-01 to 1996-12-31	272
	1997-01-01 to 1997-12-31	528
	1998-01-01 to 1998-12-31	390
	1999-01-01 to 1999-12-31	611
	2000-01-01 to 2000-12-31	608
	2001-01-01 to 2001-12-31	687
	2002-01-01 to 2002-12-31	869
	2003-01-01 to 2003-12-31	601
	2004-01-01 to 2004-12-31	1011
	2005-01-01 to 2005-12-31	412
	2006-01-01 to 2006-12-31	109
	2007-01-01 to 2007-12-31	277
	2008-01-01 to 2008-12-31	710
	2009-01-01 to 2009-12-31	500
	2010-01-01 to 2010-12-31	891
	2011-01-01 to 2011-12-31	482
	2012-01-01 to 2012-12-31	224
	2013-01-01 to 2013-12-31	137
	2014-01-01 to 2014-12-31	319
2015-01-01 to 2015-12-31	348
2016-01-01 to 2016-12-31	347
2017-01-01 to 2017-12-31	150
2018-01-01 to 2018-12-31	732
2019-01-01 to 2019-12-31	1045
overall	21600

Note that you can adjust the time interval period using the interval argument, which can be set to either “years”, “months” or “quarters”. See the example below, where it shows the number of records every 18 months:

summariseRecordCount(cdm, "drug_exposure", interval = "quarters") |>
  plotRecordCount()

We can further stratify our counts by sex (setting argument sex = TRUE) or by age (providing an age group). Notice that in both cases, the function will automatically create a group called overall with all the sex groups and all the age groups.

summariseRecordCount(cdm, "drug_exposure",
  interval = "months",
  sex = TRUE,
  ageGroup = list(
    "<30" = c(0, 29),
    ">=30" = c(30, Inf)
  )
) |>
  plotRecordCount()

By default, plotRecordCount() does not apply faceting or colour to any variables. This can result confusing when stratifying by different variables, as seen in the previous picture. We can use VisOmopResults package to help us know by which columns we can colour or face by:

summariseRecordCount(cdm, "drug_exposure",
  interval = "months",
  sex = TRUE,
  ageGroup = list(
    "0-29" = c(0, 29),
    "30-Inf" = c(30, Inf)
  )
) |>
  visOmopResults::tidyColumns()
#> [1] "cdm_name"       "omop_table"     "age_group"      "sex"           
#> [5] "variable_name"  "variable_level" "count"          "time_interval" 
#> [9] "interval"

Then, we can simply specify this by using the facet and colour arguments from plotRecordCount()

summariseRecordCount(cdm, "drug_exposure",
  interval = "months",
  sex = TRUE,
  ageGroup = list(
    "0-29" = c(0, 29),
    "30-Inf" = c(30, Inf)
  )
) |>
  plotRecordCount(facet = omop_table ~ age_group, colour = "sex")