This is practically the same code you can find on this blog post of
mine: https://www.brodrigues.co/blog/2018-11-14-luxairport/
but with some minor updates to reflect the current state of the
{tidyverse} packages as well as logging using
{chronicler}.
Let’s first load the required packages, and the avia
dataset included in the {chronicler} package:
library(chronicler)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:testthat':
#> 
#>     matches
#> The following object is masked from 'package:chronicler':
#> 
#>     pick
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
#> 
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:testthat':
#> 
#>     matches
library(stringr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
# Ensure chronicler version of `pick()` is being used
pick <- chronicler::pick
data("avia")Now I need to define the needed functions for the analysis. To
improve logging, I add the dim() function as the
.g argument of each function below. This will make it
possible to see how the dimensions of the data change inside the
pipeline:
# Define required functions 
# You can use `record_many()` to avoid having to write everything
r_select <- record(select, .g = dim)
r_pivot_longer <- record(pivot_longer, .g = dim)
r_filter <- record(filter, .g = dim)
r_mutate <- record(mutate, .g = dim)
r_separate <- record(separate, .g = dim)
r_group_by <- record(group_by, .g = dim)
r_summarise <- record(summarise, .g = dim)avia_clean <- avia %>%
  r_select(1, contains("20")) %>% # select the first column and every column starting with 20
  bind_record(r_pivot_longer, -starts_with("unit"), names_to = "date", values_to = "passengers") %>%
  bind_record(r_separate,
              col = 1,
              into = c("unit", "tra_meas", "air_pr\\time"),
              sep = ",")Let’s focus on monthly data:
avia_monthly <- avia_clean %>%
  bind_record(r_filter,
              tra_meas == "PAS_BRD_ARR",
              !is.na(passengers),
              str_detect(date, "M")) %>%
  bind_record(r_mutate,
              date = paste0(date, "01"),
              date = ymd(date)) %>%
  bind_record(r_select,
              destination = "air_pr\\time", date, passengers)avia_monthly is an object of class
chronicle, but in essence, it is just a list, with its own
print method:
avia_monthly
#> OK! Value computed successfully:
#> ---------------
#> Just
#> # A tibble: 7,632 × 3
#>    destination     date       passengers
#>    <chr>           <date>     <chr>     
#>  1 LU_ELLX_AT_LOWW 2018-03-01 3967      
#>  2 LU_ELLX_AT_LOWW 2018-02-01 3232      
#>  3 LU_ELLX_AT_LOWW 2018-01-01 3701      
#>  4 LU_ELLX_AT_LOWW 2017-12-01 4249      
#>  5 LU_ELLX_AT_LOWW 2017-11-01 4311      
#>  6 LU_ELLX_AT_LOWW 2017-10-01 4591      
#>  7 LU_ELLX_AT_LOWW 2017-09-01 4816      
#>  8 LU_ELLX_AT_LOWW 2017-08-01 4399      
#>  9 LU_ELLX_AT_LOWW 2017-07-01 4277      
#> 10 LU_ELLX_AT_LOWW 2017-06-01 4674      
#> # … with 7,622 more rows
#> 
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with pick(.c, "value").
#> To read the log of this object, call read_log(.c).Now that the data is clean, we can read the log:
read_log(avia_monthly)
#> [1] "Complete log:"                                                                                                             
#> [2] "OK! select(1,contains(\"20\")) ran successfully at 2023-02-03 14:28:16"                                                    
#> [3] "OK! pivot_longer(-starts_with(\"unit\"),date,passengers) ran successfully at 2023-02-03 14:28:16"                          
#> [4] "OK! separate(1,c(\"unit\", \"tra_meas\", \"air_pr\\\\time\"),,) ran successfully at 2023-02-03 14:28:16"                   
#> [5] "OK! filter(tra_meas == \"PAS_BRD_ARR\",!is.na(passengers),str_detect(date, \"M\")) ran successfully at 2023-02-03 14:28:18"
#> [6] "OK! mutate(paste0(date, \"01\"),ymd(date)) ran successfully at 2023-02-03 14:28:18"                                        
#> [7] "OK! select(air_pr\\time,date,passengers) ran successfully at 2023-02-03 14:28:18"                                          
#> [8] "Total running time: 1.86431932449341 secs"This is especially useful if the object avia_monthly
gets saved using saveRDS(). People that then read this
object, can read the log to know what happened and reproduce the steps
if necessary.
Let’s take a look at the final data set:
avia_monthly %>%
  pick("value")
#> # A tibble: 7,632 × 3
#>    destination     date       passengers
#>    <chr>           <date>     <chr>     
#>  1 LU_ELLX_AT_LOWW 2018-03-01 3967      
#>  2 LU_ELLX_AT_LOWW 2018-02-01 3232      
#>  3 LU_ELLX_AT_LOWW 2018-01-01 3701      
#>  4 LU_ELLX_AT_LOWW 2017-12-01 4249      
#>  5 LU_ELLX_AT_LOWW 2017-11-01 4311      
#>  6 LU_ELLX_AT_LOWW 2017-10-01 4591      
#>  7 LU_ELLX_AT_LOWW 2017-09-01 4816      
#>  8 LU_ELLX_AT_LOWW 2017-08-01 4399      
#>  9 LU_ELLX_AT_LOWW 2017-07-01 4277      
#> 10 LU_ELLX_AT_LOWW 2017-06-01 4674      
#> # … with 7,622 more rowsIt is also possible to take a look at the underlying
.log_df object that contains more details, and see the
output of the .g argument (which was defined in the
beginning as the dim() function):
check_g(avia_monthly)
#>   ops_number     function         g
#> 1          1       select  509, 231
#> 2          2 pivot_longer 117070, 3
#> 3          3     separate 117070, 5
#> 4          4       filter   7632, 5
#> 5          5       mutate   7632, 5
#> 6          6       select   7632, 3After select() the data has 509 rows and 231 columns,
after the call to pivot_longer() 117070 rows and 3 columns,
separate() adds two columns, after filter()
only 7632 rows remain (mutate() does not change the
dimensions) and then select() is used to remove 2
columns.