Broadly, makepipe does two things:
It automates code execution using a logic similar to GNU Make. In
particular, makepipe ensures that a given piece of code is executed if
and only if the targets associated that piece of code are
out-of-date with respect to its dependencies. More on this
below.
It makes code self-documenting in a double sense. Firstly, it forces data scientists to make the relationships between the different parts of their code base explicit within the code base itself. Secondly, it exhibits those relationships as a directed acyclical graph (i.e. a flowchart) which can be separated from the code base and shared.
It does these things without requiring major upfront investments in the way of code functionalisation or the like. Indeed, one will not ordinarily need to modify one’s existing code at all in order to implement a makepipe pipeline.
Assuming your workflow consists of a series of R scripts –
one.R, two.R, etc. – you can construct a
makepipe Pipeline simply by sourcing them using
make_with_source().
You’ll just need to supply, along with the path to the
source script, a set of targets (i.e. paths to
files that the script is supposed to make) and optionally a set of
dependencies (i.e. paths to files that the script needs so
as to make the targets).
For example, you’ll create a pipeline.R script
containing the following:
library(makepipe)
make_with_source(
source = "one.R",
targets = "data/1 data.Rds",
dependencies = "data/raw.Rds"
)
make_with_source(
source = "two.R",
targets = "data/2 data.Rds",
dependencies = c("data/1 data.Rds", "lookup/concordance.csv")
)
# etc.Then, when this code is run through, each source file
will be executed if and only if its targets are out-of-date
with respect to its dependencies (and source
file itself). This means that only those scripts which need to
be run will be. So, without requiring you to think about it, you’ll be
able to skip unnecessary computations.
Meanwhile, behind the scenes, each call to
make_with_source() will add a Segment onto the
Pipeline. These Segment objects keep track of
the relationships between targets,
dependencies and source files and they also
log metadata relating to the execution of the source file
such as whether it was executed on the last run and how long it took to
execute.
You can inspect this metadata by getting ahold of the
Pipeline. For example, you might see something like
this:
pipe <- get_pipeline()
pipe$segments
#> [[1]]
#> # makepipe segment
#>
#> ## one.R
#>
#> * Source: 'one.R'
#> * Targets: 'data/1 data.Rds'
#> * File dependencies: 'data/raw.Rds'
#> * Executed: TRUE
#> * Execution time: 22.5 secs
#> * Result: 0 object(s)
#> * Environment: 0x00000253c8573268
#>
#> [[2]]
#> # makepipe segment
#>
#> ## two.R
#>
#> * Source: 'two.R'
#> * Targets: 'data/2 data.Rds'
#> * File dependencies: 'data/1 data.Rds', 'lookup/concordance.csv'
#> * Executed: TRUE
#> * Execution time: 38.2 secs
#> * Result: 0 object(s)
#> * Environment: 0x00000253c8738660Additionally, you can visualise the relationships between the various
files by viewing the Pipeline itself:
show_pipeline()This will display a flow chart exhibiting the relationships between
the targets, dependencies, and code.
make_*()We used make_with_source() above since, in most cases,
that will be the simplest way to convert an existing workflow. In some
cases, however, your pipeline may include short pieces of code that
don’t need to reside in their own script. In such cases, you can use
make_with_recipe():
make_with_recipe(
recipe = rmarkdown::render(
"report.Rmd",
output_file = "output/report.html"
),
targets = "output/report.html",
dependencies = c("report.Rmd", "data/2 data.Rds")
)As with make_with_source(), when a
make_with_recipe() block is run the code (this time the
recipe) will only be executed if the relevant
targets are out-of-date with respect to their
dependencies
Instead of maintaining a separate pipeline script containing calls to
make_with_source(), you can add roxygen-like headers to the
.R files in your pipeline containing the @makepipe tag
along with @targets, @dependencies, and so on.
For example, at the top of script one.R you might have
#'@title One
#'@description This is the first script in our pipeline
#'@dependencies "data/raw.Rds"
#'@targets "data/1 data.Rds"
#'@makepipe
NULLYou can then call make_with_dir(), which will construct
a pipeline using all the scripts it finds in the provided directory
containing the @makepipe tag.
If you want to use a hybrid approach – keeping the documentation of
dependencies and targets close to the source code – while maintaining
the flexibility of a separate pipeline script you can use
make_with_roxy(). Thus you might have
make_with_roxy("one.R")
# do other stuff
make_with_roxy("two.R")
# etc.Once you’ve constructed a Pipeline by calling
make_*(), you can re-run the entire pipeline using the
build() method. As when using make_*()
directly, only code that needs to be run will be when
build() is called.
For example, if you’ve just executed the Pipeline and nothing has changed, then nothing will be re-executed and you’ll be told has much:
pipe <- get_pipeline()
pipe$build()
#> √ Targets are up to date
#> √ Targets are up to dateIf you want to start from scratch and ‘rebuild’ all targets, you can
use the build() method together with the
clean() method.
pipe$clean()
pipe$build()
#> i Targets are out of date. Updating...
#> √ Finished updating
#> i Targets are out of date. Updating...
#> √ Finished updating The clean() and build() methods are
especially useful when used with a Pipeline that has
previously been saved out. In particular, if you’ve already created your
Pipeline by stringing make_*() calls together
and you’ve saved your Pipeline object out as
pipeline.Rds you can re-run the whole Pipeline to ensure
everything is up-to-date simply by calling:
pipe <- readRDS("pipeline.Rds")
pipe$build()Each Segment on the Pipeline is associated
with a result. This is akin to a return value. Indeed, in
the case of make_with_recipe() it is the return
value of the recipe. For example:
res <- make_with_recipe(
recipe = {
saveRDS(mtcars, "data/mtcars.Rds")
nrow(mtcars)
},
targets = "data/mtcars.Rds"
)
#> i Targets are out of date. Updating...
#> √ Finished updating
res$result
#> [1] 32Note, however, that the result is captured when the
recipe is executed. If your recipe is never
executed, then there will be no result available. Thus, for
instance:
res <- make_with_recipe(
recipe = {
saveRDS(mtcars, "data/mtcars.Rds")
nrow(mtcars)
},
targets = "data/mtcars.Rds"
)
#> √ Targets are up to date
res$result
#> NULLThings are a little more complicated in the case of
make_with_source(), as you can imagine. Given that source
scripts do not really have return values, the result cannot
be what source returns when run. So what is it?
The result associated with a source Segment
is an environment containing objects ‘registered’ in the
source script. Objects are registered using
make_register(), which has a similar API to
base::assign(). Thus, imagine that three.R
contains the following code:
# ...
makepipe::make_register(nrow(dat), "num_rows")
# ...Then we will have:
res <- make_with_source(
source = "three.R",
targets = "data/3 data.Rds",
dependencies = "data/2 data.Rds"
)
#> i Targets are out of date. Updating...
#> √ Finished updating
res$result
#> <environment: 0x0000029f6840f610>
res$result$num_rows
#> [1] 32As with make_with_recipe(), a result will
only be captured if the source script is executed.
So when does a source file or a recipe get
executed? The answer is: when and only when the relevant
targets are out-of-date with respect to the
dependencies. But what does that mean? Specifically, the
targets are out-of-date if and only if:
One or more of the targets do not exist, OR
One or more of the dependencies is newer
(i.e. has a more recent file modification time) than one or more of the
targets. In other words, the dependencies have
been updated since the targets were last made.
By default the execution will take place in a fresh environment which
is a child of the calling environment. So if you’re calling
make_*() in a top-level script then the execution will take
place in a fresh environment whose parent is the global environment.
There are a number of less commonly used arguments to
make_*() which alter this behaviour. In particular:
packages can be used to supply the names of packages
which serve as dependencies for the targets. If any of
these packages have been updated since the targets were
last made, the targets will be remade. This is particularly
useful when you’re relying on a package for lookups which are liable to
change.
envir can be used to supply an environment in which
the execution of the source or recipe will
take place. Supplying envir = base::globalenv(), for
example, will mean that all execution takes place in the global
environment. If you do this, then all the objects bound in the
recipe/source will be available in the global
environment.
force can be used to ensure that the
recipe/source is executed no matter what. This
is useful, e.g., when you are pulling in some data from an external
database.