This package contains a flexible framework for extending the pipe into a loop. The basic idea is this: I often run into the problem of wanting to access an unnamed intermediate in a pipe. Why? A basic strategy of working with data frames is to focus on a certain aspect of the data frame, make some changes, and then reincorporate these changes into the original data frame. This work-flow is best understood through illustration.
This tutorial assumes familiarity with Hadley Wickham’s dplyr and magrittr packages. If you don’t know what I’m talking about, go look them up. Your life is about to get a whole lot easier
Import useful libraries for chaining, knitr for table output, and of course, loopr.
library(loopr)
library(dplyr)
library(magrittr)
library(knitr)
Define our loop object.
loop = loopClass$new()
Set up an extremely simple data frame for illustration.
id = c(1, 2, 3, 4)
toFix = c(0, 0, 1, 1)
group = c(1, 1, 1, 0)
example = data_frame(id, toFix, group)
kable(example)
| id | toFix | group |
|---|---|---|
| 1 | 0 | 1 |
| 2 | 0 | 1 |
| 3 | 1 | 1 |
| 4 | 1 | 0 |
loopr relies on a stack framework. Let’s initialize one.
stack = stackClass$new()
We can push data onto the stack like this. The names are optional.
stack$push(1, name = "first")
## [1] 1
stack$push(2, name = "second")
## [1] 2
stack$push(3, name = "third")
## [1] 3
We can peek at the top of the stack:
stack$peek
## [1] 3
or at the whole thing.
stack$stack %>%
as.data.frame %>%
kable
| bottom | first | second | third |
|---|---|---|---|
| NA | 1 | 2 | 3 |
We can find the height of the stack as well:
stack$height
## [1] 4
We can also pop off items from the stack:
stack$pop
## [1] 3
stack$pop
## [1] 2
stack$pop
## [1] 1
Now the stack is empty.
stack$stack
## $bottom
## [1] NA
Why is this important? A loop object inherits from stack.
The begin method is simply a copy of push. After the loop begins, you can focus on any part of your data while still having access to the original data.
"first" %>%
loop$begin()
## [1] "first"
To end the loop, you need to merge the data at the beginning of the loop with the data at the end. There are two ending methods defined in loopr: end and cross. Ending the loop takes a function, uses a pop from the loop stack as the first argument to the given function, and its own first argument (or chained argument) as the second.
"second" %>%
loop$end(paste)
## [1] "first second"
cross is nearly identical, but the order of the arguments gets reversed.
"first" %>%
loop$begin()
## [1] "first"
"second" %>%
loop$cross(paste)
## [1] "second first"
This is much easier to explain in code than in words.
end(endData, FUN, ...) = FUN(stack$pop, endData, ...)
cross(crossData, FUN, ...) = FUN(crossData, stack$pop, ...)
There are two useful ending functions that are included in this package:insert and amend. Why are special ending functions needed? In general, traditional join functions are not well suited to the focus-modify-restore work-flow. We need insert and amend to prioritize information in modified data over information in the original data.
insert is the slightly more simple case. Let’s use our example data again.
Create a set of data to insert.
insertData =
example %>%
filter(toFix == 0) %>%
mutate(toFix = 1) %>%
select(-group)
kable(insertData)
| id | toFix |
|---|---|
| 1 | 1 |
| 2 | 1 |
Now let’s insert it back into the original data.
insert(example, insertData, by = "id") %>%
kable
| id | toFix | group |
|---|---|---|
| 1 | 1 | NA |
| 2 | 1 | NA |
| 3 | 1 | 1 |
| 4 | 1 | 0 |
What happened? Where the by variables matched, insert excised all rows from example and inserted insertData. At the end, data was sorted by the by variable. The by variable (or variables) must be included in the function call.
Let’s take a look at the slightly more complicated ending function: amend. To understand amend, we first need to understand the underlying column update function.
amendColumns updates an old set of columns with all non-NA values from a matching new set of columns.
Build example data.
oldColumn1 = c(0, 0);
newColumn1 = c(1, NA)
oldColumn2 = c(0, 0);
newColumn2 = c(NA, 1)
columnData = data_frame(oldColumn1, newColumn1, oldColumn2, newColumn2)
kable(columnData)
| oldColumn1 | newColumn1 | oldColumn2 | newColumn2 |
|---|---|---|---|
| 0 | 1 | 0 | NA |
| 0 | NA | 0 | 1 |
Now run amendColumns.
columnData %>%
amendColumns(
c("oldColumn1", "oldColumn2"),
c("newColumn1", "newColumn2")) %>%
kable
| oldColumn1 | oldColumn2 |
|---|---|
| 1 | 0 |
| 0 | 1 |
There is also a matching function called fillColumns. In this function, NA’s from newColumn are replaced with numbers from the oldColumn, but nothing else.
oldColumn = c(0, 0)
newColumn = c(1, NA)
columnData %>%
fillColumns(c("newColumn1", "newColumn2"),
c("oldColumn1", "oldColumn2")) %>%
kable
| newColumn1 | newColumn2 |
|---|---|
| 1 | 0 |
| 0 | 1 |
amend is simply dplyr::full_join followed by amendColumns to over-write non-key columns from the original dataset with matching-named columns from the new dataset. In this case, group from amendData overwrites group from example.
amendData = insertData
example %>%
amend(amendData, by = "id") %>%
kable
## Amending columns: toFix
| id | toFix | group |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
| 4 | 1 | 0 |
If it is not included, by defaults to the grouping variables in data.
example %>%
group_by(id) %>%
amend(amendData) %>%
kable
## Amending columns: toFix
| id | toFix | group |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
| 4 | 1 | 0 |
A warning: amend internally uses the suffix "toFix". If this suffix is already used in your data, modify the suffix argument.
Now that we understand how it works, let’s use use our loop!
Remind ourselves of what the example data looks like.
kable(example)
| id | toFix | group |
|---|---|---|
| 1 | 0 | 1 |
| 2 | 0 | 1 |
| 3 | 1 | 1 |
| 4 | 1 | 0 |
Here, we convert toFix to 0 when group is 0.
example %>%
ungroup %>%
loop$begin() %>%
filter(group == 0) %>%
mutate(toFix = 0) %>%
loop$end(insert, by = "id") %>%
kable
| id | toFix | group |
|---|---|---|
| 1 | 0 | 1 |
| 2 | 0 | 1 |
| 3 | 1 | 1 |
| 4 | 0 | 0 |
In general, insert is best suited to filter/slice type operations.
Here, we summarize toFix in each of the two groups, reverse the results, and then reintegrate the summary into the original data.
example %>%
group_by(group) %>%
loop$begin() %>%
summarize(toFix = mean(toFix)) %>%
mutate(group = rev(group)) %>%
loop$end(amend) %>%
kable
## Amending columns: toFix
| group | id | toFix |
|---|---|---|
| 0 | 4 | 0.3333333 |
| 1 | 1 | 1.0000000 |
| 1 | 2 | 1.0000000 |
| 1 | 3 | 1.0000000 |
In general, amend is best suited to summarize/do type operations.
This is only the tip of the iceberg. Do not feel limited to using amend and insert as ending functions. A whole host of others could be useful: join functions, merge functions, even setNames.
Here, we will suffix the names of all the variables within the context of a chain.
example %>%
mutate(group = group + 1) %>%
loop$begin() %>%
names %>%
paste0("Suffix") %>%
loop$end(setNames) %>%
kable
| idSuffix | toFixSuffix | groupSuffix |
|---|---|---|
| 1 | 0 | 2 |
| 2 | 0 | 2 |
| 3 | 1 | 2 |
| 4 | 1 | 1 |
Here, we will double the data.
example %>%
mutate(replication = 1) %>%
loop$begin() %>%
mutate(replication = 2) %>%
loop$end(bind_rows) %>%
kable
| id | toFix | group | replication |
|---|---|---|---|
| 1 | 0 | 1 | 1 |
| 2 | 0 | 1 | 1 |
| 3 | 1 | 1 | 1 |
| 4 | 1 | 0 | 1 |
| 1 | 0 | 1 | 2 |
| 2 | 0 | 1 | 2 |
| 3 | 1 | 1 | 2 |
| 4 | 1 | 0 | 2 |
Loops within loops are in fact quite possible. I would be cautious using them. It can be exhilarating, but make sure to indent each loop carefully. Also, it is a good idea to give a name to each loop. This allows one to interpret loop$stack for debugging. Here is a quick example that filters the data, replicates the columns, and then re-merges.
example %>%
loop$begin(name = "original") %>%
filter(group == 1) %>%
loop$begin(name = "filtered") %>%
names %>%
paste0("Extra") %>%
loop$end(setNames) %>%
rename(id = idExtra) %>%
loop$end(amend, by = "id") %>%
kable
## Joining by: "id"
| id | toFix | group | toFixExtra | groupExtra |
|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 1 |
| 2 | 0 | 1 | 0 | 1 |
| 3 | 1 | 1 | 1 | 1 |
| 4 | 1 | 0 | NA | NA |