The putior
package helps you document and visualize
workflows by extracting structured annotations from your R and Python
source files. This vignette shows you how to get started with PUT
annotations and workflow extraction.
PUT stands for PUT + Input + Output + R, reflecting the package’s core purpose: tracking data inputs and outputs through your analysis pipeline using special annotations.
The fastest way to see putior in action is to run the built-in example:
This creates a sample multi-language workflow and demonstrates the workflow extraction capabilities of putior.
PUT annotations are special comments that describe workflow nodes. Here’s how to add them to your source files:
R script example:
# data_processing.R
#put id:"load_data", label:"Load Customer Data", node_type:"input", output:"raw_data.csv"
# Your actual code
data <- read.csv("customer_data.csv")
write.csv(data, "raw_data.csv")
#put id:"clean_data", label:"Clean and Validate", node_type:"process", input:"raw_data.csv", output:"clean_data.csv"
# Data cleaning code
cleaned_data <- data %>%
filter(!is.na(customer_id)) %>%
mutate(purchase_date = as.Date(purchase_date))
write.csv(cleaned_data, "clean_data.csv")
Python script example:
# analysis.py
#put id:"analyze_sales", label:"Sales Analysis", node_type:"process", input:"clean_data.csv", output:"sales_report.json"
import pandas as pd
import json
# Load cleaned data
data = pd.read_csv("clean_data.csv")
# Perform analysis
sales_summary = {
"total_sales": data["amount"].sum(),
"avg_order": data["amount"].mean(),
"customer_count": data["customer_id"].nunique()
}
# Save results
with open("sales_report.json", "w") as f:
json.dump(sales_summary, f)
Use the put()
function to scan your files and extract
workflow information:
# Scan all R and Python files in a directory
workflow <- put("./src/")
# View the extracted workflow
print(workflow)
Expected output:
#> file_name file_type input label id
#> 1 data_processing.R r <NA> Load Customer Data load_data
#> 2 data_processing.R r raw_data.csv Clean and Validate clean_data
#> 3 analysis.py py clean_data.csv Sales Analysis analyze_sales
#> node_type output
#> 1 input raw_data.csv
#> 2 process clean_data.csv
#> 3 process sales_report.json
The output is a data frame where each row represents a workflow node. The columns include:
The general syntax for PUT annotations is:
#put property1:"value1", property2:"value2", property3:"value3"
PUT annotations support several formats to fit different coding styles:
#put id:"my_node", label:"My Process" # Standard format
# put id:"my_node", label:"My Process" # Space after #
#put| id:"my_node", label:"My Process" # Pipe separator
#put id:'my_node', label:'Single quotes' # Single quotes
#put id:"my_node", label:'Mixed quotes' # Mixed quote styles
While putior accepts any properties you define, these are commonly used:
Property | Purpose | Example Values |
---|---|---|
id |
Unique identifier | "load_data" , "process_sales" |
label |
Human description | "Load Customer Data" |
node_type |
Operation type | "input" , "process" ,
"output" |
input |
Input files | "raw_data.csv" , "data/*.json" |
output |
Output files | "processed_data.csv" |
For consistency across projects, consider using these standard node types:
input
: Data collection, file loading,
API callsprocess
: Data transformation,
analysis, computationoutput
: Report generation, data
export, visualizationdecision
: Conditional logic, branching
workflowsAdd any properties you need for visualization or metadata:
#put id:"train_model", label:"Train ML Model", node_type:"process", color:"green", group:"machine_learning", duration:"45min", priority:"high"
These custom properties can be used by visualization tools or workflow management systems.
You can process single files instead of entire directories:
Include subdirectories in your scan:
Control which files are processed:
For debugging annotation issues, include line numbers:
Control annotation validation:
If you omit the id
field, putior will automatically
generate a unique UUID:
# Annotations without explicit IDs get auto-generated UUIDs
#put label:"Load Data", node_type:"input", output:"data.csv"
#put label:"Process Data", node_type:"process", input:"data.csv", output:"clean.csv"
# Extract workflow - IDs will be auto-generated
workflow <- put("./")
print(workflow$id) # Will show UUIDs like "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
Note: If you provide an empty id
(e.g.,
id:""
), you’ll get a validation warning.
If you omit the output
field, putior automatically uses
the file name as the output:
# In process_data.R:
#put label:"Process Step", node_type:"process", input:"raw.csv"
# No output specified - will default to "process_data.R"
# In analyze_data.R:
#put label:"Analyze", node_type:"process", input:"process_data.R", output:"results.csv"
# This creates a connection from process_data.R to analyze_data.R
This feature ensures that scripts can be connected in workflows even when explicit output files aren’t specified.
When you have scripts that source other scripts, use this annotation pattern:
# In main.R (sources other scripts):
#put label:"Main Analysis", input:"load_data.R,process_data.R", output:"report.pdf"
source("load_data.R") # Reading load_data.R into main.R
source("process_data.R") # Reading process_data.R into main.R
# In load_data.R (sourced by main.R):
#put label:"Data Loader", node_type:"input"
# output defaults to "load_data.R"
# In process_data.R (sourced by main.R, depends on load_data.R):
#put label:"Data Processor", input:"load_data.R"
# output defaults to "process_data.R"
This correctly shows the flow: sourced scripts are inputs to the main script.
Let’s walk through a complete data science workflow:
# 01_collect_data.py
#put id:"fetch_api_data", label:"Fetch Data from API", node_type:"input", output:"raw_api_data.json"
import requests
import json
response = requests.get("https://api.example.com/sales")
data = response.json()
with open("raw_api_data.json", "w") as f:
json.dump(data, f)
# 02_process_data.R
#put id:"clean_api_data", label:"Clean and Structure Data", node_type:"process", input:"raw_api_data.json", output:"processed_sales.csv"
library(jsonlite)
library(dplyr)
# Load raw data
raw_data <- fromJSON("raw_api_data.json")
# Process and clean
processed <- raw_data %>%
filter(!is.na(sale_amount)) %>%
mutate(
sale_date = as.Date(sale_date),
sale_amount = as.numeric(sale_amount)
) %>%
arrange(sale_date)
# Save processed data
write.csv(processed, "processed_sales.csv", row.names = FALSE)
# 03_analyze_report.R
#put id:"sales_analysis", label:"Perform Sales Analysis", node_type:"process", input:"processed_sales.csv", output:"analysis_results.rds"
#put id:"generate_report", label:"Generate HTML Report", node_type:"output", input:"analysis_results.rds", output:"sales_report.html"
library(dplyr)
# Load processed data
sales_data <- read.csv("processed_sales.csv")
# Perform analysis
analysis_results <- list(
total_sales = sum(sales_data$sale_amount),
monthly_trends = sales_data %>%
group_by(month = format(sale_date, "%Y-%m")) %>%
summarise(monthly_total = sum(sale_amount)),
top_products = sales_data %>%
group_by(product) %>%
summarise(product_sales = sum(sale_amount)) %>%
arrange(desc(product_sales)) %>%
head(10)
)
# Save analysis
saveRDS(analysis_results, "analysis_results.rds")
# Generate report
rmarkdown::render("report_template.Rmd",
output_file = "sales_report.html")
Choose clear, descriptive names that explain what each step does:
# Good
#put name:"load_customer_transactions", label:"Load Customer Transaction Data"
#put name:"calculate_monthly_revenue", label:"Calculate Monthly Revenue Totals"
# Less descriptive
#put name:"step1", label:"Load data"
#put name:"process", label:"Do calculations"
Always specify inputs and outputs for data processing steps:
#put name:"merge_datasets", label:"Merge Customer and Transaction Data", input:"customers.csv,transactions.csv", output:"merged_data.csv"
Stick to a standard set of node types across your team:
#put name:"load_raw_data", label:"Load Raw Sales Data", node_type:"input"
#put name:"clean_data", label:"Clean and Validate", node_type:"process"
#put name:"export_results", label:"Export Final Results", node_type:"output"
Include metadata that helps with workflow understanding:
#put name:"train_model", label:"Train Random Forest Model", node_type:"process", estimated_time:"30min", requires:"tidymodels", memory_intensive:"true"
If put()
returns an empty data frame:
is_valid_put_annotation()
to test individual
annotationsIf you see validation warnings:
name
property to
all annotationsinput
, process
, output
)If annotations aren’t parsed correctly:
Good example:
#put name:"step1", description:"Process data, clean outliers", type:"process"
Problematic example:
#put name:"step1", description:Process data, clean outliers, type:process
Now that you understand the basics of putior:
source(system.file("examples", "reprex.R", package = "putior"))
For more detailed information, see: - ?put
- Complete
function documentation - Advanced usage vignette - Complex workflows and
integration - Best practices vignette - Team collaboration and style
guides