---
title: "splitGraph: From Metadata to Leakage-Aware Split Design"
author: "Selçuk Korkmaz"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_document:
    toc: true
    toc_float: true
    number_sections: true
    theme: flatly
    highlight: tango
vignette: >
  %\VignetteIndexEntry{splitGraph: From Metadata to Leakage-Aware Split Design}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  eval = TRUE
)

package_root <- if (file.exists("../DESCRIPTION")) ".." else "."
if (requireNamespace("pkgload", quietly = TRUE) &&
    file.exists(file.path(package_root, "DESCRIPTION"))) {
  pkgload::load_all(package_root, export_all = FALSE, helpers = FALSE, quiet = TRUE)
} else {
  library(splitGraph)
}

or_empty <- function(x) {
  if (is.null(x)) character() else x
}
```

## Why `splitGraph` exists

Leakage in biomedical evaluation workflows often comes from dataset structure rather
than from an obvious coding mistake. Two samples may look independent in a
model matrix while still sharing the same subject, batch, study, timepoint, or
feature provenance. If those relationships stay implicit, train/test
separation can look correct while violating the scientific separation you
actually intended.

`splitGraph` exists to make those relationships explicit before evaluation. It
turns metadata into a typed dependency graph that can be:

- validated for structural and leakage-relevant problems
- queried to inspect hidden overlap and provenance
- converted into deterministic split constraints
- translated into a stable, tool-agnostic split specification through the
  `split_spec` class and `as_split_spec()` / `validate_split_spec()` API

The package is intentionally narrow. It does not fit models, run preprocessing
pipelines, or generate resamples by itself. Its job is to represent dependency
structure clearly enough that downstream evaluation can be trustworthy.

## A realistic toy dataset

The example below includes exactly the kinds of relationships that usually
matter for leakage-aware evaluation:

- repeated subjects (`P1` and `P2`)
- reused batch (`B1`)
- one subject (`P2`) appearing across studies
- explicit time ordering with one sample missing time metadata
- a feature set derived at the full-dataset scope

```{r metadata}
meta <- data.frame(
  sample_id = c("S1", "S2", "S3", "S4", "S5", "S6"),
  subject_id = c("P1", "P1", "P2", "P3", "P4", "P2"),
  batch_id = c("B1", "B2", "B1", "B3", NA, "B1"),
  study_id = c("ST1", "ST1", "ST1", "ST2", "ST3", "ST2"),
  timepoint_id = c("T0", "T1", "T0", "T2", NA, "T1"),
  assay_id = c("RNAseq", "RNAseq", "RNAseq", "RNAseq", "Proteomics", "RNAseq"),
  featureset_id = c("FS_GLOBAL", "FS_GLOBAL", "FS_GLOBAL", "FS_GLOBAL", "FS_PROT", "FS_GLOBAL"),
  outcome_id = c("O_case", "O_case", "O_ctrl", "O_case", "O_ctrl", "O_ctrl"),
  stringsAsFactors = FALSE
)

meta
```

This is still a small example, but it already contains enough structure to
make naive random splitting risky.

## Fast path: `graph_from_metadata()`

When your metadata already uses the canonical column names (`sample_id`,
`subject_id`, `batch_id`, `study_id`, `timepoint_id`, `time_index`,
`assay_id`, `featureset_id`, `outcome_id` / `outcome_value`),
`graph_from_metadata()` does ingestion, typed node construction, canonical
edge construction, and optional `timepoint_precedes` derivation in a single
call:

```{r fast-path}
quick_graph <- graph_from_metadata(
  data.frame(
    sample_id    = c("S1", "S2", "S3", "S4", "S5", "S6"),
    subject_id   = c("P1", "P1", "P2", "P2", "P3", "P3"),
    batch_id     = c("B1", "B2", "B1", "B2", "B1", "B2"),
    timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"),
    time_index   = c(0, 1, 0, 1, 0, 1),
    outcome_value = c(0, 1, 0, 1, 1, 0)
  ),
  graph_name = "quick_demo"
)

quick_graph
```

The rest of this vignette uses the explicit constructor path because it lets
us show node attributes (`time_index`, `visit_label`, `platform`,
`derivation_scope`) and non-canonical edges (`featureset_generated_from_*`,
`subject_has_outcome`) that `graph_from_metadata()` does not build for you.
Use `graph_from_metadata()` when the canonical columns are enough; use the
explicit path when you need custom attributes or extra relations.

## Ingest metadata and build typed nodes and edges

The first step is to standardize metadata and then turn each entity type into
canonical graph nodes. Sample-level relations become typed edges.

```{r construction}
meta <- ingest_metadata(meta, dataset_name = "VignetteDemo")

sample_nodes <- create_nodes(meta, type = "Sample", id_col = "sample_id")
subject_nodes <- create_nodes(meta, type = "Subject", id_col = "subject_id")
batch_nodes <- create_nodes(meta, type = "Batch", id_col = "batch_id")
study_nodes <- create_nodes(meta, type = "Study", id_col = "study_id")

time_nodes <- create_nodes(
  data.frame(
    timepoint_id = c("T0", "T1", "T2"),
    time_index = c(0L, 1L, 2L),
    visit_label = c("baseline", "follow_up", "late_follow_up"),
    stringsAsFactors = FALSE
  ),
  type = "Timepoint",
  id_col = "timepoint_id",
  attr_cols = c("time_index", "visit_label")
)

assay_nodes <- create_nodes(
  data.frame(
    assay_id = c("RNAseq", "Proteomics"),
    modality = c("transcriptomics", "proteomics"),
    platform = c("NovaSeq", "Orbitrap"),
    stringsAsFactors = FALSE
  ),
  type = "Assay",
  id_col = "assay_id",
  attr_cols = c("modality", "platform")
)

featureset_nodes <- create_nodes(
  data.frame(
    featureset_id = c("FS_GLOBAL", "FS_PROT"),
    featureset_name = c("global_rna_signature", "proteomics_panel"),
    derivation_scope = c("per_dataset", "external"),
    feature_count = c(500L, 80L),
    stringsAsFactors = FALSE
  ),
  type = "FeatureSet",
  id_col = "featureset_id",
  attr_cols = c("featureset_name", "derivation_scope", "feature_count")
)

outcome_nodes <- create_nodes(
  data.frame(
    outcome_id = c("O_case", "O_ctrl"),
    outcome_name = c("response", "response"),
    outcome_type = c("binary", "binary"),
    observation_level = c("subject", "subject"),
    stringsAsFactors = FALSE
  ),
  type = "Outcome",
  id_col = "outcome_id",
  attr_cols = c("outcome_name", "outcome_type", "observation_level")
)

subject_edges <- create_edges(
  meta, "sample_id", "subject_id",
  "Sample", "Subject", "sample_belongs_to_subject"
)

batch_edges <- create_edges(
  meta, "sample_id", "batch_id",
  "Sample", "Batch", "sample_processed_in_batch",
  allow_missing = TRUE
)

study_edges <- create_edges(
  meta, "sample_id", "study_id",
  "Sample", "Study", "sample_from_study"
)

time_edges <- create_edges(
  meta, "sample_id", "timepoint_id",
  "Sample", "Timepoint", "sample_collected_at_timepoint",
  allow_missing = TRUE
)

assay_edges <- create_edges(
  meta, "sample_id", "assay_id",
  "Sample", "Assay", "sample_measured_by_assay"
)

featureset_edges <- create_edges(
  meta, "sample_id", "featureset_id",
  "Sample", "FeatureSet", "sample_uses_featureset"
)

outcome_edges <- create_edges(
  data.frame(
    subject_id = c("P1", "P2", "P3", "P4"),
    outcome_id = c("O_case", "O_ctrl", "O_case", "O_ctrl"),
    stringsAsFactors = FALSE
  ),
  "subject_id", "outcome_id",
  "Subject", "Outcome", "subject_has_outcome"
)

precedence_edges <- create_edges(
  data.frame(
    from_timepoint = c("T0", "T1"),
    to_timepoint = c("T1", "T2"),
    stringsAsFactors = FALSE
  ),
  "from_timepoint", "to_timepoint",
  "Timepoint", "Timepoint", "timepoint_precedes"
)

featureset_from_study <- create_edges(
  data.frame(
    featureset_id = "FS_GLOBAL",
    study_id = "ST1",
    stringsAsFactors = FALSE
  ),
  "featureset_id", "study_id",
  "FeatureSet", "Study", "featureset_generated_from_study"
)

featureset_from_batch <- create_edges(
  data.frame(
    featureset_id = "FS_GLOBAL",
    batch_id = "B1",
    stringsAsFactors = FALSE
  ),
  "featureset_id", "batch_id",
  "FeatureSet", "Batch", "featureset_generated_from_batch"
)
```

The node and edge tables are canonical and typed. The package assigns globally
unique node IDs such as `sample:S1` and `subject:P1`, so different entity
types cannot collide accidentally.

```{r construction-output}
sample_nodes
as.data.frame(sample_nodes)[, c("node_id", "node_type", "node_key", "label")]

edge_preview <- do.call(rbind, lapply(
  list(
    subject_edges, batch_edges, study_edges, time_edges,
    assay_edges, featureset_edges, outcome_edges,
    precedence_edges, featureset_from_study, featureset_from_batch
  ),
  as.data.frame
))

edge_preview[, c("from", "to", "edge_type")]
```

The node table shows the canonical sample IDs that everything else refers to.
The edge table shows the package's central design choice: dependency structure
is explicit, typed, and inspectable.

## Assemble the dependency graph

```{r graph}
graph <- build_dependency_graph(
  nodes = list(
    sample_nodes, subject_nodes, batch_nodes, study_nodes,
    time_nodes, assay_nodes, featureset_nodes, outcome_nodes
  ),
  edges = list(
    subject_edges, batch_edges, study_edges, time_edges,
    assay_edges, featureset_edges, outcome_edges,
    precedence_edges, featureset_from_study, featureset_from_batch
  ),
  graph_name = "vignette_graph",
  dataset_name = attr(meta, "dataset_name")
)

graph
summary(graph)
```

At this point the package has a single `dependency_graph` object with both
tabular and `igraph` representations behind it. The summary is useful because
it tells you exactly which entity types and relation types are present before
you derive any split rules.

### Visualize the typed structure

`plot()` renders the graph with a typed, layered layout: `Sample` on top,
peer dependencies (`Subject`, `Batch`, `Study`, `Timepoint`) in the middle
band, `Assay`/`FeatureSet` below that, and `Outcome` at the bottom. Node
colors are keyed to type and an auto-generated legend is drawn by default.

```{r plot, fig.width = 7, fig.height = 5}
plot(graph)
```

Useful options:

```{r plot-options, eval = FALSE}
plot(graph, layout = "sugiyama")         # alternative hierarchical layout
plot(graph, show_labels = FALSE)         # hide labels on dense graphs
plot(graph, legend = FALSE)              # suppress the legend
plot(graph, legend_position = "bottomright")
plot(graph, node_colors = c(Sample = "#000000"))
```

## Validate before you split

Validation is where `splitGraph` starts paying off. The graph below is
structurally valid, but it still carries leakage-relevant warnings and
advisories.

```{r validation}
validation <- validate_graph(graph)

validation
as.data.frame(validation)[, c("level", "severity", "code", "message")]
```

That output is the core value proposition of the package in one place:

- repeated subjects are surfaced explicitly
- cross-study subject overlap is surfaced explicitly
- full-dataset feature provenance is surfaced explicitly
- heavy batch reuse is surfaced explicitly

`valid = TRUE` here means the graph has no errors. It does not mean the dataset
is free of leakage risk. Warnings and advisories still matter.

The package is also intentionally strict about silent failure. If you ask for a
subset of samples and some of them do not resolve, it errors instead of
dropping them.

```{r strictness}
tryCatch(
  derive_split_constraints(graph, mode = "subject", samples = c("S1", "BAD")),
  error = function(e) e$message
)
```

That behavior is important in practice because quietly omitting samples would
change the truth of the split problem.

## Query the graph to inspect hidden structure

You can inspect local provenance, trace paths, and project direct sample
dependencies.

```{r neighbors-and-paths}
neighbors_s1 <- query_neighbors(graph, node_ids = "sample:S1", direction = "out")
neighbors_s1
as.data.frame(neighbors_s1)[, c("seed_node_id", "node_id", "node_type", "edge_type")]

subject_outcome_path <- query_shortest_paths(
  graph,
  from = "sample:S1",
  to = "outcome:O_case",
  edge_types = c("sample_belongs_to_subject", "subject_has_outcome")
)

subject_outcome_path
as.data.frame(subject_outcome_path)
```

The first query shows everything the graph knows directly about `S1`. The
second shows that `S1` reaches the subject-level outcome through its subject
node, which is exactly the kind of relationship that would stay implicit in a
plain metadata table.

```{r projected-dependencies}
shared_dependencies <- detect_shared_dependencies(
  graph,
  via = c("Subject", "Batch", "FeatureSet")
)

as.data.frame(shared_dependencies)[, c(
  "sample_id_1", "sample_id_2", "shared_node_type", "shared_node_id", "edge_type"
)]

dependency_components <- detect_dependency_components(
  graph,
  via = c("Subject", "Batch")
)

as.data.frame(dependency_components)
```

These projected queries are useful because they answer the splitting question
directly. They tell you which samples should be treated as structurally linked,
not just which metadata columns happen to match.

## Derive split constraints from the graph

`splitGraph` can derive direct constraints for subject, batch, study, and time
as well as composite constraints that combine multiple dependency sources.

```{r constraints}
subject_constraint <- derive_split_constraints(graph, mode = "subject")
batch_constraint <- derive_split_constraints(graph, mode = "batch")
study_constraint <- derive_split_constraints(graph, mode = "study")
time_constraint <- derive_split_constraints(graph, mode = "time")

strict_constraint <- derive_split_constraints(
  graph,
  mode = "composite",
  strategy = "strict",
  via = c("Subject", "Batch")
)

rule_based_constraint <- derive_split_constraints(
  graph,
  mode = "composite",
  strategy = "rule_based",
  priority = c("batch", "study", "subject", "time")
)

constraint_overview <- do.call(rbind, lapply(
  list(
    subject = subject_constraint,
    batch = batch_constraint,
    study = study_constraint,
    time = time_constraint,
    composite_strict = strict_constraint,
    composite_rule = rule_based_constraint
  ),
  function(x) {
    data.frame(
      strategy = x$strategy,
      groups = length(unique(x$sample_map$group_id)),
      warnings = if (is.null(x$metadata$warnings)) 0L else length(x$metadata$warnings),
      stringsAsFactors = FALSE
    )
  }
))

constraint_overview <- cbind(constraint = row.names(constraint_overview), constraint_overview)
row.names(constraint_overview) <- NULL

constraint_overview
```

That summary already shows why the package is useful: different notions of
dependency produce different splitting units.

### Batch constraints

```{r batch-constraint}
batch_constraint
as.data.frame(batch_constraint)[, c("sample_id", "group_id", "group_label", "explanation")]
```

Batch grouping keeps all `B1` samples together and preserves `S5` as an
explicit singleton because it has no batch assignment. Missing structure is not
hidden.

### Time constraints

```{r time-constraint}
time_constraint
as.data.frame(time_constraint)[, c("sample_id", "group_id", "timepoint_id", "order_rank")]
```

Time grouping adds `order_rank`, which is the field downstream tooling actually
needs for ordered evaluation. The missing timepoint on `S5` stays visible as
`NA`, so ordering is partial rather than pretended.

### Composite constraints

```{r composite-constraints}
strict_constraint
as.data.frame(strict_constraint)[, c("sample_id", "group_id", "constraint_type")]

rule_based_constraint
as.data.frame(rule_based_constraint)[, c("sample_id", "group_id", "constraint_type", "group_label")]
```

The strict composite constraint uses transitive closure: `S1`, `S2`, `S3`, and
`S6` end up in the same group because subject and batch links connect them into
one dependency component. The rule-based composite constraint is different: it
uses the highest-priority available dependency per sample, so `S5` falls back
to study-level grouping instead of becoming a composite component.

## Time ordering can come from precedence edges alone

If explicit `time_index` metadata are unavailable, `splitGraph` can still infer
time order from `timepoint_precedes` edges.

```{r precedence-only}
precedence_meta <- data.frame(
  sample_id = c("S1", "S2", "S3"),
  subject_id = c("P1", "P1", "P2"),
  study_id = c("ST1", "ST1", "ST2"),
  timepoint_id = c("T0", "T1", "T2"),
  stringsAsFactors = FALSE
)

precedence_graph <- build_dependency_graph(
  nodes = list(
    create_nodes(precedence_meta, type = "Sample", id_col = "sample_id"),
    create_nodes(precedence_meta, type = "Subject", id_col = "subject_id"),
    create_nodes(precedence_meta, type = "Study", id_col = "study_id"),
    create_nodes(
      data.frame(timepoint_id = c("T0", "T1", "T2"), stringsAsFactors = FALSE),
      type = "Timepoint",
      id_col = "timepoint_id"
    )
  ),
  edges = list(
    create_edges(
      precedence_meta, "sample_id", "subject_id",
      "Sample", "Subject", "sample_belongs_to_subject"
    ),
    create_edges(
      precedence_meta, "sample_id", "study_id",
      "Sample", "Study", "sample_from_study"
    ),
    create_edges(
      precedence_meta, "sample_id", "timepoint_id",
      "Sample", "Timepoint", "sample_collected_at_timepoint"
    ),
    create_edges(
      data.frame(
        from_timepoint = c("T0", "T1"),
        to_timepoint = c("T1", "T2"),
        stringsAsFactors = FALSE
      ),
      "from_timepoint", "to_timepoint",
      "Timepoint", "Timepoint", "timepoint_precedes"
    )
  ),
  graph_name = "precedence_only_graph"
)

precedence_time_constraint <- derive_split_constraints(precedence_graph, mode = "time")

precedence_time_constraint$metadata$time_order_source
as.data.frame(precedence_time_constraint)[, c("sample_id", "timepoint_id", "time_index", "order_rank")]
```

The important detail is that ordering is still derived, but the source is
`timepoint_precedes` rather than `time_index`.

## Translate the constraint into a split specification

The graph-derived constraint is not the end of the workflow. The main handoff
target is a canonical sample-level split specification — the `split_spec`
class. Downstream tools consume it through their own adapters, so
`split_spec` stays tool-agnostic.

```{r split-spec}
split_spec <- as_split_spec(strict_constraint, graph = graph)
split_spec

as.data.frame(split_spec)[, c(
  "sample_id", "group_id", "batch_group", "study_group", "timepoint_id", "order_rank"
)]

split_spec_validation <- validate_split_spec(split_spec)
split_spec_validation
as.data.frame(split_spec_validation)
```

This translation step is where the package becomes operational for downstream
evaluation workflows:

- `group_id` carries the split unit
- `batch_group` and `study_group` are available for blocking
- `order_rank` is available for ordered evaluation
- the generated object is validated before handoff

## Summarize the leakage picture in one object

The final helper combines graph validation, constraint diagnostics, and
split-spec readiness into one summary object.

```{r risk-summary}
risk_summary <- summarize_leakage_risks(
  graph,
  constraint = strict_constraint,
  split_spec = split_spec
)

risk_summary
as.data.frame(risk_summary)[, c("source", "severity", "category", "message")]
```

This is a useful stopping point before model training. It gives you one place
to review whether the graph is structurally sound, whether the chosen
constraint is overly singleton-heavy, and whether the downstream split spec is
ready to use.

## Downstream handoff

`split_spec` is the tool-agnostic handoff artifact. `splitGraph` does not
know about any particular resampling package — downstream consumers provide
their own adapters so `splitGraph` stays neutral. The typical end-to-end
flow is:

1. `graph_from_metadata()` (or the explicit constructor path) → typed
   `dependency_graph`
2. `derive_split_constraints(g, mode = ...)` → `split_constraint`
3. `as_split_spec(constraint, graph = g)` → `split_spec`
4. adapter in the downstream package → native resamples

The `sample_data` frame carried by `split_spec` exposes exactly the columns
downstream adapters consume: `sample_id` for joining against the observation
frame, `group_id` for grouped resampling, `batch_group` / `study_group` for
blocking, and `order_rank` for ordered evaluation. Adapters can be built by
any package that wants to consume a `split_spec` — for example, on top of
`rsample::group_vfold_cv()` (grouped CV keyed to `group_id`) or
`rsample::rolling_origin()` (ordered evaluation keyed to `order_rank`).

## Case studies

The end-to-end workflow above shows the package surface. The case studies below
show how the same graph leads to different evaluation decisions depending on
the scientific question.

### Case study 1: repeated subjects in a longitudinal cohort

Suppose the real question is whether future observations from the same subject
should be held out from training. In this setting, subject reuse and time
ordering both matter, but they solve different problems.

```{r case-study-1}
subject_groups <- grouping_vector(subject_constraint)
time_groups <- time_constraint$sample_map[, c("sample_id", "group_id", "timepoint_id", "order_rank")]

subject_groups
time_groups
```

Interpretation:

- `S1` and `S2` share subject `P1`, so subject-grouped evaluation keeps them
  together.
- `S3` and `S6` share subject `P2`, so they also stay together under a
  subject-based split.
- time grouping adds a different axis: `T0`, `T1`, and `T2` become ordered
  units with explicit `order_rank`.

If the leakage concern is repeated measurements from the same individual, use
the subject constraint. If the evaluation question is prospective prediction,
the time constraint adds the ordering information you need.

### Case study 2: a subject reused across studies

The graph intentionally includes subject `P2` in both `ST1` and `ST2`. A
study-only split would treat those studies as separate units, but the graph
shows that subject overlap breaks the intended independence.

```{r case-study-2}
cross_study_issues <- as.data.frame(validation)[
  as.data.frame(validation)$code == "subject_cross_study_overlap",
  c("severity", "code", "message")
]

p2_shared <- detect_shared_dependencies(
  graph,
  via = "Subject",
  samples = c("S3", "S6")
)

study_only_map <- study_constraint$sample_map[, c("sample_id", "group_id", "group_label")]
strict_map <- strict_constraint$sample_map[, c("sample_id", "group_id", "constraint_type")]

cross_study_issues
as.data.frame(p2_shared)
study_only_map[study_only_map$sample_id %in% c("S3", "S6"), ]
strict_map[strict_map$sample_id %in% c("S3", "S6"), ]
```

Interpretation:

- validation surfaces the cross-study subject overlap directly
- the shared-dependency query confirms that `S3` and `S6` are linked through
  the same subject
- a study-only split would place them in different groups (`ST1` versus `ST2`)
- the strict composite constraint correctly keeps them in the same dependency
  component

This is exactly the kind of failure mode `splitGraph` is designed to expose:
metadata columns suggest a legitimate study split, but graph structure shows
that the split would still leak subject information.

### Case study 3: partially observed technical metadata

Real metadata are rarely complete. Here, `S5` has no batch assignment and no
timepoint assignment. The package does not pretend those fields exist. It keeps
the sample visible and tells you how the split logic handled it.

```{r case-study-3}
batch_missing <- batch_constraint$sample_map[
  batch_constraint$sample_map$sample_id == "S5",
  c("sample_id", "group_id", "group_label", "explanation")
]

rule_based_missing <- rule_based_constraint$sample_map[
  rule_based_constraint$sample_map$sample_id == "S5",
  c("sample_id", "group_id", "constraint_type", "group_label", "explanation")
]

split_spec_missing <- as.data.frame(split_spec)[
  as.data.frame(split_spec)$sample_id == "S5",
  c("sample_id", "group_id", "batch_group", "study_group", "timepoint_id", "order_rank")
]

batch_missing
rule_based_missing
split_spec_missing
```

Interpretation:

- batch-based splitting keeps `S5` as an explicit singleton because batch
  metadata are missing
- the rule-based composite strategy falls back to study-level grouping for `S5`
- the translated split specification preserves the missing batch and time
  fields as `NA` rather than silently inventing values

That behavior matters because incomplete metadata are common. `splitGraph`
stays strict about what is known, but still produces a usable, inspectable
split object.

### Case study 4: choosing a defensible split strategy

A typical practical question is not "what can the package compute?" but "which
constraint should I actually use?" The answer depends on which dependency
source is scientifically unacceptable to leak across train and test.

```{r case-study-4}
strategy_summary <- data.frame(
  constraint = c("subject", "batch", "study", "time", "composite_strict", "composite_rule"),
  groups = c(
    length(unique(subject_constraint$sample_map$group_id)),
    length(unique(batch_constraint$sample_map$group_id)),
    length(unique(study_constraint$sample_map$group_id)),
    length(unique(time_constraint$sample_map$group_id)),
    length(unique(strict_constraint$sample_map$group_id)),
    length(unique(rule_based_constraint$sample_map$group_id))
  ),
  warnings = c(
    length(or_empty(subject_constraint$metadata$warnings)),
    length(or_empty(batch_constraint$metadata$warnings)),
    length(or_empty(study_constraint$metadata$warnings)),
    length(or_empty(time_constraint$metadata$warnings)),
    length(or_empty(strict_constraint$metadata$warnings)),
    length(or_empty(rule_based_constraint$metadata$warnings))
  ),
  recommended_resampling = c(
    as_split_spec(subject_constraint, graph = graph)$recommended_resampling,
    as_split_spec(batch_constraint, graph = graph)$recommended_resampling,
    as_split_spec(study_constraint, graph = graph)$recommended_resampling,
    as_split_spec(time_constraint, graph = graph)$recommended_resampling,
    as_split_spec(strict_constraint, graph = graph)$recommended_resampling,
    as_split_spec(rule_based_constraint, graph = graph)$recommended_resampling
  ),
  stringsAsFactors = FALSE
)

strategy_summary
```

Interpretation:

- subject grouping is the right default when repeated individuals are the
  dominant leakage source
- batch grouping is appropriate when technical runs are the main contamination
  risk
- study grouping is useful for cross-study generalization only when no higher
  level dependency crosses study boundaries
- strict composite grouping is the safest choice when multiple dependency
  sources can connect samples transitively
- rule-based composite grouping is a pragmatic fallback when you want a single
  deterministic hierarchy over partially observed metadata

The package does not choose the scientific objective for you. It makes the
trade-off visible and auditable.

## When `splitGraph` is useful

`splitGraph` is a good fit when:

- sample relationships are scientifically meaningful and must influence
  evaluation
- metadata contain repeated subjects, shared batches, multiple studies, or
  temporal structure
- feature provenance or outcome level matters for leakage assessment
- you want deterministic, inspectable split constraints instead of ad hoc
  grouping code

## What `splitGraph` is not for

`splitGraph` is not:

- a general biological network analysis package
- a model training framework
- a resampling engine
- a substitute for downstream performance auditing

Its value is earlier in the workflow: it makes dependency structure explicit so
that the split design itself can be justified.

## Takeaway

If you already know your data have repeated subjects, reused batches, temporal
ordering, or shared feature provenance, then you already have a graph problem
whether you model it explicitly or not. `splitGraph` is useful because it turns
that hidden graph into an object you can validate, query, and convert into a
split design that downstream tooling can trust.
