--- title: "splitGraph: From Metadata to Leakage-Aware Split Design" author: "Selçuk Korkmaz" date: "`r Sys.Date()`" output: rmarkdown::html_document: toc: true toc_float: true number_sections: true theme: flatly highlight: tango vignette: > %\VignetteIndexEntry{splitGraph: From Metadata to Leakage-Aware Split Design} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, eval = TRUE ) package_root <- if (file.exists("../DESCRIPTION")) ".." else "." if (requireNamespace("pkgload", quietly = TRUE) && file.exists(file.path(package_root, "DESCRIPTION"))) { pkgload::load_all(package_root, export_all = FALSE, helpers = FALSE, quiet = TRUE) } else { library(splitGraph) } or_empty <- function(x) { if (is.null(x)) character() else x } ``` ## Why `splitGraph` exists Leakage in biomedical evaluation workflows often comes from dataset structure rather than from an obvious coding mistake. Two samples may look independent in a model matrix while still sharing the same subject, batch, study, timepoint, or feature provenance. If those relationships stay implicit, train/test separation can look correct while violating the scientific separation you actually intended. `splitGraph` exists to make those relationships explicit before evaluation. It turns metadata into a typed dependency graph that can be: - validated for structural and leakage-relevant problems - queried to inspect hidden overlap and provenance - converted into deterministic split constraints - translated into a stable, tool-agnostic split specification through the `split_spec` class and `as_split_spec()` / `validate_split_spec()` API The package is intentionally narrow. It does not fit models, run preprocessing pipelines, or generate resamples by itself. Its job is to represent dependency structure clearly enough that downstream evaluation can be trustworthy. ## A realistic toy dataset The example below includes exactly the kinds of relationships that usually matter for leakage-aware evaluation: - repeated subjects (`P1` and `P2`) - reused batch (`B1`) - one subject (`P2`) appearing across studies - explicit time ordering with one sample missing time metadata - a feature set derived at the full-dataset scope ```{r metadata} meta <- data.frame( sample_id = c("S1", "S2", "S3", "S4", "S5", "S6"), subject_id = c("P1", "P1", "P2", "P3", "P4", "P2"), batch_id = c("B1", "B2", "B1", "B3", NA, "B1"), study_id = c("ST1", "ST1", "ST1", "ST2", "ST3", "ST2"), timepoint_id = c("T0", "T1", "T0", "T2", NA, "T1"), assay_id = c("RNAseq", "RNAseq", "RNAseq", "RNAseq", "Proteomics", "RNAseq"), featureset_id = c("FS_GLOBAL", "FS_GLOBAL", "FS_GLOBAL", "FS_GLOBAL", "FS_PROT", "FS_GLOBAL"), outcome_id = c("O_case", "O_case", "O_ctrl", "O_case", "O_ctrl", "O_ctrl"), stringsAsFactors = FALSE ) meta ``` This is still a small example, but it already contains enough structure to make naive random splitting risky. ## Fast path: `graph_from_metadata()` When your metadata already uses the canonical column names (`sample_id`, `subject_id`, `batch_id`, `study_id`, `timepoint_id`, `time_index`, `assay_id`, `featureset_id`, `outcome_id` / `outcome_value`), `graph_from_metadata()` does ingestion, typed node construction, canonical edge construction, and optional `timepoint_precedes` derivation in a single call: ```{r fast-path} quick_graph <- graph_from_metadata( data.frame( sample_id = c("S1", "S2", "S3", "S4", "S5", "S6"), subject_id = c("P1", "P1", "P2", "P2", "P3", "P3"), batch_id = c("B1", "B2", "B1", "B2", "B1", "B2"), timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"), time_index = c(0, 1, 0, 1, 0, 1), outcome_value = c(0, 1, 0, 1, 1, 0) ), graph_name = "quick_demo" ) quick_graph ``` The rest of this vignette uses the explicit constructor path because it lets us show node attributes (`time_index`, `visit_label`, `platform`, `derivation_scope`) and non-canonical edges (`featureset_generated_from_*`, `subject_has_outcome`) that `graph_from_metadata()` does not build for you. Use `graph_from_metadata()` when the canonical columns are enough; use the explicit path when you need custom attributes or extra relations. ## Ingest metadata and build typed nodes and edges The first step is to standardize metadata and then turn each entity type into canonical graph nodes. Sample-level relations become typed edges. ```{r construction} meta <- ingest_metadata(meta, dataset_name = "VignetteDemo") sample_nodes <- create_nodes(meta, type = "Sample", id_col = "sample_id") subject_nodes <- create_nodes(meta, type = "Subject", id_col = "subject_id") batch_nodes <- create_nodes(meta, type = "Batch", id_col = "batch_id") study_nodes <- create_nodes(meta, type = "Study", id_col = "study_id") time_nodes <- create_nodes( data.frame( timepoint_id = c("T0", "T1", "T2"), time_index = c(0L, 1L, 2L), visit_label = c("baseline", "follow_up", "late_follow_up"), stringsAsFactors = FALSE ), type = "Timepoint", id_col = "timepoint_id", attr_cols = c("time_index", "visit_label") ) assay_nodes <- create_nodes( data.frame( assay_id = c("RNAseq", "Proteomics"), modality = c("transcriptomics", "proteomics"), platform = c("NovaSeq", "Orbitrap"), stringsAsFactors = FALSE ), type = "Assay", id_col = "assay_id", attr_cols = c("modality", "platform") ) featureset_nodes <- create_nodes( data.frame( featureset_id = c("FS_GLOBAL", "FS_PROT"), featureset_name = c("global_rna_signature", "proteomics_panel"), derivation_scope = c("per_dataset", "external"), feature_count = c(500L, 80L), stringsAsFactors = FALSE ), type = "FeatureSet", id_col = "featureset_id", attr_cols = c("featureset_name", "derivation_scope", "feature_count") ) outcome_nodes <- create_nodes( data.frame( outcome_id = c("O_case", "O_ctrl"), outcome_name = c("response", "response"), outcome_type = c("binary", "binary"), observation_level = c("subject", "subject"), stringsAsFactors = FALSE ), type = "Outcome", id_col = "outcome_id", attr_cols = c("outcome_name", "outcome_type", "observation_level") ) subject_edges <- create_edges( meta, "sample_id", "subject_id", "Sample", "Subject", "sample_belongs_to_subject" ) batch_edges <- create_edges( meta, "sample_id", "batch_id", "Sample", "Batch", "sample_processed_in_batch", allow_missing = TRUE ) study_edges <- create_edges( meta, "sample_id", "study_id", "Sample", "Study", "sample_from_study" ) time_edges <- create_edges( meta, "sample_id", "timepoint_id", "Sample", "Timepoint", "sample_collected_at_timepoint", allow_missing = TRUE ) assay_edges <- create_edges( meta, "sample_id", "assay_id", "Sample", "Assay", "sample_measured_by_assay" ) featureset_edges <- create_edges( meta, "sample_id", "featureset_id", "Sample", "FeatureSet", "sample_uses_featureset" ) outcome_edges <- create_edges( data.frame( subject_id = c("P1", "P2", "P3", "P4"), outcome_id = c("O_case", "O_ctrl", "O_case", "O_ctrl"), stringsAsFactors = FALSE ), "subject_id", "outcome_id", "Subject", "Outcome", "subject_has_outcome" ) precedence_edges <- create_edges( data.frame( from_timepoint = c("T0", "T1"), to_timepoint = c("T1", "T2"), stringsAsFactors = FALSE ), "from_timepoint", "to_timepoint", "Timepoint", "Timepoint", "timepoint_precedes" ) featureset_from_study <- create_edges( data.frame( featureset_id = "FS_GLOBAL", study_id = "ST1", stringsAsFactors = FALSE ), "featureset_id", "study_id", "FeatureSet", "Study", "featureset_generated_from_study" ) featureset_from_batch <- create_edges( data.frame( featureset_id = "FS_GLOBAL", batch_id = "B1", stringsAsFactors = FALSE ), "featureset_id", "batch_id", "FeatureSet", "Batch", "featureset_generated_from_batch" ) ``` The node and edge tables are canonical and typed. The package assigns globally unique node IDs such as `sample:S1` and `subject:P1`, so different entity types cannot collide accidentally. ```{r construction-output} sample_nodes as.data.frame(sample_nodes)[, c("node_id", "node_type", "node_key", "label")] edge_preview <- do.call(rbind, lapply( list( subject_edges, batch_edges, study_edges, time_edges, assay_edges, featureset_edges, outcome_edges, precedence_edges, featureset_from_study, featureset_from_batch ), as.data.frame )) edge_preview[, c("from", "to", "edge_type")] ``` The node table shows the canonical sample IDs that everything else refers to. The edge table shows the package's central design choice: dependency structure is explicit, typed, and inspectable. ## Assemble the dependency graph ```{r graph} graph <- build_dependency_graph( nodes = list( sample_nodes, subject_nodes, batch_nodes, study_nodes, time_nodes, assay_nodes, featureset_nodes, outcome_nodes ), edges = list( subject_edges, batch_edges, study_edges, time_edges, assay_edges, featureset_edges, outcome_edges, precedence_edges, featureset_from_study, featureset_from_batch ), graph_name = "vignette_graph", dataset_name = attr(meta, "dataset_name") ) graph summary(graph) ``` At this point the package has a single `dependency_graph` object with both tabular and `igraph` representations behind it. The summary is useful because it tells you exactly which entity types and relation types are present before you derive any split rules. ### Visualize the typed structure `plot()` renders the graph with a typed, layered layout: `Sample` on top, peer dependencies (`Subject`, `Batch`, `Study`, `Timepoint`) in the middle band, `Assay`/`FeatureSet` below that, and `Outcome` at the bottom. Node colors are keyed to type and an auto-generated legend is drawn by default. ```{r plot, fig.width = 7, fig.height = 5} plot(graph) ``` Useful options: ```{r plot-options, eval = FALSE} plot(graph, layout = "sugiyama") # alternative hierarchical layout plot(graph, show_labels = FALSE) # hide labels on dense graphs plot(graph, legend = FALSE) # suppress the legend plot(graph, legend_position = "bottomright") plot(graph, node_colors = c(Sample = "#000000")) ``` ## Validate before you split Validation is where `splitGraph` starts paying off. The graph below is structurally valid, but it still carries leakage-relevant warnings and advisories. ```{r validation} validation <- validate_graph(graph) validation as.data.frame(validation)[, c("level", "severity", "code", "message")] ``` That output is the core value proposition of the package in one place: - repeated subjects are surfaced explicitly - cross-study subject overlap is surfaced explicitly - full-dataset feature provenance is surfaced explicitly - heavy batch reuse is surfaced explicitly `valid = TRUE` here means the graph has no errors. It does not mean the dataset is free of leakage risk. Warnings and advisories still matter. The package is also intentionally strict about silent failure. If you ask for a subset of samples and some of them do not resolve, it errors instead of dropping them. ```{r strictness} tryCatch( derive_split_constraints(graph, mode = "subject", samples = c("S1", "BAD")), error = function(e) e$message ) ``` That behavior is important in practice because quietly omitting samples would change the truth of the split problem. ## Query the graph to inspect hidden structure You can inspect local provenance, trace paths, and project direct sample dependencies. ```{r neighbors-and-paths} neighbors_s1 <- query_neighbors(graph, node_ids = "sample:S1", direction = "out") neighbors_s1 as.data.frame(neighbors_s1)[, c("seed_node_id", "node_id", "node_type", "edge_type")] subject_outcome_path <- query_shortest_paths( graph, from = "sample:S1", to = "outcome:O_case", edge_types = c("sample_belongs_to_subject", "subject_has_outcome") ) subject_outcome_path as.data.frame(subject_outcome_path) ``` The first query shows everything the graph knows directly about `S1`. The second shows that `S1` reaches the subject-level outcome through its subject node, which is exactly the kind of relationship that would stay implicit in a plain metadata table. ```{r projected-dependencies} shared_dependencies <- detect_shared_dependencies( graph, via = c("Subject", "Batch", "FeatureSet") ) as.data.frame(shared_dependencies)[, c( "sample_id_1", "sample_id_2", "shared_node_type", "shared_node_id", "edge_type" )] dependency_components <- detect_dependency_components( graph, via = c("Subject", "Batch") ) as.data.frame(dependency_components) ``` These projected queries are useful because they answer the splitting question directly. They tell you which samples should be treated as structurally linked, not just which metadata columns happen to match. ## Derive split constraints from the graph `splitGraph` can derive direct constraints for subject, batch, study, and time as well as composite constraints that combine multiple dependency sources. ```{r constraints} subject_constraint <- derive_split_constraints(graph, mode = "subject") batch_constraint <- derive_split_constraints(graph, mode = "batch") study_constraint <- derive_split_constraints(graph, mode = "study") time_constraint <- derive_split_constraints(graph, mode = "time") strict_constraint <- derive_split_constraints( graph, mode = "composite", strategy = "strict", via = c("Subject", "Batch") ) rule_based_constraint <- derive_split_constraints( graph, mode = "composite", strategy = "rule_based", priority = c("batch", "study", "subject", "time") ) constraint_overview <- do.call(rbind, lapply( list( subject = subject_constraint, batch = batch_constraint, study = study_constraint, time = time_constraint, composite_strict = strict_constraint, composite_rule = rule_based_constraint ), function(x) { data.frame( strategy = x$strategy, groups = length(unique(x$sample_map$group_id)), warnings = if (is.null(x$metadata$warnings)) 0L else length(x$metadata$warnings), stringsAsFactors = FALSE ) } )) constraint_overview <- cbind(constraint = row.names(constraint_overview), constraint_overview) row.names(constraint_overview) <- NULL constraint_overview ``` That summary already shows why the package is useful: different notions of dependency produce different splitting units. ### Batch constraints ```{r batch-constraint} batch_constraint as.data.frame(batch_constraint)[, c("sample_id", "group_id", "group_label", "explanation")] ``` Batch grouping keeps all `B1` samples together and preserves `S5` as an explicit singleton because it has no batch assignment. Missing structure is not hidden. ### Time constraints ```{r time-constraint} time_constraint as.data.frame(time_constraint)[, c("sample_id", "group_id", "timepoint_id", "order_rank")] ``` Time grouping adds `order_rank`, which is the field downstream tooling actually needs for ordered evaluation. The missing timepoint on `S5` stays visible as `NA`, so ordering is partial rather than pretended. ### Composite constraints ```{r composite-constraints} strict_constraint as.data.frame(strict_constraint)[, c("sample_id", "group_id", "constraint_type")] rule_based_constraint as.data.frame(rule_based_constraint)[, c("sample_id", "group_id", "constraint_type", "group_label")] ``` The strict composite constraint uses transitive closure: `S1`, `S2`, `S3`, and `S6` end up in the same group because subject and batch links connect them into one dependency component. The rule-based composite constraint is different: it uses the highest-priority available dependency per sample, so `S5` falls back to study-level grouping instead of becoming a composite component. ## Time ordering can come from precedence edges alone If explicit `time_index` metadata are unavailable, `splitGraph` can still infer time order from `timepoint_precedes` edges. ```{r precedence-only} precedence_meta <- data.frame( sample_id = c("S1", "S2", "S3"), subject_id = c("P1", "P1", "P2"), study_id = c("ST1", "ST1", "ST2"), timepoint_id = c("T0", "T1", "T2"), stringsAsFactors = FALSE ) precedence_graph <- build_dependency_graph( nodes = list( create_nodes(precedence_meta, type = "Sample", id_col = "sample_id"), create_nodes(precedence_meta, type = "Subject", id_col = "subject_id"), create_nodes(precedence_meta, type = "Study", id_col = "study_id"), create_nodes( data.frame(timepoint_id = c("T0", "T1", "T2"), stringsAsFactors = FALSE), type = "Timepoint", id_col = "timepoint_id" ) ), edges = list( create_edges( precedence_meta, "sample_id", "subject_id", "Sample", "Subject", "sample_belongs_to_subject" ), create_edges( precedence_meta, "sample_id", "study_id", "Sample", "Study", "sample_from_study" ), create_edges( precedence_meta, "sample_id", "timepoint_id", "Sample", "Timepoint", "sample_collected_at_timepoint" ), create_edges( data.frame( from_timepoint = c("T0", "T1"), to_timepoint = c("T1", "T2"), stringsAsFactors = FALSE ), "from_timepoint", "to_timepoint", "Timepoint", "Timepoint", "timepoint_precedes" ) ), graph_name = "precedence_only_graph" ) precedence_time_constraint <- derive_split_constraints(precedence_graph, mode = "time") precedence_time_constraint$metadata$time_order_source as.data.frame(precedence_time_constraint)[, c("sample_id", "timepoint_id", "time_index", "order_rank")] ``` The important detail is that ordering is still derived, but the source is `timepoint_precedes` rather than `time_index`. ## Translate the constraint into a split specification The graph-derived constraint is not the end of the workflow. The main handoff target is a canonical sample-level split specification — the `split_spec` class. Downstream tools consume it through their own adapters, so `split_spec` stays tool-agnostic. ```{r split-spec} split_spec <- as_split_spec(strict_constraint, graph = graph) split_spec as.data.frame(split_spec)[, c( "sample_id", "group_id", "batch_group", "study_group", "timepoint_id", "order_rank" )] split_spec_validation <- validate_split_spec(split_spec) split_spec_validation as.data.frame(split_spec_validation) ``` This translation step is where the package becomes operational for downstream evaluation workflows: - `group_id` carries the split unit - `batch_group` and `study_group` are available for blocking - `order_rank` is available for ordered evaluation - the generated object is validated before handoff ## Summarize the leakage picture in one object The final helper combines graph validation, constraint diagnostics, and split-spec readiness into one summary object. ```{r risk-summary} risk_summary <- summarize_leakage_risks( graph, constraint = strict_constraint, split_spec = split_spec ) risk_summary as.data.frame(risk_summary)[, c("source", "severity", "category", "message")] ``` This is a useful stopping point before model training. It gives you one place to review whether the graph is structurally sound, whether the chosen constraint is overly singleton-heavy, and whether the downstream split spec is ready to use. ## Downstream handoff `split_spec` is the tool-agnostic handoff artifact. `splitGraph` does not know about any particular resampling package — downstream consumers provide their own adapters so `splitGraph` stays neutral. The typical end-to-end flow is: 1. `graph_from_metadata()` (or the explicit constructor path) → typed `dependency_graph` 2. `derive_split_constraints(g, mode = ...)` → `split_constraint` 3. `as_split_spec(constraint, graph = g)` → `split_spec` 4. adapter in the downstream package → native resamples The `sample_data` frame carried by `split_spec` exposes exactly the columns downstream adapters consume: `sample_id` for joining against the observation frame, `group_id` for grouped resampling, `batch_group` / `study_group` for blocking, and `order_rank` for ordered evaluation. Adapters can be built by any package that wants to consume a `split_spec` — for example, on top of `rsample::group_vfold_cv()` (grouped CV keyed to `group_id`) or `rsample::rolling_origin()` (ordered evaluation keyed to `order_rank`). ## Case studies The end-to-end workflow above shows the package surface. The case studies below show how the same graph leads to different evaluation decisions depending on the scientific question. ### Case study 1: repeated subjects in a longitudinal cohort Suppose the real question is whether future observations from the same subject should be held out from training. In this setting, subject reuse and time ordering both matter, but they solve different problems. ```{r case-study-1} subject_groups <- grouping_vector(subject_constraint) time_groups <- time_constraint$sample_map[, c("sample_id", "group_id", "timepoint_id", "order_rank")] subject_groups time_groups ``` Interpretation: - `S1` and `S2` share subject `P1`, so subject-grouped evaluation keeps them together. - `S3` and `S6` share subject `P2`, so they also stay together under a subject-based split. - time grouping adds a different axis: `T0`, `T1`, and `T2` become ordered units with explicit `order_rank`. If the leakage concern is repeated measurements from the same individual, use the subject constraint. If the evaluation question is prospective prediction, the time constraint adds the ordering information you need. ### Case study 2: a subject reused across studies The graph intentionally includes subject `P2` in both `ST1` and `ST2`. A study-only split would treat those studies as separate units, but the graph shows that subject overlap breaks the intended independence. ```{r case-study-2} cross_study_issues <- as.data.frame(validation)[ as.data.frame(validation)$code == "subject_cross_study_overlap", c("severity", "code", "message") ] p2_shared <- detect_shared_dependencies( graph, via = "Subject", samples = c("S3", "S6") ) study_only_map <- study_constraint$sample_map[, c("sample_id", "group_id", "group_label")] strict_map <- strict_constraint$sample_map[, c("sample_id", "group_id", "constraint_type")] cross_study_issues as.data.frame(p2_shared) study_only_map[study_only_map$sample_id %in% c("S3", "S6"), ] strict_map[strict_map$sample_id %in% c("S3", "S6"), ] ``` Interpretation: - validation surfaces the cross-study subject overlap directly - the shared-dependency query confirms that `S3` and `S6` are linked through the same subject - a study-only split would place them in different groups (`ST1` versus `ST2`) - the strict composite constraint correctly keeps them in the same dependency component This is exactly the kind of failure mode `splitGraph` is designed to expose: metadata columns suggest a legitimate study split, but graph structure shows that the split would still leak subject information. ### Case study 3: partially observed technical metadata Real metadata are rarely complete. Here, `S5` has no batch assignment and no timepoint assignment. The package does not pretend those fields exist. It keeps the sample visible and tells you how the split logic handled it. ```{r case-study-3} batch_missing <- batch_constraint$sample_map[ batch_constraint$sample_map$sample_id == "S5", c("sample_id", "group_id", "group_label", "explanation") ] rule_based_missing <- rule_based_constraint$sample_map[ rule_based_constraint$sample_map$sample_id == "S5", c("sample_id", "group_id", "constraint_type", "group_label", "explanation") ] split_spec_missing <- as.data.frame(split_spec)[ as.data.frame(split_spec)$sample_id == "S5", c("sample_id", "group_id", "batch_group", "study_group", "timepoint_id", "order_rank") ] batch_missing rule_based_missing split_spec_missing ``` Interpretation: - batch-based splitting keeps `S5` as an explicit singleton because batch metadata are missing - the rule-based composite strategy falls back to study-level grouping for `S5` - the translated split specification preserves the missing batch and time fields as `NA` rather than silently inventing values That behavior matters because incomplete metadata are common. `splitGraph` stays strict about what is known, but still produces a usable, inspectable split object. ### Case study 4: choosing a defensible split strategy A typical practical question is not "what can the package compute?" but "which constraint should I actually use?" The answer depends on which dependency source is scientifically unacceptable to leak across train and test. ```{r case-study-4} strategy_summary <- data.frame( constraint = c("subject", "batch", "study", "time", "composite_strict", "composite_rule"), groups = c( length(unique(subject_constraint$sample_map$group_id)), length(unique(batch_constraint$sample_map$group_id)), length(unique(study_constraint$sample_map$group_id)), length(unique(time_constraint$sample_map$group_id)), length(unique(strict_constraint$sample_map$group_id)), length(unique(rule_based_constraint$sample_map$group_id)) ), warnings = c( length(or_empty(subject_constraint$metadata$warnings)), length(or_empty(batch_constraint$metadata$warnings)), length(or_empty(study_constraint$metadata$warnings)), length(or_empty(time_constraint$metadata$warnings)), length(or_empty(strict_constraint$metadata$warnings)), length(or_empty(rule_based_constraint$metadata$warnings)) ), recommended_resampling = c( as_split_spec(subject_constraint, graph = graph)$recommended_resampling, as_split_spec(batch_constraint, graph = graph)$recommended_resampling, as_split_spec(study_constraint, graph = graph)$recommended_resampling, as_split_spec(time_constraint, graph = graph)$recommended_resampling, as_split_spec(strict_constraint, graph = graph)$recommended_resampling, as_split_spec(rule_based_constraint, graph = graph)$recommended_resampling ), stringsAsFactors = FALSE ) strategy_summary ``` Interpretation: - subject grouping is the right default when repeated individuals are the dominant leakage source - batch grouping is appropriate when technical runs are the main contamination risk - study grouping is useful for cross-study generalization only when no higher level dependency crosses study boundaries - strict composite grouping is the safest choice when multiple dependency sources can connect samples transitively - rule-based composite grouping is a pragmatic fallback when you want a single deterministic hierarchy over partially observed metadata The package does not choose the scientific objective for you. It makes the trade-off visible and auditable. ## When `splitGraph` is useful `splitGraph` is a good fit when: - sample relationships are scientifically meaningful and must influence evaluation - metadata contain repeated subjects, shared batches, multiple studies, or temporal structure - feature provenance or outcome level matters for leakage assessment - you want deterministic, inspectable split constraints instead of ad hoc grouping code ## What `splitGraph` is not for `splitGraph` is not: - a general biological network analysis package - a model training framework - a resampling engine - a substitute for downstream performance auditing Its value is earlier in the workflow: it makes dependency structure explicit so that the split design itself can be justified. ## Takeaway If you already know your data have repeated subjects, reused batches, temporal ordering, or shared feature provenance, then you already have a graph problem whether you model it explicitly or not. `splitGraph` is useful because it turns that hidden graph into an object you can validate, query, and convert into a split design that downstream tooling can trust.