--- title: "Efficient Storage of Imputed Data" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{Efficient Storage of Imputed Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction When performing multiple imputation with [`{rbmi}`](https://cran.r-project.org/package=rbmi) using many imputations (e.g., 100-1000), the full imputed dataset can become very large. However, most of this data is redundant: observed values are identical across all imputations. The `{rbmiUtils}` package provides two functions to address this: * `reduce_imputed_data()`: Extract only the imputed values (originally missing) * `expand_imputed_data()`: Reconstruct the full dataset when needed This approach can reduce storage requirements by 90% or more, depending on the proportion of missing data. # The Storage Problem Consider a typical clinical trial dataset: * 500 subjects * 5 visits per subject = 2,500 rows * 5% missing data = 125 missing values * 1,000 imputations **Full storage**: 2,500 rows × 1,000 imputations = **2.5 million rows** **Reduced storage**: 125 missing values × 1,000 imputations = **125,000 rows** (5% of full size) # Setup ```{r libraries, message = FALSE, warning = FALSE} library(dplyr) library(rbmi) library(rbmiUtils) ``` # Example with Package Data The `{rbmiUtils}` package includes example datasets we can use: ```{r load-data} data("ADMI", package = "rbmiUtils") # Full imputed dataset data("ADEFF", package = "rbmiUtils") # Original data with missing values # Check dimensions cat("Full imputed dataset (ADMI):", nrow(ADMI), "rows\n") cat("Number of imputations:", length(unique(ADMI$IMPID)), "\n") ``` # Reducing Imputed Data First, prepare the original data to match the imputed data structure: ```{r prepare-original} original <- ADEFF |> mutate( TRT = TRT01P, USUBJID = as.character(USUBJID) ) # Count missing values n_missing <- sum(is.na(original$CHG)) cat("Missing values in original data:", n_missing, "\n") ``` Define the variables specification: ```{r define-vars} vars <- set_vars( subjid = "USUBJID", visit = "AVISIT", group = "TRT", outcome = "CHG" ) ``` Now reduce the imputed data: ```{r reduce} reduced <- reduce_imputed_data(ADMI, original, vars) cat("Full imputed rows:", nrow(ADMI), "\n") cat("Reduced rows:", nrow(reduced), "\n") cat("Compression ratio:", round(100 * nrow(reduced) / nrow(ADMI), 1), "%\n") ``` # What's in the Reduced Data? The reduced dataset contains only the rows that were originally missing: ```{r examine-reduced} # First few rows head(reduced) # Structure matches original imputed data cat("\nColumns in reduced data:\n") cat(paste(names(reduced), collapse = ", ")) ``` Each row represents an imputed value for a specific subject-visit-imputation combination. # Expanding Back to Full Data When you need to run analyses, expand the reduced data back to full form: ```{r expand} expanded <- expand_imputed_data(reduced, original, vars) cat("Expanded rows:", nrow(expanded), "\n") cat("Original ADMI rows:", nrow(ADMI), "\n") ``` # Verifying Data Integrity Let's verify that the round-trip preserves data integrity: ```{r verify} # Sort both datasets for comparison admi_sorted <- ADMI |> arrange(IMPID, USUBJID, AVISIT) expanded_sorted <- expanded |> arrange(IMPID, USUBJID, AVISIT) # Compare CHG values all_equal <- all.equal( admi_sorted$CHG, expanded_sorted$CHG, tolerance = 1e-10 ) cat("Data integrity check:", all_equal, "\n") ``` # Practical Workflow Here's how to integrate efficient storage into your workflow: ## Save Reduced Data ```{r save-workflow, eval = FALSE} # After imputation impute_obj <- impute(draws_obj, references = c("Placebo" = "Placebo", "Drug A" = "Placebo")) full_imputed <- get_imputed_data(impute_obj) # Reduce for storage reduced <- reduce_imputed_data(full_imputed, original_data, vars) # Save both (reduced is much smaller) saveRDS(reduced, "imputed_reduced.rds") saveRDS(original_data, "original_data.rds") ``` ## Load and Analyse ```{r load-workflow, eval = FALSE} # Load saved data reduced <- readRDS("imputed_reduced.rds") original_data <- readRDS("original_data.rds") # Expand when needed for analysis full_imputed <- expand_imputed_data(reduced, original_data, vars) # Run analysis ana_obj <- analyse_mi_data( data = full_imputed, vars = vars, method = method, fun = ancova ) ``` # Storage Comparison Here's a comparison of storage requirements for different scenarios: | Subjects | Visits | Missing % | Imputations | Full Rows | Reduced Rows | Savings | |----------|--------|-----------|-------------|-----------|--------------|---------| | 500 | 5 | 5% | 100 | 250,000 | 12,500 | 95% | | 500 | 5 | 5% | 1,000 | 2,500,000 | 125,000 | 95% | | 1,000 | 8 | 10% | 500 | 4,000,000 | 400,000 | 90% | | 200 | 4 | 20% | 1,000 | 800,000 | 160,000 | 80% | The savings scale with: * **Lower missing %** = greater savings * **More imputations** = same relative savings, but more absolute reduction # When to Use This Approach **Use reduced storage when:** * Running many imputations (100+) * Saving imputed data for later analysis * Sharing data between team members * Working with memory constraints **Keep full data when:** * Working interactively with few imputations * Performing exploratory analysis * Storage is not a concern # Edge Cases ## No Missing Data If the original data has no missing values, `reduce_imputed_data()` returns an empty data.frame: ```{r no-missing, eval = FALSE} # If original has no missing values reduced <- reduce_imputed_data(full_imputed, complete_data, vars) nrow(reduced) #> [1] 0 # expand_imputed_data handles this correctly expanded <- expand_imputed_data(reduced, complete_data, vars) # Returns original data with IMPID = "1" ``` ## Single Imputation The functions work with any number of imputations, including just one. # Summary The `reduce_imputed_data()` and `expand_imputed_data()` functions provide an efficient way to store imputed datasets: 1. **Reduce** after imputation to store only what's necessary 2. **Expand** before analysis to reconstruct full datasets 3. **Verify** data integrity is preserved through round-trip This approach is particularly valuable when working with large numbers of imputations or when storage and memory are constrained. For the complete analysis workflow using imputed data, see `vignette('pipeline')`.