--- title: "Data Preparation and Integration" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data Preparation and Integration} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE, purl = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE, message = FALSE, warning = FALSE ) ``` ## Overview Before fitting any model in mlumr, you need to: 1. Set up the IPD with `set_ipd()` 2. Set up the AgD with `set_agd()` 3. Combine them with `combine_data()` 4. (For ML-UMR only) Add integration points with `add_integration()` ## Setting up IPD IPD should be a data frame with one row per patient, containing: - A treatment indicator column - An outcome column (binary 0/1, continuous numeric, or non-negative integer counts depending on `family`) - Covariate columns - For Poisson family: an exposure column ```{r, eval = TRUE} library(mlumr) set.seed(2026) # Example: Trial A data (index treatment) trial_a <- data.frame( patient_id = 1:500, treatment = "Drug_A", response = rbinom(500, 1, 0.6), age_group = rbinom(500, 1, 0.4), # 0 = young, 1 = old sex = rbinom(500, 1, 0.55) # 0 = male, 1 = female ) ipd <- set_ipd( data = trial_a, treatment = "treatment", outcome = "response", covariates = c("age_group", "sex") ) ipd ``` ### Continuous outcomes For continuous outcomes, use `family = "normal"`. The IPD frame needs a numeric outcome column: ```{r, eval = TRUE} trial_a_normal <- data.frame( patient_id = 1:500, treatment = "Drug_A", score = rnorm(500, mean = 3.0, sd = 1.2), age_group = rbinom(500, 1, 0.4), sex = rbinom(500, 1, 0.55) ) ipd_normal <- set_ipd( data = trial_a_normal, treatment = "treatment", outcome = "score", covariates = c("age_group", "sex"), family = "normal" ) ``` ### Count outcomes For Poisson count data, use `family = "poisson"` and supply an `exposure` column: ```{r, eval = TRUE} trial_a_poisson <- data.frame( patient_id = 1:500, treatment = "Drug_A", events = rpois(500, lambda = 0.8), person_years = runif(500, 0.5, 2.0), age_group = rbinom(500, 1, 0.4), sex = rbinom(500, 1, 0.55) ) ipd_poisson <- set_ipd( data = trial_a_poisson, treatment = "treatment", outcome = "events", covariates = c("age_group", "sex"), family = "poisson", exposure = "person_years" ) ``` ### Validation `set_ipd()` checks that: - All specified columns exist in the data - The outcome matches the family (binary for binomial, numeric for normal, non-negative integers for Poisson) - Missing values are handled (rows dropped with a warning) ```{r, eval = TRUE, error = TRUE, purl = FALSE} # This will error: non-binary outcome with binomial family bad_data <- data.frame(trt = "A", outcome = c(0, 1, 2), x = rnorm(3)) set_ipd(bad_data, "trt", "outcome", "x") ``` ## Setting up AgD AgD should be a data frame with one row per study/subgroup, containing: - Treatment indicator - Outcome summaries (depend on `family`): - **Binomial**: sample size (`outcome_n`) and number of events (`outcome_r`) - **Normal**: mean outcome (`outcome_mean`) and standard error (`outcome_se`), optionally sample size (`outcome_n`) - **Poisson**: number of events (`outcome_r`) and total exposure (`outcome_E`) - Covariate means/proportions ```{r, eval = TRUE} # Example: Trial B data (comparator treatment) trial_b <- data.frame( study = "Trial_B", treatment = "Drug_B", n_total = 400, n_events = 160, age_group_mean = 0.35, # proportion in "old" group sex_prop = 0.50 # proportion female ) agd <- set_agd( data = trial_b, treatment = "treatment", outcome_n = "n_total", outcome_r = "n_events", cov_means = c("age_group_mean", "sex_prop"), cov_types = c("binary", "binary") ) ``` ### AgD for continuous outcomes ```{r, eval = TRUE} agd_normal <- set_agd( data = data.frame( trt = "Drug_B", y_mean = 3.2, se = 0.15, n = 400, age_group_mean = 0.35, sex_prop = 0.50 ), treatment = "trt", family = "normal", outcome_mean = "y_mean", outcome_se = "se", outcome_n = "n", cov_means = c("age_group_mean", "sex_prop"), cov_types = c("binary", "binary") ) ``` **Scale requirement for normal outcomes.** `set_agd()` assumes the AgD mean (`outcome_mean`) and SE (`outcome_se`) are on the **arithmetic (original, untransformed) scale**, regardless of whether `mlumr()` is later called with `link = "identity"` or `link = "log"`. If a publication reports only a log-scale mean and SD, or a geometric mean, back-transform before calling `set_agd()` and propagate uncertainty via the delta method; passing log-scale summaries silently biases the posterior because the Stan likelihood is `normal(E[exp(eta)], se_agd)` under the log link, expecting the AgD quantities already on the outcome scale. ### AgD for count outcomes ```{r, eval = TRUE} agd_poisson <- set_agd( data = data.frame( trt = "Drug_B", n_events = 120, person_years = 800, age_group_mean = 0.35, sex_prop = 0.50 ), treatment = "trt", family = "poisson", outcome_r = "n_events", outcome_E = "person_years", cov_means = c("age_group_mean", "sex_prop"), cov_types = c("binary", "binary") ) ``` ### Covariate naming `set_agd()` automatically strips `_mean` and `_prop` suffixes from covariate names to match the IPD covariate names: - `"age_group_mean"` -> covariate name `"age_group"` - `"sex_prop"` -> covariate name `"sex"` Make sure the resulting names match those in your IPD. ### Continuous covariates For continuous covariates, also supply standard deviations: ```{r, eval = TRUE} agd_continuous <- set_agd( data = data.frame( trt = "B", n_total = 400, n_events = 160, bmi_mean = 25.3, bmi_sd = 4.2 ), treatment = "trt", outcome_n = "n_total", outcome_r = "n_events", cov_means = "bmi_mean", cov_sds = "bmi_sd", cov_types = "continuous" ) ``` ## Combining data ```{r, eval = TRUE} dat <- combine_data(ipd, agd) print(dat) ``` ## Adding integration points Integration points are needed for ML-UMR's numerical integration over the comparator population's covariate distribution. They are generated using Sobol quasi-Monte Carlo sequences with a Gaussian copula to respect covariate correlations. ### Basic usage ```{r, eval = TRUE} dat <- add_integration( dat, n_int = 64, age_group = distr(qbern, prob = age_group_mean), sex = distr(qbern, prob = sex_mean) ) ``` Key points: - **`n_int`**: Number of integration points (powers of 2 recommended: 32, 64, 128, 256). More points = better accuracy but slower Stan fitting. - **`distr()`**: Wraps a quantile function with parameters. Parameters can reference columns in the AgD data (e.g., `age_group_mean`). - **`qbern`**: Bernoulli quantile function (provided by mlumr) for binary covariates. Use `qnorm` for continuous covariates. ### Distribution types ```{r, eval = FALSE, purl = FALSE} # Binary covariate age = distr(qbern, prob = age_mean) # Continuous covariate (normal) bmi = distr(qnorm, mean = bmi_mean, sd = bmi_sd) # Continuous covariate (gamma) biomarker = distr(qgamma, shape = 2, rate = 0.5) ``` ### Correlation handling By default, correlations between covariates are estimated from the IPD using Spearman rank correlation, then adjusted for the Gaussian copula. For binary and mixed discrete/continuous covariate pairs, this adjustment is a pragmatic approximation — exact calibration of the latent Gaussian correlation depends on marginal prevalences, which are not used in the current transform. For most practical datasets this provides adequate integration-point distributions, but users with strong prior information on covariate correlations can supply their own matrix via the `cor` argument. ```{r, eval = TRUE} # Default: Spearman correlation from IPD dat <- add_integration(dat, n_int = 64, age_group = distr(qbern, prob = age_group_mean), sex = distr(qbern, prob = sex_mean)) # Supply your own correlation matrix my_cor <- matrix(c(1, 0.3, 0.3, 1), 2, 2) dat <- add_integration(dat, n_int = 64, cor = my_cor, age_group = distr(qbern, prob = age_group_mean), sex = distr(qbern, prob = sex_mean)) # No correlation adjustment dat <- add_integration(dat, n_int = 64, cor_adjust = "none", age_group = distr(qbern, prob = age_group_mean), sex = distr(qbern, prob = sex_mean)) ``` ### Inspecting integration points ```{r, eval = TRUE} # Expand to long format int_df <- unnest_integration(dat) head(int_df) ``` Use `check_integration()` to compare the generated integration points against the requested AgD marginal distributions and correlation structure: ```{r, eval = TRUE, purl = FALSE} check_integration( dat, age_group = distr(qbern, prob = age_group_mean), sex = distr(qbern, prob = sex_mean) ) ``` This is especially important when there are several covariates, strong correlations, binary covariates with prevalences near 0 or 1, or a small number of integration points. ## Notes on STC and naive The STC and naive methods do **not** require integration points. The STC uses integration points if available (for better marginalization), but falls back to AgD covariate means if they are not. You can call `stc()` and `naive()` immediately after `combine_data()`. ```{r, eval = TRUE} # These work without add_integration() dat_no_int <- combine_data(ipd, agd) naive_result <- naive(dat_no_int) stc_result <- stc(dat_no_int) ```