Complex sampling designs require special treatment for correct
variance estimation. metasurvey manages the sampling design
automatically through the Survey object, so that users can
focus on the analysis rather than technical details.
This vignette covers the following topics:
survey packageWe use the Academic Performance Index (API) dataset from the
survey package. It includes stratified, cluster, and simple
random sampling versions.
The simplest design uses probability weights without clusters or stratification:
Many national surveys use stratified multi-stage sampling. Pass
strata and psu to
Survey$new():
dt_clus <- data.table(apiclus1)
svy_strat_clus <- Survey$new(
data = dt_strat,
edition = "2000",
type = "api",
psu = NULL,
strata = "stype",
engine = "data.table",
weight = add_weight(annual = "pw")
)
cat_design(svy_strat_clus)
#> [1] "\n Design: Not initialized (lazy initialization - will be created when needed)\n"Cross-validate with the survey package:
design_strat <- svydesign(
id = ~1, strata = ~stype, weights = ~pw,
data = dt_strat
)
direct_strat <- svymean(~api00, design_strat)
wf_strat <- workflow(
list(svy_strat_clus),
survey::svymean(~api00, na.rm = TRUE),
estimation_type = "annual"
)
cat("Direct estimate:", round(coef(direct_strat), 2), "\n")
#> Direct estimate: 662.29
cat("Workflow estimate:", round(wf_strat$value, 2), "\n")
#> Workflow estimate: 662.29
cat("Match:", all.equal(
as.numeric(coef(direct_strat)),
wf_strat$value,
tolerance = 1e-6
), "\n")
#> Match: TRUEFor real-world examples of stratified cluster designs with CASEN,
PNADc, ENIGH, and DHS data, see
vignette("international-surveys").
# Check design type
cat_design_type(svy_simple, "annual")
#> [1] "None"
# View metadata
get_metadata(svy_simple)
#> Type: API
#> Edition: 2000
#> Periodicity: Annual
#> Engine: data.table
#> Design:
#> Design: Not initialized (lazy initialization - will be created when needed)
#>
#> Steps: None
#> Recipes: NoneMany surveys provide different weights depending on the analysis period (for example, annual vs. monthly). metasurvey associates periodicity labels with weight columns:
set.seed(42)
dt_multi <- copy(dt_strat)
dt_multi[, pw_monthly := pw * runif(.N, 0.9, 1.1)]
svy_multi <- Survey$new(
data = dt_multi,
edition = "2000",
type = "api",
psu = NULL,
engine = "data.table",
weight = add_weight(annual = "pw", monthly = "pw_monthly")
)
# Use different weight types in workflow()
annual_est <- workflow(
list(svy_multi),
survey::svymean(~api00, na.rm = TRUE),
estimation_type = "annual"
)
monthly_est <- workflow(
list(svy_multi),
survey::svymean(~api00, na.rm = TRUE),
estimation_type = "monthly"
)
cat("Annual estimate:", round(annual_est$value, 1), "\n")
#> Annual estimate: 662.3
cat("Monthly estimate:", round(monthly_est$value, 1), "\n")
#> Monthly estimate: 662.5For surveys that provide bootstrap replicates (such as Uruguay’s
ECH), use add_replicate() inside
add_weight():
# Requires external bootstrap replicate CSV files
svy_boot <- load_survey(
path = "data/main_survey.csv",
svy_type = "ech",
svy_edition = "2023",
svy_weight = add_weight(
annual = add_replicate(
weight_var = "pesoano",
replicate_path = "data/bootstrap_replicates.csv",
replicate_id = c("numero" = "id"),
replicate_pattern = "bsrep[0-9]+",
replicate_type = "bootstrap"
)
)
)When replicate weights are configured, workflow()
automatically uses them for variance estimation via
survey::svrepdesign().
metasurvey uses data.table by default for fast data
manipulation:
By default, steps are recorded but not executed until
bake_steps() is called. This allows validations to be
performed before execution:
Standard variance estimation using the sampling design:
results <- workflow(
list(svy_simple),
survey::svymean(~api00, na.rm = TRUE),
survey::svytotal(~enroll, na.rm = TRUE),
estimation_type = "annual"
)
results
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svymean: api00 662.2874 9.585429e+00 0.01447322 643.5003
#> 2: survey::svytotal: enroll 3687177.5324 1.645323e+05 0.04462283 3364700.1537
#> confint_upper
#> <num>
#> 1: 681.0745
#> 2: 4009654.9112Estimates for subpopulations can be computed using
survey::svyby():
domain_results <- workflow(
list(svy_simple),
survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
estimation_type = "annual"
)
domain_results
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svyby: api00 [stype=E] 674.43 12.49343 0.01852443 649.9433
#> 2: survey::svyby: api00 [stype=H] 625.82 15.34078 0.02451309 595.7526
#> 3: survey::svyby: api00 [stype=M] 636.60 16.50239 0.02592270 604.2559
#> confint_upper stype
#> <num> <fctr>
#> 1: 698.9167 E
#> 2: 655.8874 H
#> 3: 668.9441 Mratio_result <- workflow(
list(svy_simple),
survey::svyratio(~api00, ~api99),
estimation_type = "annual"
)
ratio_result
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svyratio: api00/api99 1.052261 0.00379243 0.003604079 1.044828
#> confint_upper
#> <num>
#> 1: 1.059694When building complex pipelines, it is useful to verify each step independently:
You can compare metasurvey workflow() results with
direct calls to the survey package:
# Method 1: Direct survey package
design <- svydesign(id = ~1, weights = ~pw, data = dt_strat)
direct_mean <- svymean(~api00, design)
# Method 2: metasurvey workflow
wf_result <- workflow(
list(svy_simple),
survey::svymean(~api00, na.rm = TRUE),
estimation_type = "annual"
)
cat("Direct estimate:", round(coef(direct_mean), 2), "\n")
#> Direct estimate: 662.29
cat("Workflow estimate:", round(wf_result$value, 2), "\n")
#> Workflow estimate: 662.29
cat("Match:", all.equal(
as.numeric(coef(direct_mean)),
wf_result$value,
tolerance = 1e-6
), "\n")
#> Match: TRUEYou can use view_graph() to visualize the dependency
graph between steps:
svy_viz <- step_compute(svy_simple,
api_diff = api00 - api99,
high_growth = ifelse(api00 - api99 > 50, 1L, 0L)
)
view_graph(svy_viz, init_step = "Load API data")The interactive DAG is not rendered in this vignette to keep the package size small. Run the code above in your R session to explore it.
You can assess the quality of estimates using the coefficient of variation:
results_quality <- workflow(
list(svy_simple),
survey::svymean(~api00, na.rm = TRUE),
survey::svymean(~enroll, na.rm = TRUE),
estimation_type = "annual"
)
for (i in seq_len(nrow(results_quality))) {
cv_pct <- results_quality$cv[i] * 100
cat(
results_quality$stat[i], ":",
round(cv_pct, 1), "% CV -",
evaluate_cv(cv_pct), "\n"
)
}
#> survey::svymean: api00 : 1.4 % CV - Excellent
#> survey::svymean: enroll : 4.5 % CV - ExcellentYou can verify that recipes and their steps are consistent:
# Create steps and recipe
svy_rt <- step_compute(svy_simple, api_diff = api00 - api99)
my_recipe <- steps_to_recipe(
name = "API Test",
user = "QA Team",
svy = svy_rt,
description = "Recipe for validation",
steps = get_steps(svy_rt)
)
# Check documentation is correct
doc <- my_recipe$doc()
cat("Input variables:", paste(doc$input_variables, collapse = ", "), "\n")
#> Input variables: api00, api99
cat("Output variables:", paste(doc$output_variables, collapse = ", "), "\n")
#> Output variables: api_diff
# Validate against the survey
my_recipe$validate(svy_rt)
#> [1] TRUEBefore putting a survey processing pipeline into production, the following should be verified:
survey packagevalidate_pipeline <- function(svy) {
data <- get_data(svy)
checks <- list(
has_data = !is.null(data),
has_rows = nrow(data) > 0,
has_weights = all(
unlist(svy$weight)[is.character(unlist(svy$weight))] %in% names(data)
)
)
passed <- all(unlist(checks))
if (passed) {
message("All validation checks passed")
} else {
failed <- names(checks)[!unlist(checks)]
warning("Failed checks: ", paste(failed, collapse = ", "))
}
invisible(checks)
}
validate_pipeline(svy_simple)workflow(),
RecipeWorkflow, and publishable estimatesRotativePanelSurvey and PoolSurvey