---
title: "Survey Designs and Validation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Survey Designs and Validation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)
```

## Introduction

Complex sampling designs require special treatment for correct variance
estimation. metasurvey manages the sampling design automatically through
the `Survey` object, so that users can focus on the analysis rather than
technical details.

This vignette covers the following topics:

1. Creating surveys with different design types
2. Configuring weights and data engines
3. Validating the processing pipeline
4. Cross-checking results with the `survey` package

## Initial Setup

We use the Academic Performance Index (API) dataset from the `survey`
package. It includes stratified, cluster, and simple random sampling
versions.

```{r setup}
library(metasurvey)
library(survey)
library(data.table)

data(api, package = "survey")
dt_strat <- data.table(apistrat)
```

## Sampling Design Types

### Simple Weighted Design

The simplest design uses probability weights without clusters or
stratification:

```{r simple}
svy_simple <- Survey$new(
  data = dt_strat,
  edition = "2000",
  type = "api",
  psu = NULL,
  engine = "data.table",
  weight = add_weight(annual = "pw")
)

cat_design(svy_simple)
```

### Stratified Cluster Design

Many national surveys use stratified multi-stage sampling. Pass `strata`
and `psu` to `Survey$new()`:

```{r stratified}
dt_clus <- data.table(apiclus1)

svy_strat_clus <- Survey$new(
  data    = dt_strat,
  edition = "2000",
  type    = "api",
  psu     = NULL,
  strata  = "stype",
  engine  = "data.table",
  weight  = add_weight(annual = "pw")
)

cat_design(svy_strat_clus)
```

Cross-validate with the `survey` package:

```{r stratified-validate}
design_strat <- svydesign(
  id = ~1, strata = ~stype, weights = ~pw,
  data = dt_strat
)
direct_strat <- svymean(~api00, design_strat)

wf_strat <- workflow(
  list(svy_strat_clus),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "annual"
)

cat("Direct estimate:", round(coef(direct_strat), 2), "\n")
cat("Workflow estimate:", round(wf_strat$value, 2), "\n")
cat("Match:", all.equal(
  as.numeric(coef(direct_strat)),
  wf_strat$value,
  tolerance = 1e-6
), "\n")
```

For real-world examples of stratified cluster designs with CASEN, PNADc,
ENIGH, and DHS data, see `vignette("international-surveys")`.

### Design Inspection

```{r inspect-design}
# Check design type
cat_design_type(svy_simple, "annual")

# View metadata
get_metadata(svy_simple)
```

### Multiple Weight Types

Many surveys provide different weights depending on the analysis period
(for example, annual vs. monthly). metasurvey associates periodicity
labels with weight columns:

```{r multi-weight}
set.seed(42)
dt_multi <- copy(dt_strat)
dt_multi[, pw_monthly := pw * runif(.N, 0.9, 1.1)]

svy_multi <- Survey$new(
  data    = dt_multi,
  edition = "2000",
  type    = "api",
  psu     = NULL,
  engine  = "data.table",
  weight  = add_weight(annual = "pw", monthly = "pw_monthly")
)

# Use different weight types in workflow()
annual_est <- workflow(
  list(svy_multi),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "annual"
)

monthly_est <- workflow(
  list(svy_multi),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "monthly"
)

cat("Annual estimate:", round(annual_est$value, 1), "\n")
cat("Monthly estimate:", round(monthly_est$value, 1), "\n")
```

### Bootstrap Replicate Weights

For surveys that provide bootstrap replicates (such as Uruguay's ECH),
use `add_replicate()` inside `add_weight()`:

```r
# Requires external bootstrap replicate CSV files
svy_boot <- load_survey(
  path = "data/main_survey.csv",
  svy_type = "ech",
  svy_edition = "2023",
  svy_weight = add_weight(
    annual = add_replicate(
      weight_var = "pesoano",
      replicate_path = "data/bootstrap_replicates.csv",
      replicate_id = c("numero" = "id"),
      replicate_pattern = "bsrep[0-9]+",
      replicate_type = "bootstrap"
    )
  )
)
```

When replicate weights are configured, `workflow()` automatically uses
them for variance estimation via `survey::svrepdesign()`.

## Engine and Processing Configuration

### Data Engine

metasurvey uses `data.table` by default for fast data manipulation:

```{r engine}
# Current engine
get_engine()

# Available engines
show_engines()
```

### Lazy Processing

By default, steps are recorded but not executed until `bake_steps()` is
called. This allows validations to be performed before execution:

```{r lazy}
# Check current setting
lazy_default()

# Change for the session (not recommended for most workflows)
# set_lazy_processing(FALSE)
```

### Copy Behavior

You can control whether step operations modify the data in-place or work
on copies:

```{r copy}
# Current setting
use_copy_default()

# In-place is faster but modifies the original
# set_use_copy(FALSE)
```

## Variance Estimation

### Design-Based Variance

Standard variance estimation using the sampling design:

```{r variance}
results <- workflow(
  list(svy_simple),
  survey::svymean(~api00, na.rm = TRUE),
  survey::svytotal(~enroll, na.rm = TRUE),
  estimation_type = "annual"
)

results
```

### Domain Estimation

Estimates for subpopulations can be computed using `survey::svyby()`:

```{r domain}
domain_results <- workflow(
  list(svy_simple),
  survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
  estimation_type = "annual"
)

domain_results
```

### Ratios

```{r ratio}
ratio_result <- workflow(
  list(svy_simple),
  survey::svyratio(~api00, ~api99),
  estimation_type = "annual"
)

ratio_result
```

## Pipeline Validation

### Step-by-Step Verification

When building complex pipelines, it is useful to verify each step
independently:

```{r validate-steps}
# Step 1: Compute new variable
svy_v <- step_compute(svy_simple,
  api_diff = api00 - api99,
  comment = "API score difference"
)

# Check that the step was recorded
steps <- get_steps(svy_v)
cat("Pending steps:", length(steps), "\n")
```

### Cross-Validation with the survey Package

You can compare metasurvey `workflow()` results with direct calls to the
`survey` package:

```{r cross-validate}
# Method 1: Direct survey package
design <- svydesign(id = ~1, weights = ~pw, data = dt_strat)
direct_mean <- svymean(~api00, design)

# Method 2: metasurvey workflow
wf_result <- workflow(
  list(svy_simple),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "annual"
)

cat("Direct estimate:", round(coef(direct_mean), 2), "\n")
cat("Workflow estimate:", round(wf_result$value, 2), "\n")
cat("Match:", all.equal(
  as.numeric(coef(direct_mean)),
  wf_result$value,
  tolerance = 1e-6
), "\n")
```

### Pipeline Visualization

You can use `view_graph()` to visualize the dependency graph between
steps:

```{r view-graph, eval = FALSE}
svy_viz <- step_compute(svy_simple,
  api_diff = api00 - api99,
  high_growth = ifelse(api00 - api99 > 50, 1L, 0L)
)
view_graph(svy_viz, init_step = "Load API data")
```

The interactive DAG is not rendered in this vignette to keep the
package size small. Run the code above in your R session to explore it.

### Quality Assessment

You can assess the quality of estimates using the coefficient of
variation:

```{r cv-check}
results_quality <- workflow(
  list(svy_simple),
  survey::svymean(~api00, na.rm = TRUE),
  survey::svymean(~enroll, na.rm = TRUE),
  estimation_type = "annual"
)

for (i in seq_len(nrow(results_quality))) {
  cv_pct <- results_quality$cv[i] * 100
  cat(
    results_quality$stat[i], ":",
    round(cv_pct, 1), "% CV -",
    evaluate_cv(cv_pct), "\n"
  )
}
```

### Recipe Validation

You can verify that recipes and their steps are consistent:

```{r roundtrip}
# Create steps and recipe
svy_rt <- step_compute(svy_simple, api_diff = api00 - api99)

my_recipe <- steps_to_recipe(
  name        = "API Test",
  user        = "QA Team",
  svy         = svy_rt,
  description = "Recipe for validation",
  steps       = get_steps(svy_rt)
)

# Check documentation is correct
doc <- my_recipe$doc()
cat("Input variables:", paste(doc$input_variables, collapse = ", "), "\n")
cat("Output variables:", paste(doc$output_variables, collapse = ", "), "\n")

# Validate against the survey
my_recipe$validate(svy_rt)
```

## Validation Checklist

Before putting a survey processing pipeline into production, the
following should be verified:

1. **Data integrity** -- row count, column names, and data types after
   each step
2. **Weight validation** -- weight columns exist and are positive
3. **Design verification** -- the sampling design matches the expected
   specification (PSU, strata, weights)
4. **Recipe reproducibility** -- save and reload recipes, verify the
   JSON round-trip
5. **Cross-validation** -- compare key estimates with published values
   or direct calls to the `survey` package
6. **CV thresholds** -- flag estimates with high coefficients of
   variation

```{r checklist}
validate_pipeline <- function(svy) {
  data <- get_data(svy)
  checks <- list(
    has_data = !is.null(data),
    has_rows = nrow(data) > 0,
    has_weights = all(
      unlist(svy$weight)[is.character(unlist(svy$weight))] %in% names(data)
    )
  )

  passed <- all(unlist(checks))
  if (passed) {
    message("All validation checks passed")
  } else {
    failed <- names(checks)[!unlist(checks)]
    warning("Failed checks: ", paste(failed, collapse = ", "))
  }
  invisible(checks)
}

validate_pipeline(svy_simple)
```

## Best Practices

1. **Always use appropriate weights** -- never compute unweighted
   statistics from survey data
2. **Use replicate weights when available** -- they provide more robust
   variance estimates
3. **Check sample sizes by domain** -- combine small domains when CVs
   are too high
4. **Document the design** -- include the design specification, weight
   construction, and variance method
5. **Cross-validate key estimates** -- compare with published values or
   alternative methods

## Next Steps

- **[Estimation Workflows](workflows-and-estimation.html)** -- `workflow()`, `RecipeWorkflow`, and publishable estimates
- **[Rotating Panels and PoolSurvey](panel-analysis.html)** -- Longitudinal analysis with `RotativePanelSurvey` and `PoolSurvey`
- **[Getting Started](getting-started.html)** -- Review the basics of steps and Survey objects