Getting Started with metasurvey

Introduction

Working with household survey microdata involves a great deal of repetitive processing: recoding categorical variables, building indicators, joining external data, and computing weighted estimates. Each researcher writes their own version of these transformations, and the code is rarely shared or documented in a way that others can reuse.

metasurvey addresses this problem by providing a metaprogramming layer on top of the survey package (Lumley 2004). Instead of writing ad hoc scripts, you build a pipeline of transformations that is:

The pipeline has three levels:

  1. Steps – individual transformations (compute, recode, filter, rename, remove, join, validate)
  2. Recipes – reusable collections of steps bundled with metadata
  3. Workflows – statistical estimates (svymean, svytotal, svyby) that produce the final tables

The package handles the survey design —stratification, clusters, replicate weights— automatically through the Survey object. The user focuses on the substantive analysis; metasurvey takes care of the infrastructure.

Installation

# Install from GitHub
devtools::install_github("metaSurveyR/metasurvey")
library(metasurvey)
library(data.table)

Creating a Survey object

A Survey object groups the microdata together with metadata about weights, edition, and survey type. We use a sample of real microdata from Uruguay’s Encuesta Continua de Hogares (ECH, Continuous Household Survey) 2023, published by the Instituto Nacional de Estadistica (INE).

The ECH is a rotating-panel household survey. Key variables include:

library(metasurvey)
library(data.table)

# Load a sample of real ECH 2023 microdata (200 households, ~500 persons)
dt <- fread(system.file("extdata", "ech_2023_sample.csv", package = "metasurvey"))

# Create Survey object
svy <- Survey$new(
  data    = dt,
  edition = "2023",
  type    = "ech",
  engine  = "data.table",
  weight  = add_weight(annual = "W_ANO")
)

The add_weight() function maps periodicity labels (e.g., “annual”, “monthly”) to the weight column names in the data. This lets the same recipe work across different survey editions.

You can inspect the data at any time:

head(get_data(svy), 3)
#>       ID  nper  anio   mes region  dpto   nom_dpto   e26   e27   e30 e51_2
#>    <int> <int> <int> <int>  <int> <int>     <char> <int> <int> <int> <int>
#> 1: 34561     1  2023     1      1     1 Montevideo     2    26     1     6
#> 2: 34561     2  2023     1      1     1 Montevideo     2    45     7     6
#> 3: 34561     3  2023     1      1     1 Montevideo     2     7     4     1
#>    POBPCOAC SUBEMPLEO    HT11 pobre06 W_ANO
#>       <int>     <int>   <num>   <int> <int>
#> 1:        2         0 55429.6       0    57
#> 2:        4         0 55429.6       0    57
#> 3:        1         0 55429.6       0    57

Working with Steps

Steps are lazy by default: they are recorded but not executed until bake_steps() is called. This allows you to:

  1. Build a complete transformation pipeline
  2. Inspect and validate the steps before execution
  3. Reuse step sequences as recipes
  4. Ensure all dependencies exist before processing

Computing new variables

Use step_compute() to create derived variables. The package automatically:

svy <- step_compute(svy,
  # Convert income to thousands for readability
  ht11_thousands = HT11 / 1000,

  # Create employment indicator following ILO definitions
  employed = ifelse(POBPCOAC == 2, 1, 0),

  # Working age population (14+ years, ECH standard)
  working_age = ifelse(e27 >= 14, 1, 0),
  comment = "Basic labor force indicators"
)

You can group computations using the .by parameter (similar to data.table):

# Calculate mean household income per department
svy <- step_compute(svy,
  mean_income_dept = mean(HT11, na.rm = TRUE),
  .by = "dpto",
  comment = "Department-level income averages"
)

Recoding into categories

Use step_recode() to create categorical variables from conditions. Conditions are evaluated top to bottom, and the first match applies.

# Recode labor force status (POBPCOAC) into meaningful categories
svy <- step_recode(svy, labor_status,
  POBPCOAC == 2 ~ "Employed",
  POBPCOAC %in% 3:5 ~ "Unemployed",
  POBPCOAC %in% 6:10 ~ "Inactive",
  .default = "Not classified",
  comment = "Labor force status - ILO standard"
)

# Create standard age groups for labor statistics
svy <- step_recode(svy, age_group,
  e27 < 25 ~ "Youth (14-24)",
  e27 < 45 ~ "Adult (25-44)",
  e27 < 65 ~ "Mature (45-64)",
  .default = "Elderly (65+)",
  .to_factor = TRUE, # Convert to factor
  ordered = TRUE, # Ordered factor
  comment = "Age groups for labor analysis"
)

# Recode sex into descriptive labels
svy <- step_recode(svy, gender,
  e26 == 1 ~ "Male",
  e26 == 2 ~ "Female",
  .default = "Other",
  comment = "Gender classification"
)

Filtering rows

Use step_filter() to subset rows based on logical conditions. Multiple conditions are combined with AND. Like other steps, filters are lazy by default.

# Keep only working-age individuals (14+)
svy <- step_filter(svy,
  e27 >= 14,
  comment = "Working-age population only"
)

Renaming and removing variables

Rename variables for clarity or consistency:

svy <- step_rename(svy,
  age = e27, # Rename e27 to age
  sex_code = e26 # Keep original as sex_code
)

Remove variables that are no longer needed:

# Remove intermediate calculations
svy <- step_remove(svy, working_age, mean_income_dept)

Joining external data

Use step_join() to merge in external reference data. This is useful for adding:

Note that real ECH microdata already includes nom_dpto and region. Here we demonstrate a join with a poverty-line lookup as an example:

# Poverty lines by region (illustrative values in UYU, 2023)
poverty_lines <- data.table(
  region = 1:3,
  poverty_line = c(19000, 12500, 11000),
  region_name = c("Montevideo", "Interior loc. >= 5000", "Interior loc. < 5000")
)

svy <- step_join(svy,
  poverty_lines,
  by = "region",
  type = "left",
  comment = "Add poverty lines by region"
)

Executing transformations

Applying steps (bake)

Call bake_steps() to execute all pending transformations:

svy <- bake_steps(svy)
head(get_data(svy), 3)
#>    region  dpto    ID  nper  anio   mes   nom_dpto sex_code   age   e30 e51_2
#>     <int> <int> <int> <int> <int> <int>     <char>    <int> <int> <int> <int>
#> 1:      1     1 34561     1  2023     1 Montevideo        2    26     1     6
#> 2:      1     1 34561     2  2023     1 Montevideo        2    45     7     6
#> 3:      1     1 34678     1  2023     1 Montevideo        1    48     1     6
#>    POBPCOAC SUBEMPLEO     HT11 pobre06 W_ANO ht11_thousands employed
#>       <int>     <int>    <num>   <int> <int>          <num>    <num>
#> 1:        2         0  55429.6       0    57        55.4296        1
#> 2:        4         0  55429.6       0    57        55.4296        0
#> 3:        2         0 152958.8       0    54       152.9588        1
#>    labor_status      age_group gender poverty_line region_name poverty_line.y
#>          <char>         <fctr> <char>        <num>      <char>          <num>
#> 1:     Employed Mature (45-64) Female        19000  Montevideo          19000
#> 2:   Unemployed Mature (45-64) Female        19000  Montevideo          19000
#> 3:     Employed Mature (45-64)   Male        19000  Montevideo          19000
#>    region_name.y
#>           <char>
#> 1:    Montevideo
#> 2:    Montevideo
#> 3:    Montevideo

The step history is preserved for documentation and reproducibility:

steps <- get_steps(svy)
length(steps) # Number of transformation steps
#> [1] 9

# View step details
cat("Step 1:", steps[[1]]$name, "\n")
#> Step 1: step_1 Compute: ht11_thousands, employed, working_age
cat("Comment:", steps[[1]]$comment, "\n")
#> Comment: Basic labor force indicators

Visualizing the pipeline

You can visualize the transformation pipeline as a directed graph:

view_graph(svy, init_step = "Load ECH 2023")

The interactive DAG is not rendered in this vignette to keep the package size small. Run view_graph() in your R session to get an interactive graph showing:

Statistical estimates

Once the data is prepared, use workflow() to compute survey estimates. The function wraps estimators from the survey package (Lumley 2004) and returns tidy results with standard errors and coefficients of variation.

Important: The survey object must be passed inside a list().

Basic estimates

# Estimate mean household income
result <- workflow(
  list(svy),
  survey::svymean(~HT11, na.rm = TRUE),
  estimation_type = "annual"
)

result
#>                     stat    value       se         cv confint_lower
#>                   <char>    <num>    <num>      <num>         <num>
#> 1: survey::svymean: HT11 110389.3 4076.154 0.03692528      102400.1
#>    confint_upper
#>            <num>
#> 1:      118378.4

The output includes:

Multiple estimates

You can compute several statistics in a single call:

results <- workflow(
  list(svy),
  survey::svymean(~HT11, na.rm = TRUE),
  survey::svytotal(~employed, na.rm = TRUE),
  survey::svymean(~labor_status, na.rm = TRUE),
  estimation_type = "annual"
)

results
#>                                       stat        value           se         cv
#>                                     <char>        <num>        <num>      <num>
#> 1:                   survey::svymean: HT11 1.103893e+05 4.076154e+03 0.03692528
#> 2:              survey::svytotal: employed 1.426200e+04 6.756206e+02 0.04737208
#> 3:   survey::svymean: labor_statusEmployed 5.551576e-01 2.610113e-02 0.04701571
#> 4:   survey::svymean: labor_statusInactive 3.860646e-01 2.551750e-02 0.06609646
#> 5: survey::svymean: labor_statusUnemployed 5.877773e-02 1.309115e-02 0.22272291
#>    confint_lower confint_upper
#>            <num>         <num>
#> 1:  1.024001e+05  1.183784e+05
#> 2:  1.293781e+04  1.558619e+04
#> 3:  5.040004e-01  6.063149e-01
#> 4:  3.360512e-01  4.360780e-01
#> 5:  3.311956e-02  8.443591e-02

Domain estimation

Compute estimates for subpopulations using survey::svyby():

# Mean income by gender
income_by_gender <- workflow(
  list(svy),
  survey::svyby(~HT11, ~gender, survey::svymean, na.rm = TRUE),
  estimation_type = "annual"
)

income_by_gender
#>                                   stat    value       se         cv
#>                                 <char>    <num>    <num>      <num>
#> 1: survey::svyby: HT11 [gender=Female] 111107.9 5955.676 0.05360263
#> 2:   survey::svyby: HT11 [gender=Male] 109608.9 5519.291 0.05035439
#>    confint_lower confint_upper gender
#>            <num>         <num> <char>
#> 1:      99434.99      122780.8 Female
#> 2:      98791.31      120426.5   Male

Quality assessment

The coefficient of variation (CV) measures the reliability of estimates. A lower CV indicates more precise estimates. Following the INE Uruguay guidelines (Instituto Nacional de Estadística (INE) 2023):

CV range Quality category Recommendation
< 5% Excellent Use without restrictions
5%–10% Very good Use with confidence
10%–15% Good Use for most purposes
15%–25% Acceptable Use with caution, noting limitations
25%–35% Poor Use only for general trends
>= 35% Unreliable Do not publish

Use evaluate_cv() to classify estimate quality:

# Check quality of mean income estimate
cv_percentage <- results$cv[1] * 100
quality <- evaluate_cv(cv_percentage)

cat("CV:", round(cv_percentage, 2), "%\n")
#> CV: 3.69 %
cat("Quality:", quality, "\n")
#> Quality: Excellent

For official statistics, always report:

  1. Point estimate
  2. Standard error or confidence interval
  3. Coefficient of variation
  4. Quality classification
  5. Sample size

Working with Recipes

Recipes bundle transformation steps for reproducibility and sharing. Once you have developed a working pipeline, you can convert it into a recipe that can be:

Creating a Recipe

Create a recipe from the steps you have developed:

# Convert current steps to a recipe
labor_recipe <- steps_to_recipe(
  name = "ECH Labor Force Indicators",
  user = "National Statistics Office",
  svy = svy,
  description = paste(
    "Standard labor force indicators following ILO definitions.",
    "Creates employment status, age groups, and gender classifications."
  ),
  steps = get_steps(svy),
  topic = "labor_statistics"
)

class(labor_recipe)
#> [1] "Recipe" "R6"
labor_recipe$name
#> [1] "ECH Labor Force Indicators"

Or you can define a recipe from scratch:

minimal_recipe <- recipe(
  name = "Basic Demographics - ECH",
  user = "analyst",
  svy = survey_empty(type = "ech", edition = "2023"),
  description = "Basic demographic recoding for ECH microdata",
  topic = "demographics",

  # Define steps inline
  step_recode(
    gender,
    e26 == 1 ~ "Male",
    e26 == 2 ~ "Female",
    .default = "Other"
  ),
  step_recode(
    age_group,
    e27 < 14 ~ "Child",
    e27 < 65 ~ "Adult",
    .default = "Senior"
  )
)

Applying Recipes to new data

Once published, anyone can retrieve a recipe by ID and apply it to their data:

# Get a recipe by ID from the API (requires server)
r <- api_get_recipe("ech_labor_001")

# Apply to a new survey
new_svy <- add_recipe(new_svy, r)
processed <- bake_recipes(new_svy)

Recipe documentation

Recipes automatically document their transformations:

doc <- labor_recipe$doc()
names(doc)
#> [1] "meta"             "input_variables"  "output_variables" "pipeline"

# Input variables required
doc$input_variables
#> [1] "HT11"     "POBPCOAC" "e27"      "e26"      "region"

# Output variables created
doc$output_variables
#> [1] "ht11_thousands"   "employed"         "working_age"      "mean_income_dept"
#> [5] "labor_status"     "age_group"        "gender"           "mapping"

Package configuration

metasurvey provides global settings that can be adjusted to suit your workflow:

# Check current lazy-processing setting
lazy_default() # TRUE = steps recorded but not executed immediately
#> [1] TRUE

# Check data-copy behavior
use_copy_default() # TRUE = operate on copies (safer but slower)
#> [1] TRUE

# View available computation engines
show_engines() # "data.table", "dplyr", etc.
#> [1] "data.table" "tidyverse"  "dplyr"

You can modify settings for the current session:

# Disable lazy evaluation (execute steps immediately)
set_lazy_processing(FALSE)

# Modify inplace (faster, but modifies original data)
set_use_copy(FALSE)

# Reset to defaults
set_lazy_processing(TRUE)
set_use_copy(TRUE)

Sharing your work: the recipe ecosystem

One of the goals of metasurvey is to reduce duplicated effort among the community of researchers working with surveys. If you have built a useful processing pipeline, you can publish it so that others can find and reuse it. The package connects to a public API where recipes and workflows are stored:

# Browse existing recipes (requires API server)
ech_recipes <- api_list_recipes(survey_type = "ech")
length(ech_recipes)

# Search for something specific
labor <- api_list_recipes(search = "labor")

# To publish your own, create an account first
api_register("Your Name", "you@example.com", "password")

# Then publish
api_publish_recipe(labor_recipe)

The ecosystem supports three certification levels (community, reviewed, official) and three account types (individual, institutional member, institution). Institutional accounts require administrator approval, ensuring that certifications carry real backing.

For more details, see Creating and Sharing Recipes.

Data and acknowledgments

The sample data used in this vignette comes from the Encuesta Continua de Hogares (ECH) 2023, published by Uruguay’s Instituto Nacional de Estadistica (INE). The full microdata is available at INE.

The ech package by Gabriela Mathieu and Richard Detomasi was an important inspiration for metasurvey. While ech provides ready-to-use functions for computing socioeconomic indicators, metasurvey takes a different approach: it provides a metaprogramming layer that lets users define, share, and reproduce their own processing pipelines.

Next steps

Now that you understand the basics, you can explore these guides:

References

Instituto Nacional de Estadística (INE). 2023. Encuesta Continua de Hogares: Metodología y Documentación. Instituto Nacional de Estadística (INE), Uruguay. https://www.gub.uy/instituto-nacional-estadistica/.
Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (1): 1–19. https://doi.org/10.18637/jss.v009.i08.