--- title: "Estimation Workflows" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Estimation Workflows} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` ## What is a Workflow? After transforming survey data with steps and recipes, the next task is **estimation**: computing means, totals, ratios, and their standard errors while accounting for the complex survey design. The `workflow()` function wraps the estimators from the `survey` package (`svymean`, `svytotal`, `svyratio`, `svyby`) and returns tidy results as a `data.table` that include: - Point estimates and standard errors - Coefficients of variation (CV) - Confidence intervals - Metadata for reproducibility ## Initial Setup We use the Academic Performance Index (API) dataset from the `survey` package, which contains real data from stratified schools in California. ```{r setup} library(metasurvey) library(survey) library(data.table) data(api, package = "survey") dt <- data.table(apistrat) svy <- Survey$new( data = dt, edition = "2000", type = "api", psu = NULL, engine = "data.table", weight = add_weight(annual = "pw") ) ``` ## Basic Estimation ### Mean We estimate the population mean of the API score in the year 2000: ```{r mean} result <- workflow( list(svy), survey::svymean(~api00, na.rm = TRUE), estimation_type = "annual" ) result ``` ### Total We estimate total enrollment across all schools: ```{r total} result_total <- workflow( list(svy), survey::svytotal(~enroll, na.rm = TRUE), estimation_type = "annual" ) result_total ``` ### Multiple Estimates at Once You can pass multiple estimation calls to `workflow()` to compute them in a single step: ```{r multiple} results <- workflow( list(svy), survey::svymean(~api00, na.rm = TRUE), survey::svytotal(~enroll, na.rm = TRUE), estimation_type = "annual" ) results ``` ## Domain Estimation We use `survey::svyby()` to compute estimates by subpopulations (domains): ```{r domain} # Mean API score by school type api_by_type <- workflow( list(svy), survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE), estimation_type = "annual" ) api_by_type ``` ```{r domain-award} # Mean enrollment by awards status enroll_by_award <- workflow( list(svy), survey::svyby(~enroll, ~awards, survey::svymean, na.rm = TRUE), estimation_type = "annual" ) enroll_by_award ``` ## Quality Assessment The **coefficient of variation (CV)** measures the precision of an estimate. You can use `evaluate_cv()` to classify quality following standard guidelines: | CV Range | Quality | Recommendation | |----------|---------|----------------| | < 5% | Excellent | Use without restrictions | | 5-10% | Very good | Use with confidence | | 10-15% | Good | Use for most purposes | | 15-25% | Acceptable | Use with caution | | 25-35% | Poor | Only for general trends | | >= 35% | Unreliable | Do not publish | ```{r cv} # Evaluate quality of the API score estimate cv_pct <- results$cv[1] * 100 quality <- evaluate_cv(cv_pct) cat("CV:", round(cv_pct, 2), "%\n") cat("Quality:", quality, "\n") ``` ## RecipeWorkflow: Publishable Estimates A `RecipeWorkflow` bundles estimation calls with metadata, making the analysis reproducible and shareable. It records: - Which recipes were used for data preparation - Which estimation calls were performed - Authorship and versioning information ### Creating a RecipeWorkflow ```{r create-wf} wf <- RecipeWorkflow$new( name = "API Score Analysis 2000", description = "Mean API score estimation by school type", user = "Research Team", survey_type = "api", edition = "2000", estimation_type = "annual", recipe_ids = character(0), calls = list( "survey::svymean(~api00, na.rm = TRUE)", "survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE)" ) ) wf ``` ### Publishing to the Registry We publish the workflow so that others can discover and reuse it: ```{r wf-registry} # Configure a local backend wf_path <- tempfile(fileext = ".json") set_workflow_backend("local", path = wf_path) # Publish publish_workflow(wf) # Discover workflows all_wf <- list_workflows() length(all_wf) # Search by text found <- search_workflows("income") length(found) # Filter by survey type ech_wf <- filter_workflows(survey_type = "ech") length(ech_wf) ``` ### Finding Workflows Associated with a Recipe If you have a recipe and want to know which estimates have been published for it, you can use `find_workflows_for_recipe()`: ```{r find-for-recipe} # Create a workflow that references a recipe wf2 <- RecipeWorkflow$new( name = "Labor Market Estimates", user = "Team", survey_type = "ech", edition = "2023", estimation_type = "annual", recipe_ids = c("labor_force_recipe_001"), calls = list("survey::svymean(~employed, na.rm = TRUE)") ) publish_workflow(wf2) # Find all workflows that use this recipe related <- find_workflows_for_recipe("labor_force_recipe_001") length(related) if (length(related) > 0) cat("Found:", related[[1]]$name, "\n") ``` ## Sharing via the Remote API For broader dissemination, you can publish workflows to the metasurvey API: ```r # Requires authentication api_login("you@example.com", "password") # Publish api_publish_workflow(wf) # Browse all <- api_list_workflows(survey_type = "ech") specific <- api_get_workflow("workflow_id_here") ``` ## Full Pipeline Below is a complete pipeline from raw data to publishable estimation, using the API dataset: ```{r full-pipeline} # 1. Create survey from real data dt_full <- data.table(apistrat) svy_full <- Survey$new( data = dt_full, edition = "2000", type = "api", psu = NULL, engine = "data.table", weight = add_weight(annual = "pw") ) # 2. Apply steps: compute derived variables svy_full <- step_compute(svy_full, api_growth = api00 - api99, high_growth = ifelse(api00 - api99 > 50, 1L, 0L), comment = "API score growth indicators" ) svy_full <- step_recode(svy_full, school_level, stype == "E" ~ "Elementary", stype == "M" ~ "Middle", stype == "H" ~ "High", .default = "Other", comment = "School level classification" ) # 3. Estimate means estimates <- workflow( list(svy_full), survey::svymean(~api_growth, na.rm = TRUE), survey::svymean(~high_growth, na.rm = TRUE), estimation_type = "annual" ) estimates ``` ```{r full-pipeline-domain} # 4. Domain estimation (by school type) by_school <- workflow( list(svy_full), survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE), estimation_type = "annual" ) by_school ``` ```{r full-pipeline-cv} # 5. Assess quality for (i in seq_len(nrow(estimates))) { cv_val <- estimates$cv[i] * 100 cat( estimates$stat[i], ":", round(cv_val, 1), "% CV -", evaluate_cv(cv_val), "\n" ) } ``` ## Provenance: Data Lineage Every `Survey` object records **provenance** metadata: where the data came from, which steps were applied, how many rows survived each step, and which versions of R and metasurvey were used. This makes it possible to trace any estimate back to the raw data. ```{r provenance} # Provenance is populated automatically after bake_steps() prov <- provenance(svy_full) prov ``` Provenance is also attached to `workflow()` results, so you can always inspect the full lineage of an estimate: ```{r provenance-workflow} prov_wf <- provenance(estimates) cat("metasurvey version:", prov_wf$environment$metasurvey_version, "\n") cat("Steps applied:", length(prov_wf$steps), "\n") ``` For audit trails, export provenance to JSON: ```{r provenance-json, eval = FALSE} provenance_to_json(prov, "audit_trail.json") ``` To compare two runs (e.g., different editions), use `provenance_diff()`: ```{r provenance-diff, eval = FALSE} diff <- provenance_diff(prov_2022, prov_2023) diff$steps_changed diff$n_final_changed ``` ## Publication-Quality Tables `workflow_table()` formats estimation results as publication-ready tables using the `gt` package. It adds confidence intervals, CV quality classification with color coding, and provenance-based source notes. ```{r workflow-table, eval = requireNamespace("gt", quietly = TRUE)} workflow_table(estimates) ``` You can customize the output: ```{r workflow-table-opts, eval = requireNamespace("gt", quietly = TRUE)} # Spanish locale, hide SE, custom title workflow_table( estimates, locale = "es", show_se = FALSE, title = "API Growth Indicators", subtitle = "California Schools, 2000" ) ``` For domain estimates, the table detects group columns automatically: ```{r workflow-table-domain, eval = requireNamespace("gt", quietly = TRUE)} workflow_table(by_school) ``` Export to any format supported by `gt::gtsave()`: ```{r workflow-table-export, eval = FALSE} tbl <- workflow_table(estimates) gt::gtsave(tbl, "estimates.html") gt::gtsave(tbl, "estimates.docx") gt::gtsave(tbl, "estimates.png") ``` ## Next Steps - **[Creating and Publishing Recipes](recipes.html)** -- Build reproducible transformation pipelines - **[Survey Designs and Validation](complex-designs.html)** -- Stratification, clustering, replicate weights - **[Case Study: ECH](ech-case-study.html)** -- Complete labor market analysis with estimation