--- title: "surveycore vs. survey and srvyr" output: rmarkdown::html_vignette: toc: true toc_depth: 3 bibliography: references.bib link-citations: true vignette: > %\VignetteIndexEntry{surveycore vs. survey and srvyr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} library(surveycore) knitr::opts_chunk$set(comment = "#>") has_survey <- requireNamespace("survey", quietly = TRUE) has_srvyr <- requireNamespace("srvyr", quietly = TRUE) if (has_survey) { library(survey) data(api) # loads apisrs, apistrat, apiclus1 } if (has_srvyr) suppressMessages(library(srvyr)) ``` If you're coming from `survey` or `srvyr`, this vignette is a side-by-side reference showing how surveycore maps to the workflows you already know. Every section shows the same task three ways: `survey`, `srvyr`, and `surveycore`. **Two things to know upfront:** - surveycore is **not** a wrapper around `survey`. Its variance code is vendored from `survey` — so every estimate surveycore produces matches `survey` output numerically — but `survey` is not a runtime dependency. - `survey` → `srvyr` added tidyverse syntax. surveycore rethinks the interface further: tidy-select constructors, dedicated analysis functions, automatic label handling from haven-imported data, and richer tibble output. **Constructor comparisons** use the `api` dataset from the `survey` package — the same reference dataset as the [srvyr comparison vignette](https://CRAN.R-project.org/package=srvyr), so cross-referencing is easy. **Analysis comparisons** use `ns_wave1` (Nationscape Wave 1, Democracy Fund + UCLA) from surveycore's bundled data. --- ## 1. Creating Survey Design Objects ### 1.1 Simple Random Sample `apisrs` is a simple random sample of California schools. **survey** ```{r srs-survey, eval=has_survey} srs_sv <- svydesign(ids = ~1, fpc = ~fpc, weights = ~pw, data = apisrs) srs_sv ``` **srvyr** ```{r srs-srvyr, eval=has_survey && has_srvyr} srs_srvyr <- apisrs |> as_survey_design(ids = 1, fpc = fpc, weights = pw) srs_srvyr ``` **surveycore** ```{r srs-sc, eval=has_survey} srs_sc <- surveycore::as_survey(apisrs, weights = pw, fpc = fpc) srs_sc ``` `ids = ~1` is `survey`'s idiom for "no clusters" — not immediately obvious to new users. `as_survey()` without `ids` or `strata` creates an SRS design directly, making the design type clear from context. ### 1.2 Stratified Design `apistrat` is stratified by school type (`stype`: E = elementary, M = middle, H = high school). **survey** ```{r strat-survey, eval=has_survey} strat_sv <- svydesign( ids = ~1, strata = ~stype, weights = ~pw, fpc = ~fpc, data = apistrat ) strat_sv ``` **srvyr** ```{r strat-srvyr, eval=has_survey && has_srvyr} strat_srvyr <- apistrat |> as_survey_design(strata = stype, weights = pw, fpc = fpc) strat_srvyr ``` **surveycore** ```{r strat-sc, eval=has_survey} strat_sc <- surveycore::as_survey(apistrat, strata = stype, weights = pw, fpc = fpc) strat_sc ``` ### 1.3 Cluster Design `apiclus1` is a one-stage cluster sample with school districts (`dnum`) as the primary sampling units. **survey** ```{r clus-survey, eval=has_survey} clus_sv <- svydesign(ids = ~dnum, fpc = ~fpc, weights = ~pw, data = apiclus1) clus_sv ``` **srvyr** ```{r clus-srvyr, eval=has_survey && has_srvyr} clus_srvyr <- apiclus1 |> as_survey_design(ids = dnum, fpc = fpc, weights = pw) clus_srvyr ``` **surveycore** ```{r clus-sc, eval=has_survey} clus_sc <- surveycore::as_survey(apiclus1, ids = dnum, fpc = fpc, weights = pw) clus_sc ``` ### 1.4 Replicate Weights Replicate weights are common in government surveys like the ACS PUMS (80 successive-difference replicates) and Pew's Jewish Americans Study (100 JK1 replicates). Both datasets are bundled with surveycore. The key interface difference: `survey` selects replicate columns with a raw regex string; surveycore uses tidyselect — the same composable selection language used throughout the tidyverse. **ACS PUMS Wyoming — successive-difference replicates** ```{r repwt-acs-survey, eval=has_survey} acs_sv <- svrepdesign( data = acs_pums_wy, weights = ~pwgtp, repweights = "pwgtp[0-9]+", # regex string type = "successive-difference", combined.weights = TRUE ) acs_sv ``` ```{r repwt-acs-srvyr, eval=has_survey && has_srvyr} acs_srvyr <- acs_pums_wy |> as_survey_rep( weights = pwgtp, repweights = matches("^pwgtp[0-9]+$"), # tidyselect type = "successive-difference", combined_weights = TRUE ) acs_srvyr ``` ```{r repwt-acs-sc} acs_sc <- as_survey_replicate( acs_pums_wy, weights = pwgtp, repweights = tidyselect::matches("^pwgtp[0-9]+$"), # tidyselect type = "successive-difference" ) acs_sc ``` **Pew Jewish Americans 2020 — JK1 jackknife replicates** ```{r repwt-pew-sc} pew_sc <- as_survey_replicate( pew_jewish_2020, weights = extweight, repweights = extweight1:extweight100, type = "JK1" ) pew_sc ``` ### 1.5 Calibrated / Non-Probability Samples `ns_wave1` is the Nationscape Wave 1 survey — a non-probability quota panel with raking weights calibrated to ACS demographics and 2016 vote. `survey` and `srvyr` have no dedicated constructor for calibrated or non-probability designs. The design intent is lost in the code: ```{r calib-survey, eval=has_survey} # No way to signal this is calibrated or non-probability ns_sv <- svydesign(ids = ~1, weights = ~weight, data = ns_wave1) ``` ```{r calib-srvyr, eval=has_survey && has_srvyr} ns_srvyr <- ns_wave1 |> as_survey_design(weights = weight) ``` ```{r calib-sc} # as_survey_nonprob() makes the design type explicit ns_sc <- as_survey_nonprob(ns_wave1, weights = weight) ns_sc ``` `as_survey_nonprob()` preserves the distinction in code, output, and documentation. Standard errors are approximate — they assume the calibration weights produce approximately correct variance estimates [@elliott2017]. ### 1.6 Two-Phase Designs Two-phase designs are uncommon. surveycore's `as_survey_twophase()` matches `survey::twophase()` for the Breslow-Cain variance estimator [@breslow1988]. For a full worked example using `survival::nwtco`, see `vignette("creating-survey-objects")`. ### 1.7 Constructor Summary | Design | survey | srvyr | surveycore | |--------|--------|-------|------------| | SRS | `svydesign(ids=~1, ...)` | `as_survey_design(ids=1, ...)` | `as_survey(...)` (no `ids`/`strata`) | | Stratified | `svydesign(strata=~s, ...)` | `as_survey_design(strata=s, ...)` | `as_survey(..., strata=s)` | | Cluster | `svydesign(ids=~d, ...)` | `as_survey_design(ids=d, ...)` | `as_survey(..., ids=d)` | | Replicate wts | `svrepdesign(repweights="regex")` | `as_survey_rep(repweights=matches(...))` | `as_survey_replicate(repweights=matches(...))` | | Calibrated/NPS | `svydesign(ids=~1, weights=~w)` ⚠ | `as_survey_design(weights=w)` ⚠ | `as_survey_nonprob(...)` | | Two-phase | `twophase(...)` | `as_survey_twophase(...)` | `as_survey_twophase(...)` | ⚠ No dedicated non-probability constructor — design intent is not preserved. --- ## 2. Summary Statistics The sections below use `ns_sc` (already created above) alongside the equivalent `survey` and `srvyr` designs. The **label contrast** — raw integer codes in `survey`/`srvyr` vs. human-readable labels in surveycore — is the recurring theme. `ns_wave1` was imported with `haven` labels intact; surveycore resolves them automatically. ### 2.1 Weighted Means (Grouped) Estimated discrimination experienced by Black Americans, broken out by party identification (`pid3`). **survey** — group values appear as raw codes (1, 2, 3, 4) ```{r means-survey, eval=has_survey} svyby(~discrimination_blacks, ~pid3, ns_sv, svymean, na.rm = TRUE) ``` **srvyr** — also raw codes unless `pid3` is manually factored first ```{r means-srvyr, eval=has_survey && has_srvyr} ns_srvyr |> group_by(pid3) |> summarise(m = survey_mean(discrimination_blacks, vartype = "ci", na.rm = TRUE)) ``` **surveycore** — "Democrat", "Republican", "Independent", "Something else" from the haven labels, automatically ```{r means-sc} get_means(ns_sc, discrimination_blacks, group = pid3) ``` ### 2.2 Proportions / Frequency Tables Distribution of willingness to consider voting for Trump (`consider_trump`). **survey** — `svymean()` on a factor produces column names like `consider_trump1`, `consider_trump2`, `consider_trump999` ```{r freqs-survey, eval=has_survey} svymean(~factor(consider_trump), ns_sv, na.rm = TRUE) ``` **srvyr** ```{r freqs-srvyr, eval=has_survey && has_srvyr} ns_srvyr |> group_by(consider_trump) |> summarise(pct = survey_mean(na.rm = TRUE)) ``` **surveycore** — `consider_trump` column shows "Yes", "No", "Don't know" ```{r freqs-sc} get_freqs(ns_sc, consider_trump) ``` ### 2.3 Population Totals `ns_wave1` uses calibration weights scaled to the sample size (weights sum to 6,422 — the number of respondents). `get_totals()` with no variable argument returns the estimated population size — here, it confirms the calibration: **survey** — `svytotal(~1, design)` is not supported; the sum of weights gives the estimated N, and `svytotal()` requires a real variable ```{r totals-survey, eval=has_survey} sum(weights(ns_sv)) # estimated population N svytotal(~age, ns_sv, na.rm = TRUE) # total of a continuous variable ``` **srvyr** — `survey_total(1)` computes estimated N ```{r totals-srvyr, eval=has_survey && has_srvyr} ns_srvyr |> summarise(n_pop = survey_total(1)) # estimated N ns_srvyr |> summarise(age_total = survey_total(age, na.rm = TRUE)) ``` **surveycore** ```{r totals-sc} get_totals(ns_sc) # estimated N (no x argument) get_totals(ns_sc, age) # total of a continuous variable ``` For a design with probability weights that sum to the actual population (like the Pew Jewish Americans study), `get_totals()` returns the estimated population count in millions: ```{r totals-pew} get_totals(pew_sc) ``` ### 2.4 Quantiles Weighted age distribution of Nationscape respondents. **survey** ```{r quantiles-survey, eval=has_survey} svyquantile(~age, ns_sv, quantiles = c(0.25, 0.5, 0.75), na.rm = TRUE) ``` **srvyr** ```{r quantiles-srvyr, eval=has_survey && has_srvyr} ns_srvyr |> summarise(q = survey_quantile(age, c(0.25, 0.5, 0.75), na.rm = TRUE)) ``` **surveycore** — Woodruff (1952) confidence intervals, guaranteed to respect the data range ```{r quantiles-sc} get_quantiles(ns_sc, age) ``` ### 2.5 Ratios `api00` / `api99` is a natural ratio: Academic Performance Index in 2000 relative to 1999. We use `apisrs` here because it provides a clear probability design where the ratio estimator is unambiguous. **survey** — positional argument order requires knowing which formula is numerator vs. denominator ```{r ratios-survey, eval=has_survey} svyratio(~api00, ~api99, srs_sv) ``` **srvyr** ```{r ratios-srvyr, eval=has_survey && has_srvyr} srs_srvyr |> summarise(ratio = survey_ratio(api00, api99)) ``` **surveycore** — named arguments make direction self-documenting ```{r ratios-sc, eval=has_survey} get_ratios(srs_sc, numerator = api00, denominator = api99) ``` `numerator =` / `denominator =` remove the ambiguity present in `svyratio(~y, ~x, design)`. ### 2.6 Correlations Pearson correlation between Trump and Biden favorability (`cand_favorability_*` is a 1–4 scale; 999 codes respondents who haven't heard enough — filtered below). ```{r corr-setup} # Pre-filter non-substantive responses before creating the design ns_corr <- ns_wave1[ !is.na(ns_wave1$cand_favorability_trump) & ns_wave1$cand_favorability_trump != 999 & !is.na(ns_wave1$cand_favorability_biden) & ns_wave1$cand_favorability_biden != 999, ] ns_corr_sc <- as_survey_nonprob(ns_corr, weights = weight) ``` **survey** — matrix output, no confidence intervals ```{r corr-survey, eval=has_survey && requireNamespace("jtools", quietly = TRUE)} ns_corr_sv <- svydesign(ids = ~1, weights = ~weight, data = ns_corr) jtools::svycor(~cand_favorability_trump + cand_favorability_biden, ns_corr_sv) ``` **srvyr** — no dedicated `survey_corr()` verb; users must fall back to `survey` **surveycore** — long tibble with Fisher-Z confidence intervals (bounds guaranteed in [−1, 1]) ```{r corr-sc} get_corr(ns_corr_sc, c(cand_favorability_trump, cand_favorability_biden)) ``` `svycor()` returns a matrix with no CIs. `get_corr()` returns a tidy tibble with Fisher-Z confidence intervals. srvyr has no `survey_corr()` verb at all — users fall back to `survey` directly. --- ## 3. Controlling Uncertainty Output All surveycore analysis functions share a `variance` argument that controls which uncertainty columns appear. In `survey`, you call a separate function per metric. In `srvyr`, you repeat the `summarise()` call for each type. **survey** — separate call per uncertainty type ```{r uncertainty-survey, eval=has_survey} m <- svymean(~age, ns_sv, na.rm = TRUE) m # SE only in the estimate confint(m) # CI — separate call cv(m) # CV — separate call svymean(~age, ns_sv, deff = TRUE, na.rm = TRUE) # DEFF — different return structure ``` **srvyr** — one call per type; the variable is estimated multiple times ```{r uncertainty-srvyr, eval=has_survey && has_srvyr} ns_srvyr |> summarise( m_se = survey_mean(age, vartype = "se", na.rm = TRUE), m_ci = survey_mean(age, vartype = "ci", na.rm = TRUE), m_cv = survey_mean(age, vartype = "cv", na.rm = TRUE), m_deff = survey_mean(age, deff = TRUE, na.rm = TRUE) ) ``` **surveycore** — one call, any combination of metrics ```{r uncertainty-sc} get_means(ns_sc, age, variance = c("se", "ci", "cv", "deff")) ``` Set `variance = NULL` to return point estimates and sample counts only: ```{r uncertainty-null} get_means(ns_sc, age, variance = NULL) ``` Available `variance` codes: | Code | What it returns | |------|-----------------| | `"se"` | Standard error | | `"ci"` | Confidence interval: `ci_low`, `ci_high` | | `"var"` | Variance (SE²) | | `"cv"` | Coefficient of variation (SE / estimate) | | `"moe"` | Margin of error at `conf_level` | | `"deff"` | Design effect (complex / SRS variance) | The `conf_level` argument controls the level for `"ci"` and `"moe"`. Default is `0.95`; for a 90% interval: `get_means(ns_sc, age, conf_level = 0.9)`. --- ## 4. Features With No survey / srvyr Equivalent ### 4.1 Automatic Value Labels `ns_wave1` was imported with `haven` labels intact. surveycore resolves them automatically — no manual recoding required. **survey / srvyr** — group column values are raw integer codes ```{r labels-survey, eval=has_survey} # pid3 values: 1, 2, 3, 4 — the reader must consult the codebook svyby(~discrimination_blacks, ~pid3, ns_sv, svymean, na.rm = TRUE) ``` **surveycore** — "Democrat", "Republican", "Independent", "Something else" ```{r labels-sc} get_means(ns_sc, discrimination_blacks, group = pid3) ``` Opt out with `label_values = FALSE` to see raw codes: ```{r labels-optout} get_means(ns_sc, discrimination_blacks, group = pid3, label_values = FALSE) ``` ### 4.2 Multiple Variables in One Call `ns_wave1` includes a battery of 13 news source items (`news_sources_facebook`, `news_sources_cnn`, …, `news_sources_other`). Analyzing all at once requires a loop in `survey` and `srvyr`; surveycore stacks them in a single call. **survey / srvyr** — must loop; output is a list that the user binds manually ```{r multi-survey, eval=has_survey} news_vars <- c( "news_sources_facebook", "news_sources_cnn", "news_sources_fox", "news_sources_npr", "news_sources_new_york_times" ) results_sv <- lapply(news_vars, function(v) { f <- as.formula(paste0("~", v)) svymean(f, ns_sv, na.rm = TRUE) }) # Results are a list — user must bind rows and add a name column manually do.call(rbind, lapply(seq_along(results_sv), function(i) { data.frame(name = news_vars[[i]], coef(results_sv[[i]])) })) ``` **surveycore** — one call; a `name` column identifies each item; variable labels are applied automatically ```{r multi-sc} get_freqs( ns_sc, c(news_sources_facebook:news_sources_other) ) ``` ### 4.3 Minimum Cell Size Warnings `survey` and `srvyr` return estimates for tiny cells silently — the user may not notice that a group has only 8 respondents. surveycore warns when any unweighted cell count falls below `min_cell_n` (default: 30). ```{r min-cell} # Construct a design with deliberately small cells small_df <- data.frame( group = rep(c("A", "B", "C"), c(8, 15, 200)), x = rnorm(223), w = 1 ) small_svy <- surveycore::as_survey(small_df, weights = w) get_means(small_svy, x, group = group) ``` Suppress the warning when small cells are expected: ```{r min-cell-suppress, eval=FALSE} get_means(small_svy, x, group = group, min_cell_n = 0L) ``` ### 4.4 Weighted Sample Size In `survey` and `srvyr`, getting both the unweighted and estimated population count for each cell requires a separate `svytotal(~1, ...)` call. surveycore adds it with one argument: **survey** — extra call for weighted N ```{r n-weighted-survey, eval=has_survey} # Proportions by group (unweighted n not shown in output) svyby(~factor(consider_trump), ~pid3, ns_sv, svymean, na.rm = TRUE) # Estimated weighted N per group — requires a separate call svyby(~as.numeric(!is.na(consider_trump)), ~pid3, ns_sv, svytotal, na.rm = TRUE) ``` **surveycore** — one argument ```{r n-weighted-sc} get_freqs(ns_sc, consider_trump, group = pid3, n_weighted = TRUE) ``` The `n_weighted` column is the sum of weights within each cell — the estimated population size that cell represents. ### 4.5 Metadata-Rich Results (`.meta`) surveycore attaches a `.meta` attribute to every result tibble. It contains the variable label, value labels, and question preface for each focal and grouping variable — everything needed to build a publication-ready table without consulting the codebook separately. ```{r meta} result <- get_means(ns_sc, discrimination_blacks, group = pid3) # Variable label for the focal variable attr(result, ".meta")$x$discrimination_blacks$variable_label # Value labels for the grouping variable attr(result, ".meta")$group$pid3$value_labels ``` In `survey` and `srvyr`, metadata is not attached to results — label information is lost after estimation. --- ## 5. Notable Differences | | survey | srvyr | surveycore | |--|--------|-------|------------| | **Output format** | S3 `svystat` / matrix | Tibble with `_se`/`_low`/`_upp` suffix columns | S3 tibble subclass with CI columns by default | | **Interface** | `~formula` throughout | Mixed: tidy constructor, formula in `summarise()` | Bare names throughout (tidy-select) | | **Value labels** | Not applied | Not applied | Applied automatically from `haven` attributes | | **Multiple variables** | Loop required | Loop required | `c(x, y, z)` in one call | | **Min-cell warning** | None | None | Default `min_cell_n = 30L` | | **Weighted N** | Separate call | Separate call | `n_weighted = TRUE` | | **Correlation CIs** | None (`svycor()`) | No verb | Fisher-Z CIs via `get_corr()` | | **Non-probability design** | No dedicated constructor | No dedicated constructor | `as_survey_nonprob()` | | **Manipulation** | Pre/post construction | Bundled via pipe | `surveytidy` (companion package) | | **Runtime `survey` dep.** | Is `survey` | Wraps `survey` | Vendored — `survey` not required | --- ## 6. Function Reference Table | Task | survey | srvyr | surveycore | |------|--------|-------|------------| | SRS design | `svydesign(ids=~1, ...)` | `as_survey_design(ids=1, ...)` | `as_survey(...)` (no `ids`/`strata`) | | Stratified design | `svydesign(strata=~s, ...)` | `as_survey_design(strata=s, ...)` | `as_survey(..., strata=s)` | | Cluster design | `svydesign(ids=~d, ...)` | `as_survey_design(ids=d, ...)` | `as_survey(..., ids=d)` | | Replicate weights | `svrepdesign(repweights="regex")` | `as_survey_rep(repweights=matches(...))` | `as_survey_replicate(repweights=matches(...))` | | Calibrated/NPS | `svydesign(weights=~w)` ⚠ | `as_survey_design(weights=w)` ⚠ | `as_survey_nonprob(...)` | | Two-phase | `twophase(...)` | `as_survey_twophase(...)` | `as_survey_twophase(...)` | | Weighted mean | `svymean(~x, d)` | `summarise(survey_mean(x))` | `get_means(d, x)` | | Grouped mean | `svyby(~x, ~g, d, svymean)` | `group_by(g) \|> summarise(...)` | `get_means(d, x, group=g)` | | Proportions | `svymean(~factor(x), d)` | `group_by(x) \|> summarise(survey_mean())` | `get_freqs(d, x)` | | Total | `svytotal(~x, d)` | `summarise(survey_total(x))` | `get_totals(d, x)` | | Population N | `svytotal(~1, d)` | `summarise(survey_total(1))` | `get_totals(d)` | | Quantiles | `svyquantile(~x, d, q)` | `summarise(survey_quantile(x, q))` | `get_quantiles(d, x, probs=q)` | | Ratio | `svyratio(~y, ~x, d)` | `summarise(survey_ratio(y, x))` | `get_ratios(d, numerator=y, denominator=x)` | | Correlation | `svycor(~x+y, d)` ⚠ no CI | ✗ no verb | `get_corr(d, c(x, y))` with CI | | Multiple variables | Loop + bind | Loop + bind | `get_means(d, c(x, y, z))` | | Value labels | Manual recode | Manual recode | `label_values = TRUE` (default) | | Min-cell warning | ✗ | ✗ | `min_cell_n = 30L` (default) | | Weighted N | Separate call | Separate call | `n_weighted = TRUE` | | Domain filter | `subset(d, cond)` | `filter(cond)` | `filter(cond)` (`surveytidy`) | | Mutate | Modify df, recreate | `mutate(...)` | `mutate(...)` (`surveytidy`) | | Group by | `svyby(...)` | `group_by(...)` | `group_by(...)` (`surveytidy`) or `group=` arg | ⚠ = partial / workaround; ✗ = no equivalent --- ## 7. Learning More - `vignette("getting-started")` — full surveycore overview with worked examples - `vignette("creating-survey-objects")` — all five constructors, including two-phase designs and the `nest` argument - [srvyr comparison vignette](https://CRAN.R-project.org/package=srvyr) — the original side-by-side that this vignette is modeled on - @lumley2010 — the definitive reference on complex survey analysis in R