--- title: "Choosing Weights and Validating ML" author: "Maciej Nasinski" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Choosing Weights and Validating ML} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, collapse = TRUE, comment = "#>" ) ``` This vignette is a decision guide for choosing and checking weights in `cat2cat()`. Read it when you want to answer one of these questions: - Are naive and frequency weights telling the same story? - Is ML worth trying at all? - If ML is used, does it improve on the frequency baseline? - What should I do when different weight methods disagree? - How should I handle failed ML predictions? If you only need the basic two-period workflow, go back to [Get Started](cat2cat.html). If you need multi-period, panel, aggregated, or regression workflows, continue to [Advanced Workflows](cat2cat_advanced.html). ```{r load-data} library(cat2cat) library(dplyr) library(tidyr) library(e1071) library(randomForest) data(occup, package = "cat2cat") data(trans, package = "cat2cat") occup_2008 <- occup[occup$year == 2008, ] occup_2010 <- occup[occup$year == 2010, ] occup_2012 <- occup[occup$year == 2012, ] ``` ## Step 1: Understand the competing weight assumptions `cat2cat` offers several ways to assign probability weights to replicated observations. Each method encodes a different **distributional assumption** about how ambiguous observations split across candidate categories. When a downstream estimand depends on the mapped category, this is the identifying assumption for that estimand - so always check sensitivity. **Naive weights** (`wei_naive_c2c`) are always computed. Each replicated observation gets uniform probability $1/k$ where $k$ is the number of candidate categories. - *Assumption*: All candidates equally likely (maximum entropy / uninformative prior) - *Requires*: Only the mapping table - no data from either period - *Use when*: No information favoring any candidate, or as a robustness lower bound **Frequency-based weights** (`wei_freq_c2c`) are the default. They use category counts from the base period. - *Assumption*: Ambiguous observations distribute like the base period population - *Requires*: Observed counts in base period (falls back to naive if all zero) - *Use when*: Base period is large and representative; ambiguous cases resemble the general population **ML weights** (`wei_knn_c2c`, `wei_lda_c2c`, `wei_rf_c2c`, `wei_nb_c2c`) use individual features to predict category membership. - *Assumption*: Features (age, education, etc.) predict true category: $P(j \mid X, g)$ - *Requires*: Training data with both category labels and predictive features - *Use when*: Features are informative - verify with `cat2cat_ml_run()` Available ML methods: - **knn**: k-Nearest Neighbours. A non-parametric method that handles non-linear boundaries. Sensitive to the choice of `k`. - **lda**: Linear Discriminant Analysis. Fast, interpretable. Assumes multivariate normality and equal covariance. - **rf**: Random Forest. Handles interactions well. Slower, needs `ntree` tuning. - **nb**: Naive Bayes via `e1071`. Fast, useful after numeric/logical/factor preprocessing. Assumes conditional independence of features. ML features must be numeric, logical, or factor columns. Factor columns are one-hot encoded automatically using levels observed in the training data and the target period. Character columns are not encoded automatically; convert them to factors first if they represent categories. You can run multiple methods at once and compare or combine them: ```{r mixed-ml-weights} occup_2_mix <- cat2cat( data = list( old = occup_2008, new = occup_2010, cat_var = "code", time_var = "year" ), mappings = list(trans = trans, direction = "backward"), ml = list( data = occup_2010, cat_var = "code", method = c("knn", "rf", "lda", "nb"), features = c("age", "sex", "edu", "exp", "parttime", "salary"), args = list(k = 10, ntree = 50), on_fail = "na" ) ) ``` Correlations between weight methods: ```{r weight-correlations} occup_2_mix$old %>% select(wei_knn_c2c, wei_rf_c2c, wei_lda_c2c, wei_nb_c2c, wei_freq_c2c, wei_naive_c2c) %>% cor(use = "pairwise.complete.obs") ``` ### If ML fails on some rows: `on_fail` and `fail_warn` Sometimes ML probabilities cannot be produced for a subset of replicated rows (for example incomplete target features or method-specific prediction failures). `cat2cat()` exposes explicit policy controls in `ml`: - `on_fail = "freq"` (default): failed ML rows are filled with `wei_freq_c2c`. - `on_fail = "naive"`: failed ML rows are filled with `wei_naive_c2c`. - `on_fail = "na"`: failed ML rows are kept as `NA`. - `on_fail = "error"`: stop immediately when failed rows are detected. - `fail_warn = TRUE` (default): warn with affected rows/observations per method. - `fail_warn = FALSE`: suppress these warnings. Important: this failure accounting is specific to `cat2cat()` and the constructed weight columns (`wei_*_c2c`). It is different from `cat2cat_ml_run()` "SKIPPED GROUPS", which reports mapping groups that were not evaluated in holdout diagnostics (single category, too few observations, or method fit/predict error for that group). ```{r ml-failure-policy, eval=FALSE} ml_setup <- list( data = bind_rows(occup_2010, occup_2012), cat_var = "code", method = c("knn", "rf", "lda"), features = c("age", "sex", "edu", "exp", "parttime", "salary"), args = list(k = 10, ntree = 50), on_fail = "freq", # default policy fail_warn = TRUE # default reporting ) # strict mode for QA pipelines ml_strict <- ml_setup ml_strict$on_fail <- "error" # diagnostic mode to inspect failures directly ml_diag <- ml_setup ml_diag$on_fail <- "na" ml_diag$fail_warn <- FALSE ``` Ensemble weights with `cross_c2c()` and pruning with `prune_c2c()`: ```{r ensemble-prune} occup_old_mix <- occup_2_mix$old %>% cross_c2c(.) %>% prune_c2c(., column = "wei_cross_c2c", method = "nonzero") ``` ## Step 2: Check whether conclusions are sensitive to the weight choice Different weight methods affect regression coefficients when you filter to a specific occupation group and combine both periods. This is the proper sensitivity analysis: subjects from the base period (new, no replication) plus subjects from the target period (old, weighted by probability of belonging to this group). ### Compare weight methods on the same mapped data Run backward mapping with all ML methods: ```{r sensitivity-result} result_all <- cat2cat( data = list(old = occup_2008, new = occup_2010, cat_var = "code", time_var = "year"), mappings = list(trans = trans, direction = "backward"), ml = list( data = occup_2010, cat_var = "code", method = c("knn", "rf", "lda", "nb"), features = c("age", "sex", "edu", "exp", "parttime", "salary"), args = list(k = 10, ntree = 50) ) ) ``` **Weighted counts per group** - compare how weight methods redistribute observations: ```{r sensitivity-counts} weight_cols <- c("wei_naive_c2c", "wei_freq_c2c", "wei_knn_c2c", "wei_rf_c2c", "wei_lda_c2c", "wei_nb_c2c") # Pick groups with high replication top_groups <- result_all$old %>% filter(rep_c2c > 1) %>% count(g_new_c2c, sort = TRUE) %>% head(6) %>% pull(g_new_c2c) # Weighted counts from OLD period (replicated) old_counts <- lapply(weight_cols, function(wcol) { result_all$old %>% filter(g_new_c2c %in% top_groups) %>% group_by(g_new_c2c) %>% summarise(n = sum(.data[[wcol]]), .groups = "drop") }) %>% setNames(gsub("wei_|_c2c", "", weight_cols)) %>% bind_rows(.id = "method") %>% tidyr::pivot_wider(names_from = method, values_from = n) # Counts from NEW period (no replication, exact) new_counts <- result_all$new %>% filter(code %in% top_groups) %>% count(code, name = "new_period") %>% rename(g_new_c2c = code) # Combine for comparison left_join(old_counts, new_counts, by = "g_new_c2c") ``` The `new_period` column shows the actual counts in 2010. The other columns show how the 2008 observations are redistributed under each weight method. `naive` assigns uniform probability (1/n candidates), `freq` uses base period frequencies, and ML methods (`knn`, `rf`, `lda`, `nb`) use predicted probabilities. Pick a specific group for regression analysis: ```{r sensitivity-weights} # New-period counts per category (no replication, so plain tally) new_counts_all <- result_all$new %>% count(code, name = "n_new") %>% rename(g_new_c2c = code) # Old-period weighted counts, joined to new-period counts group_sizes <- result_all$old %>% group_by(g_new_c2c) %>% summarise(n_old = sum(wei_freq_c2c), .groups = "drop") %>% left_join(new_counts_all, by = "g_new_c2c") %>% filter(n_old >= 10, n_new >= 10) %>% arrange(desc(n_old)) # Pick a group for regression analysis target_group <- group_sizes$g_new_c2c[1] cat("Analysing occupation group:", target_group, "\n") ``` **Regression within a single occupation group** - combine both periods and compare coefficients: ```{r sensitivity-group-reg} # Subset old period to target group (with weights) old_subset <- result_all$old %>% filter(g_new_c2c == target_group) # Subset new period to target group (no replication, weight = 1) new_subset <- result_all$new %>% filter(code == target_group) %>% mutate( wei_naive_c2c = 1, wei_freq_c2c = 1, wei_knn_c2c = 1, wei_rf_c2c = 1, wei_lda_c2c = 1, wei_nb_c2c = 1 ) # Combine both periods d <- bind_rows(old_subset, new_subset) # Compare all regression coefficients across weight methods f <- I(log(salary)) ~ age + sex + factor(edu) + exp + parttime coefs <- sapply(weight_cols, function(wcol) { d$w <- d$multiplier * d[[wcol]] coef(lm(f, data = d, weights = w)) }) colnames(coefs) <- gsub("wei_|_c2c", "", weight_cols) round(coefs, 4) ``` All coefficients can vary because weight methods change which old-period subjects contribute to this occupation group. ### Compare pruning strategies only after comparing full weights > **Note**: Pruning discards probability information and should be used only after analysis with full weights. Prefer `prune_c2c(method = "nonzero")` to remove impossible candidates while preserving the probability distribution. More aggressive pruning (`highest1`) is appropriate only for descriptive tables or when you need exactly one category per observation. ```{r sensitivity-pruning} # Compare regression coefficients under different pruning strategies prune_methods <- c("nonzero", "highest", "highest1") prune_coefs <- sapply(prune_methods, function(pm) { old_pruned <- result_all$old %>% prune_c2c(method = pm) %>% filter(g_new_c2c == target_group) d <- bind_rows(old_pruned, new_subset) d$w <- d$multiplier * d$wei_freq_c2c coef(lm(f, data = d, weights = w)) }) round(prune_coefs, 4) ``` ### Compare ensemble compositions when no single method dominates `cross_c2c()` creates a weighted average of multiple weight columns. Vary the mix: ```{r sensitivity-ensemble} configs <- list( equal = c(1, 1) / 2, freq_heavy = c(3, 1) / 4, ml_heavy = c(1, 3) / 4 ) ens_coefs <- sapply(names(configs), function(nm) { old_ens <- result_all$old %>% cross_c2c(c("wei_freq_c2c", "wei_knn_c2c"), configs[[nm]]) %>% filter(g_new_c2c == target_group) new_ens <- new_subset %>% mutate(wei_cross_c2c = 1) d <- bind_rows(old_ens, new_ens) d$w <- d$multiplier * d$wei_cross_c2c coef(lm(f, data = d, weights = w)) }) round(ens_coefs, 4) ``` When regression coefficients are stable across weight methods, pruning strategies, and ensemble compositions, report with confidence. When they diverge, the mapping introduces uncertainty - report the range or investigate the source. ## Step 3: Validate whether ML actually improves on simpler baselines The `ml` argument in `cat2cat()` adds ML-based probability weights, but ML is not guaranteed to improve over simpler baselines. `cat2cat_ml_run()` provides per-group holdout (single train/test split) diagnostics to answer this question *before* committing to a method. ### What `cat2cat_ml_run()` is doing For each mapping group (set of candidate categories linked by the transition table) `cat2cat_ml_run()`: 1. Collects all observations from `ml$data` whose category belongs to the group. 2. Randomly splits them into training (`1 - test_prop`) and test (`test_prop`) sets. 3. Computes two baselines on the test set: - **naive** - accuracy of a random guess ($1 / k$ where $k$ is the number of candidate categories). - **freq** - accuracy of always predicting the most frequent training-set category. 4. Trains each specified ML method on the training set and records test-set model performance. Groups with fewer than 5 observations or only one candidate category are skipped. Also note that `cat2cat_ml_run()` does not use `on_fail`; it is a diagnostic tool and reports skipped groups instead of applying row-level fallback weights. ### Minimal validation workflow ```{r cv-basic} cv_knn <- cat2cat_ml_run( mappings = list(trans = trans, direction = "backward"), ml = list( data = bind_rows(occup_2010, occup_2012), cat_var = "code", method = "knn", features = c("age", "sex", "edu", "exp", "parttime", "salary"), args = list(k = 10) ) ) print(cv_knn) ``` The `print()` summary reports: - **ACCURACY** - average held-out classification accuracy across non-skipped groups. `naive (1/k)` is the random-guess baseline, `freq` is the majority-class baseline, and each ML line reports top-class accuracy for that method. - **BRIER SCORE** - average full-vector probability error across non-skipped groups. Lower is better. This matters because `cat2cat` ultimately uses probability weights, not just hard classifications. - **MEAN P(TRUE CLASS)** - average probability assigned to the true category. Higher is better. This is often the most directly relevant metric for `cat2cat`, because it measures the quality of the probability weights themselves. - **ACCURACY: ML vs BASELINES** - the share of groups in which the ML method beats `naive` or beats `freq` on accuracy. This is a win-rate summary, not an average accuracy gap. - **SKIPPED GROUPS** - the percentage of mapping groups for which that ML method has no reported result because the group had only one candidate category, fewer than 5 observations, or the model could not be fit for that group. So for output like: - `knn > naive: 87.7%` - `knn > freq: 18.0%` - `knn: accuracy = 0.5108` vs `freq (most common): 0.5366` the right reading is: kNN clearly beats the naive baseline, but it does **not** beat the frequency baseline on top-class accuracy overall. In that case, `wei_freq_c2c` remains the default choice if your only goal is classification accuracy. At the same time, if kNN has a slightly lower Brier score and a higher mean P(true class) than `freq`, then it may still be producing better-calibrated probability weights even though its top prediction is less often correct. That distinction matters in `cat2cat`, because the mapped weights are probabilities distributed across candidate categories rather than single-class assignments. ### Compare multiple ML methods in one run ```{r cv-multiple} cv_all <- cat2cat_ml_run( mappings = list(trans = trans, direction = "backward"), ml = list( data = bind_rows(occup_2010, occup_2012), cat_var = "code", method = c("knn", "lda", "rf", "nb"), features = c("age", "sex", "edu", "exp", "parttime", "salary"), args = list(k = 10, ntree = 50) ) ) print(cv_all) ``` Interpretation tip for mixed outputs: - It is possible for a method to have 0 failed rows in `cat2cat()` but a non-zero skipped-group rate in `cat2cat_ml_run()`. - This is not a contradiction: the first is row-level weight construction, the second is group-level holdout evaluation coverage. ### Inspect per-group diagnostics when methods disagree The returned object is a named list. Each element corresponds to one mapping group: ```{r cv-inspect} # Pick a group with multiple candidates group_names <- names(cv_all) example_group <- group_names[ which(vapply(cv_all, function(g) !is.na(g$freq) && g$naive < 1, logical(1)))[1] ] cv_all[[example_group]] ``` Each group entry contains the group-level diagnostics behind the printed summary: - `$naive` - $1/k$ random-guess accuracy for that group. - `$freq` - majority-class accuracy for that group. - `$acc` - named numeric vector with ML accuracy by method. - `$naive_brier` and `$freq_brier` - baseline Brier scores. - `$brier` - named numeric vector with ML Brier scores by method. - `$naive_mean_prob` and `$freq_mean_prob` - baseline mean P(true class). - `$mean_prob` - named numeric vector with ML mean P(true class) by method. ### Decision rules for interpreting the output **Understanding model performance in context**: This is **multi-class classification** - each mapping group can have 3-10+ candidate categories. A naive random guess yields only ~18% accuracy (1/k where k is the number of candidates). Achieving 50%+ is substantial improvement over random - do not compare these numbers to binary classification benchmarks where 80%+ is typical. The key question is whether ML beats the *frequency* baseline, not whether it reaches some absolute threshold. | Scenario | Recommendation | |----------|---------------| | ML model performance >> freq across most groups | ML weights add genuine signal; use them | | ML model performance $\approx$ freq | ML is no better than frequency; prefer `wei_freq_c2c` (simpler, faster) | | ML model performance < freq for many groups | ML is adding noise; do **not** use ML weights | | High skipped-group rate (>20%) | Features may have too many missing values, groups are too small, or method fitting is unstable | Because the train/test split is random, results vary between runs. For more stable estimates, pool more data into `ml$data` (e.g. multiple survey waves) or run `cat2cat_ml_run()` several times and average the summaries. > **Caveat**: high `cat2cat_ml_run()` model performance means the model discriminates well *within* mapping groups. It does not validate the mapping table itself. A perfect model with a wrong transition table will still produce wrong results.