Matching observations based on generalized propensity scores involves
tuning multiple hyperparameters. Running separate workflows with
different parameter combinations can be tedious, and the effects of some
parameters are not always predictable. To streamline this process,
vecmatch provides an automated optimization workflow using a random
search algorithm. The function optimize_gps() is
implemented with multithreading to leverage computational resources
efficiently.
Step 1: Define the Formula, Data, and Optimization Space
In this example, we use the built-in cancer dataset and
focus on two predictors: the categorical variable sex and
the continuous variable age. We first specify the model
formula. Note that data and formula are fixed
inputs; if you want to compare different formulas, you must run the
workflow separately for each.
Next, we define the search space for the hyperparameters. The helper
function make_opt_args() validates your inputs and computes
the Cartesian product of all specified values, reporting the total
number of parameter combinations.
opt_args <- make_opt_args(
data = cancer,
formula = formula_cancer,
reference = c("control", "adenoma", "crc_benign", "crc_malignant"),
gps_method = c("m1", "m7", "m8"),
matching_method = c("fullopt", "nnm"),
caliper = seq(0.01, 5, 0.01),
cluster = 1:3,
ratio = 1:3,
min_controls = 1:3,
max_controls = 1:3
)
opt_args
#> Optimization Argument Set (class: opt_args)
#> ----------------------------------------
#> gps_method : m1, m7, m8
#> reference : control, adenoma, crc_benign, crc_malignant
#> matching_method : fullopt, nnm
#> caliper : [500 values]
#> order : desc, asc, original, random
#> cluster : 1, 2, 3
#> replace : TRUE, FALSE
#> ties : TRUE, FALSE
#> ratio : 1, 2, 3
#> min_controls : 1, 2, 3
#> max_controls : 1, 2, 3
#> ----------------------------------------
#> Total combinations: 1512000The print() method for make_opt_args()
provides a clear summary of the search space, including each
hyperparameter’s values and the total number of combinations.
With the search space defined, we can launch the optimization. The
optimize_gps() function performs a random search across the
parameter grid and returns a summary table containing key quality
metrics for each tested combination. You control the number of
iterations via the n_iter argument.
The function uses multithreading (via the future
package) to parallelize work. As a guideline, aim for at least 1000–1500
iterations per core for reliable search coverage. Monitor your system’s
memory usage, since the parallel backend can consume substantial amounts
of RAM.
The function uses already registered backends.
By default, optimize_gps() preserves the global random
seed. For reproducibility, set a seed before calling the optimizer.
library(future)
library(doFuture)
## 1. Register future as the foreach backend
doFuture::registerDoFuture()
## 2. Choose parallel strategy
old_plan <- future::plan()
on.exit(future::plan(old_plan), add = TRUE)
future::plan(
future::multisession,
workers = 4
)
## 3. Seeding before calling optimize_gps()
set.seed(167894)
seed_before <- .Random.seed
opt_results <- optimize_gps(
data = cancer,
formula = formula_cancer,
opt_args = opt_args,
n_iter = 6000
)
summary(opt_results)
#> Best Optimization Results by SMD Group
#> ======================================
#>
#> -------------------------------------------------------
#> | smd_group | unique_configs | smd | perc_matched |
#> -------------------------------------------------------
#> |0-0.05 | 2| 0.025| 79.73|
#> |0.05-0.10 | 1| 0.098| 80.73|
#> |0.10-0.15 | 7| 0.145| 91.2|
#> |0.15-0.20 | 3| 0.191| 96.1|
#> |0.20-0.25 | 1| 0.203| 93.52|
#> |0.25-0.30 | 2| 0.266| 79.01|
#> |0.25-0.30 | 2| 0.286| 79.01|
#> |>0.30 | 1| 0.343| 80.73|
#> -------------------------------------------------------
#>
#> Optimization Summary
#> --------------------
#> Total combinations tested : 6000
#> Total optimization time [s]: 123.97We ran the optimization on a single core with
n_iter = 1500; on our test machine this required 123.97
seconds. Given the size of the parameter grid, increasing
n_iter would improve the search’s coverage, but here we
limited iterations to keep the vignette’s build time reasonable.
When you print opt_results, it summarizes the entire
search by grouping parameter sets into bins defined by their maximum
standardized mean difference (SMD), and within each bin it highlights
the combination that achieves the highest proportion of matched
observations.
After optimization, select_opt() helps you choose
parameter combinations that meet your specific objectives. For example,
you may wish to maximize the number of matched samples for certain
treatment groups while minimizing imbalance on key covariates.
In this example, we aim to: * Retain the largest possible number of
observations in the adenoma and crc_malignant
groups. * Minimize the standardized mean difference (SMD) for the
age variable.
We can achieve this by specifying arguments in
select_opt():
select_results <- select_opt(opt_results,
perc_matched = c("adenoma", "crc_malignant"),
smd_variables = "age",
smd_type = "max"
)
summary(select_results)
#>
#> Optimization Selection Summary
#> ====================
#>
#> -------------------------------------------------------
#> | smd_group | unique_configs | smd | perc_matched |
#> -------------------------------------------------------
#> |0-0.05 | 4| 0.047| 83.155|
#> |0.05-0.10 | 2| 0.094| 90.304|
#> |0.10-0.15 | 7| 0.145| 94.737|
#> |0.15-0.20 | 3| 0.191| 97.976|
#> |0.20-0.25 | 1| 0.203| 95.763|
#> |0.25-0.30 | 1| 0.266| 80.137|
#> |0.30-0.35 | 1| 0.343| 81.764|
#> |0.35-0.40 | 1| 0.371| 71.67|
#> |0.40-0.45 | 1| 0.445| 76.449|
#> |0.45-0.50 | 1| 0.465| 77.39|
#> |>0.50 | 1| 0.541| 80.082|
#> -------------------------------------------------------The output shows the SMD bins and highlights the combination within
each bin that best meets our criteria. Suppose the configuration in the
0.10–0.15 SMD bin offers a desirable balance of matched
samples; we can extract its parameters for manual refitting.
To inspect the matched dataset and detailed balance summaries, we
rerun the standard vecmatch workflow using the selected
parameters.
1.Select the subset with highest percentage of matched samples for
smd_group of interest and rematch the original data using
the standard vecmatch workflow. At the end you get an
output object of class Matched, just like in the standard
workflow:
# estimating gps
final_match <- run_selected_matching(select_results,
cancer,
formula_cancer,
smd_group = "0.05-0.10"
)
summary(final_match)
#>
#> Treatment | N before | N after | Retained (%)
#> --------------------------------------------------------------------------------
#> adenoma | 374 | 306 | 81.8%
#> control | 315 | 261 | 82.9%
#> crc_benign | 273 | 241 | 88.3%
#> crc_malignant | 248 | 245 | 98.8%balqual on the output object to evaluate
balance and verify reproducibility at then end:balqual(final_match,
formula = formula_cancer,
type = "smd",
statistic = "max",
round = 3,
cutoffs = 0.2
)
#>
#> Matching Quality Evaluation
#> ================================================================================
#>
#> Count table for the treatment variable:
#> --------------------------------------------------
#> Treatment | Before | After
#> --------------------------------------------------
#> adenoma | 374 | 306
#> control | 315 | 261
#> crc_benign | 273 | 241
#> crc_malignant | 248 | 245
#> --------------------------------------------------
#>
#>
#> Matching summary statistics:
#> ----------------------------------------
#> Total n before matching: 1210
#> Total n after matching: 1053
#> % of matched observations: 87.02 %
#> Total maximal SMD value: 0.13
#>
#>
#> Maximal values :
#> --------------------------------------------------------------------------------
#> Variable | Coef | Before | After | Quality
#> --------------------------------------------------------------------------------
#> age | SMD | 0.240 | 0.094 | Balanced
#> sexF | SMD | 0.138 | 0.116 | Balanced
#> sexM | SMD | 0.138 | 0.116 | Balanced
#> age:sexF | SMD | 0.143 | 0.130 | Balanced
#> age:sexM | SMD | 0.156 | 0.126 | Balanced
#> --------------------------------------------------------------------------------
all.equal(seed_before, .Random.seed)
#> [1] TRUE