--- title: "Migrating from rMIDAS to rMIDAS2" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Migrating from rMIDAS to rMIDAS2} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` This vignette accompanies the deprecation of **rMIDAS**. Existing projects can keep using **rMIDAS**, but new development should move to [**rMIDAS2**](https://CRAN.R-project.org/package=rMIDAS2). The source repository for the successor package is . ## Why rMIDAS2? **rMIDAS2** is the successor to rMIDAS. It re-implements the MIDAS multiple imputation algorithm with several improvements: | | rMIDAS | rMIDAS2 | |---|---|---| | **Backend** | TensorFlow (Python, via `reticulate`) | PyTorch (Python, via local HTTP API) | | **Runtime R dependency on `reticulate`** | Yes | No | | **Preprocessing** | Manual (`convert()`) | Automatic | | **Python versions** | 3.6--3.10 | 3.9+ | | **TensorFlow required** | Yes (< 2.12) | No | The API is deliberately simpler: most pipelines that required four function calls in rMIDAS need just one or two in rMIDAS2. ## Installation ```{r} # Remove rMIDAS (optional -- it can coexist) # remove.packages("rMIDAS") # Install rMIDAS2 install.packages("rMIDAS2") # One-time Python backend setup library(rMIDAS2) install_backend() ``` ## Side-by-side comparison ### 1. Setup **rMIDAS** required configuring a `reticulate` Python environment with TensorFlow: ```{r} # --- rMIDAS --- library(rMIDAS) # Python environment configured automatically on first load, # or manually via set_python_env() ``` **rMIDAS2** uses a standalone Python server -- no reticulate needed at runtime: ```{r} # --- rMIDAS2 --- library(rMIDAS2) install_backend() # one-time setup # The server starts automatically when you call any imputation function ``` ### 2. Data preparation **rMIDAS** required explicit preprocessing with `convert()`, where you had to specify which columns were binary and which were categorical: ```{r} # --- rMIDAS --- data(adult) adult_conv <- convert(adult, bin_cols = c("income"), cat_cols = c("workclass", "marital_status"), minmax_scale = TRUE) ``` **rMIDAS2** detects column types automatically -- just pass your data frame directly: ```{r} # --- rMIDAS2 --- # No convert() step needed. Pass raw data to midas() or midas_fit(). ``` ### 3. Training **rMIDAS** used `train()`: ```{r} # --- rMIDAS --- mid <- train(adult_conv, training_epochs = 20L, layer_structure = c(256, 256, 256), input_drop = 0.8, learn_rate = 0.0004, seed = 89L) ``` **rMIDAS2** uses `midas_fit()` (or the all-in-one `midas()`): ```{r} # --- rMIDAS2 --- fit <- midas_fit(adult, epochs = 20L, hidden_layers = c(256L, 128L, 64L), corrupt_rate = 0.8, lr = 0.001, seed = 89L) ``` **Parameter name changes:** | rMIDAS (`train()`) | rMIDAS2 (`midas_fit()`) | Notes | |---|---|---| | `training_epochs` | `epochs` | | | `layer_structure` | `hidden_layers` | Default changed from 256-256-256 to 256-128-64 | | `input_drop` | `corrupt_rate` | | | `learn_rate` | `lr` | Default changed from 0.0004 to 0.001 | | `dropout_level` | `dropout_prob` | | | `train_batch` | `batch_size` | Default changed from 16 to 64 | | `cont_adj` | `num_adj` | | | `softmax_adj` | `cat_adj` | | | `binary_adj` | `bin_adj` | | ### 4. Generating imputations **rMIDAS** used `complete()`: ```{r} # --- rMIDAS --- imps <- complete(mid, m = 10) # Returns a list of 10 data.frames head(imps[[1]]) ``` **rMIDAS2** uses `midas_transform()`: ```{r} # --- rMIDAS2 --- imps <- midas_transform(fit, m = 10) # Returns a list of 10 data.frames head(imps[[1]]) ``` Or skip `midas_fit()` + `midas_transform()` entirely and use the all-in-one `midas()`: ```{r} # --- rMIDAS2 (all-in-one) --- result <- midas(adult, m = 10, epochs = 20) head(result$imputations[[1]]) ``` ### 5. Rubin's rules regression The `combine()` interface has changed: **rMIDAS** took a formula and a list of completed data frames: ```{r} # --- rMIDAS --- combine("income ~ age + hours_per_week", imps) ``` **rMIDAS2** takes a model ID and an outcome variable name. Independent variables default to all other columns: ```{r} # --- rMIDAS2 --- combine(fit, y = "income") # Specify predictors explicitly: combine(fit, y = "income", ind_vars = c("age", "hours_per_week")) ``` The output format is the same: a data frame with columns `term`, `estimate`, `std.error`, `statistic`, `df`, and `p.value`. ### 6. Overimputation diagnostic **rMIDAS** required re-specifying the data and column types: ```{r} # --- rMIDAS --- overimpute(adult, binary_columns = c("income"), softmax_columns = c("workclass", "marital_status"), training_epochs = 20L, spikein = 0.3) ``` **rMIDAS2** runs overimputation on an already-fitted model: ```{r} # --- rMIDAS2 --- diag <- overimpute(fit, mask_frac = 0.1) diag$mean_rmse diag$rmse # per-column RMSE ``` ### 7. Mean imputation (new in rMIDAS2) rMIDAS2 adds `imp_mean()`, which computes the element-wise mean across all imputations -- useful as a quick single point estimate: ```{r} # --- rMIDAS2 only --- mean_df <- imp_mean(fit) head(mean_df) ``` ### 8. Cleanup **rMIDAS2** runs a background Python server that should be stopped when you are done: ```{r} # --- rMIDAS2 --- stop_server() ``` ## Complete migration example Below is a full rMIDAS pipeline and its rMIDAS2 equivalent. ### rMIDAS (old) ```{r} library(rMIDAS) data(adult) adult <- adult[1:1000, ] # 1. Preprocess adult_conv <- convert(adult, bin_cols = c("income"), cat_cols = c("workclass", "marital_status"), minmax_scale = TRUE) # 2. Train mid <- train(adult_conv, training_epochs = 20L, seed = 89L) # 3. Generate imputations imps <- complete(mid, m = 5) # 4. Analyse combine("income ~ age + hours_per_week", imps) ``` ### rMIDAS2 (new) ```{r} library(rMIDAS2) data(adult) adult <- adult[1:1000, ] # 1. Fit and impute (no preprocessing needed) result <- midas(adult, m = 5, epochs = 20, seed = 89L) # 2. Analyse combine(result, y = "income", ind_vars = c("age", "hours_per_week")) # 3. Clean up stop_server() ``` ## Quick-reference cheat sheet | Task | rMIDAS | rMIDAS2 | |---|---|---| | Install Python env | Automatic / `set_python_env()` | `install_backend()` | | Preprocess data | `convert(data, bin_cols, cat_cols)` | *Not needed* | | Train model | `train(data, training_epochs, ...)` | `midas_fit(data, epochs, ...)` | | Generate imputations | `complete(model, m)` | `midas_transform(model, m)` | | Train + impute (one step) | *Not available* | `midas(data, m, epochs, ...)` | | Mean imputation | *Not available* | `imp_mean(model)` | | Rubin's rules | `combine(formula, df_list)` | `combine(model, y, ind_vars)` | | Overimputation | `overimpute(data, ...)` | `overimpute(model, mask_frac)` | | Shutdown | *Not needed* | `stop_server()` |