yaml
---
title: "Using the missForest Package"
author: "Daniel J. Stekhoven"
date: "`r format(Sys.Date(), '%Y-%m-%d')`"
output:
  pdf_document:
    number_sections: true
    toc: true
fontsize: 11pt
geometry: margin=2.5cm, top=3cm, bottom=2.5cm
lang: en
bibliography: myBib.bib
link-citations: true
vignette: >
  %\VignetteIndexEntry{missForest}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE, message = FALSE, warning = FALSE,
  fig.width = 5.33, fig.height = 3, fig.align = "center"
)
options(width = 70, prompt = "> ", continue = "+ ")
```

## Introduction

### What is this document? (And what it isn't!)

This *package vignette* is a practical, application-focused user guide for the R package `missForest`. We’ll walk through the workflow on real datasets, discuss argument choices with a keen eye on **feasibility** and **accuracy**, and keep an occasional smile. Don’t be alarmed by the length — most of it is friendly R output for illustration.

This document is **not** a theoretical primer on the foundations of the algorithm, nor is it a comparative study. For the theory and evaluations, see @stekhoven11.

### The `missForest` algorithm (with ranger by default)

`missForest` is a nonparametric imputation method for basically any kind of tabular data. It handles **mixed types** (numeric + categorical), **nonlinear relations**, **interactions**, and even **high dimensionality** ((p \gg n)). For each variable with missingness, it fits a random forest on the observed part and predicts the missing part, iterating until a stopping rule is met (or `maxiter` says “enough”).

* By default, `missForest()` now uses the **ranger** backend for speed and multithreading.
* For legacy/compatibility, you can select the classic **randomForest** backend via `backend = "randomForest"`.

The out-of-bag (OOB) error from the backend is transformed into an **imputation error estimate** — one for numeric variables (NRMSE) and one for factors (PFC). This estimate has been shown to be a good proxy of the true error @stekhoven11.

### Installation

From CRAN:

```{r, eval=FALSE}
install.packages("missForest", dependencies = TRUE)
```

The default backend (`ranger`) is automatically used if installed; otherwise the package will fall back to `randomForest` if requested.

---

## Missing value imputation with `missForest`

We’ll start with a small walk-through on `iris`, sprinkle in performance hints, and then get fancy with parallelization for big jobs.

### Data for illustrations

* **Iris**: Five variables, one categorical with three levels; base R: `data(iris)` (@anderson35).
* **Esoph**: Oesophageal cancer case-control study; base R: `data(esoph)` (@breslow80).
* **Musk**: A larger, high-dimensional example used only for *timing* demonstrations. Because CRAN machines shouldn’t fetch the internet, the chunks here are `eval = FALSE`. See @UCI10 for details.

### `missForest` in a nutshell

Load the package:

```{r}
library(missForest)
```

Create 10% missing values completely at random (not a lifestyle choice we endorse, but very educational):

```{r}
set.seed(81)
data(iris)
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)
```

Impute:

```{r}
set.seed(81)
iris.imp <- missForest(iris.mis)  # default backend = "ranger"
```

The result is a list with:

* `iris.imp$ximp` – the imputed data matrix,
* `iris.imp$OOBerror` – estimated imputation error(s).

A common gotcha (we’ve all done it): use **`iris.imp$ximp`** (not `iris.imp`) in subsequent analyses.

```{r}
iris.imp$OOBerror
```

Because `iris` has both numeric and categorical variables, you see two numbers: **NRMSE** (numeric) and **PFC** (factors). Both are better when closer to **0**.

If you prefer per-variable diagnostics (for that post-imputation feature selection debate), use `variablewise = TRUE`:

```{r}
imp_var <- missForest(iris.mis, variablewise = TRUE)
imp_var$OOBerror
```

### Additional iteration output with `verbose = TRUE`

Want to watch it think? Switch on diagnostics:

```{r}
set.seed(81)
imp_verbose <- missForest(iris.mis, verbose = TRUE)
imp_verbose$OOBerror
```

You’ll see **estimated error(s)**, **difference(s)** between iterations, and **time** per iteration. When differences increase (by type), the algorithm returns the **previous** iteration’s imputation.

### Changing the number of iterations: `maxiter`

Sometimes the stopping rule is slow to trigger (data are complicated; it happens). You can guard time with `maxiter`, or deliberately pick an earlier iteration.

```{r}
set.seed(96)
data(esoph)
esoph.mis <- prodNA(esoph, noNA = 0.05)
esoph.imp <- missForest(esoph.mis, verbose = TRUE, maxiter = 6)
esoph.imp$OOBerror
```

### Speed/accuracy trade-off: `ntree` and `mtry`

* `ntree` scales **linearly** with time. Defaults to 100; values in the tens often work well.
* `mtry = floor(sqrt(p))` is a robust default, but tuning can pay off on complex data.

**Demonstration on a bigger matrix (timings only):**

```{r, eval=FALSE}
# musk <- ...  # (not fetched during CRAN build)
# musk.mis <- prodNA(musk, 0.05)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 100)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 20)
```

As you might guess, fewer trees → fewer minutes, at a modest cost in error.

### Subsampling instead of bootstrapping: `replace = FALSE`

If you set `replace = FALSE`, the sampler uses about `0.632 * n` observations (otherwise OOB would vanish). Sometimes it helps, sometimes not:

```{r}
set.seed(81)
imp_sub <- missForest(iris.mis, replace = FALSE, verbose = TRUE)
imp_sub$OOBerror
```

### Imbalanced data & sampling controls: `classwt`, `cutoff`, `strata`, `sampsize`

These let you focus the classifier for factor variables (and sampling for both types). Each is a **list with one entry per variable** (use `NULL`/`1` where not applicable). *A quick note on backends:* For this cutoff example we explicitly use the legacy `randomForest` backend. The default `ranger` backend handles cutoffs by fitting a probability forest and then post-thresholding, but its `predict()` method requires passing the training data for non-quantile prediction, and a faithful OOB probability–based estimate is more involved to reproduce in a vignette. The `randomForest` backend natively supports per-class cutoffs and gives a clean, portable example the reader can run without extra plumbing.


```{r}
# Per-variable samples: numeric use single integers; factors need a vector per class
iris.sampsize <- list(12, 12, 12, 12, c(10, 15, 10))
imp_ss <- missForest(iris.mis, sampsize = iris.sampsize)

# Per-class cutoffs (factor only). With ranger backend, cutoffs are emulated via probability forests.
iris.cutoff <- list(1, 1, 1, 1, c(0.3, 0.6, 0.1))
imp_co <- missForest(iris.mis, cutoff = iris.cutoff, backend = "randomForest")

# Class weights (factor only)
iris.classwt <- list(NULL, NULL, NULL, NULL, c(10, 30, 20))
imp_cw <- missForest(iris.mis, classwt = iris.classwt)
```

### Tree shape controls: `nodesize` and `maxnodes`

* `nodesize` is a length-2 vector: **first** for numeric, **second** for factors.
  Our package defaults: `c(5, 1)` (yes: numeric=5, factor=1).
* With `backend = "ranger"`, `nodesize` maps to `min.bucket`; `maxnodes` is ignored (consider ranger’s `max.depth` if needed).
* With `backend = "randomForest"`, both behave as in `randomForest`.

```{r}
imp_nodes <- missForest(iris.mis, nodesize = c(5, 1))
```

### Benchmarking with a complete matrix: `xtrue` and `mixError`

If you have a ground truth (or simulate one), supply `xtrue` to log the **true** error per iteration. The return value then includes `$error`.

```{r}
set.seed(81)
imp_bench <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
imp_bench$error

# Or compute it later:
err_manual <- mixError(imp_bench$ximp, iris.mis, iris)
err_manual
```

### Parallelization: `parallelize` and `num.threads`

We offer two modes:

1. `parallelize = "variables"`
   Different variables are imputed in parallel using a registered `foreach` backend.
   To avoid nested oversubscription, per-variable **ranger** calls use `num.threads = 1` internally.

2. `parallelize = "forests"`
   A single variable’s forest is built with **ranger** multithreading (set `num.threads`) or, with **randomForest**, by combining sub-forests via `foreach`.

Register a backend first (example with `doParallel`):

```{r, eval=FALSE}
library(doParallel)
registerDoParallel(2)
# Variables mode
imp_vars <- missForest(iris.mis, parallelize = "variables", verbose = TRUE)

# Forests mode (ranger threading)
imp_fors <- missForest(iris.mis, parallelize = "forests", verbose = TRUE, num.threads = 2)
```

> Which one is faster? It depends on your data and machine. Try both when in doubt (and coffee is brewing).

---

## Concluding remarks

Imputation with `missForest` is straightforward, and OOB errors help you judge quality at a glance. Do remember: imputation does **not** add information; it helps retain partially observed rows for downstream analyses that prefer complete cases. For broader perspectives, see @schafer97 and @little87.

We thank Steve Weston for contributions regarding parallel computation ideas and tools in the R ecosystem.

## References