--- title: "What is Mutation Testing and Why Does it Matter?" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{What is Mutation Testing and Why Does it Matter?} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Code coverage is not test quality If you use `covr`, you know that 80% coverage means 80% of your lines ran during tests. What it does not mean is that those tests would catch a bug. Here is a concrete example. This function has a subtle operator bug: ```r # R/stats.R above_threshold <- function(x, threshold) { x > threshold # should this be >= ? } ``` And this test achieves 100% line coverage: ```r test_that("above_threshold works", { result <- above_threshold(c(1, 5, 10), 3) expect_true(is.logical(result)) expect_length(result, 3) }) ``` The function runs. The test passes. Coverage is 100%. But `>` could be replaced with `>=`, `<`, or `==` and this test would still pass — because it never checks the actual values, only the type and length. **Coverage measures execution. Mutation testing measures detection.** ## What a mutant is A mutant is a copy of your source code with one small, deliberate change — an operator swap, a flipped condition, a replaced constant. The idea is to simulate the kind of mistake a developer might actually make. For the function above, `muttest` could generate mutants like: ```r # mutant 1: > → >= above_threshold <- function(x, threshold) { x >= threshold } # mutant 2: > → < above_threshold <- function(x, threshold) { x < threshold } ``` Your test suite runs against each mutant. If the tests fail, the mutant is **killed** — your tests noticed the change. If the tests pass, the mutant **survived** — your tests are blind to that kind of bug. ## Kill vs survive | Outcome | Meaning | | -------- | ---------------------------------------------------------- | | Killed | At least one test failed. Your tests caught this mutation. | | Survived | All tests passed. Your tests did not detect this change. | | Error | The mutated code caused an unexpected runtime error. | Survivors are the interesting ones. Each surviving mutant points to a specific gap: a mutation your tests cannot distinguish from the original code. That is a candidate for a stronger test. ## The mutation score ``` Mutation Score = (Killed Mutants / Total Mutants) × 100% ``` - **0%** — Your tests pass regardless of what the code does. Assertions are missing or trivial. - **100%** — Every mutation your tests can detect is detected. Your tests are very specific. No project needs a perfect score on every file. The goal is to use the score directionally: find the files where survivors cluster, and strengthen those tests first. ## The LLM-generated tests problem Many R programmers reach for LLMs (ChatGPT, Claude, Copilot) to write tests. This can be a useful shortcut — LLMs write syntactically correct tests quickly, and for boilerplate cases they can work well. LLMs might produce assertions that are easy to satisfy — tests that pass but don't deeply verify correctness: ```r # Typical LLM output for above_threshold(): test_that("above_threshold returns logical vector", { expect_true(is.logical(above_threshold(c(1, 5), 3))) }) test_that("above_threshold handles length", { expect_equal(length(above_threshold(1:5, 2)), 5) }) ``` Both tests pass. Both would pass against every mutant of `above_threshold`. These tests document the shape of the output but say nothing about its correctness — a pattern that can appear in LLM-generated tests. This is not a criticism of LLMs. But it means mutation testing is a useful way to check how strong those tests actually are: > **LLM-generated tests need external validation just as much as human-written tests do.** Mutation testing provides that validation. Run `muttest` on any file where the tests were AI-generated. A low score does not mean the LLM did a bad job — it means you now know exactly where to add better assertions. ## When mutation testing pays off most Mutation testing is most valuable when: - **The logic is complex** — branching conditions, arithmetic formulas, comparison chains. These produce many mutants, and survivors are easy to fix with targeted test cases. - **The code is critical** — financial calculations, data validation, model thresholds. A bug here has real consequences; extra confidence is worth the investment. - **Tests were generated automatically** — by an LLM, a code generator, or a template. These tests are the most likely to have weak assertions. - **Coverage is already high but bugs still slip through** — a common symptom of assertion-light test suites. ## When it is less useful - **Simple functions** — functions that read a file and return its contents, or just call another function. There is little logic to mutate. - **Snapshot testing** — in snapshot testing oftentimes every little change in the code breaks the snapshot. It's very likely that every mutant will be killed, providing little useful feedback. - **Very slow test suites** — mutation testing multiplies your test runtime by the number of mutants. Start with fast unit tests before applying it to slower tests. ## How it relates to covr These tools answer different questions and complement each other: | Tool | Question answered | | --------- | --------------------------------------- | | `covr` | Which lines does my test suite execute? | | `muttest` | Which bugs would my test suite detect? | A practical workflow: use `covr` to find untested code, then use `muttest` on the covered code to find weakly-tested logic. High coverage + high mutation score = genuinely robust tests. ## Next steps - [Getting Started](getting-started.html) — run your first mutation test in minutes - [Mutator Reference](mutators.html) — which mutations to apply and when