---
title: "Get started with BinaryReplicates"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Get started with BinaryReplicates}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(BinaryReplicates)
library(tidyverse)
set.seed(42)
```

The `BinaryReplicates` package provides tools for the analysis of binary replicates.
The aim of this vignette is to provide a quick overview of the main functions of the package.

# The statistical model

For each individual $i$ in the dataset, we observe $n_i$ replicates, each taking the value $0$ or $1$. These replicates are noisy measurements of the true latent state $T_i$, which is also binary. We denote $S_i$ the number of replicates equal to $1$ for individual $i$.
The probability of observing a $1$ is $1-q$ if $T_i = 1$ and $p$ if $T_i = 0$. Thus,

- $S_i$ is the number of replicates equal to $1$ for individual $i$,
- $p$ is the false positive rate,
- $q$ is the false negative rate.

We assume that the individuals are independent and identically distributed. The prevalence of state $1$ in the population is $\theta_T = P(T_i = 1)$.

The **statistical model** is thus
$$
[T_i|\theta_T]\sim \text{Bernoulli}(\theta_T),\quad [S_i|T_i, p, q]\sim \text{Binomial}\big(n_i, T_i(1-q)+(1-T_i)p\big).
$$

In the Bayesian paradigm, we need to set a prior on the fixed parameters: $\theta_T\in(0,1)$, $p\in(0,1/2)$ and $q\in(0,1/2)$. Under the **prior distribution**, the three parameters are independent:

- $\theta_T \sim \text{Beta}(a_T, b_T)$,
- $p \sim \text{Beta}(a_{FP}, b_{FP})$ constrained on $(0, 1/2)$ and
- $q \sim \text{Beta}(a_{FN}, b_{FN})$ constrained on $(0, 1/2)$.

Hyperparameters $a_T, b_T, a_{FP}, b_{FP}, a_{FN}, b_{FN}$ are set by the user to include external information in the analysis. In absence of prior information, we recommend
$$
a_T = b_T = 1/2;\quad a_{FP} = b_{FP} = a_{FN} = b_{FN} = 2.
$$

We rely on [Stan](https://mc-stan.org/) to sample from the posterior distribution of the fixed parameters and the latent variables. 

# Examples on how to use the package

## Generate synthetic data

We start by generating a synthetic dataset according to the model:

```{r}
theta <- .4
p <- q <- .22
n <- 50
ni <- sample(2:6, n, replace = TRUE)
ti <- rbinom(n, 1, theta)
si <- rbinom(n, ni, ti*(1-q) + (1-ti)*p)
synth_data <- data.frame(ni = ni, si = si, ti=ti)
```

We will use this dataset to illustrate the main functions of the package.

## Parameter estimation with the EM algorithm

Before scoring individuals, we can estimate the model parameters $\theta_T$, $p$, and $q$ using the EM algorithm. The `EMFit` function computes the Maximum-A-Posteriori (MAP) estimates:

```{r}
fit_em <- EMFit(ni, si)
fit_em$parameters_hat
```

These estimates can then be used with `MAP_scoring` to compute likelihood-based scores without knowing the true parameter values.

## Scoring the individuals

We provide various ways to score individuals in the dataset: each score is a number between $0$ and $1$. Higher values of the score indicate a higher probability of being in state $1$. The scoring methods are:

- the average or the median of the replicates of the individual,
- a likelihood-based score that requires estimates of $p$, $q$, and $\theta_T$, and
- the posterior probability of being in state $1$.

Before computing the scores, we need to sample from the posterior distribution. With the default prior, this is done as follows:
```{r}
fit <- BayesianFit(ni, si, chains = 4, iter = 5000, refresh = 0)
print(fit, pars = c("theta", "p", "q"))
```

We can now add the scores to the dataset:
```{r}
synth_data <- synth_data %>%
  mutate(Y_A = average_scoring(ni, si),
         Y_M = median_scoring(ni, si),
         Y_L = likelihood_scoring(ni, si, list(theta = theta, p = p, q = q)),
         Y_MAP = MAP_scoring(ni, si, fit_em),
         Y_B = bayesian_scoring(ni, si, fit))
```

Note that we have plugged in the true values of the parameters in the likelihood scoring function (`Y_L`). In practice, the true parameters are unknown and we use the MAP estimates from `EMFit` instead, as shown with `Y_MAP`.

## Credible intervals for model parameters

The Bayesian fit provides credible intervals for the model parameters. By default, `credint` returns 90% credible intervals:

```{r}
credint(fit)
```

You can also request different confidence levels:

```{r}
credint(fit, level = 0.95)
```

## Prevalence estimation

The prevalence $\theta_T$ can be estimated from the scores using `prevalence_estimate`, or directly from the Bayesian fit using `bayesian_prevalence_estimate`:

```{r}
theta_A <- prevalence_estimate(synth_data$Y_A)
theta_M <- prevalence_estimate(synth_data$Y_M)
theta_MAP <- prevalence_estimate(synth_data$Y_MAP)
theta_B <- bayesian_prevalence_estimate(fit)

cat("True prevalence:    ", theta, "\n")
cat("Average-based:      ", round(theta_A, 3), "\n")
cat("Median-based:       ", round(theta_M, 3), "\n")
cat("MAP-based:          ", round(theta_MAP, 3), "\n")
cat("Bayesian:           ", round(theta_B, 3), "\n")
```

## Predicting the latent states

The latent state of an individual is either $0$ or $1$. But since binary replicates are noisy measurements, we allow three different values for the prediction:

- $0$ if the score is lower than a threshold $v_L$
- $1$ if the score is higher than a threshold $v_U$
- $1/2$ otherwise, which means that we are uncertain about the prediction. This is an inconclusive decision that calls for more data to conclude.

We can compute the predictions for the synthetic dataset as follows:
```{r}
v_L <- .35
v_U <- 1 - v_L
synth_data <- synth_data %>%
  mutate(T_A = classify_with_scores(Y_A, v_L, v_U),
         T_M = classify_with_scores(Y_M, v_L, v_U),
         T_L = classify_with_scores(Y_L, v_L, v_U),
         T_MAP = classify_with_scores(Y_MAP, v_L, v_U),
         T_B = classify_with_scores(Y_B, v_L, v_U))
```

We can now count how many times we decide in favor of $0$, $1$ or $1/2$ (inconclusive) given the true state and the method to score the individuals as follows:
```{r}
confusion <- synth_data %>%
  mutate(
    Status = ifelse(ti == 1, "T=1", "T=0"),
    Average = T_A, Median = T_M, Likelihood = T_L, MAP = T_MAP, Bayesian = T_B) %>%
  pivot_longer(cols = c(Average, Median, Likelihood, MAP, Bayesian),
               names_to = "Method", values_to = "Decision") %>%
  group_by(Status, Method, Decision) %>%
  summarise(count = n(), .groups = "keep") %>%
  ungroup() %>%
  mutate(count = as.integer(count)) %>%
  pivot_wider(names_from = Decision, values_from = count, values_fill = 0)
confusion
```

## Predicting scores on new data and classify them

Based on the Bayesian fitting, we can also compute the Bayesian scores and the predictions on new data. Let us say we are interested in predicting the scores and the latent state of individuals with $n_i = 8$ replicates and $s_i \in \{0, 1, 2, 3, 4, 5, 6, 7, 8\}$. The new data are thus:

```{r}
new_data <- data.frame(ni = rep(8, 9), si = 0:8)
```

The scores and the predictions can be computed as follows:
```{r}
new_data <- new_data %>% 
  mutate(Y_B = predict_scores(ni, si, fit),
         T_B = classify_with_scores(Y_B, v_L, v_U))
new_data
```

In the above situation, the predictions are conclusive for all the individuals except the one where we observed as many $0$'s as $1$'s in the replicates.