---
title: "Simulating interval Beta data"
shorttitle: "Simulating interval Beta data"
author: 
- name: "Frédéric Bertrand"
  affiliation: 
  - Cedric, Cnam, Paris
  email: frederic.bertrand@lecnam.net
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Simulating interval Beta data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE")

knitr::opts_chunk$set(purl = LOCAL)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
suppressPackageStartupMessages(library(SelectBoost.beta))
set.seed(1)
```

## Overview

The `simulation_DATA.beta()` helper produces beta-regression design matrices
paired with either fully observed responses or interval-censored outcomes. This
vignette illustrates a typical workflow for drawing a single data set with
structured correlation and custom missingness behaviour that mimics practical
survey settings.

We simulate 300 observations with 10 candidate predictors. Four predictors are
truly associated with the response through coefficients specified in
`beta_size`. Correlation among predictors follows an AR(1) structure governed by
`rho`, which conveniently induces near-multicollinearity while remaining
positive-definite.

```{r, cache=TRUE, eval=LOCAL}
sim <- simulation_DATA.beta(
  n = 300, p = 10, s = 4, beta_size = c(1.0, -0.8, 0.6, -0.5),
  corr = "ar1", rho = 0.25,
  mechanism = "mixed", mix_prob = 0.5,
  delta = function(mu, X) 0.03 + 0.02 * abs(mu - 0.5),
  alpha = function(mu, X) 0.1 + 0.05 * (mu < 0.3),
  na_rate = 0.1, na_side = "random"
)
```

### Interval-generation parameters

The `delta` and `alpha` callbacks control how often the simulated outcome is
converted to an interval and how wide that interval is:

* `delta(mu, X)` encodes the expected half-width of the interval around the
  latent mean response `mu`. Here we allow wider intervals when the mean is far
  from 0.5, highlighting heteroskedastic behaviour.
* `alpha(mu, X)` represents an observation-specific inflation probability. When
  the latent mean is below 0.3, the function returns larger values, creating
  more lower-bound censoring for small `mu`.

With `mechanism = "mixed"` and `mix_prob = 0.5`, half of the affected
observations receive two-sided intervals, whereas the remainder experience
one-sided censoring driven by `na_side = "random"`.

### Inspecting the simulated outcomes

The output contains the design matrix `X`, the fully observed latent response
`Y`, and the interval bounds `Y_low`/`Y_high`. The following summaries check the
distribution of the latent response and the frequency of interval-censoring.

```{r, cache=TRUE, eval=LOCAL}
summary(sim$Y)
mean(is.na(sim$Y_low) | is.na(sim$Y_high))
```

To better understand the resulting intervals we can look at a small excerpt of
the censored rows. Observations with `NA` on one side correspond to
one-sided censoring events.

```{r, cache=TRUE, eval=LOCAL}
head(sim$Y_low, 10)
head(sim$Y_high, 10)
```

### Visualising interval widths

The difference between `Y_high` and `Y_low` conveys how much uncertainty each
interval carries. When an observation is fully observed the bounds coincide with
`Y`, leading to a zero width. The histogram below demonstrates that, even with a
modest base width of 0.03, the adaptive component in `delta()` creates a long
right tail as the mean moves away from the centre of the unit interval.

```{r, cache=TRUE, eval=LOCAL}
interval_width <- sim$Y_high - sim$Y_low
hist(interval_width, breaks = 30, col = "#0A6AA6", border = "white",
     main = "Distribution of simulated interval widths",
     xlab = "Y_high - Y_low")
```

These simulated objects can be passed directly to the modelling routines in
`SelectBoost.beta`. The following sections outline how to turn the generated
intervals into pseudo-observations for classical selectors, how to obtain stable frequencies with the interval-aware fastboost routine, and how to visualise the results side by side.

## Point-response selectors on pseudo-observations

When only interval bounds are observed, a quick way to obtain a point response is to impute either the midpoint or the available bound in case of one-sided censoring. The helper below keeps the midpoint when both bounds are present and falls back to the observed edge otherwise.

```{r, cache=TRUE, eval=LOCAL}
pseudo_y <- ifelse(
  is.na(sim$Y_low) | is.na(sim$Y_high),
  ifelse(is.na(sim$Y_low), sim$Y_high, sim$Y_low),
  0.5 * (sim$Y_low + sim$Y_high)
)
```

With this pseudo-response in hand, we can deploy the full suite of selectors
shipped with the package. The `compare_selectors_single()` wrapper runs the
stepwise AIC/BIC/AICc procedures, the GAMLSS-based LASSO and the IRLS/`glmnet` approach in one call. Setting `include_enet = FALSE` ensures the example remains lightweight even when the optional `gamlss.lasso` extension is not installed.

```{r, cache=TRUE, eval=LOCAL}
single <- compare_selectors_single(sim$X, pseudo_y, include_enet = FALSE)
head(single$table)
```

The helper returns a list with two elements. `single$coefs` stores one
coefficient vector per selector (intercept followed by the original column
names), while `single$table` provides a tidy summary with the `selector`,
`variable`, `coef`, and `selected` columns. Column names are briefly shortened
internally to guarantee syntactic validity for all selectors, then mapped back to
their original labels in the returned objects.

The bootstrap helper `compare_selectors_bootstrap()` repeats this exercise over resampled datasets, providing empirical selection frequencies for each method.

```{r, cache=TRUE, eval=LOCAL}
freq <- compare_selectors_bootstrap(
  sim$X, pseudo_y, B = 15, include_enet = FALSE, seed = 321
)
head(freq)
```

The `freq` column reports the share of bootstrap replicates in which the
variable was selected. Higher values indicate stronger evidence across resamples
and help prioritise predictors for follow-up modelling.

Merging both outputs with `compare_table()` gives a compact summary containing the per-run coefficients and the associated bootstrap frequencies.

```{r, cache=TRUE, eval=LOCAL}
summary_tab <- compare_table(single$table, freq)
head(summary_tab)
```

## Visual comparisons

The package offers quick visualisation helpers that default to base graphics
but automatically use `ggplot2` when it is available. The coefficient heatmap
highlights which variables are selected and their estimated effect sizes across selectors.

```{r, cache=TRUE, eval=LOCAL}
plot_compare_coeff(single$table)
```

Selection frequencies from the bootstrap stage can be plotted in the same
manner, providing a stability-oriented counterpart to the coefficient map.

```{r, cache=TRUE, eval=LOCAL}
plot_compare_freq(freq)
```

Combining both outputs with `compare_table()` retains the bootstrap frequencies
next to the single-run coefficients for easy inspection.

## Interval stability selection with `fastboost_interval`

Instead of reducing intervals to single values, the `fastboost_interval()`
routine repeatedly samples pseudo-responses inside each interval before running the chosen selector. The resulting selection frequencies account for the uncertainty carried by the censored observations.

```{r, cache=TRUE, eval=LOCAL}
fb <- fastboost_interval(
  sim$X, sim$Y_low, sim$Y_high,
  func = function(X, y) betareg_glmnet(X, y, choose = "bic", prestandardize = TRUE),
  B = 30, seed = 99
)
sort(fb$freq, decreasing = TRUE)[1:5]
```

These additions demonstrate how the simulation engine, the collection of
selectors, and the fast interval booster interact in a cohesive workflow for
interval-valued beta regression problems.

## Interval SelectBoost with `sb_beta_interval`

When you want to run the full SelectBoost workflow on interval data, the
`sb_beta_interval()` wrapper calls `sb_beta()` with the appropriate interval
sampling mode under the hood. It keeps the same output structure as other
`sb_beta()` runs, including the stability matrix and diagnostic attributes.

```{r, cache=TRUE, eval=LOCAL}
comp_id <- complete.cases(sim$Y_low) & complete.cases(sim$Y_high)

sb_interval <- sb_beta_interval(
  sim$X[comp_id, ],
  Y_low = sim$Y_low[comp_id],
  Y_high = sim$Y_high[comp_id],
  B = 30,
  step.num = 0.4,
  sample = "uniform"
)
attr(sb_interval, "interval")
head(sb_interval)
```

Pair this output with the plotting methods (`autoplot.sb_beta()`, `summary()`) to inspect
the stability matrix produced under interval resampling.

```{r, cache=TRUE, eval=LOCAL}
summary(sb_interval)
autoplot.sb_beta(sb_interval)
```