---
title: "Creating Baseline Characteristics Tables"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Creating Baseline Characteristics Tables}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  warning = FALSE,
  message = FALSE,
  error = TRUE
)
```

```{r setup}
library(clinpubr)
library(dplyr)
library(survival)
```

## Introduction

Baseline characteristics tables (Table 1) summarize patient demographics and clinical features at study entry. The `clinpubr` package automates the key decisions: variable type classification, normality assessment, statistical test selection, and missing data reporting.

## Loading and Preparing Data

We'll use the NCCTG Lung Cancer dataset from the `survival` package:

```{r load-data}
data(cancer, package = "survival")
str(cancer)
knitr::kable(head(cancer), caption = "Raw Data Preview")
```

Create derived variables for demonstration:

```{r prepare-data}
cancer$age_group <- cut(cancer$age,
  breaks = c(0, 50, 60, 70, 100),
  labels = c("<50", "50-60", "60-70", ">70")
)

# Combine sparse ECOG categories
cancer$ph.ecog_cat <- factor(cancer$ph.ecog,
  levels = c(0:3),
  labels = c("0", "1", ">=2", ">=2")
)

# Add missing values for demonstration
set.seed(123)
cancer$meal.cal[sample(1:nrow(cancer), 30)] <- NA
cancer$wt.loss[sample(1:nrow(cancer), 20)] <- NA

knitr::kable(head(cancer), caption = "Data After Preparation")
```

## Automatic Variable Type Determination

Before creating a baseline table, `get_var_types()` classifies each variable as:

- **Factor variables**: Categorical (factors or numeric with few unique values)
- **Non-normal variables**: Continuous variables failing normality tests
- **Exact test variables**: Categorical variables with small cell counts (Fisher's exact test)
- **Omitted variables**: Variables with too many levels

```{r get-var-types}
var_types <- get_var_types(cancer, strata = "sex")

var_types
```

### Customizing Classification

Adjust thresholds for automatic classification:

```{r customize-var-types}
var_types_custom <- get_var_types(
  cancer,
  strata = "sex",
  num_to_factor = 10, # Numeric vars with <=10 unique values treated as factor
  omit_factor_above = 15, # Omit factors with >15 levels
  norm_test_by_group = TRUE # Test normality within each stratum
)

var_types_custom
```

Save QQ plots for manual review of normality tests (optional):

```{r save-qqplots, eval = FALSE}
# var_types_with_plots <- get_var_types(
#   cancer, strata = "sex",
#   save_qqplots = TRUE, folder_name = "qqplots_review"
# )
```

## Creating Baseline Tables

### Basic Baseline Table

`baseline_table()` automatically selects summary statistics (mean/SD vs median/IQR) and statistical tests (t-test vs Mann-Whitney vs Chi-square vs Fisher):

```{r basic-baseline}
baseline_result <- baseline_table(
  cancer,
  var_types = var_types,
  save_table = FALSE
)

knitr::kable(baseline_result$baseline, caption = "Baseline Characteristics by Sex")
```

### Multi-Group Comparisons

With more than 2 groups, pairwise comparisons are automatically generated with optional multiple testing correction:

```{r multi-group}
data(cancer, package = "survival")
cancer$ph.ecog_cat <- factor(cancer$ph.ecog,
  levels = c(0:3),
  labels = c("0", "1", ">=2", ">=2")
)

var_types_ecog <- get_var_types(cancer, strata = "ph.ecog_cat")

baseline_multi <- baseline_table(
  cancer,
  var_types = var_types_ecog,
  save_table = FALSE,
  multiple_comparison_test = TRUE,
  p_adjust_method = "BH"
)

knitr::kable(baseline_multi$baseline, caption = "Baseline Characteristics by ECOG Status")
knitr::kable(baseline_multi$pairwise, caption = "Pairwise Comparison P-values")
```

### Customizing the Table

Select specific variables, add SMD, handle missing strata:

```{r customize-baseline}
baseline_custom <- baseline_table(
  cancer,
  var_types = var_types,
  vars = c("age", "wt.loss", "meal.cal", "ph.ecog"),
  smd = TRUE,
  omit_missing_strata = TRUE,
  seed = 123
)

knitr::kable(baseline_custom$baseline, caption = "Customized Baseline Table")
```

### Missing Data Summary

```{r missing-table}
knitr::kable(baseline_result$missing, caption = "Missing Data Summary")
```

## Manual Override

Override automatic classification based on clinical knowledge or manual review:
:

```{r manual-override}
baseline_manual <- baseline_table(
  cancer,
  strata = "sex",
  factor_vars = c("ph.ecog", "pat.karno"),
  nonnormal_vars = c("age"),
  exact_vars = c("ph.ecog")
)

knitr::kable(baseline_manual$baseline, caption = "Baseline Table with Manual Overrides")
```

## Saving Results

Save all tables to CSV files:

```{r save-results, eval = FALSE}
# baseline_saved <- baseline_table(
#   cancer, var_types = var_types,
#   save_table = TRUE, filename = "baseline_characteristics.csv"
# )
```

## Complete Workflow

A streamlined 5-step workflow from data preparation to final table:

```{r complete-workflow}
# Step 1: Prepare data
data(cancer, package = "survival")
cancer_clean <- cancer %>%
  mutate(
    age_group = cut(age,
      breaks = c(0, 50, 60, 70, 100),
      labels = c("<50", "50-60", "60-70", ">70")
    ),
    ph.ecog_cat = factor(ph.ecog,
      levels = c(0:3),
      labels = c("0", "1", ">=2", ">=2")
    ),
    sex = factor(sex, labels = c("Male", "Female"))
  )

# Step 2: Determine variable types
var_types <- get_var_types(cancer_clean, strata = "sex", num_to_factor = 5)

# Step 3: Review classification
knitr::kable(data.frame(
  Variable_Type = c("Factor", "Non-normal", "Exact"),
  Variables = c(
    paste(var_types$factor_vars, collapse = ", "),
    paste(var_types$nonnormal_vars, collapse = ", "),
    paste(var_types$exact_vars, collapse = ", ")
  )
), caption = "Variable Type Review")

# Step 4: Create baseline table
baseline_final <- baseline_table(
  cancer_clean,
  var_types = var_types,
  smd = TRUE
)

# Step 5: Review results
knitr::kable(baseline_final$baseline, caption = "Final Baseline Characteristics Table")
knitr::kable(baseline_final$missing, caption = "Final Missing Data Summary")
```

## Summary

### Key Functions

- **`get_var_types()`**: Automatic variable type determination with customizable thresholds
- **`baseline_table()`**: Create comprehensive baseline tables with automatic test selection

### Best Practices

1. **Review automatic classifications** --- clinical knowledge should override statistical defaults when appropriate
2. **Include SMD for observational studies** --- standardized mean differences help assess group balance
3. **Handle missing data transparently** --- report missing patterns in your tables
4. **Use BH correction** for multi-group pairwise comparisons