---
title: "Data Cleaning and Preparation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Cleaning and Preparation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  warning = FALSE,
  message = FALSE,
  error = TRUE
)
```

```{r setup}
library(clinpubr)
library(dplyr)
```

## Introduction

Clinical data from Electronic Health Records (EHR) often contains inconsistencies, missing values, outliers, and formatting issues that must be addressed before analysis. This vignette demonstrates a structured cleaning workflow using `clinpubr`:

1. **Data Overview** --- Assess data quality
2. **Format Cleaning** --- Extract numbers, convert dates, check non-numeric values
3. **Unit Standardization** --- Harmonize measurement units
4. **Outlier Detection** --- Identify and manage outliers (after standardization)
5. **Missing Value Handling** --- Filter or handle missing data
6. **Data Transformation** --- Create derived variables

> **Why this order?** Format cleaning and unit standardization must precede outlier detection --- mixed units (e.g., mg/dL vs mmol/L) can cause false outlier flags.

## Creating Realistic Messy Lab Data

Let's create a sample laboratory dataset mimicking real-world EHR data with common quality issues. This dataset will be used throughout the vignette.

```{r create-messy-data}
set.seed(123)
n <- 100

# Patient IDs
patient_ids <- paste0("P", sprintf("%03d", 1:n))

# Glucose: mix of mg/dL and mmol/L (1 mmol/L = 18 mg/dL)
# Also include some outliers and missing values
glucose_vals <- c(
  rnorm(35, mean = 100, sd = 15), # mg/dL normal range
  rnorm(30, mean = 5.5, sd = 0.8), # mmol/L normal range
  rnorm(10, mean = 200, sd = 20), # mg/dL high (outliers)
  rnorm(10, mean = 15, sd = 2), # mmol/L high (outliers)
  rep(NA, 10), # missing values
  999, 888, 0, -50, 500 # erroneous values
)
glucose_vals <- round(sample(glucose_vals, n), 1)

glucose_units <- sample(c("mg/dL", "mmol/L", NA), n, replace = TRUE, prob = c(0.5, 0.4, 0.1))

# Creatinine: mix of mg/dL and umol/L (1 mg/dL = 88.4 umol/L)
creatinine_vals <- c(
  rnorm(35, mean = 1.0, sd = 0.2), # mg/dL normal range
  rnorm(30, mean = 88, sd = 15), # umol/L normal range
  rnorm(10, mean = 3.0, sd = 0.5), # mg/dL high (outliers)
  rnorm(10, mean = 265, sd = 44), # umol/L high (outliers)
  rep(NA, 10), # missing values
  999, 0, -10, 1000, 0.001 # erroneous values
)
creatinine_vals <- round(sample(creatinine_vals, n), 2)

creatinine_units <- sample(c("mg/dL", "umol/L", NA), n, replace = TRUE, prob = c(0.5, 0.4, 0.1))

# Cholesterol: mix of mg/dL and mmol/L (1 mmol/L = 38.67 mg/dL)
cholesterol_vals <- c(
  rnorm(35, mean = 180, sd = 30), # mg/dL normal range
  rnorm(30, mean = 4.5, sd = 0.8), # mmol/L normal range
  rnorm(10, mean = 350, sd = 40), # mg/dL high (outliers)
  rnorm(10, mean = 9, sd = 1), # mmol/L high (outliers)
  rep(NA, 10), # missing values
  888, 999, 50, 600, 0 # erroneous values
)
cholesterol_vals <- round(sample(cholesterol_vals, n), 2)

cholesterol_units <- sample(c("mg/dL", "mmol/L", NA), n, replace = TRUE, prob = c(0.5, 0.4, 0.1))

# Test dates with various formats and impossible dates
test_date_vals <- sample(c(
  format(Sys.Date() - sample(1:365, 60, replace = TRUE), "%Y-%m-%d"), # ISO format
  format(Sys.Date() - sample(1:365, 30, replace = TRUE), "%Y/%m/%d"), # Slash format
  "1900-01-01", "2030-12-31", "N/A", "", "unknown", "pending", # Invalid dates
  rep(NA, 10)
), n * 3, replace = TRUE)

# Create long-format lab data with messy values
lab_data <- data.frame(
  patient_id = rep(patient_ids, 3),
  test = rep(c("Glucose", "Creatinine", "Cholesterol"), each = n),
  value = c(glucose_vals, creatinine_vals, cholesterol_vals),
  unit = c(glucose_units, creatinine_units, cholesterol_units),
  test_date = test_date_vals
)

# Add messy string values that need cleaning
# Glucose with text annotations and European decimal commas
messy_glucose_idx <- sample(which(lab_data$test == "Glucose"), 15)
lab_data$value[messy_glucose_idx] <- sample(c(
  "<40", ">500", "5.2", "6.8", "120 mg/dL", "89 mmol/L",
  "5.2 (fasting)", "180 (post-meal)", "N/A", "pending",
  "6..5", "7..2", "8.5.1", "normal range", "see comment"
), length(messy_glucose_idx), replace = TRUE)

# Creatinine with text annotations
messy_creatinine_idx <- sample(which(lab_data$test == "Creatinine"), 12)
lab_data$value[messy_creatinine_idx] <- sample(c(
  "<0.5", ">5.0", "1.1", "0.9", "1.2 mg/dL", "110 umol/L",
  "1.5 (dialysis)", "2.8 (critical)", "N/A", " hemolyzed",
  "1.3 ", " 1.4", "(1.2)", "1.0*", "see note"
), length(messy_creatinine_idx), replace = TRUE)

# Cholesterol with text annotations
messy_cholesterol_idx <- sample(which(lab_data$test == "Cholesterol"), 10)
lab_data$value[messy_cholesterol_idx] <- sample(c(
  "<100", ">400", "4.5", "5.2", "200 mg/dL", "5.5 mmol/L",
  "180 (fasting)", "250 (borderline)", "N/A", "lipemic"
), length(messy_cholesterol_idx), replace = TRUE)

knitr::kable(head(lab_data[sample(nrow(lab_data)), ], 10),
  caption = "Original Messy Lab Data (10 random rows)"
)
```

## Step 1: Data Overview

**Data used**: `lab_data` (the messy lab data created above)

Use `data_overview()` to get a comprehensive diagnostic report of data quality issues:

```{r data-overview}
overview <- data_overview(lab_data)

print(overview$variable_types)
print(overview$summary_stats)
print(overview$quality_issues$missing_values)
```

Check for unit conflicts across different test types:

```{r check-units}
knitr::kable(unit_view(lab_data, subject_col = "test", value_col = "value", unit_col = "unit"),
  caption = "Unit Conflicts by Test Type"
)
```

## Step 2: Format Cleaning

**Data used**: `lab_data`

Format cleaning is essential before any numerical analysis. Lab data often contains mixed formats, text annotations, and inconsistent date representations.

### 2.1 Checking for Non-numeric Values

First, identify which entries cannot be converted to numeric values:

```{r check-nonnum}
nonnum_df <- df_view_nonnum(lab_data)
knitr::kable(head(nonnum_df, 15), caption = "Non-numeric Entries by Variable")
```

### 2.2 Extracting Numeric Values

Use `extract_num()` to clean messy numeric strings. This function handles:
- Text annotations (e.g., "120 mg/dL" → 120)
- Range indicators (e.g., "<40", ">500")
- Extra decimal points (e.g., "5..2" → 5.2)
- Extra whitespace and punctuation

```{r extract-numeric}
# Create a copy for cleaning
lab_data_cleaned <- lab_data

# Extract numeric values from the messy value column
lab_data_cleaned$value_numeric <- extract_num(lab_data$value)

# Show before/after comparison
comparison <- data.frame(
  test = lab_data$test,
  original = lab_data$value,
  cleaned = lab_data_cleaned$value_numeric,
  unit = lab_data$unit
) %>%
  dplyr::filter(original != cleaned | is.na(cleaned)) %>%
  head(15)

knitr::kable(comparison, caption = "Value Cleaning: Before vs After")
```

### 2.3 Date Conversion

Lab data often contains test dates in various formats. Use `to_date()` to standardize them:

```{r date-conversion}
# Convert various date formats to standard Date objects
lab_data_cleaned$test_date_clean <- to_date(lab_data$test_date)

# Show date conversion results
date_comparison <- data.frame(
  original = lab_data$test_date,
  cleaned = lab_data_cleaned$test_date_clean
) %>%
  dplyr::filter(!is.na(original)) %>%
  head(15)

knitr::kable(date_comparison, caption = "Date Conversion: Before vs After")
```

Check for invalid dates that couldn't be parsed:

```{r check-invalid-dates}
invalid_dates <- lab_data_cleaned %>%
  dplyr::filter(!is.na(test_date) & is.na(test_date_clean)) %>%
  dplyr::select(patient_id, test, test_date) %>%
  head(10)

knitr::kable(invalid_dates, caption = "Invalid Date Entries (Could Not Be Parsed)")
```

### 2.4 Update Data for Next Steps

Replace the original messy columns with cleaned versions:

```{r update-cleaned-data}
lab_data <- lab_data_cleaned %>%
  dplyr::mutate(
    value = value_numeric,
    test_date = test_date_clean
  ) %>%
  dplyr::select(-value_numeric, -test_date_clean)
```

## Step 3: Unit Standardization

**Data used**: `lab_data`

Unit standardization must be performed **before** outlier detection. This is critical when merging data from different laboratories. We will standardize all values to conventional units:

- **Glucose**: mg/dL (mmol/L × 18 = mg/dL)
- **Creatinine**: mg/dL (umol/L ÷ 88.4 = mg/dL)
- **Cholesterol**: mg/dL (mmol/L × 38.67 = mg/dL)

```{r unit-standardize-apply}
change_rules <- list(
  list(subject = "Glucose", target_unit = "mg/dL", units2change = "mmol/L", coeffs = 18),
  list(subject = "Creatinine", target_unit = "mg/dL", units2change = "umol/L", coeffs = 1 / 88.4),
  list(subject = "Cholesterol", target_unit = "mg/dL", units2change = "mmol/L", coeffs = 38.67)
)

lab_data_std <- unit_standardize(
  lab_data,
  subject_col = "test", value_col = "value",
  unit_col = "unit", change_rules = change_rules
)

knitr::kable(head(lab_data_std, 15), caption = "Lab Data After Unit Standardization")
```

Verify that units are now standardized:

```{r verify-standardization}
knitr::kable(unit_view(lab_data_std, subject_col = "test", value_col = "value", unit_col = "unit"),
  caption = "Units After Standardization"
)
```

## Step 4: Outlier Detection

**Data used**: `lab_data_std` (unit-standardized lab data)

> **Important**: Outlier detection is performed **after** unit standardization to ensure consistent scales across all measurements.

The package provides three methods, each with different strengths:

- **MAD** (Median Absolute Deviation): Robust to outliers, good for small samples
- **IQR** (Interquartile Range): Most commonly used in clinical research
- **Z-score**: Assumes normal distribution, use with caution for skewed data

Detect outliers for each test type using the IQR method:

```{r outlier-detection}
# Split data by test type for outlier detection
lab_data_clean <- lab_data_std

# Detect outliers for each test
for (test_name in c("Glucose", "Creatinine", "Cholesterol")) {
  test_data <- lab_data_std$value[lab_data_std$test == test_name]

  outlier_res <- detect_outliers(test_data, method = "iqr")

  cat("\n", test_name, ":\n")
  cat("  Total values:", length(test_data), "\n")
  cat("  Outliers detected:", sum(outlier_res$outlier_mask, na.rm = TRUE), "\n")
  cat("  Missing values:", sum(is.na(test_data)), "\n")
}
```

Compare outlier detection methods for Glucose:

```{r compare-methods}
glucose_values <- lab_data_std$value[lab_data_std$test == "Glucose"]

knitr::kable(data.frame(
  Method = c("MAD", "IQR", "Z-score"),
  Outlier_Count = c(
    sum(mad_outlier(glucose_values, threshold = 3), na.rm = TRUE),
    sum(iqr_outlier(glucose_values, threshold = 1.5), na.rm = TRUE),
    sum(zscore_outlier(glucose_values, threshold = 3), na.rm = TRUE)
  )
), caption = "Outlier Detection Results by Method (Glucose)")
```

### Handling Outliers

For this demonstration, we will set outliers to NA. **In practice, outliers should be manually verified** --- they may represent true extreme values (e.g., severe hyperglycemia) rather than data errors.

```{r handle-outliers}
# Create a copy for outlier handling
lab_data_final <- lab_data_std

# Set outliers to NA for each test type
for (test_name in c("Glucose", "Creatinine", "Cholesterol")) {
  test_idx <- lab_data_std$test == test_name
  test_values <- lab_data_std$value[test_idx]

  outlier_res <- detect_outliers(test_values, method = "iqr")

  # Set outliers to NA
  lab_data_final$value[test_idx][outlier_res$outlier_mask] <- NA
}

# Compare before and after outlier handling
comparison_df <- data.frame(
  Test = rep(c("Glucose", "Creatinine", "Cholesterol"), each = 2),
  Stage = rep(c("Before", "After"), 3),
  N = c(
    sum(!is.na(lab_data_std$value[lab_data_std$test == "Glucose"])),
    sum(!is.na(lab_data_final$value[lab_data_final$test == "Glucose"])),
    sum(!is.na(lab_data_std$value[lab_data_std$test == "Creatinine"])),
    sum(!is.na(lab_data_final$value[lab_data_final$test == "Creatinine"])),
    sum(!is.na(lab_data_std$value[lab_data_std$test == "Cholesterol"])),
    sum(!is.na(lab_data_final$value[lab_data_final$test == "Cholesterol"]))
  ),
  Mean = c(
    mean(lab_data_std$value[lab_data_std$test == "Glucose"], na.rm = TRUE),
    mean(lab_data_final$value[lab_data_final$test == "Glucose"], na.rm = TRUE),
    mean(lab_data_std$value[lab_data_std$test == "Creatinine"], na.rm = TRUE),
    mean(lab_data_final$value[lab_data_final$test == "Creatinine"], na.rm = TRUE),
    mean(lab_data_std$value[lab_data_std$test == "Cholesterol"], na.rm = TRUE),
    mean(lab_data_final$value[lab_data_final$test == "Cholesterol"], na.rm = TRUE)
  )
)

knitr::kable(comparison_df, caption = "Before vs After Outlier Handling", digits = 2)
```

> **Clinical Note**: Always verify outliers before removal. Some "outliers" may be clinically significant (e.g., glucose > 400 mg/dL in diabetic ketoacidosis). Consider creating an outlier indicator variable for sensitivity analyses rather than deleting values.

## Step 5: Missing Value Handling

**Data used**: `lab_data_final` (after standardization and outlier handling)

For missing data filtering by row/column missing rate, use `get_valid_subset()` to obtain a complete-enough subset for analysis. First, let's convert to wide format to create a longitudinal cohort dataset where each row represents a patient-test date combination:

```{r wide-format}
# Convert to wide format preserving all test dates
# This creates a longitudinal cohort dataset
lab_wide <- lab_data_final %>%
  filter(!is.na(test_date)) %>% # Remove records without valid dates
  select(patient_id, test_date, test, value) %>%
  tidyr::pivot_wider(
    names_from = test,
    values_from = value,
    names_sort = TRUE
  )

# Sort by patient and date for longitudinal view
lab_wide <- lab_wide %>%
  arrange(patient_id, test_date)

knitr::kable(head(lab_wide, 10), caption = "Longitudinal Lab Data (First 10 records)")
```

View the cohort structure - number of tests per patient:

```{r cohort-structure}
cohort_summary <- lab_wide %>%
  group_by(patient_id) %>%
  summarise(
    n_visits = n(),
    first_date = min(test_date),
    last_date = max(test_date)
  ) %>%
  head(10)

knitr::kable(cohort_summary, caption = "Cohort Structure: Visits per Patient")
```

Assess missing data patterns:

```{r missing-patterns}
missing_summary <- data.frame(
  Variable = names(lab_wide)[-1],
  Missing_Count = sapply(lab_wide[-1], function(x) sum(is.na(x))),
  Missing_Percent = round(sapply(lab_wide[-1], function(x) sum(is.na(x)) / length(x) * 100, 2))
)

knitr::kable(missing_summary, caption = "Missing Data Summary by Test")
```

Filter records with too much missing data:

```{r filter-missing}
lab_wide_clean <- get_valid_subset(
  lab_wide,
  row_na_ratio = 0.5, # Allow up to 50% missing per record
  col_na_ratio = 0.3 # Allow up to 30% missing per test type
)

cat(
  "Dimensions:", nrow(lab_wide), "records",
  "->", nrow(lab_wide_clean), "records\n"
)
cat(
  "Unique patients:", length(unique(lab_wide$patient_id)),
  "->", length(unique(lab_wide_clean$patient_id)), "\n"
)
```

## Step 6: Data Transformation

**Data used**: `lab_wide_clean` (cleaned longitudinal cohort data)

### Creating Clinical Categories

Create categorical variables from continuous lab values using clinical cut-points. In longitudinal data, each visit can have different classifications:

```{r clinical-categories}
# Glucose categories
lab_wide_clean$glucose_category <- cut_by(
  lab_wide_clean$Glucose,
  breaks = c(100, 126),
  labels = c("Normal", "Prediabetes", "Diabetes"),
  label_with_range = FALSE
)

# Creatinine categories (eGFR approximation)
lab_wide_clean$renal_function <- cut_by(
  lab_wide_clean$Creatinine,
  breaks = c(1.2, 2.0, 4.0),
  labels = c("Normal", "Mild", "Moderate", "Severe"),
  label_with_range = FALSE
)

# Cholesterol categories
lab_wide_clean$cholesterol_category <- cut_by(
  lab_wide_clean$Cholesterol,
  breaks = c(200, 240),
  labels = c("Desirable", "Borderline", "High"),
  label_with_range = FALSE
)

# Summary of categories across all visits
cat_summary <- data.frame(
  Category = c(
    rep("Glucose", length(table(lab_wide_clean$glucose_category))),
    rep("Renal Function", length(table(lab_wide_clean$renal_function))),
    rep("Cholesterol", length(table(lab_wide_clean$cholesterol_category)))
  ),
  Level = c(
    names(table(lab_wide_clean$glucose_category)),
    names(table(lab_wide_clean$renal_function)),
    names(table(lab_wide_clean$cholesterol_category))
  ),
  Count = c(
    as.vector(table(lab_wide_clean$glucose_category)),
    as.vector(table(lab_wide_clean$renal_function)),
    as.vector(table(lab_wide_clean$cholesterol_category))
  )
)

knitr::kable(cat_summary, caption = "Clinical Category Distributions (All Visits)")
```

### Longitudinal View Example

View a single patient's trajectory over time:

```{r longitudinal-view}
# Select one patient with multiple visits for demonstration
patient_trajectory <- lab_wide_clean %>%
  filter(patient_id == unique(patient_id)[1]) %>%
  select(
    patient_id, test_date, Glucose, Creatinine, Cholesterol,
    glucose_category, renal_function
  ) %>%
  arrange(test_date)

knitr::kable(patient_trajectory, caption = "Example: Single Patient's Lab Trajectory")
```

## Complete Cleaning Pipeline

Here's a complete workflow from raw lab data to analysis-ready data:

```{r complete-pipeline}
# Step 1: Data Overview
overview <- data_overview(lab_data)

# Step 2: Format Cleaning - Extract numeric values and convert dates
clean <- lab_data
clean$value <- extract_num(lab_data$value)
clean$test_date <- to_date(lab_data$test_date)

# Step 3: Unit Standardization
change_rules <- list(
  list(subject = "Glucose", target_unit = "mg/dL", units2change = "mmol/L", coeffs = 18),
  list(subject = "Creatinine", target_unit = "mg/dL", units2change = "umol/L", coeffs = 1 / 88.4),
  list(subject = "Cholesterol", target_unit = "mg/dL", units2change = "mmol/L", coeffs = 38.67)
)

clean <- unit_standardize(
  clean,
  subject_col = "test", value_col = "value",
  unit_col = "unit", change_rules = change_rules
)

# Step 4: Outlier Detection and Handling
for (test_name in c("Glucose", "Creatinine", "Cholesterol")) {
  test_idx <- clean$test == test_name
  test_values <- clean$value[test_idx]
  outlier_res <- detect_outliers(test_values, method = "iqr")
  clean$value[test_idx][outlier_res$outlier_mask] <- NA
}

# Step 5: Convert to wide format (longitudinal cohort data)
clean_wide <- clean %>%
  filter(!is.na(test_date)) %>%
  select(patient_id, test_date, test, value) %>%
  tidyr::pivot_wider(names_from = test, values_from = value) %>%
  arrange(patient_id, test_date)

# Step 6: Missing value filtering
clean_wide <- get_valid_subset(clean_wide, row_na_ratio = 0.5, col_na_ratio = 0.3)

# Step 7: Create clinical categories
clean_wide$glucose_category <- cut_by(
  clean_wide$Glucose,
  breaks = c(100, 126),
  labels = c("Normal", "Prediabetes", "Diabetes")
)

# Step 8: Final check
final_overview <- data_overview(clean_wide)
knitr::kable(final_overview$summary_stats, caption = "Final Data Quality Overview")

cat(
  "\nOriginal records:", nrow(lab_data),
  "| Final records:", nrow(clean_wide),
  "| Removed:", nrow(lab_data) - nrow(clean_wide), "\n"
)
cat(
  "Unique patients:", length(unique(lab_data$patient_id)),
  "->", length(unique(clean_wide$patient_id)), "\n"
)
```

## Summary

| Step | Function | Purpose | Data Used |
|------|----------|---------|-----------|
| 1. Data Overview | `data_overview()` | Comprehensive data quality assessment | Raw lab data |
| 2. Format Cleaning | `extract_num()`, `to_date()`, `df_view_nonnum()` | Clean messy strings, convert dates | Raw lab data |
| 3. Unit Standardization | `unit_standardize()` | Standardize measurement units | Format-cleaned data |
| 4. Outlier Detection | `detect_outliers()`, `iqr_outlier()` | Detect and handle outliers | Standardized data |
| 5. Missing Values | `get_valid_subset()` | Filter by missing rate | Post-outlier data |
| 6. Data Transformation | `cut_by()` | Create derived variables | Cleaned data |

### Key Principles

1. **Standardize before outlier detection**: Mixed units cause false outlier flags
2. **Verify outliers clinically**: Some extreme values are clinically meaningful
3. **Track data loss**: Document how many observations are removed at each step
4. **Preserve raw data**: Never overwrite original data; create cleaned copies