--- title: "DIVINE" output: rmarkdown::html_vignette: toc: true number_sections: true vignette: > %\VignetteIndexEntry{DIVINE} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` ```{r setup, include = FALSE} rm(list = ls()) library("DIVINE") ```
# Introduction The **DIVINE** package offers a collection of curated datasets along with convenient data management functions. These sample datasets (e.g., patient demographics, symptoms, treatments, and outcomes) are included for demonstration purposes, enabling users to practice and test various data manipulation workflows. Using DIVINE functions, you can inspect, clean, merge, summarize, visualize, and export data efficiently. In this vignette, we illustrate common tasks and functions provided by DIVINE to help streamline your data analysis process.

# Installation and Loading To install DIVINE, use the standard CRAN or GitHub approach: - CRAN: ```{r eval=FALSE} install.packages("DIVINE") ``` - GitHub (development version): ```{r eval=FALSE} # install.packages("devtools") devtools::install_github("bruigtp/DIVINE") pak::pak("bruigtp/DIVINE") # Alternative ``` Once installed, load the package: ```{r eval=FALSE} library(DIVINE) ``` This makes all datasets and functions available in your R session.

# Available Datasets Use the `data()` function to list available sample datasets. For example: ```{r eval=FALSE} data(package = "DIVINE") ``` The DIVINE package includes the following datasets (each represents a data frame): - `analytics` - `comorbidities` - `complications` - `concomitant_medication` - `demographic` - `end_followup` - `icu` - `inhosp_antibiotics` - `inhosp_antivirals` - `inhosp_other_treatments` - `scores` - `symptoms` - `vital_signs` - `vaccine` To obtain more information for a dataset: ```{r eval=FALSE} ?demographic ``` Load any dataset into your R environment using `data("dataset_name")`. For example, to load and preview the `demographic` dataset: ```{r} data("demographic") head(demographic) ```

# Workflow Examples The following examples demonstrate a typical data management workflow using DIVINE functions. Each section shows how to use a function step-by-step with example output.
## 1. Inspecting Data with `data_overview()` Start by understanding your dataset’s shape, variable types, and missingness with `data_overview()`. The function returns a named list containing the dataset dimensions, variable types, missing-value counts, and a small preview of the data (by default the first 6 rows). ```{r} # Overview of your data frame ov <- data_overview(demographic) # Print the entire overview ov ``` You can also access each component individually: ```{r} # Each of the elements ov$dimensions # number of rows and columns ov$variable_types # data types of each variable ov$missing_values # count of missing values per column ov$preview # a small preview of the data ``` This helps you quickly assess the dataset before any processing.
## 2. Handling Missing Values with `impute_missing()` The `impute_missing()` function lets you replace missing values using a specific strategy. You provide a named list of formulas (` ~ `), where `` can be any tidyselect expression (e.g. a column name, `starts_with()`, or `where(is.numeric)`) and `` one of the following strategies: - "mean" or "median" (numeric columns only) - "mode" (character/factor columns only) - a numeric constant (for numeric columns) - a character constant (for character/factor columns) You can also drop rows that are entirely `NA` by setting `all_na_rm = TRUE`. ```{r} # 1) Default: replace all numeric NAs with column means cleaned_default <- impute_missing(icu) # 2) Single column strategies: # - Mean for vent_mec_start_days # - Zero for icu_enter_days cleaned_mix <- impute_missing( icu, method = list( vent_mec_start_days ~ "mean", icu_enter_days ~ 0 ) ) # 3) Multiple columns at once: # - Medians for any column ending in "_days" cleaned_days_median <- impute_missing( icu, method = list(starts_with(".*_days$") ~ "median") ) # 4) Factor/character imputation: # - Fill gender with its most common level # - Fill status with "Unknown" cleaned_char <- impute_missing( icu, method = list( covid_wave ~ "mode", icu ~ "Unknown" ) ) # 5) Drop all-NA rows first, then impute numeric means cleaned_no_empty <- impute_missing( icu, method = list(where(is.numeric) ~ "mean"), drop_all_na = TRUE ) # ▶ message: Removed X rows where all values were NA ``` After running `impute_missing()`, the returned dataset will have missing values replaced in the columns you specified according to your chosen strategies. The overall dataset structure (column names and types) is preserved where possible; only the number of rows will change if you set `drop_all_na = TRUE`.
## 3. Merging Multiple Tables with `multi_join()` When working with related tables, `multi_join()` combines several data frames into a single table by a common key. Choose the join behavior with `join_type = "left"`, `"inner"`, `"right"`, or `"full"` to control which rows are kept. For example, suppose you have demographic, vital signs, and scores tables all sharing the default key variables of the package (`record_id`, `covid_wave`, `center`): ```{r} data("vital_signs") data("scores") joined <- multi_join( list(demographic, vital_signs, scores), key = c("record_id", "covid_wave", "center"), join_type = "left" ) ``` This creates one combined data frame `joined` that includes all rows from `demographic` and matches information from `vital_signs` and `scores` by `record_id`. Use `join_type` to decide whether unmatched rows from one or more tables should be retained.
## 4. Creating Summary Tables with `stats_table()` Use `stats_table()` to generate summary tables (leveraging the `gtsummary` package) for one or more variables, optionally stratified by a grouping variable. Use the `statistic_type` argument to select the summary: - "mean_sd" (show `mean (SD)` for numeric variables) - "median_iqr" (show `median [Q1; Q3]`) - "both" (include both `mean (SD)` and `median [Q1; Q3]` where applicable) ```{r} # Mean (SD) by group (e.g., by gender or cohort) tbl1 <- stats_table( demographic, vars = c("age", "smoker", "alcohol"), by = "sex", statistic_type = "mean_sd", pvalue = TRUE ) # Median [Q1; Q3] for all observations (no grouping) tbl2 <- stats_table( demographic, statistic_type = "median_iqr" ) # Both mean (SD) and median [IQR] combined tbl3 <- stats_table( demographic, statistic_type = "both" ) ``` Each `tbl` object is a **gtsummary**-style table that you can print, refine or export to reports. Set `pvalue = TRUE` to add p-values for group comparisons, and consult the function documentation for options to format labels and missing-value displays.
## 5. Visualizing Data with `multi_plot()` Create common plot types quickly with `multi_plot()`. It supports histograms, density plots, boxplots, barplots, and spider (radar) charts. For instance: ```{r} #| fig.alt: > #| Different types of plots # Histogram of age multi_plot( demographic, x_var = "age", plot_type = "histogram", fill_color = "skyblue", title = "Distribution of Age" ) # Boxplot of age by sex multi_plot( demographic, x_var = "sex", y_var = "age", plot_type = "boxplot", group = "sex", title = "Age by Sex" ) # Spider plot of numeric variables (e.g., compare age, weight, height distributions) multi_plot( comorbidities, x_var = "hypertension", y_var = "dyslipidemia", plot_type = "spider", z_var = c("depression", "mild_kidney_disease", "ceiling_dico"), radar_vlabels = stringr::str_to_sentence(c("hypertension", "dyslipidemia", "depression", "mild_kidney_disease", "ceiling_dico")), radar_color = "blue", radar_ref_lev = "Yes" ) ``` Each call generates a plot (using **ggplot2** under the hood). Customize titles, colors, and variables as needed for your data.
## 6. Exporting Data with `export_data()` Finally, use `export_data()` to save your processed data to disk in various formats. Supported formats include CSV, XLSX, RDS, SPSS, Stata, and SAS: ```{rr eval=FALSE} # Export cleaned data to CSV export_data(cleaned_default, format = "csv", path = "cleaned_demographic.csv") # Export joined data to Excel export_data(joined, format = "xlsx", path = "joined_data.xlsx") ``` Specify the `path` including filename and extension; the function will write the file accordingly. You can also use `format = "rds"`, `"sav"` (SPSS), `"dta"` (Stata), or `"sas7bdat"` (SAS) as needed.

# Further Resources - **Package Documentation:** View the reference manual for detailed function descriptions (e.g., via `help(package = "DIVINE")` or the CRAN/GitHub repository documentation). - **In-R Help:** Use `?DIVINE`, `?data_overview`, `?impute_missing`, etc., to access function-specific help pages.