DIVINE


1 Introduction

The DIVINE package offers a collection of curated datasets along with convenient data management functions. These sample datasets (e.g., patient demographics, symptoms, treatments, and outcomes) are included for demonstration purposes, enabling users to practice and test various data manipulation workflows. Using DIVINE functions, you can inspect, clean, merge, summarize, visualize, and export data efficiently. In this vignette, we illustrate common tasks and functions provided by DIVINE to help streamline your data analysis process.



2 Installation and Loading

To install DIVINE, use the standard CRAN or GitHub approach:

install.packages("DIVINE")
# install.packages("devtools")
devtools::install_github("bruigtp/DIVINE")

pak::pak("bruigtp/DIVINE") # Alternative

Once installed, load the package:

library(DIVINE)

This makes all datasets and functions available in your R session.



3 Available Datasets

Use the data() function to list available sample datasets. For example:

data(package = "DIVINE")

The DIVINE package includes the following datasets (each represents a data frame):

To obtain more information for a dataset:

?demographic

Load any dataset into your R environment using data("dataset_name"). For example, to load and preview the demographic dataset:

data("demographic")
head(demographic)
#> # A tibble: 6 × 8
#>   record_id covid_wave center     sex      age smoker   alcohol residence_center
#>       <int> <fct>      <fct>      <fct>  <dbl> <fct>    <fct>   <fct>           
#> 1         1 Wave 2     Hospital C Female    84 No       No      No              
#> 2         2 Wave 2     Hospital A Female    33 No       No      No              
#> 3         3 Wave 3     Hospital A Male      64 Ex-smok… Yes     No              
#> 4         4 Wave 3     Hospital A Female    65 Ex-smok… No      No              
#> 5         5 Wave 2     Hospital C Male      49 No       No      No              
#> 6         6 Wave 1     Hospital C Female    40 No       No      No



4 Workflow Examples

The following examples demonstrate a typical data management workflow using DIVINE functions. Each section shows how to use a function step-by-step with example output.


4.1 1. Inspecting Data with data_overview()

Start by understanding your dataset’s shape, variable types, and missingness with data_overview(). The function returns a named list containing the dataset dimensions, variable types, missing-value counts, and a small preview of the data (by default the first 6 rows).

# Overview of your data frame
ov <- data_overview(demographic)

# Print the entire overview
ov
#> $dimensions
#> [1] 5813    8
#> 
#> $variable_types
#>        record_id       covid_wave           center              sex 
#>        "integer"         "factor"         "factor"         "factor" 
#>              age           smoker          alcohol residence_center 
#>        "numeric"         "factor"         "factor"         "factor" 
#> 
#> $missing_values
#>        record_id       covid_wave           center              sex 
#>                0                0                0                0 
#>              age           smoker          alcohol residence_center 
#>                0                0                0                0 
#> 
#> $preview
#> # A tibble: 6 × 8
#>   record_id covid_wave center     sex      age smoker   alcohol residence_center
#>       <int> <fct>      <fct>      <fct>  <dbl> <fct>    <fct>   <fct>           
#> 1         1 Wave 2     Hospital C Female    84 No       No      No              
#> 2         2 Wave 2     Hospital A Female    33 No       No      No              
#> 3         3 Wave 3     Hospital A Male      64 Ex-smok… Yes     No              
#> 4         4 Wave 3     Hospital A Female    65 Ex-smok… No      No              
#> 5         5 Wave 2     Hospital C Male      49 No       No      No              
#> 6         6 Wave 1     Hospital C Female    40 No       No      No

You can also access each component individually:

# Each of the elements
ov$dimensions      # number of rows and columns
#> [1] 5813    8
ov$variable_types  # data types of each variable
#>        record_id       covid_wave           center              sex 
#>        "integer"         "factor"         "factor"         "factor" 
#>              age           smoker          alcohol residence_center 
#>        "numeric"         "factor"         "factor"         "factor"
ov$missing_values  # count of missing values per column
#>        record_id       covid_wave           center              sex 
#>                0                0                0                0 
#>              age           smoker          alcohol residence_center 
#>                0                0                0                0
ov$preview         # a small preview of the data
#> # A tibble: 6 × 8
#>   record_id covid_wave center     sex      age smoker   alcohol residence_center
#>       <int> <fct>      <fct>      <fct>  <dbl> <fct>    <fct>   <fct>           
#> 1         1 Wave 2     Hospital C Female    84 No       No      No              
#> 2         2 Wave 2     Hospital A Female    33 No       No      No              
#> 3         3 Wave 3     Hospital A Male      64 Ex-smok… Yes     No              
#> 4         4 Wave 3     Hospital A Female    65 Ex-smok… No      No              
#> 5         5 Wave 2     Hospital C Male      49 No       No      No              
#> 6         6 Wave 1     Hospital C Female    40 No       No      No

This helps you quickly assess the dataset before any processing.


4.2 2. Handling Missing Values with impute_missing()

The impute_missing() function lets you replace missing values using a specific strategy. You provide a named list of formulas (<selector> ~ <strategy>), where <selector> can be any tidyselect expression (e.g. a column name, starts_with(), or where(is.numeric)) and <strategy> one of the following strategies:

You can also drop rows that are entirely NA by setting all_na_rm = TRUE.

# 1) Default: replace all numeric NAs with column means
cleaned_default <- impute_missing(icu)

# 2) Single column strategies:
#    - Mean for vent_mec_start_days
#    - Zero for icu_enter_days
cleaned_mix <- impute_missing(
  icu,
  method = list(
    vent_mec_start_days ~ "mean",
    icu_enter_days ~ 0
  )
)

# 3) Multiple columns at once:
#    - Medians for any column ending in "_days"
cleaned_days_median <- impute_missing(
  icu,
  method = list(starts_with(".*_days$") ~ "median")
)

# 4) Factor/character imputation:
#    - Fill gender with its most common level
#    - Fill status with "Unknown"
cleaned_char <- impute_missing(
  icu,
  method = list(
    covid_wave ~ "mode",
    icu ~ "Unknown"
  )
)

# 5) Drop all-NA rows first, then impute numeric means
cleaned_no_empty <- impute_missing(
  icu,
  method    = list(where(is.numeric) ~ "mean"),
  drop_all_na = TRUE
)
# ▶ message: Removed X rows where all values were NA

After running impute_missing(), the returned dataset will have missing values replaced in the columns you specified according to your chosen strategies. The overall dataset structure (column names and types) is preserved where possible; only the number of rows will change if you set drop_all_na = TRUE.


4.3 3. Merging Multiple Tables with multi_join()

When working with related tables, multi_join() combines several data frames into a single table by a common key. Choose the join behavior with join_type = "left", "inner", "right", or "full" to control which rows are kept. For example, suppose you have demographic, vital signs, and scores tables all sharing the default key variables of the package (record_id, covid_wave, center):

data("vital_signs")
data("scores")

joined <- multi_join(
  list(demographic, vital_signs, scores),
  key = c("record_id", "covid_wave", "center"),
  join_type = "left"
)

This creates one combined data frame joined that includes all rows from demographic and matches information from vital_signs and scores by record_id. Use join_type to decide whether unmatched rows from one or more tables should be retained.


4.4 4. Creating Summary Tables with stats_table()

Use stats_table() to generate summary tables (leveraging the gtsummary package) for one or more variables, optionally stratified by a grouping variable. Use the statistic_type argument to select the summary:

# Mean (SD) by group (e.g., by gender or cohort)
tbl1 <- stats_table(
  demographic,
  vars = c("age", "smoker", "alcohol"),
  by = "sex",
  statistic_type = "mean_sd",
  pvalue = TRUE
)

# Median [Q1; Q3] for all observations (no grouping)
tbl2 <- stats_table(
  demographic,
  statistic_type = "median_iqr"
)

# Both mean (SD) and median [IQR] combined
tbl3 <- stats_table(
  demographic,
  statistic_type = "both"
)

Each tbl object is a gtsummary-style table that you can print, refine or export to reports. Set pvalue = TRUE to add p-values for group comparisons, and consult the function documentation for options to format labels and missing-value displays.


4.5 5. Visualizing Data with multi_plot()

Create common plot types quickly with multi_plot(). It supports histograms, density plots, boxplots, barplots, and spider (radar) charts. For instance:

# Histogram of age
multi_plot(
  demographic,
  x_var = "age",
  plot_type = "histogram",
  fill_color = "skyblue",
  title = "Distribution of Age"
)

Different types of plots


# Boxplot of age by sex
multi_plot(
  demographic,
  x_var = "sex",
  y_var = "age",
  plot_type = "boxplot",
  group = "sex",
  title = "Age by Sex"
)

Different types of plots


# Spider plot of numeric variables (e.g., compare age, weight, height distributions)
multi_plot(
  comorbidities,
  x_var = "hypertension",
  y_var = "dyslipidemia",
  plot_type = "spider",
  z_var = c("depression", "mild_kidney_disease", "ceiling_dico"),
  radar_vlabels = stringr::str_to_sentence(c("hypertension", "dyslipidemia", "depression", "mild_kidney_disease", "ceiling_dico")),
  radar_color = "blue",
  radar_ref_lev = "Yes"
)

Different types of plots

Each call generates a plot (using ggplot2 under the hood). Customize titles, colors, and variables as needed for your data.


4.6 6. Exporting Data with export_data()

Finally, use export_data() to save your processed data to disk in various formats. Supported formats include CSV, XLSX, RDS, SPSS, Stata, and SAS:

# Export cleaned data to CSV
export_data(cleaned_default, format = "csv", path = "cleaned_demographic.csv")

# Export joined data to Excel
export_data(joined, format = "xlsx", path = "joined_data.xlsx")

Specify the path including filename and extension; the function will write the file accordingly. You can also use format = "rds", "sav" (SPSS), "dta" (Stata), or "sas7bdat" (SAS) as needed.



5 Further Resources