---
title: "NBDCtools"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{NBDCtools}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
Sys.setenv("_R_CHECK_CRAN_INCOMING_" = "true") # always build vignette like on CRAN
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Background
The `NBDCtools` R package makes use of the regular structure of NBDC datasets, especially
standardized metadata (data dictionary and levels table; see, e.g.,
[here](https://docs.abcdstudy.org/latest/documentation/curation/metadata.html))
and the organization of tabulated data as one file per table in the BIDS
`rawdata/phenotype/` directory
(see [here](https://docs.abcdstudy.org/latest/documentation/curation/structure.html#rawdata)
for information about the structure of the ABCD file-based data, and
[here](https://docs.hbcdstudy.org/latest/datacuration/phenotypes/)
for the HBCD study).
The package assumes that users downloaded the complete tabulated
dataset as file-based data and saved the files in a local directory. Using
functions from the package, users can then create custom datasets by specifying
the study name and any set of variable names and/or table names in its data
dictionary.
By making use of the study’s metadata, the functions automatically
retrieve the needed columns from different files on disk, and join them to a
data frame in memory. This provides a fast, storage- and memory-efficient, and
highly reproducible way to work with data from the NBDC Data Hub that can be
used as an alternative to creating and downloading different datasets (and
creating on-disk representations for each of them) through the
[Data Exploration & Analysis Portal (DEAP)](https://nbdc.deapscience.com) or
the [NBDC Data Access Platform](https://nbdc-datashare.lassoinformatics.com).
### Download data using DEAP
To download data from the NBDC Data Hub in the format that is required by the
`NBDCtools` package, follow the following steps:
1. Log in to the [DEAP](https://nbdc.deapscience.com) application and select
the `My datasets` tab.
1. On the bottom of the page, click on `Pre-assembled datasets`.
1. In the pop-up window, select the `All tables` option.
1. Click on the `Download tables` button to download the data files.
1. Unzip the downloaded file to a local directory and remember the path to this
directory, as you will need it to load the data using the `NBDCtools`
package.
{width=100%}
## Getting started
To begin using the `NBDCtools` package effectively, the most essential and
frequently utilized function is `create_dataset()`. This omnibus function loads
selected variables from files and creates an analysis-ready data frame in one
step, incorporating various transformation and cleaning options.
In this vignette, we will demonstrate the use of the `create_dataset()` function
with simulated ABCD data files. We will illustrate how to join variables,
perform various transformations, and explore some advanced options.
## Setup
> **IMPORTANT:** Please ensure that the both the `NBDCtools` and `NBDCtoolsData`
packages are installed. When `NBDCtools` is loaded, it will automatically load
the required objects from `NBDCtoolsData` package, so you don't need to load it
separately.
To load `NBDCtools`, use the following command:
```{r setup}
library(NBDCtools)
```
Alternatively, you can call functions directly without loading the package by
using `::`, e.g., `NBDCtools::name_of_function(...)`. You can also access
`NBDCtoolsData` objects directly using the colon-colon syntax.
## Load and join data
We can use the following command to inspect the simulated data files:
```{r}
dir_abcd <- system.file("extdata", "phenotype", package = "NBDCtools")
list.files(dir_abcd)
```
Next, we will use the `create_dataset()` function to load data from the files in
`dir_abcd` with selected variables of interest.
```{r}
vars <- c(
"ab_g_dyn__visit_type",
"ab_g_dyn__cohort_grade",
"ab_g_dyn__visit__day1_dt",
"ab_g_stc__gen_pc__01",
"ab_g_dyn__visit_age",
"ab_g_dyn__visit_days",
"ab_g_dyn__visit_dtt",
"mr_y_qc__raw__dmri__r01__series_t"
)
create_dataset(
dir_data = dir_abcd,
study = "abcd",
vars = vars
)
```
> **NOTE:** The simulated data contains only a few variables and rows. In a
real-world scenario, each file will typically have many more rows and tables.
Users can select which variables to join using the following four arguments:
- `vars`: Individual variables of interest
- `tables`: Full tables of interest
- `vars_add`: Additional individual variables
- `tables_add`: Additional full tables
Columns of interest specified by the `vars` and `tables` arguments are
full-joined, meaning the resulting data frame retains all rows with at least one
non-missing value in the selected variables/tables. Additional columns specified
by the `vars_add` and `tables_add` arguments are left-joined to the data frame
containing the columns of interest, retaining all rows and adding columns from
the additional variables/tables. The `create_dataset()` function utilizes the
low-level function `join_tabulated()` for data joining. For more information
about the `join_tabulated()` function, refer to the [Join data](https://software.nbdc-datahub.org/NBDCtools/articles/join.html)
vignette. For a diagram detailing the joining strategy for main and additional
variables/tables, see [this
page](https://docs.deapscience.com/create_edit/create.html#joining) (the
`NBCDtools` package uses the same approach as the
[DEAP](https://nbdc.deapscience.com) application).
For example, if we only specify the `mr_y_qc__raw__dmri` variable in `vars` and
move others to `vars_add`, we will have different number of rows in the data:
```{r}
create_dataset(
dir_data = dir_abcd,
study = "abcd",
vars = c(
"mr_y_qc__raw__dmri__r01__series_t"
),
vars_add = c(
"ab_g_dyn__visit_type",
"ab_g_dyn__cohort_grade",
"ab_g_dyn__visit__day1_dt",
"ab_g_stc__gen_pc__01",
"ab_g_dyn__visit_age",
"ab_g_dyn__visit_days",
"ab_g_dyn__visit_dtt"
)
)
```
## Process data
After loading and joining the data, the `create_dataset()` function performs
several transformation steps. Each step is reported with an `i` message in the
console, allowing users to see which actions are being taken. For example, the
output indicates that the function has executed the following steps:
```
#> ℹ Converting categorical variables to factors.
#> ℹ Adding variable and value labels.
```
These steps utilize lower-level functions that can be used independently. The
[Transform data](https://software.nbdc-datahub.org/NBDCtools/articles/transformation.html)
vignette describes how to do so.
### Default transformations
By default, `create_dataset()` performs the following two transformation steps
(users can choose to not execute them by setting the respective arguments to
`FALSE`):
- `categ_to_factor`: Converts categorical columns to factors using
the lower-level function `transf_factor()`.
- `add_labels`: Adds variable and value labels using the lower-level
function `transf_label()`.
### Additional transformations
Users can also apply additional transformations to the data by setting the
respective arguments to `TRUE`. The following transformations are available:
- `value_to_label`: Converts categorical columns' numeric values to labels
using the lower-level function `transf_value_to_label()`.
- `value_to_na`: Converts categorical missingness/non-response codes to `NA`
using the lower-level function `transf_value_to_na()`.
- `time_to_hms`: Converts time variables to `hms` class using the lower-level
function `transf_time_to_hms()`.
Here is an example of adding these additional transformations to the
`create_dataset()` function:
```{r}
create_dataset(
dir_data = dir_abcd,
study = "abcd",
vars = vars,
value_to_label = TRUE,
value_to_na = TRUE,
time_to_hms = TRUE
)
```
### Shadow matrices
The `create_dataset()` function also includes the option to process shadow
matrices. Shadow matrices are tables with the same dimensions as the original
data and provide information about why a given cell is missing in the original
data. Using the `bind_shadow = TRUE` argument, users can append the shadow
matrix as additional columns to the end of the data frame.
```{r eval=FALSE}
create_dataset(
dir_data = dir_abcd,
study = "abcd",
vars = vars,
bind_shadow = TRUE
)
```
Please note that shadow matrices are processed differently for ABCD and HBCD
study datasets:
- **ABCD:** Currently, no raw shadow matrix data is being released. As such,
`create_dataset()` will create a shadow matrix from the data using
`naniar::as_shadow()` if `bind_shadow` is set to `TRUE`.
- **HBCD:** The shadow matrix is provided as a separate file in the
`rawdata/phenotype/` directory. The `create_dataset()` function will read it
from the file and append it to the data frame by default if `bind_shadow` is
set to `TRUE`. Users can use the additional argument `naniar_shadow = TRUE` if
they prefer for the shadow matrix to be created from the data using
`naniar::as_shadow()` instead:
```{r eval=FALSE}
create_dataset(
dir_data = dir_abcd,
study = "abcd",
vars = vars,
bind_shadow = TRUE,
naniar_shadow = TRUE
)
```
> **IMPORTANT:** The `naniar::as_shadow()` requires the `naniar` package to be
installed, which is not a dependency of `NBDCtools`. If you want to use this
option, please install the `naniar` package first using
`install.packages("naniar")`.
For more information about shadow matrices, please refer to the
[Work with shadow matrices](https://software.nbdc-datahub.org/NBDCtools/articles/shadow.html)
vignette.
## Advanced options
The `create_dataset()` function calls several other low-level functions to
process the data. Some of these low-level functions have additional arguments
that can be used to customize the processing. To use these arguments, users
can pass them to the `create_dataset()` function using the `...` argument.
For example, if we select `value_to_na = TRUE`, the function will
call the lower-level `transf_value_to_na()` function, which will convert
factor levels that represent missingness/non-response codes to `NA`.
This is useful when the data contains specific codes that indicate missingness
like in the ABCD study where `"222"`, `"333"`, `"444"`, etc. are used
consistently (see,
[here](https://docs.abcdstudy.org/latest/documentation/curation/standards.html#non-responsemissingness-codes)
for more details).
One can change the non-response/missingness codes that should be converted to
`NA` by passing the `missing_codes` argument to the `create_dataset()` function.
For example, if we want to convert the levels `1` and `2` to `NA` (this is
typically not advisable in a real-world scenario), we can do so by passing the
`missing_codes` argument to the `create_dataset()` function as follows:
```{r}
create_dataset(
dir_data = dir_abcd,
study = "abcd",
vars = vars,
value_to_na = TRUE,
missing_codes = c("1", "2")
)
```
First `create_dataset()` prints out the message that indicating which additional
arguments are passed to the low-level functions:
```r
#> ℹ Argument `missing_codes` is passed to `transf_value_to_na()`.
```
In the results, we can see that in column `ab_g_dyn__visit_type`, the levels `1`
and `2` are converted to `NA`, while the other values are kept as is.
If the user defines wrong or not existing arguments, they will be ignored. For
example, if we pass an additional argument `my_arg` to `create_dataset()`
function, it will be ignored and the returned data will be the same as if we did
not pass this argument at all:
```{r}
create_dataset(
dir_data = dir_abcd,
study = "abcd",
vars = vars,
value_to_na = TRUE,
my_arg = "some_value" # this argument will be ignored
)
```
Please refer to the lower-level functions documentation for more information
about the available arguments and their usage on the
[Reference](https://software.nbdc-datahub.org/NBDCtools/reference/index.html)
page.