---
title: "Getting started"
vignette: >
  %\VignetteIndexEntry{Getting started}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup}
#| include: false
show_tree <- function(dir) {
  withr::with_dir(fs::path_dir(dir), {
    fs::dir_tree(fs::path_file(dir))
  })
}
```

fastreg aims to make working with Danish registers simpler and faster by
providing functionality to convert the SAS register files (`.sas7bdat`)
into [Parquet](https://parquet.apache.org/) and read the resulting
Parquet files. A *register* in this context refers to a collection of
related data files that belong to the same dataset, typically with
yearly snapshots (e.g., `bef2020.sas7bdat`,`bef2021.sas7bdat`).

## Why Parquet?

[Parquet](https://parquet.apache.org/) is a columnar storage file format
optimised for analytical workloads. Compared to
[SAS](https://fileinfo.com/extension/sas7bdat) files (and row-based
formats like CSV), Parquet offers:

- **Smaller file size**: Efficient compression significantly reduces
  disk space, especially for large datasets.
- **Faster queries**: The columnar layout speeds up analytical queries
  that only need a subset of columns.
- **Wide tool support**: Parquet is supported across data processing
  frameworks in [R](https://www.r-project.org/),
  [Python](https://www.python.org/), and beyond, making it easy to
  integrate into modern workflows.

## Setup

For the examples below, we've simulated SAS register data for two
registers, `bef` and `lmdb`:

```{r prepare}
#| code-fold: true
#| code-summary: "Show setup code"
library(fastreg)

sas_dir <- fs::path_temp("sas-dir")
fs::dir_create(sas_dir)

bef_list <- simulate_register(
  "bef",
  c("", "1999", "1999_1", "2020"),
  n = 1000
)

lmdb_list <- simulate_register(
  "lmdb",
  c("2020", "2021"),
  n = 1000
)

save_as_sas(
  c(bef_list, lmdb_list),
  sas_dir
)
```

```{r setup-tree}
#| echo: false
show_tree(sas_dir)
```

## Converting a single file

Converting one file from SAS to Parquet in fastreg isn't a simple change
of file extension. We make use of Parquet's Hive partitioning to
organise the output by year, for easier querying and management. So the
output Parquet file is written to a subdirectory named after the year
extracted from the file name. Use the `convert_file()` function to
convert a single SAS file to a year-partitioned Parquet format:

```{r convert-file}
sas_file <- fs::path(sas_dir, "bef2020.sas7bdat")
output_file_dir <- fs::path_temp("output-file-dir")

convert_file(
  path = sas_file,
  output_dir = output_file_dir
)
```

`convert_file()` reads files in chunks (to be able to handle
larger-than-memory data) with a default of reading 1 million rows,
extracts 4-digit years from filenames for partitioning, and lowercases
column names. See `?convert_file` for more details.

::: callout-note
When a SAS file contains more rows than the `chunk_size`, multiple
Parquet files will be created from it. This doesn't affect how the data
is loaded with `read_register()` (see
[Reading a Parquet register](#reading-a-parquet-register) below), it
only means you may see more Parquet files in the output than input SAS
files.
:::

Even though this only converts a single file, the output is partitioned
by the year extracted from the file name as seen below:

```{r output-tree-file}
#| echo: false
show_tree(output_file_dir)
```

## Converting a register

Use `list_sas_files()` to find SAS files in a directory (and any
subdirectories), then use `convert_register()` to convert them.
`convert_register()` expects files to be from the **same register**
based on file names.

```{r locate}
bef_sas_files <- list_sas_files(sas_dir) |>
  stringr::str_subset("bef")
bef_sas_files
```

```{r convert-register}
output_register_dir <- fs::path_temp("output-register-dir")

convert_register(
  path = bef_sas_files,
  output_dir = output_register_dir
)
```

`convert_register()` uses `convert_file()` internally so the same
chunking and partitioning behaviour applies. See `?convert_file` and
`?convert_register` for more details. As a result, the output from
`convert_register()` is also partitioned by year, extracted from file
names:

```{r output-tree}
#| echo: false
show_tree(output_register_dir)
```

The output is organised into a "bef" folder (register name extracted from file names) with year-based subdirectories:

- The data from the two SAS files with "1999" in their file names are
  located in the subfolder "year=1999"
- The data from the SAS file from 2020 are located in the subfolder
  "year=2020"
- One SAS file didn't have a year in its file name, `bef.sas7bdat`. The
  data from this file is placed in the "year=**HIVE_DEFAULT_PARTITION**"
  folder, the default for files without a year in their name.

## Converting multiple registers in parallel

For many or large files, fastreg provides a
[targets](https://docs.ropensci.org/targets/) pipeline template that
parallelises conversion across CPU cores. By default it uses 10 workers,
but that can be adjusted in the pipeline in the `_targets.R` file to not
consume too many cores on a shared server.

To create the pipeline file, you can use the `use_targets_template()`
function. In this example, we're outputting it to a temporary directory.

```{r use-targets}
pipeline_dir <- fs::path_temp("pipeline-dir")
fs::dir_create(pipeline_dir)

use_targets_template(path = pipeline_dir)
```

Once the `_targets.R` file is created, open it and edit the `config`
section:

```{r config}
config <- list(
  input_dir = fs::path_temp("sas-dir"),
  output_dir = fs::path(pipeline_dir, "parquet-registers")
)
```

The `input_dir` is the directory that contains the SAS files (searched
recursively). This directory can contain different registers, rather
than just one as is expected in `convert_register()`. The `output_dir`
directory is where the Parquet files will be written to.

After you've updated the `config` section, you can run the pipeline:

```{r tar-make}
#| eval: false
targets::tar_make()
```

```{r edit-and-run-pipeline}
#| echo: false
#| eval: !expr rlang::is_installed("targets")
template_content <- readLines(fs::path(pipeline_dir, "_targets.R"))
modified_content <- template_content |>
  stringr::str_replace("/path/to/register/sas/files/directory", config$input_dir) |>
  stringr::str_replace("/path/to/output/directory", config$output_dir)

withr::with_dir(pipeline_dir, {
  writeLines(modified_content, "_targets.R")
  targets::tar_make(callr_function = NULL, reporter = "silent")
})
```

The pipeline will find all SAS files from `input_dir` and convert each
file into a Parquet file, all done in parallel. Re-running `tar_make()`
only re-converts registers whose source files have changed or if the
pipeline itself has been edited.

Below, you can see the output of running the pipeline with the example
data:

```{r output-pipeline}
#| eval: !expr rlang::is_installed("targets")
#| echo: false
show_tree(config$output_dir)
```

## Reading a Parquet register

The final function reads the converted Parquet register data into R.
This function reads the data into a [DuckDB](https://duckdb.org/) table,
which a powerful way to query and process large data.

```{r read-register}
register <- read_register(output_register_dir)
register
```

You can pass a directory to read the full partitioned register or a file
path to read a single `.parquet` file. The data is read lazily, so it
won't load into memory until collected with e.g. `dplyr::collect()`.