---
title: "Download and convert mobility datasets"
vignette: >
  %\VignetteIndexEntry{Download and convert mobility datasets}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
number-sections: true
format:
  html:
    toc: true
    toc-depth: 2
    code-overflow: wrap
execute: 
  eval: false
---
## Introduction {#intro}
**TL;DR (too long, didn't read): For analysing more than 1 week of data, use `spod_convert()` to convert the data into `DuckDB` and `spod_connect()` to connect to it for analysis using `{dplyr}`. Skip to the [section about it](#duckdb).**
The main focus of this vignette is to show how to get long periods of origin-destination data for analysis. First, we describe and compare the two ways to get the mobility data using origin-destination data as an example. The package functions and overall approaches are the same for working with other types of data available through the package, such as the number of trips, overnight stays and any other data. Then we show how to get a few days of origin-destination data with `spod_get()`. Finally, we show how to download and convert multiple weeks, months or even years of origin-destination data into analysis-ready formats. See description of datasets in the [Codebook and cookbook for v1 (2020-2021) Spanish mobility data](v1-2020-2021-mitma-data-codebook.html) and in the [Codebook and cookbook for v2 (2022 onwards) Spanish mobility data](v2-2022-onwards-mitma-data-codebook.html).
## Two ways to get the data
There are two main ways to import the datasets:
1. as an in-memory object with `spod_get()`;
2. as a connection to DuckDB or Parquet files on disk with `spod_convert()` + `spod_connect()`. The latter is recommended for large datasets (more than 1 week), as it is much faster and more memory efficient, as we demonstarte below.
`spod_get()` returns objects that are only appropriate for small datasets representing a few days of the national origin-destination flows. We recommend converting the data into analysis-ready formats (`DuckDB` or `Parquet`) using `spod_convert()` + `spod_connect()`. This will allow you to work with much longer time periods (months and years) on a consumer laptop (with 8-16 GB of memory). See the section below for more details.
## Analysing large datasets {#analysing-large-datasets}
The mobility datasets available through `{spanishiddata}` are very large. Particularly the origin-destination data, which contains millions of rows. These data sets may not fit into the memory of your computer, especially if you plan to run the analysis over multiple days, weeks, months, or even years.
To work with these datasets, we highly recommend using `DuckDB` and `Parquet`. These are systems for efficiently processing larger-than-memory datasets, while being user-firendly by presenting the data in a familiar `data.frame`/`tibble` object (almost). For a great intoroduction to both, we recommend materials by Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt: [website](https://arrow-user2022.netlify.app/){target="_blank"}, [slides](https://arrow-user2022.netlify.app/slides){target="_blank"}, and [the video tutorial](https://www.youtube.com/watch?v=YZMuFavEgA4){target="_blank"}. You can also find examples of aggregating origin-destination data for flows analysis and visualisation in our vignettes on [static](https://rOpenSpain.github.io/spanishoddata/articles/flowmaps-static.html) and [interactive](https://rOpenSpain.github.io/spanishoddata/articles/flowmaps-interactive.html) flows visualisation.
Learning to use `DuckDB` and `Parquet` is easy for anyone who have ever worked with `{dplyr}` functions such as `select()`, `filter()`, `mutate()`, `group_by()`, `summarise()`, etc. However, since there is some learning curve to master these new tools, we provide some helper functions for novices to get started and easily open the datasets from `DuckDB` and `Parquet`. Please read the relevant sections below, where we first show how to convert the data, and then how to use it.
### How to choose between DuckDB, Parquet, and CSV {#duckdb-vs-parquet-csv}
The main considerations to make when choosing between `DuckDB` and `Parquet` (that you can get with `spod_convert()` + `spod_connect()`), as well as `CSV.gz` (that you can get with `spod_get()`) are analysis speed, convenience of data analysis, and the specific approach you prefer when getting the data. We discuss all three below.
#### Analysis Speed {#speed-comparison}
The data format you choose may dramatically impact the speed of analysis (e.g. filtering by dates, calculating number of trips per hour, per week, per month, per origin-destination pair, and any other data aggregation or manipulation).
In our tests (see @fig-csv-duckdb-parquet-speed), we found that conducting an analysis using `DuckDB` database provided a significant speed advantage over using `Parquet` and, more importantly, raw `CSV.gz` files. Specifically, when comparing a query to determine the mean hourly trips over 18 months for each zone pair, we observed that using `DuckDB` database was **up to 5 times** faster than using `Parquet` files and **up to 8 times** faster than using `CSV.gz` files.
{#fig-csv-duckdb-parquet-speed}
You can see the query we used for measuring the speed here
For reference, here is a simple query we used for speed comparison in @fig-csv-duckdb-parquet-speed:
```r
# data represents either CSV files acquired from `spod_get()`, a `DuckDB` database or a folder of Parquet files connected with `spod_connect()`
data |>
  group_by(id_origin, id_destination, hour) |> 
  summarise(mean_hourly_trips = mean(n_trips, na.rm = TRUE),
    .groups = "drop")
```