With tibblify() you can rectangle deeply nested lists
into a tidy tibble. These lists might come from an API in the form of
JSON or from scraping XML.
Let’s start with gh_users, which is a list from the
{repurrrsive} package containing information about four GitHub users.
We’ll select a subset of columns to keep the example relatively
simple.
gh_users_small <- purrr::map(
repurrrsive::gh_users,
~ .x[c(
"followers",
"login",
"url",
"name",
"location",
"email",
"public_gists"
)]
)
names(gh_users_small[[1]])
#> [1] "followers" "login" "url" "name" "location"
#> [6] "email" "public_gists"We can quickly rectangle gh_users_small with
tibblify().
tibblify(gh_users_small)
#> # A tibble: 6 × 7
#> followers login url name location email public_gists
#> <int> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 303 gaborcsardi https://api.github.co… Gábo… Chippen… csar… 6
#> 2 780 jennybc https://api.github.co… Jenn… Vancouv… <NA> 54
#> 3 3958 jtleek https://api.github.co… Jeff… Baltimo… <NA> 12
#> 4 115 juliasilge https://api.github.co… Juli… Salt La… <NA> 4
#> 5 213 leeper https://api.github.co… Thom… London,… <NA> 46
#> 6 34 masalmon https://api.github.co… Maël… Barcelo… <NA> 0We can now look at the specification tibblify() used for
rectangling.
guess_tspec(gh_users_small)
#> tspec_df(
#> tib_int("followers"),
#> tib_chr("login"),
#> tib_chr("url"),
#> tib_chr("name"),
#> tib_chr("location"),
#> tib_chr("email"),
#> tib_int("public_gists"),
#> )If we are only interested in some of the fields we can easily adapt the specification.
spec <- tspec_df(
login_name = tib_chr("login"),
tib_chr("name"),
tib_int("public_gists")
)
tibblify(gh_users_small, spec)
#> # A tibble: 6 × 3
#> login_name name public_gists
#> <chr> <chr> <int>
#> 1 gaborcsardi Gábor Csárdi 6
#> 2 jennybc Jennifer (Jenny) Bryan 54
#> 3 jtleek Jeff L. 12
#> 4 juliasilge Julia Silge 4
#> 5 leeper Thomas J. Leeper 46
#> 6 masalmon Maëlle Salmon 0We refer to lists like gh_users_small as
collections, and each element of that list is an
object. Objects and collections are the typical input for
tibblify().
Basically, an object is something that can be converted to a one-row tibble. This boils down to a condition on the names of the object:
object must have names (the names
attribute must not be NULL).NA or
"").In other words, the names must fulfill
vctrs::vec_as_names(names, repair = "check_unique"). The
name-value pairs of an object are the fields.
For example list(x = 1, y = "a") is an object with the
fields (x, 1) and (y, "a") but
list(1, z = 3) is not an object because it is not fully
named.
A collection is a list of similar objects so that the fields can become the columns in a tibble.
Providing an explicit specification has several advantages:
To create a specification for a collection, use
tspec_df(). Describe the columns of the output tibble with
the tib_*() functions. The tib_*() functions
describe the path to the field to extract and the output type of the
field. Throughout these functions, we use the concept of
ptype (prototype) as used in the {vctrs} package (see
vignette("type-size", package = "vctrs")).
There are five main types of tib_*() functions:
tib_scalar(ptype): A length-one vector with type
ptype. The result is a column of that type.tib_vector(ptype): A vector of arbitrary length with
type ptype. The result is a list column, with each row
containing a vector of that type.tib_variant(): A vector of arbitrary length and type.
The result is a list column, with each row containing a list. You should
rarely need this function.tib_row(...): An object with the fields
.... The result is a tibble column, with each row
containing the specified one-row tibble.tib_df(...): A collection where the objects have the
fields .... The result is a tibble column, with each row
containing a tibble.There are also two other tib_*() functions for special
cases:
tib_recursive(): A collection where objects can
themselves be collections of the same structure, such as a directory
tree. The result is a tibble column, with each row containing a tibble
of the same structure (until NULL is reached, terminating
recursion).tib_unspecified(): An unspecified field where the type
and shape is not known in advance. The unspecified argument
of tibblify() controls how such fields are handled.For convenience there are shortcuts for tib_scalar() and
tib_vector() for the most common prototypes:
logical(): tib_lgl() and
tib_lgl_vec()integer(): tib_int() and
tib_int_vec()double(): tib_dbl() and
tib_dbl_vec()character(): tib_chr() and
tib_chr_vec()Date: tib_date() and
tib_date_vec()Date encoded as character: tib_chr_date()
and tib_chr_date_vec()Scalar elements are the most common case and result in a normal vector column.
tibblify(
list(
list(id = 1, name = "Peter"),
list(id = 2, name = "Lilly")
),
tspec_df(
tib_int("id"),
tib_chr("name")
)
)
#> # A tibble: 2 × 2
#> id name
#> <int> <chr>
#> 1 1 Peter
#> 2 2 LillyWith tib_scalar(), you can also provide your own
prototype for column types not included in our shortcuts. For example,
let’s say you have a list with durations (objects with class
“difftime”).
x <- list(
list(id = 1, duration = vctrs::new_duration(100)),
list(id = 2, duration = vctrs::new_duration(200))
)
x
#> [[1]]
#> [[1]]$id
#> [1] 1
#>
#> [[1]]$duration
#> Time difference of 100 secs
#>
#>
#> [[2]]
#> [[2]]$id
#> [1] 2
#>
#> [[2]]$duration
#> Time difference of 200 secsUse tib_scalar() with vctrs::new_duration()
to specify the duration ptype.
If an element does not always have size one then it is a vector
element. If it still always has the same type ptype then it
produces a list column with elements of ptype.
x <- list(
list(id = 1, children = c("Peter", "Lilly")),
list(id = 2, children = "James"),
list(id = 3, children = c("Emma", "Noah", "Charlotte"))
)
tibblify(
x,
tspec_df(
tib_int("id"),
tib_chr_vec("children")
)
)
#> # A tibble: 3 × 2
#> id children
#> <int> <list<chr>>
#> 1 1 [2]
#> 2 2 [1]
#> 3 3 [3]You can use tidyr::unnest()
or tidyr::unnest_longer()
to flatten these columns to regular columns.
In gh_repos_small, the field owner is an
object itself.
gh_repos_small <- purrr::map(
repurrrsive::gh_repos[[1]],
~ .x[c("id", "name", "owner")]
)
gh_repos_small <- purrr::map(
gh_repos_small,
function(repo) {
repo$owner <- repo$owner[c("login", "id", "url")]
repo
}
)
gh_repos_small[[1]]
#> $id
#> [1] 61160198
#>
#> $name
#> [1] "after"
#>
#> $owner
#> $owner$login
#> [1] "gaborcsardi"
#>
#> $owner$id
#> [1] 660288
#>
#> $owner$url
#> [1] "https://api.github.com/users/gaborcsardi"The specification to extract it uses tib_row().
spec <- guess_tspec(gh_repos_small)
spec
#> tspec_df(
#> tib_int("id"),
#> tib_chr("name"),
#> tib_row(
#> "owner",
#> tib_chr("login"),
#> tib_int("id"),
#> tib_chr("url"),
#> ),
#> )This specification results in a tibble column.
tibblify(gh_repos_small, spec)
#> # A tibble: 30 × 3
#> id name owner$login $id $url
#> <int> <chr> <chr> <int> <chr>
#> 1 61160198 after gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 2 40500181 argufy gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 3 36442442 ask gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 4 34924886 baseimports gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 5 61620661 citest gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 6 33907457 clisymbols gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 7 37236467 cmaker gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 8 67959624 cmark gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 9 63152619 conditions gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 10 24343686 crayon gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> # ℹ 20 more rowsIf you don’t like the tibble column you can unpack it with
tidyr::unpack(). Alternatively, if you only want to extract
some of the fields in owner you can use a character vector
path.
spec2 <- tspec_df(
id = tib_int("id"),
name = tib_chr("name"),
owner_id = tib_int(c("owner", "id")), # "id" in "owner"
owner_login = tib_chr(c("owner", "login")) # "login" in "owner"
)
spec2
#> tspec_df(
#> tib_int("id"),
#> tib_chr("name"),
#> owner_id = tib_int(c("owner", "id")),
#> owner_login = tib_chr(c("owner", "login")),
#> )
tibblify(gh_repos_small, spec2)
#> # A tibble: 30 × 4
#> id name owner_id owner_login
#> <int> <chr> <int> <chr>
#> 1 61160198 after 660288 gaborcsardi
#> 2 40500181 argufy 660288 gaborcsardi
#> 3 36442442 ask 660288 gaborcsardi
#> 4 34924886 baseimports 660288 gaborcsardi
#> 5 61620661 citest 660288 gaborcsardi
#> 6 33907457 clisymbols 660288 gaborcsardi
#> 7 37236467 cmaker 660288 gaborcsardi
#> 8 67959624 cmark 660288 gaborcsardi
#> 9 63152619 conditions 660288 gaborcsardi
#> 10 24343686 crayon 660288 gaborcsardi
#> # ℹ 20 more rowsObjects usually have some fields that always exist and some that are
optional. By default tib_*() requires that a field
exists.
x <- list(
list(x = 1, y = "a"),
list(x = 2)
)
spec <- tspec_df(
x = tib_int("x"),
y = tib_chr("y")
)
tibblify(x, spec)
#> Error in `tibblify()`:
#> ! Field y is required but does not exist in `x[[2]]`.
#> ℹ Use `required = FALSE` if the field is optional.You can mark a field as optional with the argument
.required = FALSE.
spec <- tspec_df(
x = tib_int("x"),
y = tib_chr("y", .required = FALSE)
)
tibblify(x, spec)
#> # A tibble: 2 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 <NA>You can specify the value to use with the .fill
argument.
To rectangle a single object you have two options:
tspec_object() which produces a list, or
tspec_row() which produces a tibble with one row.
It often makes more sense to convert such objects to a list. For example a typical API response might be something like this.
api_output <- list(
status = "success",
requested_at = "2021-10-26 09:17:12",
data = list(
list(x = 1),
list(x = 2)
)
)Use tspec_row() to convert to a one-row tibble.
row_spec <- tspec_row(
status = tib_chr("status"),
data = tib_df(
"data",
x = tib_int("x")
)
)
api_output_df <- tibblify(api_output, row_spec)
api_output_df
#> # A tibble: 1 × 2
#> status data
#> <chr> <list<tibble[,1]>>
#> 1 success [2 × 1]To access data one has to use
api_output_df$data[[1]], which is not very nice. We can use
tspec_object() to simplify this situation.
object_spec <- tspec_object(
status = tib_chr("status"),
data = tib_df(
"data",
x = tib_int("x")
)
)
api_output_list <- tibblify(api_output, object_spec)
api_output_list
#> $status
#> [1] "success"
#>
#> $data
#> # A tibble: 2 × 1
#> x
#> <int>
#> 1 1
#> 2 2Now accessing data does not require an extra subsetting
step.