--- title: "Getting Started with vectra" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with vectra} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction vectra is a columnar query engine for R. It stores data in a custom binary format (`.vtr`), reads it lazily, and processes it batch-by-batch through a C11 execution engine. The user-facing API looks like dplyr: you chain `filter()`, `mutate()`, `group_by()`, `summarise()`, and the rest. But nothing actually runs until you call `collect()`. That call pulls row groups through a tree of plan nodes, each consuming and producing fixed-size batches. Memory use stays roughly constant regardless of how large the file is. The execution model is pull-based. When `collect()` fires, the root node calls `next_batch()` on its child, which calls its own child, and so on down to the scan node that reads from disk. Each node transforms one batch and passes it upstream. Filtering uses selection vectors rather than copying rows, so a filter over a billion-row file touches only the matching indices. Sorting spills to temporary files and merges via a k-way min-heap when data exceeds 1 GB. Aggregation uses hash tables that grow proportionally to the number of groups, not the number of input rows. The engine knows four column types: 64-bit integers, doubles, booleans, and variable-length strings, each with a separate validity bitmap for NA tracking. The question of when to reach for vectra over base R, dplyr, data.table, or arrow comes down to working set size and workflow. If your data fits comfortably in RAM and you prefer in-memory semantics, dplyr and data.table are mature, fast, and well-supported. Arrow's `open_dataset()` handles partitioned Parquet and multi-file scans well and interoperates with the broader Arrow ecosystem (Spark, DuckDB, Flight). vectra fills a different niche: dplyr syntax on single-file storage with memory-bounded execution and no heavy compiled dependencies. The entire C codebase compiles from C11 source with no external library dependencies — compression and deflate support are vendored in the package itself. There is no C++ compiler requirement, no Boost, no protobuf. A 20 GB `.vtr` file processes on a laptop with 8 GB of RAM because the scan node reads one row group at a time and the downstream pipeline never holds more than two batches in memory simultaneously. The on-disk format uses dictionary encoding for low-cardinality string columns, delta encoding for monotonic integer sequences, and Zstandard compression at the byte level. File sizes are competitive with Parquet for typical tabular workloads. Zone-map statistics (per-row-group min/max) are stored in the header, enabling the scan node to skip irrelevant row groups before decompressing any column data. Optional hash indexes stored as `.vtri` sidecar files accelerate equality predicates to O(1) row group lookup. This vignette is a taste test. Each section introduces a feature area with runnable examples, then points to the relevant deep-dive vignette for full coverage. All code uses tempfiles and built-in datasets, so you can run every chunk on any machine with vectra installed. ## Writing and reading data vectra uses a binary columnar format with the `.vtr` extension. A `.vtr` file contains a header (format version, column schema, and row group count), a row group index for fast seeking, and then the actual column data per row group. Each column within a row group is stored as a typed array with a validity bitmap for NA values. The current format version (v4) applies a two-stage encoding stack per column per row group. The first stage is a logical encoding: DICTIONARY for string columns where fewer than 50% of values are unique (the dictionary itself is stored once, and index runs are RLE-compressed), DELTA for int64 columns with monotonically increasing values (storing first value plus deltas), or PLAIN for everything else. The second stage is byte-level compression: Zstandard when the compressed output is smaller than the input and the raw data exceeds 64 bytes, or NONE otherwise. The file reader handles v1 through v4 transparently; the writer always produces v4. Write any data.frame to `.vtr` with `write_vtr()`. Open it lazily with `tbl()`, which returns a `vectra_node` object. No data is read at this point. The `tbl()` call reads only the file header to learn the schema and row group layout. ```{r write-read} library(vectra) f <- tempfile(fileext = ".vtr") write_vtr(mtcars, f) node <- tbl(f) node ``` The print method shows the schema and file path but does not scan any data. Call `collect()` to pull all batches through the plan and materialize the result as a data.frame. ```{r collect} tbl(f) |> collect() |> head() ``` For large datasets, `write_vtr()` accepts a `batch_size` parameter that controls how many rows go into each row group. Smaller row groups give finer-grained zone-map pruning (the scan node skips row groups whose min/max statistics exclude your filter predicate) but add per-group overhead. The default, one row group per data.frame, works well for most in-memory writes. For streaming writes from a pipeline, the upstream batch size is preserved. ```{r write-batch-size} f_batched <- tempfile(fileext = ".vtr") write_vtr(mtcars, f_batched, batch_size = 10) tbl(f_batched) |> collect() |> nrow() ``` The row count is the same. The difference is internal layout: 4 row groups of approximately 10 rows each instead of 1 row group of 32 rows. With smaller row groups, zone-map pruning becomes more selective. If `mpg` ranges from 10.4 to 33.9 across the full dataset but only 10.4 to 15.2 within one particular row group, a `filter(mpg > 20)` can skip that row group entirely without decompressing any column data. The tradeoff is overhead: each row group carries its own encoding metadata and compression frame, so very small row groups (under a few hundred rows) waste space on headers. For most analytical workloads, row groups between 10,000 and 100,000 rows strike a good balance. ## Filtering and selecting `filter()` keeps rows matching one or more conditions. Comma-separated conditions inside a single `filter()` call are combined with AND. Use `|` for OR within a single expression. You can also chain multiple `filter()` calls, which stacks FilterNodes in the plan. The engine evaluates them sequentially, one batch at a time, so the second filter only sees rows that passed the first. ```{r filter-and} tbl(f) |> filter(cyl == 6, mpg > 19) |> select(mpg, cyl, hp, wt) |> collect() ``` The C engine evaluates the predicate batch-by-batch and sets a selection vector on each `VecBatch`. The selection vector is an integer array of indices that passed the predicate. Downstream nodes (project, sort, aggregation) read only those indices, so no column data is copied or compacted at the filter stage. This zero-copy approach means filtering a 50 GB file with a 1% hit rate allocates memory proportional to 1% of one batch, not 1% of the entire file. NA values in comparisons follow SQL semantics: `NA == 6` evaluates to NA, not FALSE. Rows where the predicate evaluates to NA are excluded from the selection vector, the same as rows that evaluate to FALSE. To explicitly test for NA, use `is.na()` in the predicate. OR conditions and parenthesized expressions work as expected. ```{r filter-or} tbl(f) |> filter(cyl == 4 | cyl == 8) |> select(mpg, cyl) |> collect() |> head() ``` For set membership, `%in%` is accelerated with a hash set when the right-hand side has more than a handful of values. On a column with 10 million rows and a set of 500 target values, the hash-based `%in%` is roughly 10x faster than repeated OR comparisons. ```{r filter-in} tbl(f) |> filter(cyl %in% c(4, 6)) |> select(mpg, cyl) |> collect() |> head() ``` `select()` picks columns by name or by tidyselect helper. The full tidyselect vocabulary works: `starts_with()`, `ends_with()`, `contains()`, `matches()` (regex), `where()` (type predicate), `everything()`, `last_col()`, and negation via `-` or `!`. Column renaming inside `select()` also works (`select(miles = mpg)` renames on the fly). Under the hood, `select()` inserts a ProjectNode that prunes columns from the batch before they reach the next verb, so columns you drop are never decompressed from disk in subsequent row groups. ```{r select-helpers} tbl(f) |> select(starts_with("d"), mpg) |> collect() |> head() ``` Negation drops columns. ```{r select-negate} tbl(f) |> select(-am, -vs, -gear, -carb) |> collect() |> head() ``` `explain()` prints the query plan tree. We can see how `filter()` and `select()` map to plan nodes, and where predicate pushdown applies. ```{r explain-filter} tbl(f) |> filter(cyl > 4) |> select(mpg, cyl, hp) |> explain() ``` The ProjectNode prunes columns before the data reaches R. The FilterNode sits below it, and the ScanNode reads from disk. When a hash index exists on the filter column, the scan node probes the `.vtri` sidecar index and builds a bitmap of row groups that could contain matching rows. Row groups not in the bitmap are skipped entirely. This composes with zone-map pruning: even without an explicit index, the scan node compares the filter predicate against per-row-group min/max statistics and skips row groups that cannot match. Between these two mechanisms, a point query on an indexed column over a 10 GB file with 1000 row groups might read only 1 row group. ## Transforming columns `mutate()` adds or replaces columns. All expression evaluation happens in the C engine: the R layer captures the expression as an abstract syntax tree (via `substitute()`), serializes it into a nested list structure, and passes it across the `.Call()` bridge. The C side walks the expression tree and evaluates it element-wise over the batch. Supported operators include arithmetic (`+`, `-`, `*`, `/`, `%%`), comparison (`==`, `!=`, `<`, `<=`, `>`, `>=`), boolean (`&`, `|`, `!`), and unary negation. There is no R-level evaluation per row, which is why `mutate()` on 100 million rows runs at C speed regardless of expression complexity. ```{r mutate-arith} tbl(f) |> mutate(kpl = mpg * 0.425144, hp_per_wt = hp / wt) |> select(mpg, kpl, hp, wt, hp_per_wt) |> collect() |> head() ``` Math functions are evaluated natively in C: `abs()`, `sqrt()`, `log()`, `log2()`, `log10()`, `exp()`, `floor()`, `ceiling()`, `round()`, `sign()`, `trunc()`. For element-wise binary min/max, `pmin()` and `pmax()` take two arguments (columns or literals) and return the smaller or larger of each pair. These compose with arithmetic to build clamping expressions. ```{r mutate-math} tbl(f) |> mutate( log_hp = log(hp), hp_floor = floor(hp / 10) * 10, bounded = pmin(pmax(mpg, 15), 25) ) |> select(hp, log_hp, hp_floor, mpg, bounded) |> collect() |> head() ``` `transmute()` works like `mutate()` but drops all columns except the new ones. ```{r transmute} tbl(f) |> transmute( efficiency = mpg / wt, power_ratio = hp / disp ) |> collect() |> head() ``` Type casting converts between vectra's four internal column types. `as.numeric()` and `as.double()` cast to double, `as.integer()` casts to int64, `as.character()` casts to string, and `as.logical()` casts to bool. Casting follows R semantics: `as.integer()` on a double truncates toward zero, `as.numeric()` on a string parses the number or returns NA, and `as.character()` on a numeric formats it as a decimal string. These casts happen per-element in C and respect the validity bitmap, so NA values stay NA after casting. ```{r mutate-cast} tbl(f) |> mutate(cyl_str = as.character(cyl)) |> select(cyl, cyl_str) |> collect() |> head(3) ``` `is.na()` tests for missing values by checking the validity bitmap directly, without touching the data buffer. `if_else()` and `case_when()` handle conditional logic. Unlike base R's `ifelse()`, `if_else()` is type-strict: both branches must have the same type, and the result preserves NA from the condition (matching dplyr semantics). `case_when()` evaluates formulas top to bottom and assigns the value from the first matching condition. `coalesce()` returns the first non-NA value from an arbitrary number of columns or literals. `between()` is shorthand for `x >= left & x <= right`, compiled to a single AND expression internally. ```{r mutate-control} tbl(f) |> mutate( size = case_when( cyl == 4 ~ "small", cyl == 6 ~ "medium", cyl == 8 ~ "large" ), mpg_class = if_else(mpg > 20, "high", "low"), in_range = between(hp, 100, 200) ) |> select(cyl, size, mpg, mpg_class, hp, in_range) |> collect() |> head() ``` `case_when()` accepts a `.default` argument for the fallback value. Formulas are evaluated top to bottom; the first matching condition wins. `coalesce()` takes any number of arguments and returns the first non-NA value for each row. It works with both column references and literal values. ```{r mutate-coalesce} df_na <- data.frame( a = c(NA, 2, NA, 4), b = c(10, NA, NA, 40), stringsAsFactors = FALSE ) f_na <- tempfile(fileext = ".vtr") write_vtr(df_na, f_na) tbl(f_na) |> mutate(filled = coalesce(a, b, 0)) |> collect() ``` ## String operations vectra evaluates string functions in C using a two-pass strategy. The first pass walks the input strings and computes the total output byte count. The second pass allocates a single contiguous buffer of that size and fills it. This avoids per-row `malloc`/`realloc` overhead that would otherwise dominate runtime on columns with millions of short strings. For regex operations (`grepl` with `fixed = FALSE`, `gsub`, `sub`, `str_extract`), the engine uses POSIX regex compiled once per batch (or once per thread under OpenMP), supporting full backreference syntax (`\\1`, `\\2`, etc.) and capture groups. We will use a small dataset with character columns for the examples. The patterns here are representative of what you would do on larger string columns; the C engine processes them identically regardless of batch size. ```{r string-data} people <- data.frame( name = c(" Alice ", "Bob", "Charlie Brown", "Diana"), city = c("Amsterdam", "Berlin", "Chicago", "Dublin"), email = c("alice@example.com", "bob@test.org", "charlie.b@work.net", "diana@example.com"), stringsAsFactors = FALSE ) fs <- tempfile(fileext = ".vtr") write_vtr(people, fs) ``` `nchar()` returns the number of characters in each string (not bytes; the engine is UTF-8 aware for multi-byte characters). `substr()` extracts a substring by 1-indexed start and stop positions, matching base R semantics. `trimws()` strips leading and trailing whitespace (spaces, tabs, newlines). ```{r string-basic} tbl(fs) |> mutate( name_trimmed = trimws(name), name_len = nchar(trimws(name)), city_prefix = substr(city, 1, 3) ) |> select(name_trimmed, name_len, city_prefix) |> collect() ``` `toupper()` and `tolower()` convert case. `startsWith()` and `endsWith()` test prefixes and suffixes, returning a boolean column. ```{r string-case} tbl(fs) |> mutate( city_upper = toupper(city), is_example = endsWith(email, "example.com"), starts_a = startsWith(city, "A") ) |> select(city_upper, email, is_example, starts_a) |> collect() ``` `grepl()` tests for pattern matches. The default is fixed matching (literal substring search), which is the fastest path and uses a simple `strstr()` internally. Pass `fixed = FALSE` for POSIX regex, which compiles the pattern once per batch and applies it element-wise. On columns with more than 1000 rows, regex `grepl` is OpenMP-parallelized with per-thread regex compilation for thread safety. ```{r string-grepl} tbl(fs) |> mutate(has_at = grepl("@example", email)) |> select(email, has_at) |> collect() ``` `gsub()` and `sub()` perform string replacement. `gsub()` replaces all occurrences of the pattern; `sub()` replaces only the first. Like `grepl()`, the default is fixed matching. Pass `fixed = FALSE` for regex with full backreference support. In regex mode, the replacement string can reference capture groups with `\\1`, `\\2`, and so on, following the POSIX extended regex specification. ```{r string-gsub} tbl(fs) |> mutate(domain = gsub(".*@", "", email, fixed = FALSE)) |> select(email, domain) |> collect() ``` `str_extract()` pulls out the first match of a regex pattern and returns it as a string. When there is no match, the result is NA. If the pattern contains a capture group (parentheses), `str_extract()` returns the captured portion rather than the full match, which makes it possible to extract structured substrings like domains from URLs or species epithets from binomial names. ```{r string-extract} tbl(fs) |> mutate(user = str_extract(email, "^[^@]+")) |> select(email, user) |> collect() ``` `paste()` and `paste0()` concatenate columns and literals. Both are N-ary: they accept any number of arguments, mixing column references and string literals freely. `paste()` takes a `sep` argument (default `" "`) that is inserted between each pair of arguments; `paste0()` concatenates with no separator. Internally, these use the two-pass allocation strategy: the first pass sums up the output lengths across all arguments, allocates once, then the second pass writes the concatenated strings into the pre-sized buffer. ```{r string-paste} tbl(fs) |> mutate( greeting = paste0("Hello, ", trimws(name), "!"), label = paste(trimws(name), city, sep = " - ") ) |> select(greeting, label) |> collect() ``` For a complete reference of string operations, regex patterns, and performance characteristics on large columns, see `vignette("string-ops")`. ## Aggregation `group_by()` marks columns as grouping keys. It is metadata-only: no plan node is created, no data is scanned, and the grouping information simply attaches to the `vectra_node` as an R-level attribute. The C engine only learns about the groups when `summarise()` (or equivalently `summarize()`) is called, at which point a GroupAggNode is created in the plan. This node builds a hash table keyed on the group columns and accumulates aggregation results in a single pass through the input batches. Each batch's rows are hashed, looked up in the hash table, and their values are folded into the running aggregation state. Once all batches are consumed, the hash table entries are flushed as the output. vectra supports 14 aggregation functions, all evaluated in C: `n()`, `sum()`, `mean()`, `min()`, `max()`, `sd()`, `var()`, `first()`, `last()`, `any()`, `all()`, `median()`, `n_distinct()`. All except `n()` take a column name as their first argument. All except `n()` and `n_distinct()` accept `na.rm = TRUE` to skip NA values. Without `na.rm`, any NA in a group poisons the result: `sum()` returns NA, `mean()` returns NA, and so on. This matches base R semantics. With `na.rm = TRUE`, the behavior also matches R: `sum()` on all-NA returns 0, `mean()` on all-NA returns NaN, and `min()`/`max()` on all-NA return Inf/-Inf with a warning. ```{r summarise-basic} tbl(f) |> group_by(cyl) |> summarise( count = n(), avg_mpg = mean(mpg), total_hp = sum(hp), best_mpg = max(mpg) ) |> collect() ``` The result has one row per group. Groups are not guaranteed to be in any particular order unless you add an `arrange()` afterward. `sd()` and `var()` compute sample standard deviation and variance using the Welford one-pass algorithm (denominator `n - 1`, matching R's `sd()` and `var()`). `first()` and `last()` return the first and last non-NA values encountered in scan order within each group. `any()` and `all()` reduce boolean (or boolean-coercible) columns, returning TRUE/FALSE per group. These are useful for flagging groups that contain at least one positive observation or checking whether all rows in a group satisfy a condition. ```{r summarise-advanced} tbl(f) |> group_by(cyl) |> summarise( mpg_sd = sd(mpg), mpg_var = var(mpg), first_hp = first(hp), last_hp = last(hp) ) |> collect() ``` `median()` uses a per-group dynamic array and quicksort internally. `n_distinct()` uses a per-group open-addressing hash set with FNV-1a hashing at 70% load factor. ```{r summarise-median} tbl(f) |> group_by(cyl) |> summarise( med_mpg = median(mpg), unique_gears = n_distinct(gear) ) |> collect() ``` For quick counts, `count()` and `tally()` are convenient shorthands. `count()` takes grouping columns inline; `tally()` uses existing groups. ```{r count} tbl(f) |> count(cyl, sort = TRUE) |> collect() ``` ```{r tally} tbl(f) |> group_by(gear) |> tally() |> collect() ``` `across()` applies the same function or set of functions to multiple columns at once. It works inside both `summarise()` and `mutate()`, accepting tidyselect column specifications (including helpers like `where(is.numeric)`, `starts_with()`, etc.) and functions as either a single function, a formula (`~ .x + 1`), or a named list of functions. The `.names` argument controls output column naming with `{.col}` and `{.fn}` glue-style placeholders, so you get predictable names like `mpg_avg`, `hp_total` without manual repetition. Under the hood, `across()` is expanded at the R level into individual aggregation expressions before the plan reaches C, so there is no runtime overhead compared to writing each expression by hand. ```{r across-summarise} tbl(f) |> group_by(cyl) |> summarise(across(c(mpg, hp, wt), mean)) |> collect() ``` With multiple functions and a naming pattern: ```{r across-multi} tbl(f) |> group_by(cyl) |> summarise(across( c(mpg, hp), list(avg = mean, total = sum), .names = "{.col}_{.fn}" )) |> collect() ``` `ungroup()` removes grouping metadata from a node. This is occasionally needed when you want to apply a second `summarise()` with different groups, or when you want window functions to operate on the full dataset rather than per-group. ```{r ungroup} tbl(f) |> group_by(cyl, gear) |> summarise(n = n(), .groups = "keep") |> ungroup() |> arrange(desc(n)) |> collect() ``` The `.groups` argument to `summarise()` controls what happens to grouping after aggregation. `"drop_last"` (the default) peels off the last grouping column, `"drop"` removes all groups, and `"keep"` preserves them. ## Sorting and slicing `arrange()` sorts rows. Wrap column names in `desc()` for descending order. Sort is stable: rows with equal sort keys preserve their original order. ```{r arrange} tbl(f) |> select(mpg, cyl, hp) |> arrange(cyl, desc(mpg)) |> collect() |> head(8) ``` Under the hood, `arrange()` uses an external merge sort with a 1 GB memory budget. When data exceeds that limit, sorted runs spill to temporary `.vtr` files and merge via a k-way min-heap. NAs sort last in ascending order and first in descending order. `slice_head()` and `slice_tail()` return the first or last `n` rows. `slice_head()` is streaming (it creates a LimitNode that stops pulling after `n` rows). `slice_tail()` must materialize all rows to know the total count. ```{r slice-head} tbl(f) |> slice_head(n = 5) |> collect() ``` `slice_min()` and `slice_max()` use an optimized top-N algorithm (a bounded heap) that avoids a full sort. This matters when you want the 10 smallest values from a 100-million-row dataset: the heap holds at most 10 entries at any point. ```{r slice-min} tbl(f) |> select(mpg, cyl, hp) |> slice_min(order_by = mpg, n = 3) |> collect() ``` By default, `slice_min()` and `slice_max()` include ties. The result may contain more than `n` rows if multiple rows share the boundary value. Use `with_ties = FALSE` for exactly `n` rows. ```{r slice-no-ties} tbl(f) |> select(mpg, cyl) |> slice_min(order_by = cyl, n = 3, with_ties = FALSE) |> collect() ``` ```{r slice-max} tbl(f) |> select(mpg, cyl, hp) |> slice_max(order_by = hp, n = 4, with_ties = FALSE) |> collect() ``` ## Joins vectra implements hash joins using a build-right, probe-left strategy. The entire right-side table is consumed and stored in a hash table keyed on the join columns. Then left-side batches stream through one at a time, probing the hash table for matches. Memory cost is proportional to the number of rows in the right-side table, so the standard practice is to put the smaller table on the right. For a left table with 500 million rows and a right table with 50,000 rows, the hash table holds 50,000 entries and the left side streams through in constant memory. Seven join types are available. `left_join()` keeps all left rows and fills unmatched right columns with NA. `inner_join()` keeps only rows with matches on both sides. `right_join()` keeps all right rows. `full_join()` keeps everything from both sides, filling NA where there is no match. `semi_join()` keeps left rows that have at least one match on the right, without adding any right-side columns. `anti_join()` keeps left rows that have no match. `cross_join()` produces the Cartesian product (every left row paired with every right row). NA keys never match in any join type, following SQL NULL semantics. ```{r join-setup} cyl_info <- data.frame( cyl = c(4, 6, 8), engine_type = c("inline", "v-type", "v-type"), stringsAsFactors = FALSE ) f_cyl <- tempfile(fileext = ".vtr") write_vtr(cyl_info, f_cyl) ``` `left_join()` keeps all rows from the left table. Unmatched right-side columns are filled with NA. ```{r left-join} tbl(f) |> select(mpg, cyl, hp) |> left_join(tbl(f_cyl), by = "cyl") |> collect() |> head() ``` `inner_join()` keeps only rows with matching keys on both sides. `semi_join()` keeps left-side rows that have a match on the right but does not add any right-side columns. `anti_join()` keeps left-side rows that have no match. ```{r semi-anti} tbl(f) |> select(mpg, cyl) |> anti_join( tbl(f_cyl) |> filter(engine_type == "v-type"), by = "cyl" ) |> collect() |> head() ``` That returns only the 4-cylinder cars, since both 6- and 8-cylinder entries are in the "v-type" subset. For joins on differently named columns, use a named vector in `by`. Multi-column keys work by passing multiple names. ```{r join-named} ratings <- data.frame( cylinders = c(4, 6, 8), rating = c("A", "B", "C"), stringsAsFactors = FALSE ) f_rat <- tempfile(fileext = ".vtr") write_vtr(ratings, f_rat) tbl(f) |> select(mpg, cyl) |> inner_join(tbl(f_rat), by = c("cyl" = "cylinders")) |> collect() |> head() ``` The `suffix` argument (default `c(".x", ".y")`) controls how non-key columns with the same name on both sides are disambiguated. For example, if both tables have a column called `value`, the output will contain `value.x` from the left and `value.y` from the right. Multi-column keys work by passing a character vector to `by`, either unnamed (same column names on both sides) or named (mapping left names to right names). Key types are auto-coerced following the `bool < int64 < double` promotion hierarchy. If the left key is int64 and the right key is double, the int64 values are promoted to double before hashing. Joining a string column against a numeric column is an error and fails immediately with a descriptive message. vectra also supports `fuzzy_join()` for approximate string matching between tables. It computes string distances between key columns and keeps pairs within a normalized distance threshold. ```{r fuzzy-join} ref_species <- data.frame( canonical = c("Quercus robur", "Quercus petraea", "Fagus sylvatica"), code = c("QR", "QP", "FS"), stringsAsFactors = FALSE ) query_species <- data.frame( name = c("Quercus robur", "Qurecus petraea", "Fagus sylvatca"), stringsAsFactors = FALSE ) f_ref <- tempfile(fileext = ".vtr") f_query <- tempfile(fileext = ".vtr") write_vtr(ref_species, f_ref) write_vtr(query_species, f_query) tbl(f_query) |> fuzzy_join( tbl(f_ref), by = c("name" = "canonical"), method = "dl", max_dist = 0.15 ) |> collect() ``` The `method` argument accepts `"dl"` (Damerau-Levenshtein), `"levenshtein"`, or `"jw"` (Jaro-Winkler). The `block_by` parameter restricts comparisons to rows sharing an exact-match value on a second column, which can reduce runtime by orders of magnitude on large reference tables. For full documentation, see `vignette("joins")`. ## Window functions Window functions compute values row-by-row in the context of the full dataset or a per-group partition. Unlike aggregation, which collapses groups to single rows, window functions preserve the original row count. They are detected inside `mutate()` by function name and automatically routed to a dedicated WindowNode in the execution plan. The WindowNode is a materializing operation: it consumes all input rows (per partition) before producing output, because functions like `rank()` need to see all values before assigning ranks. vectra supports 12 window functions organized in three families. Ranking functions: `row_number()` (no argument, sequential 1..n), `rank()` and `dense_rank()` (take a column argument, with `rank()` leaving gaps on ties and `dense_rank()` not), `ntile(n)` (divides rows into `n` roughly equal buckets), `percent_rank()` and `cume_dist()` (take a column argument, return normalized rank as a double between 0 and 1). Offset functions: `lag()` and `lead()`. Cumulative functions: `cumsum()`, `cummean()`, `cummin()`, `cummax()`. ```{r window-rank} tbl(f) |> select(mpg, cyl, hp) |> slice_head(n = 8) |> mutate( rn = row_number(), mpg_rank = rank(mpg), mpg_dense = dense_rank(mpg) ) |> collect() ``` `lag()` and `lead()` shift a column forward or backward by `n` positions (default 1). `lag(mpg)` gives each row the `mpg` value from the previous row; `lead(mpg)` gives the value from the next row. The `default` argument specifies the fill value when the offset falls outside the partition boundary (the first row for `lag`, the last row for `lead`). Without `default`, these positions are NA. ```{r window-lag-lead} tbl(f) |> select(mpg, hp) |> slice_head(n = 6) |> mutate( prev_mpg = lag(mpg), next_mpg = lead(mpg), prev2_hp = lag(hp, n = 2, default = 0) ) |> collect() ``` Cumulative functions: `cumsum()`, `cummean()`, `cummin()`, `cummax()`. ```{r window-cum} tbl(f) |> select(mpg, hp) |> slice_head(n = 6) |> mutate( running_hp = cumsum(hp), running_avg = cummean(mpg), running_min = cummin(mpg) ) |> collect() ``` Window functions respect grouping. When a node is grouped, partitions are computed independently. ```{r window-grouped} tbl(f) |> select(mpg, cyl) |> group_by(cyl) |> mutate(rn = row_number(), pct = percent_rank(mpg)) |> slice_head(n = 10) |> collect() ``` Each `cyl` group gets its own `row_number()` sequence starting from 1 and its own `percent_rank()` distribution. The grouped window node partitions the data by the group keys, materializes each partition independently, computes the window functions within the partition, and then concatenates the results. If a `mutate()` call mixes window functions with regular expressions (e.g., `mutate(rn = row_number(), double_mpg = mpg * 2)`), the R layer splits them: window functions go to a WindowNode, regular expressions go to a ProjectNode stacked on top. The split is invisible to the caller. ## Dates and times Date and POSIXct columns roundtrip through vectra. Internally, dates are stored as doubles: days since the Unix epoch (1970-01-01) for Date objects, seconds since epoch for POSIXct. The R class attribute is preserved in the `.vtr` schema metadata, so when you `collect()`, the output columns have the correct R class restored. Component extraction functions work in `mutate()` and `filter()`: `year()`, `month()`, `day()` for date parts, and `hour()`, `minute()`, `second()` for time parts on POSIXct columns. These are evaluated in C by computing the Gregorian calendar decomposition from the epoch offset, so there is no R-level overhead per row. ```{r date-data} events <- data.frame( event_date = as.Date(c("2020-03-15", "2020-07-01", "2021-01-15", "2021-06-30")), event_time = as.POSIXct(c("2020-03-15 09:30:00", "2020-07-01 14:00:00", "2021-01-15 08:15:00", "2021-06-30 17:45:00"), tz = "UTC"), value = c(10, 20, 30, 40) ) fd <- tempfile(fileext = ".vtr") write_vtr(events, fd) ``` Extract date components and aggregate by year. ```{r date-extract} tbl(fd) |> mutate( yr = year(event_date), mo = month(event_date), dy = day(event_date) ) |> group_by(yr) |> summarise(total = sum(value)) |> collect() ``` Extract time components from POSIXct. ```{r time-extract} tbl(fd) |> mutate( hr = hour(event_time), mn = minute(event_time) ) |> select(event_time, hr, mn) |> collect() ``` Filter by date using `as.Date()` to convert a string literal to a date value at parse time. The C engine receives the numeric date value and compares directly. ```{r date-filter} tbl(fd) |> filter(event_date >= as.Date("2021-01-01")) |> collect() ``` Date arithmetic works through the numeric representation. Adding 30 to a Date column advances it by 30 days. Subtracting two date columns gives a difference in days. You can combine date extraction with aggregation: `group_by(yr = year(event_date)) |> summarise(total = sum(value))` works as expected because `mutate()` expressions are allowed inline inside `group_by()` through the standard dplyr convention of named expressions. ```{r date-arith} tbl(fd) |> mutate(plus_30 = event_date + 30) |> select(event_date, plus_30) |> collect() ``` The `plus_30` column is numeric (days since epoch). To interpret it as a Date in R after collecting, wrap it with `as.Date(x, origin = "1970-01-01")`. ## String similarity vectra computes edit distances directly in the C engine, with OpenMP parallelization for columns exceeding 1000 rows. Three distance algorithms are available. `levenshtein()` counts insertions, deletions, and substitutions. `dl_dist()` (Damerau-Levenshtein) additionally counts transpositions of adjacent characters as a single edit, which better captures common typos like "hte" for "the". `jaro_winkler()` returns a similarity score from 0 to 1, where 1 means identical. It weights early-character matches more heavily, making it well-suited for short strings like personal names. ```{r similarity-data} species <- data.frame( name = c("Quercus robur", "Quercus rubra", "Fagus sylvatica", "Acer platanoides", "Quercus petraea"), stringsAsFactors = FALSE ) fs2 <- tempfile(fileext = ".vtr") write_vtr(species, fs2) ``` Compute all three metrics against a reference string and filter by edit distance. ```{r similarity-metrics} tbl(fs2) |> mutate( lev = levenshtein(name, "Quercus robur"), dl = dl_dist(name, "Quercus robur"), jw = jaro_winkler(name, "Quercus robur") ) |> filter(lev <= 5) |> arrange(lev) |> collect() ``` Normalized variants `levenshtein_norm()` and `dl_dist_norm()` divide the raw distance by the length of the longer string, returning a value between 0 and 1. This makes thresholds transferable across strings of different lengths. ```{r similarity-norm} tbl(fs2) |> mutate( lev_norm = levenshtein_norm(name, "Quercus robur"), dl_norm = dl_dist_norm(name, "Quercus robur") ) |> collect() ``` Damerau-Levenshtein catches transposition errors that pure Levenshtein scores as two edits. ```{r dl-transposition} tbl(fs2) |> mutate( lev = levenshtein(name, "Qurecus robur"), dl = dl_dist(name, "Qurecus robur") ) |> collect() ``` The "ue" / "re" swap in "Qurecus" counts as 2 edits for Levenshtein (delete + insert) but 1 for Damerau-Levenshtein (transposition). For full coverage of distance algorithms, threshold tuning, and blocking strategies, see `vignette("string-ops")`. ## Tree traversal `resolve()` and `propagate()` are scalar self-join functions designed for hierarchical (parent-child) data stored in a single table. They operate within a batch by matching a foreign key column to a primary key column in the same batch, which means the entire hierarchy must fit within one row group (or the data must be collected first). These functions are particularly useful for taxonomic backbones, organizational charts, and bill-of-materials structures where each row has a `parent_id` pointing to another row's `id`. `resolve(fk, pk, value)` returns the `value` column from the row where `pk == fk`. It is a single-level lookup, equivalent to a self left-join on `(fk, pk)`, but without the overhead of materializing a hash table. The lookup is O(n) per batch (linear scan), so it works well for hierarchies with up to a few million rows. ```{r resolve} taxa <- data.frame( id = c(1L, 2L, 3L, 4L), name = c("Fagaceae", "Quercus", "Q. robur", "Q. petraea"), parent_id = c(NA, 1L, 2L, 2L), stringsAsFactors = FALSE ) ft <- tempfile(fileext = ".vtr") write_vtr(taxa, ft) tbl(ft) |> mutate(parent_name = resolve(parent_id, id, name)) |> collect() ``` Row 1 (Fagaceae) has `parent_id = NA`, so `parent_name` is NA. Row 3 (Q. robur) has `parent_id = 2`, which matches `id = 2` (Quercus), so `parent_name` is "Quercus". `propagate(parent_fk, pk, seed)` walks the tree upward from each row and fills every node with the nearest non-NA ancestor value of `seed`. Unlike `resolve()`, which does a single-level lookup, `propagate()` traverses the full ancestor chain until it finds a row where the seed expression is non-NA. This denormalizes a hierarchy in a single pass, which would otherwise require a recursive CTE in SQL or repeated self-joins in dplyr. ```{r propagate} tbl(ft) |> mutate(family = propagate( parent_id, id, if_else(is.na(parent_id), name, NA_character_) )) |> collect() ``` The seed expression assigns "Fagaceae" to the root (the row with no parent) and NA everywhere else. `propagate()` then walks each row's ancestry until it finds "Fagaceae" and fills it in. All four rows now carry the family name. ## Format backends vectra reads and writes several file formats beyond `.vtr`. Each backend produces a `vectra_node` that plugs into the same verb pipeline, so you can `filter()`, `mutate()`, `group_by()`, and `summarise()` on CSV files, SQLite databases, Excel workbooks, and GeoTIFF rasters using identical syntax. The key advantage is streaming: format conversion pipelines (e.g., CSV to `.vtr`) process data batch-by-batch without loading the entire file into memory. `tbl_csv()` scans a CSV file lazily. Column types (int64, double, bool, string) are inferred from the first 1000 rows by probing each field for numeric, boolean, and date patterns. Gzip-compressed files (`.csv.gz`) are detected automatically by file extension and decompressed on the fly. The `batch_size` parameter controls how many rows are read per batch (default 65,536). ```{r csv-roundtrip} csv_in <- tempfile(fileext = ".csv") write.csv(mtcars, csv_in, row.names = FALSE) tbl_csv(csv_in) |> filter(cyl == 6) |> select(mpg, cyl, hp) |> collect() ``` `tbl_sqlite()` scans a SQLite table. Column types are derived from the CREATE TABLE schema. All filtering and aggregation happens in vectra's C engine, not in SQLite. ```{r sqlite-roundtrip} db <- tempfile(fileext = ".sqlite") f_src <- tempfile(fileext = ".vtr") write_vtr(mtcars, f_src) tbl(f_src) |> write_sqlite(db, "cars") tbl_sqlite(db, "cars") |> filter(mpg > 25) |> collect() ``` `tbl_xlsx()` reads an Excel sheet via the openxlsx2 package (a Suggests dependency, not required for other backends). The sheet is read into memory as a data.frame, then converted to a vectra node. This is not a streaming operation (the full sheet is loaded), but it makes it possible to chain vectra verbs on Excel data without writing intermediate files. The `sheet` argument accepts a name or 1-based integer index. `tbl_tiff()` reads GeoTIFF rasters. Each pixel becomes a row with `x`, `y`, and `band1`, `band2`, ... columns. Coordinates are pixel centers derived from the affine geotransform stored in the TIFF metadata. NoData values become NA. This enables spatial queries like `filter(band1 > 25, x >= xmin, x <= xmax)` on rasters that are too large to load as matrices. Results can be converted back to a raster with `terra::rast(df, type = "xyz")`. The corresponding `write_tiff()` function reconstructs the grid from x/y columns and writes a GeoTIFF with optional DEFLATE compression. Streaming format conversion is a common pattern. The entire pipeline runs batch-by-batch, so converting a 20 GB CSV file to `.vtr` stays within a few hundred MB of memory. Write functions (`write_vtr()`, `write_csv()`, `write_sqlite()`, `write_tiff()`) all accept `vectra_node` inputs and consume batches from the upstream pipeline, writing each batch to the output format as it arrives. ```{r format-conversion} csv_file <- tempfile(fileext = ".csv") vtr_file <- tempfile(fileext = ".vtr") csv_out <- tempfile(fileext = ".csv") write.csv(mtcars, csv_file, row.names = FALSE) tbl_csv(csv_file) |> write_vtr(vtr_file) tbl(vtr_file) |> filter(cyl == 6) |> write_csv(csv_out) read.csv(csv_out) |> head() ``` For format-specific options, batch sizes, and performance considerations, see `vignette("formats")`. ## Indexes Hash indexes accelerate equality predicates on `.vtr` files. An index is stored as a `.vtri` sidecar file alongside the `.vtr` file (named `..vtri` for single-column indexes). It maps key hashes to row group indices using FNV-1a hashing, enabling O(1) row group identification for `filter(col == value)` and `filter(col %in% values)`. When `tbl()` opens a `.vtr` file, it automatically detects and loads any `.vtri` sidecar files present in the same directory. The scan node then consults the index before reading data, skipping row groups that cannot contain matching keys. This composes with zone-map pruning: the index narrows down candidate row groups, and zone-map statistics further eliminate row groups where the column's min/max range excludes the filter value. Create an index with `create_index()`. Check for an existing index with `has_index()`. Indexes persist on disk and survive R sessions. ```{r index-create} f_idx <- tempfile(fileext = ".vtr") write_vtr( data.frame(id = letters, val = 1:26, stringsAsFactors = FALSE), f_idx, batch_size = 5 ) has_index(f_idx, "id") # FALSE create_index(f_idx, "id") has_index(f_idx, "id") # TRUE ``` With 6 row groups of 5 rows each, a filter on `id` now skips row groups that cannot contain the target value. ```{r index-query} tbl(f_idx) |> filter(id == "m") |> collect() ``` Composite indexes cover AND-combined equality predicates on multiple columns. Pass a character vector to `create_index()`. ```{r index-composite} f_comp <- tempfile(fileext = ".vtr") write_vtr( data.frame( region = rep(c("north", "south"), each = 13), id = letters, val = 1:26, stringsAsFactors = FALSE ), f_comp, batch_size = 5 ) create_index(f_comp, c("region", "id")) tbl(f_comp) |> filter(region == "north", id == "c") |> collect() ``` The `ci` argument to `create_index()` builds a case-insensitive index by lowercasing all key values before hashing. This is useful for string columns with inconsistent casing (e.g., species names entered by different data providers). When a case-insensitive index is present, `filter(col == "Quercus")` will match "quercus", "QUERCUS", and any other casing variant. Index pushdown composes with binary search on sorted columns and zone-map pruning. A well-indexed `.vtr` file with many small row groups can answer point queries over hundreds of millions of rows by reading a single row group. `explain()` shows whether an index was used and how many row groups were pruned. For details on index internals, composite index hash combining, and tuning batch size for optimal index selectivity, see `vignette("indexing")`. ## Incremental operations Three functions support incremental workflows on `.vtr` files without full rewrites. These are designed for append-heavy pipelines where new data arrives in batches (daily sensor readings, transaction logs, biodiversity occurrence records) and rewriting the entire file on each update would be prohibitively slow. `append_vtr()` adds new rows as additional row groups at the end of an existing file. The schema of the new data must match the existing file exactly: same column names, same types, same order. Existing row groups are untouched; the function writes the new row groups after the last existing one and patches the header to update the row group count. This is not fully atomic: if the process is interrupted between writing the row groups and patching the header, the file may be corrupted. For write-once workloads where atomicity matters, use `write_vtr()` (which writes to a temp file and renames). ```{r append} fa <- tempfile(fileext = ".vtr") write_vtr(mtcars[1:16, ], fa) append_vtr(mtcars[17:32, ], fa) tbl(fa) |> collect() |> nrow() ``` `delete_vtr()` marks rows as deleted by writing a tombstone sidecar file (`.del`). The `.vtr` file itself is never modified. The next `tbl()` call automatically excludes tombstoned rows. Tombstones are cumulative across multiple `delete_vtr()` calls. ```{r delete} delete_vtr(fa, c(0, 1, 2)) # 0-based row indices tbl(fa) |> collect() |> nrow() unlink(c(fa, paste0(fa, ".del"))) ``` `diff_vtr()` computes a key-based logical diff between two snapshots of the same dataset. It returns a list with `$deleted` (a vector of key values present in the old file but not the new) and `$added` (a lazy `vectra_node` of rows present in the new file but not the old). This is a set-level difference based on key identity, not a binary file diff or row-level comparison. Rows with the same key that have changed values appear in both `$deleted` and `$added`, which makes it possible to detect updates by intersecting the two sets. The implementation streams both files: pass 1 builds a hash set from the old file's key column, pass 2 streams all columns from the new file and writes added rows to a temporary `.vtr` file. The temporary file is cleaned up automatically when the returned node is garbage-collected. ```{r diff} fd1 <- tempfile(fileext = ".vtr") fd2 <- tempfile(fileext = ".vtr") old <- data.frame(id = 1:5, val = letters[1:5], stringsAsFactors = FALSE) new <- data.frame(id = c(3L, 4L, 5L, 6L, 7L), val = c("C", "d", "e", "f", "g"), stringsAsFactors = FALSE) write_vtr(old, fd1) write_vtr(new, fd2) d <- diff_vtr(fd1, fd2, "id") d$deleted collect(d$added) unlink(c(fd1, fd2)) ``` Keys 1 and 2 were deleted. Keys 6 and 7 were added. Key 3 appears with a changed value ("c" to "C"), which shows up in both deleted and added. For incremental ETL patterns and large-data workflows, see `vignette("large-data")`. ## Materialized blocks A `vectra_node` is single-use: once `collect()` or `write_vtr()` consumes it, the C-level external pointer is invalidated and the plan cannot be reused. This is a consequence of the pull-based execution model, where each node's `next_batch()` advances internal state that cannot be rewound. For repeated lookups against the same dataset (e.g., matching a stream of queries against a reference table), `materialize()` pulls all batches and stores the result as a persistent columnar block in memory. The block lives until it is garbage-collected or the R session ends. ```{r block-materialize} blk_data <- data.frame( taxonID = c("T1", "T2", "T3", "T4", "T5"), name = c("Quercus robur", "Pinus sylvestris", "Fagus sylvatica", "Acer campestre", "Betula pendula"), stringsAsFactors = FALSE ) f_blk <- tempfile(fileext = ".vtr") write_vtr(blk_data, f_blk) blk <- materialize(tbl(f_blk)) blk ``` `block_lookup()` performs a hash-based exact lookup on a string column of the materialized block. The hash index is built lazily on the first call to `block_lookup()` for a given column and cached inside the block for all subsequent calls. This means the first lookup pays a one-time O(n) indexing cost and all subsequent lookups are O(k) where k is the number of query keys. The `ci` argument enables case-insensitive matching by lowercasing keys before hashing. ```{r block-lookup} block_lookup(blk, "name", c("Quercus robur", "Betula pendula")) ``` `block_fuzzy_lookup()` computes string distances between query keys and a column in the block, returning all pairs within the `max_dist` threshold. It supports Damerau-Levenshtein (`"dl"`), Levenshtein (`"levenshtein"`), and Jaro-Winkler (`"jw"`). The `block_col` and `block_keys` arguments enable exact-match blocking: only rows where the blocking column matches the corresponding blocking key are compared, which can reduce the number of distance computations by orders of magnitude. For example, blocking on genus before fuzzy-matching species names restricts comparisons to rows within the same genus. The distance computation is OpenMP-parallelized via the `n_threads` argument. ```{r block-fuzzy} block_fuzzy_lookup( blk, "name", c("Qurecus robur", "Pinus silvestris"), method = "dl", max_dist = 0.2 ) ``` The `query_idx` column (1-based) maps each result row back to the input query vector. The `fuzzy_dist` column gives the normalized distance for each match. ## Inspecting the plan Every vectra pipeline builds a lazy execution plan as a tree of nodes. `explain()` prints this tree before any data moves, which makes it the primary debugging tool for understanding what the engine will do when you call `collect()`. The output shows node types (ScanNode, FilterNode, ProjectNode, SortNode, GroupAggNode, WindowNode, JoinNode, etc.), their parent-child relationships, and the output column schema at each level. ```{r explain-full} tbl(f) |> filter(cyl > 4) |> select(mpg, cyl, hp) |> arrange(desc(mpg)) |> explain() ``` Several things to look for in the plan: **Streaming vs. materializing.** FilterNode and ProjectNode are streaming: they process one batch at a time in constant memory. SortNode, GroupAggNode, and WindowNode are materializing: they consume all input before producing output. A plan with only streaming nodes above the scan will use memory proportional to one batch, regardless of file size. **Column pruning.** ProjectNode appears when `select()` or `mutate()` is used. It lists only the columns that pass through. Columns not listed are never decoded from disk in the scan node, saving I/O and decompression work. **Predicate pushdown.** When a hash index exists on the filter column, the ScanNode consults the `.vtri` index before reading row groups. `explain()` shows the index file path when pushdown is active. Zone-map pruning (based on per-row-group min/max statistics) applies to all columns automatically, with or without an explicit index. Combining both mechanisms gives the fastest possible scan: the index identifies candidate row groups by key hash, and zone-map statistics further narrow the set. **Node order.** The plan is printed bottom-up (leaf to root). The ScanNode is always at the bottom. Reading the plan from bottom to top traces the path data takes through the pipeline: scan from disk, filter rows, project columns, sort, limit. `glimpse()` provides a quick preview of column types and a sample of values without collecting the full result. It reads enough data to show one line per column. ```{r glimpse} tbl(f) |> glimpse() ``` ## Cleanup ```{r cleanup} unlink(c(f, f_batched, f_na, fs, fs2, f_cyl, f_rat, f_ref, f_query, fd, ft, csv_in, csv_out, csv_file, vtr_file, db, f_src, f_idx, paste0(f_idx, ".id.vtri"), f_comp, paste0(f_comp, ".region_id.vtri"), f_blk)) ```