ecodive for Large Datasetsecodive is engineered for high performance, using
parallelized C code to deliver results quickly and with minimal memory
usage. For most use cases, you can pass your data in any standard R
format (like a matrix or data.frame) and get
fast results.
However, when working with very large datasets—such as those with
thousands of samples or hundreds of thousands of features—you can
achieve noticeably better performance by paying attention to the format
of your input data. These optimizations minimize the internal data
conversion steps that ecodive has to perform.
ecodive is highly efficient at reformatting data
internally for one-off calculations. You do not need to manually format
your data for single function calls.
However, if you plan to run multiple ecodive
functions on the same large dataset, you will see a performance
benefit by converting your data into the optimal format first. This
format is a column-compressed sparse matrix
(dgCMatrix) from the Matrix package,
with samples arranged in columns.
By doing the conversion once, you prevent ecodive from
having to reformat the data for each subsequent function call.
library(ecodive)
library(Matrix)
# Assume 'my_counts' is a standard matrix with samples in rows
my_counts <- as.matrix(ex_counts)
# Convert once to the optimal format
optimal_counts <- as(t(my_counts), "dgCMatrix")
# Now, run multiple analyses on the pre-formatted object.
# Remember to specify margin = 2L since samples are in columns.
shannon_vals <- shannon(optimal_counts, margin = 2L)
simpson_vals <- simpson(optimal_counts, margin = 2L)
observed_vals <- observed(optimal_counts, margin = 2L)Ecological datasets, particularly from amplicon sequencing (16S, ITS) or shotgun metagenomics, are typically “sparse.” This means that for any given sample, the vast majority of species in the full dataset have a count of zero.
A standard R matrix stores every single value, including
all the zeros. For a large dataset, this can consume a massive amount of
memory. A sparse matrix, on the other hand, only stores the non-zero
values and their locations. This dramatically reduces the memory
footprint.
By providing a dgCMatrix with samples in columns, you
are feeding ecodive data in its preferred native format,
allowing it to skip any internal conversion and proceed directly to
calculations.
rbiom UsersIf you use the rbiom package to manage your data, you’re
already taking advantage of this! rbiom objects store their
count tables internally as a dgCMatrix with samples in
columns. When you pass an rbiom object to an
ecodive function, it is passed in this optimal format
automatically.
Many diversity metrics operate on transformed data, most commonly
relative abundances (norm = 'percent'). When you specify a
transformation like norm = 'percent', ecodive
performs this conversion internally. This adds a small amount of
computational overhead.
If you plan to run many different analyses on the same transformed
data, it is more efficient to perform the transformation once yourself
and then pass the pre-transformed data to ecodive,
specifying norm = 'none'.
# Inefficient: Transforming the data twice
shannon_vals <- shannon(optimal_counts, norm = 'percent', margin = 2L)
simpson_vals <- simpson(optimal_counts, norm = 'percent', margin = 2L)
# More Efficient: Transform once
rel_abund <- t(apply(t(optimal_counts), 1, function(x) x / sum(x)))
rel_abund <- as(rel_abund, "dgCMatrix")
shannon_vals <- shannon(rel_abund, norm = 'none', margin = 2L)
simpson_vals <- simpson(rel_abund, norm = 'none', margin = 2L)The one exception to this rule is the centered log-ratio
transformation (norm = 'clr'), used for calculating
Aitchison distance. A standard CLR transformation turns a sparse matrix
into a “dense” matrix (where all zeros become non-zero values), which
dramatically increases memory consumption.
ecodive’s internal CLR implementation uses special
techniques to calculate the result while preserving the sparse matrix
structure as much as possible. Therefore, for CLR-based metrics, you
will achieve better performance and lower memory usage by letting
ecodive handle the transformation.
Do this:
# Best for CLR: Let ecodive handle the transformation
aitchison_dist <- aitchison(optimal_counts, margin = 2L)Not this:
For the best performance with very large datasets:
dgCMatrix with samples in
columns.margin = 2L in ecodive function
calls.norm = 'none'.ecodive handle
CLR transformations by using norm='clr'.