--- title: "Working with lme4 and nlme in seqwrap" author: - name: Chidimma Echebiri affiliation: - name: University of Inland Norway department: Department of Biotechnology, PhD programme in applied ecology and biotechnology - name: Daniel Hammarström email: daniel.hammarstrom@inn.no affiliation: - name: University of Inland Norway department: Department of Public Health and Sport Sciences vignette: > %\VignetteIndexEntry{Working with lme4 and nlme in seqwrap} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r} #| label: setup #| include: false required_pkgs <- c("glmmTMB", "lme4", "nlme") have_all <- all(vapply( required_pkgs, requireNamespace, logical(1), quietly = TRUE )) knitr::opts_chunk$set(eval = have_all) ``` The `seqwrap` function lets the user specify which modeling algorithm to use when modeling data provided as a data frame and metadata describing each sample. Target-specific models are passed to the results together with optional user-specified summary and model-evaluation functions. The function aims not to limit the user to the preferences of specific models. This allows for comparisons between modeling frameworks and flexibility for many types of high-dimensional data. In this vignette, three popular packages are applied to the same underlying data with varying model definitions. ### Fitting count data to a negative binomial model using `glmmTBM` We will start by simulating data from a simple experiment where count data has been gathered from 25 clusters in two conditions (e.g. treatment and control). ```{r} #| label: simulated-data library(seqwrap) library(glmmTMB) # Simulate data dat <- seqwrap::simcounts(n_genes = 1000, n_samples = 50, beta_0 = 2, sigma_0 = 0.5, beta_1 = 0.5,, sigma_1 = 2, b_0 = 1, clusters = 25, overdispersion_min = c(1, 10)) # Calculate library sizes dat$metadata$libsize <- log(colSums(dat$data[, -1]) ) ``` Using `seqwrap_compose` we can populate a `swcontainer` object with all needed information to fit models. The `swcontainer` objects is basically a list with data, meta data and arguments that gets passed the model algorithm. The `swcontainer` object also contain reference to the desired model function. ```{r} swobject <- seqwrap_compose( data = dat$data, # A data frame with the first column as the target identifier metadata = dat$metadata, # Meta data bout samples, containing predictors samplename = "sample", # The name sample id column modelfun = glmmTMB::glmmTMB, # The model fitting function # A list of argumnets supplied to the modelfunction arguments = list( formula = y ~ x + (1 | cluster), family = glmmTMB::nbinom1 ), summary_fun = NULL, eval_fun = NULL, additional_vars = NULL ) ``` The `seqwrap` function will iterate over all targets in the data and fit the model and apply the summary and evaluation function to each model. Notice that we can use `return_models = TRUE` to save model in the output. Below we also use `subset = 1:10` to only do the first 10 targets. This is useful when we want to check if teh summary and evaluation functions work as intended. The `swobject` created above is updated when we supply ne w arguments to `seqwrap`. ```{r} models <- seqwrap(swobject, return_models = TRUE, subset = 1:10) ``` The resulting object can be summarized using `seqwrap_summarise` ```{r} modelsum <- seqwrap_summarise(models) ``` We have left the arguments `summary_fun` and `eval_fun` as NULL. This means that `seqwrap` will try to combine model parameter estimates and performance using `broom.mixed::`/`broom::tidy` and `DHARMa`. These evaluations are thought of as minimal summaries of the model and might not completly satisfy the needs of the user. From our example above, the fitted parameters from the models are returned in `modelsum$summaries` and model evaluations are returned in `modelsum$evaluations`. The functions are exported from the seqwrap package and we can inspect them: ```{r} seqwrap::generic_summary ``` ```{r} seqwrap::generic_evaluation ``` We encourage users to define their own custom functions. ### Fitting transformed count data to a Gaussian model using `lme4` Instead of counts, let's say we are interested in analyzing counts per million, expressed as $\text{CPM} = \frac{\text{Count}}{\text{Library size}} \times 10^6$. We will convert our simulated count data to CPM. ```{r} lib_sizes <- colSums(dat$data[,-1]) / 1e6 cpm <- sweep(dat$data[,-1], MARGIN = 2, STATS = lib_sizes, FUN = "/") dat$data <- data.frame(cbind(dat$data[,1], cpm)) ``` We will create a new `swcontainer` using `lme4::lmer` as our fitting algorithm. ```{r} swobject <- seqwrap_compose( data = dat$data, # A data frame with the first column as the target identifier metadata = dat$metadata, # Meta data bout samples, containing predictors samplename = "sample", # The name sample id column modelfun = lme4::lmer, # The model fitting function # A list of argumnets supplied to the modelfunction arguments = list( formula = y ~ x + (1 | cluster) ), summary_fun = NULL, eval_fun = NULL, additional_vars = NULL ) ``` Next we can run `seqwrap` to get estimates for our genes, to save some time we will start by using a subset of the simulated genes. ```{r} models <- seqwrap(swobject, return_models = TRUE, subset = 1:10) ``` The generic summary function will provide fitted parameter statistics. ```{r} modelsum <- seqwrap_summarise(models) modelsum$summaries ``` ### Using `nlme::lme` and `nlme::gls`, need for `additional_vars` We will now try to fit the same model as previously fitted in `lme4::lmer` using `nlme::lme`. We can use the same `swcontainer` object and update the relevant lines. Notice that `nlme::lme` requires a different syntax for defining the model. Unsurprisingly, the results are the same @fig-complme. ```{r} models_lme <- seqwrap(swobject, modelfun = nlme::lme, arguments = alist(fixed = y ~ x, random = ~ 1|cluster), return_models = TRUE, eval_fun = NULL, subset = 1:10) modelsum_lme <- seqwrap_summarise(models_lme) ``` ```{r} lme_est <- data.frame(modelsum_lme$summaries[, c(1,4,5, 6,9)]) lmer_est <- data.frame(modelsum$summaries[, c(1,4,5, 6)]) ``` ```{r} #| echo: false #| fig-cap: "Similar estimates between `lme4::lmer` and `nlme::lme` models" #| label: fig-complme plot(lme_est[lme_est$term == "x","estimate"], lmer_est[lmer_est$term == "x","estimate"], xlab = "nlme::lme estimate", ylab = "lme4::lmer estimate") abline(a = 0, b = 1) ``` `seqwrap` will attempt to extract relevant metadata for use in modeling. This is done by extracting variable names from model formulations. In some cases, this can be difficult because we use additional functionality in some packages. Models constructed in `nlme::gls` are an example. We can construct a similar model to `nlme::lme` using `nlme::CorCompSymm`, resulting in a model where the correlations between observations within clusters are uniform (see `?nlme::CorCompSymm`). To help `seqwrap` understand that we want to include `cluster` from the metadata in all fitting objects, we need to include it in `additional_vars`. ```{r} models_gls <- seqwrap(swobject, modelfun = nlme::gls, arguments = alist( model = y ~ x, correlation = nlme::corCompSymm(form = ~ 1|cluster) ), additional_vars = "cluster", return_models = TRUE, eval_fun = NULL, subset = 1:10) modelsum_gls <- seqwrap_summarise(models_gls) ``` As evident from @fig-compgls we can see that the two models will produce slightly different p-values. ```{r} gls_est <- data.frame(modelsum_gls$summaries[, c(1,2,3,4,6)]) ``` ```{r} #| echo: false #| fig-cap: "Similar estimates between `lme4::lmer` and `nlme::lme` models" #| label: fig-compgls plot(-log10(lme_est[lme_est$term == "x","p.value"]), -log10(gls_est[gls_est$term == "x","p.value"]), xlab = "nlme::lme p-value", ylab = "nlme::gls p-value") abline(a = 0, b = 1) ```