---
title: "Working with lme4 and nlme in seqwrap"
author:
  - name: Chidimma Echebiri
    affiliation:
      - name: University of Inland Norway
        department: Department of Biotechnology, PhD programme in applied ecology and biotechnology
  - name: Daniel Hammarström
    email: daniel.hammarstrom@inn.no
    affiliation: 
      - name: University of Inland Norway
        department: Department of Public Health and Sport Sciences 
vignette: >
  %\VignetteIndexEntry{Working with lme4 and nlme in seqwrap}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r}
#| label: setup
#| include: false

required_pkgs <- c("glmmTMB", "lme4", "nlme")
have_all <- all(vapply(
  required_pkgs,
  requireNamespace,
  logical(1),
  quietly = TRUE
))
knitr::opts_chunk$set(eval = have_all)
```

The `seqwrap` function lets the user specify which modeling algorithm to use when modeling data provided as a data frame and metadata describing each sample. Target-specific models are passed to the results together with optional user-specified summary and model-evaluation functions. The function aims not to limit the user to the preferences of specific models. This allows for comparisons between modeling frameworks and flexibility for many types of high-dimensional data. In this vignette, three popular packages are applied to the same underlying data with varying model definitions.


### Fitting count data to a negative binomial model using `glmmTBM`

We will start by simulating data from a simple experiment where count data has been gathered from 25 clusters in two conditions (e.g. treatment and control). 

```{r}
#| label: simulated-data

library(seqwrap)
library(glmmTMB)


# Simulate data
dat <- seqwrap::simcounts(n_genes = 1000, 
                    n_samples = 50, 
                    beta_0 = 2, 
                    sigma_0 = 0.5, 
                    beta_1 = 0.5,, 
                    sigma_1 = 2, 
                    b_0 = 1,
                    clusters = 25,
                    overdispersion_min = c(1, 10))

# Calculate library sizes
dat$metadata$libsize <- log(colSums(dat$data[, -1]) )


```

Using `seqwrap_compose` we can populate a `swcontainer` object with all needed information to fit models. The `swcontainer` objects is basically a list with data, meta data and arguments that gets passed the model algorithm. The `swcontainer` object also contain reference to the desired model function. 

```{r}

swobject <- seqwrap_compose(
  data = dat$data, # A data frame with the first column as the target identifier
  metadata = dat$metadata, # Meta data bout samples, containing predictors
  samplename = "sample", # The name sample id column 
  modelfun = glmmTMB::glmmTMB, # The model fitting function
  # A list of argumnets supplied to the modelfunction
  arguments = list(
    formula = y ~ x + (1 | cluster),
    family = glmmTMB::nbinom1
  ),
  summary_fun = NULL,
  eval_fun = NULL,
  additional_vars = NULL
)

```


The `seqwrap` function will iterate over all targets in the data and fit the model and apply the summary and evaluation function to each model. Notice that we can use `return_models = TRUE` to save model in the output. Below we also use `subset = 1:10` to only do the first 10 targets. This is useful when we want to check if teh summary and evaluation functions work as intended. The `swobject` created above is updated when we supply ne w arguments to `seqwrap`.


```{r}

models <- seqwrap(swobject, 
                  return_models = TRUE,
                  subset = 1:10) 

```

The resulting object can be summarized using `seqwrap_summarise`

```{r}

modelsum <- seqwrap_summarise(models)

```

We have left the arguments `summary_fun` and `eval_fun` as NULL. This means that `seqwrap` will try to combine model parameter estimates and performance using `broom.mixed::`/`broom::tidy` and `DHARMa`. These evaluations are thought of as minimal summaries of the model and might not completly satisfy the needs of the user. From our example above, the fitted parameters from the models are returned in `modelsum$summaries` and model evaluations are returned in `modelsum$evaluations`.

The functions are exported from the seqwrap package and we can inspect them:

```{r}
seqwrap::generic_summary
```



```{r}
seqwrap::generic_evaluation
```

We encourage users to define their own custom functions. 

### Fitting transformed count data to a Gaussian model using `lme4`

Instead of counts, let's say we are interested in analyzing counts per million, expressed as $\text{CPM} = \frac{\text{Count}}{\text{Library size}} \times 10^6$. We will convert our simulated count data to CPM.


```{r}

lib_sizes <- colSums(dat$data[,-1]) / 1e6

cpm <- sweep(dat$data[,-1], MARGIN = 2, 
             STATS = lib_sizes, 
             FUN = "/")

dat$data <- data.frame(cbind(dat$data[,1], cpm))

```

We will create a new `swcontainer` using `lme4::lmer` as our fitting algorithm. 

```{r}

swobject <- seqwrap_compose(
  data = dat$data, # A data frame with the first column as the target identifier
  metadata = dat$metadata, # Meta data bout samples, containing predictors
  samplename = "sample", # The name sample id column 
  modelfun = lme4::lmer, # The model fitting function
  # A list of argumnets supplied to the modelfunction
  arguments = list(
    formula = y ~ x + (1 | cluster)
  ),
  summary_fun = NULL,
  eval_fun = NULL,
  additional_vars = NULL
)

```

Next we can run `seqwrap` to get estimates for our genes, to save some time we will start by using a subset of the simulated genes.

```{r}

models <- seqwrap(swobject, 
                  return_models = TRUE,
                  subset = 1:10) 

```

The generic summary function will provide fitted parameter statistics.

```{r}

modelsum <- seqwrap_summarise(models)

modelsum$summaries

```

### Using `nlme::lme` and `nlme::gls`, need for `additional_vars`

We will now try to fit the same model as previously fitted in `lme4::lmer` using `nlme::lme`. We can use the same `swcontainer` object and update the relevant lines. Notice that `nlme::lme` requires a different syntax for defining the model. Unsurprisingly, the results are the same @fig-complme. 

```{r}

models_lme <- seqwrap(swobject, 
                  modelfun = nlme::lme,
                  
                  arguments = alist(fixed = y ~ x, 
                                    random = ~ 1|cluster), 
                  return_models = TRUE,
                  eval_fun = NULL,
                  subset = 1:10)

modelsum_lme <- seqwrap_summarise(models_lme)


```

```{r}
lme_est <- data.frame(modelsum_lme$summaries[, c(1,4,5, 6,9)])
lmer_est <- data.frame(modelsum$summaries[, c(1,4,5, 6)])
```


```{r}
#| echo: false
#| fig-cap: "Similar estimates between `lme4::lmer` and `nlme::lme` models"
#| label: fig-complme
plot(lme_est[lme_est$term == "x","estimate"], 
     lmer_est[lmer_est$term == "x","estimate"], 
     xlab = "nlme::lme estimate", 
     ylab = "lme4::lmer estimate")
abline(a = 0, b = 1)

```

`seqwrap` will attempt to extract relevant metadata for use in modeling. This is done by extracting variable names from model formulations. In some cases, this can be difficult because we use additional functionality in some packages. Models constructed in `nlme::gls` are an example. 

We can construct a similar model to `nlme::lme` using `nlme::CorCompSymm`, resulting in a model where the correlations between observations within clusters are uniform (see `?nlme::CorCompSymm`). To help `seqwrap` understand that we want to include `cluster` from the metadata in all fitting objects, we need to include it in `additional_vars`.

```{r}

models_gls <- seqwrap(swobject, 
                  modelfun = nlme::gls,
                  arguments = alist(
                  model = y ~ x,
                  correlation = nlme::corCompSymm(form = ~ 1|cluster) ),
                  additional_vars = "cluster",
                  return_models = TRUE,
                  eval_fun = NULL,
                  subset = 1:10)

modelsum_gls <- seqwrap_summarise(models_gls)

```
As evident from @fig-compgls we can see that the two models will produce slightly different p-values.

```{r}


gls_est <- data.frame(modelsum_gls$summaries[, c(1,2,3,4,6)])


```

```{r}
#| echo: false
#| fig-cap: "Similar estimates between `lme4::lmer` and `nlme::lme` models"
#| label: fig-compgls
plot(-log10(lme_est[lme_est$term == "x","p.value"]), 
     -log10(gls_est[gls_est$term == "x","p.value"]), 
     xlab = "nlme::lme p-value", 
     ylab = "nlme::gls p-value")
abline(a = 0, b = 1)

```