--- title: "Imputing quantitative proteomics data " author: "Laurent Gatto" package: MsCoreUtils abstract: > This vignette describes the multiple ways to perform imputation using the `impute()` methods and the underlying support function from the *MsCoreUtils* package. This vignette is distributed under a CC BY-SA license. output: BiocStyle::html_document: toc_float: true bibliography: QFeatures.bib vignette: > %\VignetteIndexEntry{Imputation} %\VignetteEngine{knitr::rmarkdown} %%\VignetteKeywords{Mass Spectrometry, Proteomics, Metabolomics, Quantitative, imputation} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Introduction This vignette provides a technical description of the imputation functionality available in the *R for Mass Spectrometry* packages, in particular `r BiocStyle::Biocpkg("MsCoreUtils")` for the implementation and `r BiocStyle::Biocpkg("QFeatures")` for the high level application. These packages depend on other ones for the specific imputation implementation approaches. ```{r env, message = FALSE} library(MsCoreUtils) library(QFeatures) ``` This vignette focuses on the technical aspects of imputation, without delving in the scientific motivations too much - see [@Webb-Robertson:2015; @Lazar:2016; @Bramer:2021] for the necessary backgroud. We will simply introduce important concepts when needed and refer to some relevant papers for further reading. An important concept, described among others in [@Lazar:2016], is data that can be missing *at random* (MAR) or missing *not at random* (MNAR). A MNAR feature is assumed to be missing in the data because is was effectively absent or below the limit of detection in the biological sample. MAR features, on the other hand, have not been detected or identified due to technological limitations such as poor ionisation, competition among precursors (in data dependent acquisition), or absence of identification or mis-identification. Given the different underlying causes of the missingness, they should be imputed using different approaches. Typically, MNAR features can imputed using left-censored method, that will impute using a *small* value, reflective of the absence of the feature, while MAR features should be imputed using *hot deck* approaches, i.e. methods that impute using *similar* or *matching* values. We would like to caution users on the risks of imputation, in particular when a high proportion of values are missing. Given the different types of missingness, wrongly imputing values can substantially distort downstream analyses and their validity. In such situations, it might be safer to avoid imputation altogether, and maintain missing values. Is can also be helpful to filter out features with *too many* missing values - the `filterNA()` function can be used to such effect. The imputation methods available in the `r BiocStyle::Biocpkg("MsCoreUtils")` package can be listed programmatically with the `imputeMethods()` function and are documented in the `?impute_matrix` [documentation page](https://rformassspectrometry.github.io/MsCoreUtils/reference/imputation.html). ```{r imputeMethods} imputeMethods() ``` Note that 0s are technically impossible to be recorded by a mass spectrometer, and should never be observed in a dataset. If present, these are the result of a prior zero-imputation by the pre-processing software that erroneously suggest that the feature was of the MNAR type and effectively absent in the sample. We advise to start your processing by replacing these misleading 0 by a missing value. This could be achieved with the `zeroIsNA()` method if your data is formated as a `SummarizedExperiment` object [@SE]. ## Example data ```{r mkdata} m <- matrix(1:50, nrow = 10) diag(m) <- NA m[which(is.na(m)) + 5] <- NA dimnames(m) <- list(paste0("F", 1:10), paste0("S", 1:5)) randna <- rep(c(TRUE, FALSE), each = 5) se <- SummarizedExperiment(assays = m, rowData = data.frame(randna)) se ``` We are going to use the small `SummarizedExperiment` to illustrate the different imputation approaches and their parametrisation. It is composed of 10 features and 5 samples, and contains 5 missing values aligned diagonally along the top and bottom parts of the data matrix. ```{r se} assay(se) ``` In the following sections, we will use the [`impute()` method](https://rformassspectrometry.github.io/QFeatures/reference/impute.html) (see `?QFeatures::impute`) and apply it to the `SummarizedExperiment` instance `se` above. The method is also applicable to `QFeatures` objects. The individual `impute_matrix()` and other `impute_*` from the `MsCoreUtils` package can be applied directly on `matrix` objects. # Simple imputation We refer to *simple* (or *single*) imputation when a single imputation method is used across the whole dataset. For example, if we want to replace all missing values by 0, we can use the `impute()` method as shown below. ```{r simpleImputation} impute(se, method = "zero") |> assay() ``` Setting the `method` to `"zero"` will apply the `MsCoreUtils::impute_zero()` function on the object's assay. ## Passing paramters to the imputation function Or, if we want to impute all missing values with a specific value such as 0.5, we can use the `"with"` method to apply the `MsCoreUtils::impute_with()` function. This function requires an additional argument, `val`, that defines the specific value that should be used to replace missing values. ```{r imputeWith} impute(se, method = "with", val = 0.5) |> assay() ``` Each of the underlying function's details are documented in the `?impute_zero`, `?impute_with`, ... manual pages, that all lead to the main `?impute_matrix` documentation. # The MARGIN argument In the two simple examples above, there is no sense of direction when imputing, as every missing value is replaced by a single, pre-defined value. In many cases however, this is not the case. To illustrate this, let's use the `"MinDet"` method (see `?impute_MinDet`). *MinDet* performs the imputation of left-censored missing data using a deterministic minimal value approach. Considering a expression data with _n_ samples and _p_ features, for each *sample*, the missing entries are replaced with a minimal value observed in that sample. The minimal value observed is estimated as being the q-th quantile (default `q = 0.01`) of the observed values in that sample. Below, we are going to set `q = 0` to impute with the minimal value within each sample. ```{r imputeMinDet} impute(se, method = "MinDet", q = 0) |> assay() ``` As can be seen, the missing values in sample S1, namely F1 and F6, have been imputed by the smallest observed value in S1, namely 2. And similarly for the four other samples. In the definition above, it is explicitly stated that the imputation is done for each sample, i.e. along the columns of the quantitative matrix, also called the second margin. We can repeat the same imputation by explicitly setting `MARGIN = 2`. ```{r imputeMinDetMargin2} impute(se, method = "MinDet", q = 0, MARGIN = 2) |> assay() ``` And indeed, the default margin for the `"MinDet"` method is 2: ```{r imputeMinDetDefaultMargin} getImputeMargin("impute_MinDet") ``` The imputation margin is not always 2. The *nearest neighbour* imputation method chooses a certain number of similar features. By similar features, we explicitly refer to other rows, i.e. the first margin: ```{r imputeKnnDetDefaultMargin} getImputeMargin("impute_knn") ``` It is possible to change the margin from its default value. Below, we now use `"MinDet"` and choose the smallest value within each feature/row. ```{r imputeMinDetMargin1} impute(se, method = "MinDet", q = 0, MARGIN = 1) |> assay() ``` Now, we see that the missing F1 value in S1 has been imputed by the smallest observed value along the first row, namely 11. We can extract all default margin values for all `MsCoreUtils::impute_*` functions as show below. ```{r defaultMargins} getImputeMargin() ``` A missing margin means that, as for `"with"` or `"zero"` above, there is not margin along which the imputation is performed. Mixed imputation is a special case that has two margins, which we will describe in the next section. The relevance of the imputation margin can also depend on downstream analyses. In [@Vanderaa:2023], the authors illustrate that imputation along the first margin increases the correlation between features, while imputation along the second margin increases the correlation between samples. These artificially improved correlations can then in turn impact any analyses that rely on the identification of sample or protein clusters. # Mixed imputation As we have seen above, different underlying processes can lead to different types of missing values, namely MAR and MNAR. One view of this is to define these processes at the feature level ^[Nowadays, I believe that this feature-level representation of MAR or MNAR is not entirely correct. It can be useful in some simple cases though.]. In such cases, one might want to impute different sets of features in a *mixed* way: MAR features with a MAR-appropriate method, and MNAR features with a MNAR-appropriate method. This is possible is the `"mixed"` method. To be able to apply mixed imputation, we need to define features that are MAR, and features that are MNAR. This is done using a logical vector whose length is equal to the number of features. ```{r randna} rowData(se)$randna ``` The `TRUE` values define the MAR features (F1 to F5 in our case) and `FALSE` defines MNAR features (F6 to F10). To use mixed imputation, we need to specify the MAR and MNAR features, two imputation methods, one for MAR features, and another one for MNAR features. ```{r imputeMixedSimple} impute(se, method = "mixed", randna = rowData(se)$randna, mar = "MinDet", mnar = "zero") |> assay() ``` We can see that the bottom-half of the matrix corresponding to MNAR features have been imputed by zero, while the other have imputed by `"MinDet"`. We also see that the margin used for `"MinDet"` was 1 (along the rows). Indeed, the default margins are 1 for both MAR and MNAR features: the first one is for MAR, and the second one for MNAR. ```{r defaultMixedMargins} getImputeMargin("impute_mixed") ``` ## Different margins It is of course possible to change the margins when performing mixed imputation: ```{r imputeMixedWithMargins} impute(se, method = "mixed", randna = rowData(se)$randna, mar = "MinDet", mnar = "zero", MARGIN = c(2, NA)) |> assay() ``` We set `NA` for zero-imputation (it could also have been 1, as it is irrelevant anyway) and 2 for MinDet-imputation. And we can confirm that this time, the the MAR features have been imputed using the smallest values have been choosen for each sample/column. ## Passing paramters to the imputation functions It is possible to pass arguments to the respective MAR and MNAR function using the `marArgs` and `mnarArg` arguments as named lists. Below, we are going to use *MinDet* in both cases, with different parameters. ```{r mixedWithArgs} impute(se, method = "mixed", randna = rowData(se)$randna, mar = "MinDet", mnar = "MinDet", marArgs = list(q = 0), mnarArgs = list(q = 1), MARGIN = c(1, 1)) |> assay() ``` In both cases, we impute along the rows. For the MAR features (top half of the matrix), we impute using the minimal value of that row (using `q = 0`), while for the MNAR feature (bottom half of the matrix), we impute using the maximal value of that row (using `q = 1`). As anticipated, the value of F1 in S1 gets 11, and F5 in S1 gets 46. ## Using the whole matrix to compute imputated values When doing mixed imputation, the respective MAR and MNAR sub-matrices are split and imputed separately. It is also possible the use the whole data matrix to compute the MAR and MNAR imputated values. This is controlled by the `split` argument that, by default, is set to `TRUE`. Below, we are going to repeat a mixed imputation, imputing the MAR values (the top half of the matrix) with the highest value of the *whole* columns using `MARGIN = 2` and `split = TRUE`. The NMAR values (the bottom half of the matrix) are impute using the smallest value along the rows using `MARGIN = 1`, and are hence not impacted by the `split` value. ```{r mixedNoSplit} impute(se, method = "mixed", randna = rowData(se)$randna, mar = "MinDet", mnar = "MinDet", marArgs = list(q = 1), mnarArgs = list(q = 0), MARGIN = c(2, 1), split = FALSE) |> assay() ``` We see that the value of F1 in S1 gets 10, the highest value from F10 in S1. If we keep the default `split = TRUE`, it would have gotten 5 from F5, the highest value among the MAR values. The MNAR imputation isn't affected by the split and get the smallest values in each row. # Session information {-} ```{r sessioninfo, echo=FALSE} sessionInfo() ``` # License {-} This vignette is distributed under a [CC BY-SA license](https://creativecommons.org/licenses/by-sa/2.0/) license. # References {-}