library(sfi)
The sfi package provides features extraction/alignment tools for single file injections(sfi) mode Gas/liquid chromatography-mass spectrometry (GC/LC-MS) data. The SFI technique enables the analysis of numerous samples (e.g., up to 1000 per day) by acquiring data from multiple injections within a single analytical run, significantly reducing the time compared to traditional serial LC-MS injections.
The input of sfi package could be a single mzML file, which contains the interleaved data from all injected samples. Feature table from mzML file by other feature extraction tools such as xcms can also be used as input and the output will be a feature table compatible with regular metabolomics/non-target analysis. Such a feature table can be used for downstream analysis, such as statistical analysis and machine learning. The core function of sfi package is designed to handle sfi feature table and recover peaks back to individual samples. Currently, no bioconductor package is available to analysis such high throughput data.
We also provide a simple local maxim peak picking algorithm for sfi raw data analysis as regular feature extraction tools such as xcms package were designed for single injection files and need to perform feature alignment across multiple files. This is not suitable for sfi data, where all samples are injected in a single file. However, xcms package could still serve as a feature extraction tool for sfi data, but it is still need sfi package to handle the feature table and recover peaks back to individual samples. We depend mzR package to read single mzML file and data.table package to handle large data.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("sfi")
The SFI method (see demo figure) operates by:
Injecting a pooled sample multiple times initially to establish reference chromatographic profiles.
Subsequently injecting individual samples repeatedly at fixed, short time intervals under a continuous isocratic elution. Each sample undergoes a fixed chromatographic separation.
This process generates a single, continuous data file containing the interleaved data from all injected samples.
knitr::include_graphics("https://github.com/yufree/presentation/blob/gh-pages/figure/SFI.png?raw=true")
Figure 1: Schematic representation of the Single File Injection (SFI) workflow
(Top) A pooled Quality Control (QC) sample is injected multiple times at the beginning of the run to establish a robust reference for peak alignment. Individual study samples are then injected sequentially at short, fixed time intervals (e.g., every 1 min) into a continuous isocratic flow. This results in a single, long mzML file containing interleaved chromatograms from all samples, which sfi separates and aligns. (Bottom) Demonstration of peak recovery algorithms for SFI data using fixed time intervals
You need to convert the raw data into mzML file. You might try ThermoFlask for Thermo data or ProteoWizard for other vendor file.
path <- 'sfi.mzML'
peak <- getmzml(path)
The find_2d_peaks() function will perform peak picking on the mzML file. The input is the mz, rt and intensity of the peaks. The output is a data frame containing the peak information.
# load demo data
data(sfi)
# perform peaks picking
peaklist <- find_2d_peaks(mz=sfi$mz,rt=sfi$rt,intensity=sfi$intensity)
The getsfm() function will extract sample features from one injections file.
# injection interval
idelta <- 92
# time windows for a full separation
windows <- 632
# sample numbers in the files
n <- 100
# retention time shift in seconds
deltart <- 10
# min peak number in pooled qc samples
minn <- 6
# align peaks from sfi. Note: peaklist is an sfi_peaks object, so we can pass it directly.
se <- getsfm(peaklist, idelta=idelta, windows=windows, n=100, deltart=10, minn=6)
#> 9 peaks found in all blank samples
#> 9 peaks found in at least 6 QC samples
#> 29825 sample peaks found
#> 188 aligned QC peaks found
#> Recover 0.00630343671416597 peaks
#> 218 aligned matrix peaks found
#> Recover 0.00643755238893546 peaks
#> 9 peaks found in samples
#> 9 matrix peaks found in samples
#> 0 isomer peaks found in samples
#> 26 isomer matrix peaks found in samples
# The output is a SummarizedExperiment object
se
#> class: SummarizedExperiment
#> dim: 9 100
#> metadata(0):
#> assays(1): intensities
#> rownames(9): 195.0876@74 196.8651@74 ... 199.1004@74 199.1749@73
#> rowData names(2): mz rt
#> colnames(100): S1 S2 ... S99 S100
#> colData names(1): sampleidx
Save feature list as csv file.
# extract feature matrix with row metadata
library(SummarizedExperiment)
feature_table <- cbind(as.data.frame(rowData(se)), as.data.frame(assay(se)))
# save feature as csv file
library(data.table)
fwrite(feature_table, 'featurelist.csv')
The primary output of the sfi workflow is a feature table (matrix) where rows represent mass spectral features (unique m/z and retention time pairs) and columns represent individual samples. The values in the cells are the integrated intensities of these features.
This feature table allows you to answer biological or chemical questions about your samples: - Metabolomics/Lipidomics Profiling: Identify which compounds change in abundance between different experimental groups (e.g., disease vs. control). - Quality Control: Assess the stability of the instrument over the run by checking the intensity of internal standards or features in QC samples. - Sample Classification: Use the feature profiles to cluster samples or build predictive models.
The “Downstream Analysis” section below demonstrates how to format this data for use with other R/Bioconductor tools that specialize in these statistical tasks.
For integration with other Bioconductor packages, the sfi workflow already provides results as a SummarizedExperiment object. This class is the standard container for rectangular data in Bioconductor and facilitates interoperability with tools like DESeq2, limma, or SummarizedExperiment-based metabolomics workflows.
Once the data is in a SummarizedExperiment object, we can leverage the rich Bioconductor ecosystem for visualization and quality control.
if (requireNamespace("SummarizedExperiment", quietly = TRUE) &&
requireNamespace("ggplot2", quietly = TRUE)) {
library(SummarizedExperiment)
library(ggplot2)
# Basic data access
intensities <- assay(se, "intensities")
# Visualization: Distribution of intensities across samples
# Reshape for ggplot2
plot_df <- data.frame(
Intensity = as.vector(intensities),
Sample = rep(colnames(intensities), each = nrow(intensities))
)
# Filter out zeros for log transformation
plot_df <- plot_df[plot_df$Intensity > 0, ]
ggplot(plot_df, aes(x = Sample, y = Intensity, fill = Sample)) +
geom_boxplot() +
scale_y_log10() +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Intensity Distribution Across Samples",
y = "Log10 Intensity",
x = "Sample Index") +
guides(fill = "none")
}
Typical metabolomics workflows require data normalization and transformation. We can use MsCoreUtils for these tasks while maintaining the metadata within our object.
if (requireNamespace("SummarizedExperiment", quietly = TRUE) &&
requireNamespace("MsCoreUtils", quietly = TRUE)) {
library(SummarizedExperiment)
library(MsCoreUtils)
# Log2 transformation
# We can create a new assay in the SummarizedExperiment
assay(se, "log2") <- log2(assay(se, "intensities") + 1)
# Quantile normalization
# assay(se, "norm") <- MsCoreUtils::normalizeMethods()[["quantile"]](assay(se, "log2"))
# Example: Summarize feature metadata
cat("Total number of features:", nrow(se), "\n")
cat("M/Z range:", paste(range(rowData(se)$mz), collapse = " - "), "\n")
# Accessing sample metadata for group-wise analysis
colData(se)$Group <- rep(c("Control", "Treatment"), length.out = ncol(se))
# You can now pass this object to other tools like limma or DESeq2
# print(se)
}
#> Total number of features: 9
#> M/Z range: 195.087615966797 - 199.174865722656
In real data, you might find the input window for separation and injection interval are not accurate due to the lag in sample injection process. It’s suggested to filter high intensity peaks and use get_sfi_params to find the accurate value for accurate instrumental window for separation and injection interval.
# get windows and delta time
sfmsub <- peaklist[peaklist$intensity>1e4,]
class(sfmsub) <- c("sfi_peaks", "data.frame") # Preserve class for method dispatch
# get windows and delta time
sfi_params <- get_sfi_params(sfmsub, n=158, deltart = 5)
#> Window is 632.11
#> No retention time information!
#> Delta retention time is 92.213125
Then you can use those parameters for all data.
windows <- sfi_params['window']
idelta <- sfi_params['idelta']
se <- getsfm(peaklist,
idelta = idelta, windows = windows,
n = 158, deltart = 10, minn = 6)
#> 9 peaks found in all blank samples
#> 9 peaks found in at least 6 QC samples
#> 29819 sample peaks found
#> 366 aligned QC peaks found
#> Recover 0.0122740534558503 peaks
#> 371 aligned matrix peaks found
#> Recover 0.011569804487072 peaks
#> 9 peaks found in samples
#> 9 matrix peaks found in samples
#> 0 isomer peaks found in samples
#> 26 isomer matrix peaks found in samples
The sfi package streamlines the complex task of processing Single File Injection data, converting a continuous stream of interleaved mass spectral data into a structured format ready for analysis. By correctly identifying and aligning features from rapid, sequential injections, it unlocks the throughput potential of SFI-MS for large-scale metabolomics and screening studies.
sessionInfo()
#> R Under development (unstable) (2026-03-05 r89546)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] MsCoreUtils_1.23.6 ggplot2_4.0.2
#> [3] SummarizedExperiment_1.41.1 Biobase_2.71.0
#> [5] GenomicRanges_1.63.1 Seqinfo_1.1.0
#> [7] IRanges_2.45.0 S4Vectors_0.49.0
#> [9] BiocGenerics_0.57.0 generics_0.1.4
#> [11] MatrixGenerics_1.23.0 matrixStats_1.5.0
#> [13] sfi_0.99.5 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 SparseArray_1.11.11 enviGCMS_0.8.0
#> [4] lattice_0.22-9 digest_0.6.39 magrittr_2.0.4
#> [7] evaluate_1.0.5 grid_4.6.0 RColorBrewer_1.1-3
#> [10] bookdown_0.46 fastmap_1.2.0 jsonlite_2.0.0
#> [13] Matrix_1.7-5 tinytex_0.58 BiocManager_1.30.27
#> [16] scales_1.4.0 jquerylib_0.1.4 abind_1.4-8
#> [19] cli_3.6.5 rlang_1.1.7 XVector_0.51.0
#> [22] withr_3.0.2 cachem_1.1.0 DelayedArray_0.37.0
#> [25] yaml_2.3.12 otel_0.2.0 S4Arrays_1.11.1
#> [28] tools_4.6.0 dplyr_1.2.0 vctrs_0.7.1
#> [31] R6_2.6.1 magick_2.9.1 lifecycle_1.0.5
#> [34] clue_0.3-67 MASS_7.3-65 cluster_2.1.8.2
#> [37] pkgconfig_2.0.3 pillar_1.11.1 bslib_0.10.0
#> [40] gtable_0.3.6 data.table_1.18.2.1 glue_1.8.0
#> [43] Rcpp_1.1.1 tidyselect_1.2.1 tibble_3.3.1
#> [46] xfun_0.57 knitr_1.51 dichromat_2.0-0.1
#> [49] farver_2.1.2 htmltools_0.5.9 igraph_2.2.2
#> [52] rmarkdown_2.30 compiler_4.6.0 S7_0.2.1