sfi workflow

library(sfi)

0.1 Introduction

The sfi package provides features extraction/alignment tools for single file injections(sfi) mode Gas/liquid chromatography-mass spectrometry (GC/LC-MS) data. The SFI technique enables the analysis of numerous samples (e.g., up to 1000 per day) by acquiring data from multiple injections within a single analytical run, significantly reducing the time compared to traditional serial LC-MS injections.

The input of sfi package could be a single mzML file, which contains the interleaved data from all injected samples. Feature table from mzML file by other feature extraction tools such as xcms can also be used as input and the output will be a feature table compatible with regular metabolomics/non-target analysis. Such a feature table can be used for downstream analysis, such as statistical analysis and machine learning. The core function of sfi package is designed to handle sfi feature table and recover peaks back to individual samples. Currently, no bioconductor package is available to analysis such high throughput data.

We also provide a simple local maxim peak picking algorithm for sfi raw data analysis as regular feature extraction tools such as xcms package were designed for single injection files and need to perform feature alignment across multiple files. This is not suitable for sfi data, where all samples are injected in a single file. However, xcms package could still serve as a feature extraction tool for sfi data, but it is still need sfi package to handle the feature table and recover peaks back to individual samples. We depend mzR package to read single mzML file and data.table package to handle large data.

0.2 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("sfi")

0.3 Experimental Desgin

The SFI method (see demo figure) operates by:

Injecting a pooled sample multiple times initially to establish reference chromatographic profiles.
Subsequently injecting individual samples repeatedly at fixed, short time intervals under a continuous isocratic elution. Each sample undergoes a fixed chromatographic separation.
This process generates a single, continuous data file containing the interleaved data from all injected samples.

knitr::include_graphics("https://github.com/yufree/presentation/blob/gh-pages/figure/SFI.png?raw=true")

Schematic representation of the Single File Injection (SFI) workflow. (Top) A pooled Quality Control (QC) sample is injected multiple times at the beginning of the run to establish a robust reference for peak alignment. Individual study samples are then injected sequentially at short, fixed time intervals (e.g., every 1 min) into a continuous isocratic flow. This results in a single, long mzML file containing interleaved chromatograms from all samples, which `sfi` separates and aligns. (Bottom) Demonstration of peak recovery algorithms for SFI data using fixed time intervals

Figure 1: Schematic representation of the Single File Injection (SFI) workflow
(Top) A pooled Quality Control (QC) sample is injected multiple times at the beginning of the run to establish a robust reference for peak alignment. Individual study samples are then injected sequentially at short, fixed time intervals (e.g., every 1 min) into a continuous isocratic flow. This results in a single, long mzML file containing interleaved chromatograms from all samples, which sfi separates and aligns. (Bottom) Demonstration of peak recovery algorithms for SFI data using fixed time intervals

0.4 Basic Usage

0.4.1 Loading raw data

You need to convert the raw data into mzML file. You might try ThermoFlask for Thermo data or ProteoWizard for other vendor file.

path <- 'sfi.mzML'
peak <- getmzml(path)

0.4.2 Feature extraction

The find_2d_peaks() function will perform peak picking on the mzML file. The input is the mz, rt and intensity of the peaks. The output is a data frame containing the peak information.

# load demo data
data(sfi)
# perform peaks picking
peaklist <-  find_2d_peaks(mz=sfi$mz,rt=sfi$rt,intensity=sfi$intensity)

0.4.3 Feature alignment

The getsfm() function will extract sample features from one injections file.

# injection interval
idelta <- 92
# time windows for a full separation
windows <- 632
# sample numbers in the files
n <- 100
# retention time shift in seconds
deltart <- 10
# min peak number in pooled qc samples 
minn <- 6
# align peaks from sfi. Note: peaklist is an sfi_peaks object, so we can pass it directly.
se <- getsfm(peaklist, idelta=idelta, windows=windows, n=100, deltart=10, minn=6)
#> 9 peaks found in all blank samples
#> 9 peaks found in at least 6 QC samples
#> 29825 sample peaks found
#> 188 aligned QC peaks found
#> Recover 0.00630343671416597 peaks
#> 218 aligned matrix peaks found
#> Recover 0.00643755238893546 peaks
#> 9 peaks found in samples
#> 9 matrix peaks found in samples
#> 0 isomer peaks found in samples
#> 26 isomer matrix peaks found in samples
# The output is a SummarizedExperiment object
se
#> class: SummarizedExperiment 
#> dim: 9 100 
#> metadata(0):
#> assays(1): intensities
#> rownames(9): 195.0876@74 196.8651@74 ... 199.1004@74 199.1749@73
#> rowData names(2): mz rt
#> colnames(100): S1 S2 ... S99 S100
#> colData names(1): sampleidx

0.4.4 Save feature list

Save feature list as csv file.

# extract feature matrix with row metadata
library(SummarizedExperiment)
feature_table <- cbind(as.data.frame(rowData(se)), as.data.frame(assay(se)))
# save feature as csv file
library(data.table)
fwrite(feature_table, 'featurelist.csv')

0.5 Using the Results

The primary output of the sfi workflow is a feature table (matrix) where rows represent mass spectral features (unique m/z and retention time pairs) and columns represent individual samples. The values in the cells are the integrated intensities of these features.

This feature table allows you to answer biological or chemical questions about your samples: - Metabolomics/Lipidomics Profiling: Identify which compounds change in abundance between different experimental groups (e.g., disease vs. control). - Quality Control: Assess the stability of the instrument over the run by checking the intensity of internal standards or features in QC samples. - Sample Classification: Use the feature profiles to cluster samples or build predictive models.

The “Downstream Analysis” section below demonstrates how to format this data for use with other R/Bioconductor tools that specialize in these statistical tasks.

0.6 Downstream Analysis

For integration with other Bioconductor packages, the sfi workflow already provides results as a SummarizedExperiment object. This class is the standard container for rectangular data in Bioconductor and facilitates interoperability with tools like DESeq2, limma, or SummarizedExperiment-based metabolomics workflows.

0.6.1 Interoperability and Visualization

Once the data is in a SummarizedExperiment object, we can leverage the rich Bioconductor ecosystem for visualization and quality control.

if (requireNamespace("SummarizedExperiment", quietly = TRUE) && 
    requireNamespace("ggplot2", quietly = TRUE)) {
  library(SummarizedExperiment)
  library(ggplot2)
  
  # Basic data access
  intensities <- assay(se, "intensities")
  
  # Visualization: Distribution of intensities across samples
  # Reshape for ggplot2
  plot_df <- data.frame(
    Intensity = as.vector(intensities),
    Sample = rep(colnames(intensities), each = nrow(intensities))
  )
  # Filter out zeros for log transformation
  plot_df <- plot_df[plot_df$Intensity > 0, ]
  
  ggplot(plot_df, aes(x = Sample, y = Intensity, fill = Sample)) +
    geom_boxplot() +
    scale_y_log10() +
    theme_minimal() +
    theme(axis.text.x = element_blank()) +
    labs(title = "Intensity Distribution Across Samples",
         y = "Log10 Intensity",
         x = "Sample Index") +
    guides(fill = "none")
}

0.6.2 Normalization and Transformation

Typical metabolomics workflows require data normalization and transformation. We can use MsCoreUtils for these tasks while maintaining the metadata within our object.

if (requireNamespace("SummarizedExperiment", quietly = TRUE) && 
    requireNamespace("MsCoreUtils", quietly = TRUE)) {
  library(SummarizedExperiment)
  library(MsCoreUtils)
  
  # Log2 transformation
  # We can create a new assay in the SummarizedExperiment
  assay(se, "log2") <- log2(assay(se, "intensities") + 1)
  
  # Quantile normalization
  # assay(se, "norm") <- MsCoreUtils::normalizeMethods()[["quantile"]](assay(se, "log2"))
  
  # Example: Summarize feature metadata
  cat("Total number of features:", nrow(se), "\n")
  cat("M/Z range:", paste(range(rowData(se)$mz), collapse = " - "), "\n")
  
  # Accessing sample metadata for group-wise analysis
  colData(se)$Group <- rep(c("Control", "Treatment"), length.out = ncol(se))
  
  # You can now pass this object to other tools like limma or DESeq2
  # print(se)
}
#> Total number of features: 9 
#> M/Z range: 195.087615966797 - 199.174865722656

0.7 Advanced Usage

0.7.1 Advanced Usage

In real data, you might find the input window for separation and injection interval are not accurate due to the lag in sample injection process. It’s suggested to filter high intensity peaks and use get_sfi_params to find the accurate value for accurate instrumental window for separation and injection interval.

# get windows and delta time
sfmsub <- peaklist[peaklist$intensity>1e4,]
class(sfmsub) <- c("sfi_peaks", "data.frame") # Preserve class for method dispatch
# get windows and delta time
sfi_params <- get_sfi_params(sfmsub, n=158, deltart = 5)
#> Window is 632.11
#> No retention time information!
#> Delta retention time is 92.213125

Then you can use those parameters for all data.

windows <- sfi_params['window']
idelta <- sfi_params['idelta']
se <- getsfm(peaklist,
              idelta = idelta, windows = windows,
              n = 158, deltart = 10, minn = 6)
#> 9 peaks found in all blank samples
#> 9 peaks found in at least 6 QC samples
#> 29819 sample peaks found
#> 366 aligned QC peaks found
#> Recover 0.0122740534558503 peaks
#> 371 aligned matrix peaks found
#> Recover 0.011569804487072 peaks
#> 9 peaks found in samples
#> 9 matrix peaks found in samples
#> 0 isomer peaks found in samples
#> 26 isomer matrix peaks found in samples

0.8 Conclusion

The sfi package streamlines the complex task of processing Single File Injection data, converting a continuous stream of interleaved mass spectral data into a structured format ready for analysis. By correctly identifying and aligning features from rapid, sequential injections, it unlocks the throughput potential of SFI-MS for large-scale metabolomics and screening studies.

sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] MsCoreUtils_1.24.0          ggplot2_4.0.3              
#>  [3] SummarizedExperiment_1.42.0 Biobase_2.72.0             
#>  [5] GenomicRanges_1.64.0        Seqinfo_1.2.0              
#>  [7] IRanges_2.46.0              S4Vectors_0.50.0           
#>  [9] BiocGenerics_0.58.0         generics_0.1.4             
#> [11] MatrixGenerics_1.24.0       matrixStats_1.5.0          
#> [13] sfi_1.0.0                   BiocStyle_2.40.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10         SparseArray_1.12.0  enviGCMS_0.8.0     
#>  [4] lattice_0.22-9      digest_0.6.39       magrittr_2.0.5     
#>  [7] evaluate_1.0.5      grid_4.6.0          RColorBrewer_1.1-3 
#> [10] bookdown_0.46       fastmap_1.2.0       jsonlite_2.0.0     
#> [13] Matrix_1.7-5        tinytex_0.59        BiocManager_1.30.27
#> [16] scales_1.4.0        jquerylib_0.1.4     abind_1.4-8        
#> [19] cli_3.6.6           rlang_1.2.0         XVector_0.52.0     
#> [22] withr_3.0.2         cachem_1.1.0        DelayedArray_0.38.0
#> [25] yaml_2.3.12         otel_0.2.0          S4Arrays_1.12.0    
#> [28] tools_4.6.0         dplyr_1.2.1         vctrs_0.7.3        
#> [31] R6_2.6.1            magick_2.9.1        lifecycle_1.0.5    
#> [34] clue_0.3-68         MASS_7.3-65         cluster_2.1.8.2    
#> [37] pkgconfig_2.0.3     pillar_1.11.1       bslib_0.10.0       
#> [40] gtable_0.3.6        data.table_1.18.2.1 glue_1.8.1         
#> [43] Rcpp_1.1.1-1.1      tidyselect_1.2.1    tibble_3.3.1       
#> [46] xfun_0.57           knitr_1.51          dichromat_2.0-0.1  
#> [49] farver_2.1.2        htmltools_0.5.9     igraph_2.3.0       
#> [52] rmarkdown_2.31      compiler_4.6.0      S7_0.2.2