xcms 3.22.0
Package: xcms
Authors: Johannes Rainer
Modified: 2023-04-25 14:01:32.005268
Compiled: Tue Apr 25 19:15:42 2023
This documents describes data import, exploration, preprocessing and analysis of
LCMS experiments with xcms version >= 3. The examples and basic workflow was
adapted from the original LC/MS Preprocessing and Analysis with xcms vignette
from Colin A. Smith.
The new user interface and methods use the XCMSnExp object (instead of the
old xcmsSet object) as a container for the pre-processing results. To
support packages and pipelines relying on the xcmsSet object, it is however
possible to convert an XCMSnExp into a xcmsSet object using the as method
(i.e.xset <- as(x, "xcmsSet"), with x being an XCMSnExp object.
xcms supports analysis of LC/MS data from files in (AIA/ANDI) NetCDF,
mzXML and mzML format. For the actual data import Bioconductor’s
mzR is used. For demonstration purpose we will analyze a
subset of the data from [1] in which the metabolic consequences of
knocking out the fatty acid amide hydrolase (FAAH) gene in mice was
investigated. The raw data files (in NetCDF format) are provided with the
faahKO data package. The data set consists of samples from the spinal cords of
6 knock-out and 6 wild-type mice. Each file contains data in centroid mode
acquired in positive ion mode form 200-600 m/z and 2500-4500 seconds. To speed
up processing of this vignette we will restrict the analysis to only 8 files and
to the retention time range from 2500 to 3500 seconds.
Below we load all required packages, locate the raw CDF files within the
faahKO package and build a phenodata data frame describing the experimental
setup. Note that for real experiments it is suggested to define a file (table)
that contains the file names of the raw data files along with descriptions of
the samples for each file as additional columns. Such a file could then be
imported with e.g. read.table as variable pd (instead of being defined
within R as in the example below) and the file names could be passed along to
the readMSData function below with e.g.
files = paste0(MZML_PATH, "/", pd$mzML_file) where MZML_PATH would be the
path to directory in which the files are located and "mzML_file" the name of
the column in the phenodata file that contains the file names.
library(xcms)
library(faahKO)
library(RColorBrewer)
library(pander)
library(magrittr)
library(pheatmap)
library(SummarizedExperiment)
## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
            recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
                                   replacement = "", fixed = TRUE),
                 sample_group = c(rep("KO", 4), rep("WT", 4)),
                 stringsAsFactors = FALSE)Subsequently we load the raw data as an OnDiskMSnExp object using the
readMSData method from the MSnbase package. The MSnbase
provides based structures and infrastructure for the processing of mass
spectrometry data. Also, MSnbase can be used to centroid profile-mode MS
data (see the corresponding vignette in the MSnbase package).
raw_data <- readMSData(files = cdfs, pdata = new("NAnnotatedDataFrame", pd),
                       mode = "onDisk")We next restrict the data set to the retention time range from 2500 to 3500 seconds. This is merely to reduce the processing time of this vignette.
raw_data <- filterRt(raw_data, c(2500, 3500))The resulting OnDiskMSnExp object contains general information about the
number of spectra, retention times, the measured total ion current etc, but does
not contain the full raw data (i.e. the m/z and intensity values from each
measured spectrum). Its memory footprint is thus rather small making it an ideal
object to represent large metabolomics experiments while allowing to perform
simple quality controls, data inspection and exploration as well as data
sub-setting operations. The m/z and intensity values are imported from the raw
data files on demand, hence the location of the raw data files should not be
changed after initial data import.
The OnDiskMSnExp organizes the MS data by spectrum and provides the methods
intensity, mz and rtime to access the raw data from the files (the measured
intensity values, the corresponding m/z and retention time values). In addition,
the spectra method could be used to return all data encapsulated in Spectrum
objects. Below we extract the retention time values from the object.
head(rtime(raw_data))## F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
## 2501.378 2502.943 2504.508 2506.073 2507.638 2509.203All data is returned as one-dimensional vectors (a numeric vector for rtime
and a list of numeric vectors for mz and intensity, each containing the
values from one spectrum), even if the experiment consists of multiple
files/samples. The fromFile function returns an integer vector providing
the mapping of the values to the originating file. Below we use the fromFile
indices to organize the mz values by file.
mzs <- mz(raw_data)
## Split the list by file
mzs_by_file <- split(mzs, f = fromFile(raw_data))
length(mzs_by_file)## [1] 8As a first evaluation of the data we plot below the base peak chromatogram (BPC)
for each file in our experiment. We use the chromatogram method and set the
aggregationFun to "max" to return for each spectrum the maximal intensity
and hence create the BPC from the raw data. To create a total ion chromatogram
we could set aggregationFun to sum.
## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(raw_data, aggregationFun = "max")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")
## Plot all chromatograms.
plot(bpis, col = group_colors[raw_data$sample_group])The chromatogram method returned a MChromatograms object that organizes
individual Chromatogram objects (which in fact contain the chromatographic
data) in a two-dimensional array: columns represent samples and rows
(optionally) m/z and/or retention time ranges. Below we extract the chromatogram
of the first sample and access its retention time and intensity values.
bpi_1 <- bpis[1, 1]
head(rtime(bpi_1))## F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
## 2501.378 2502.943 2504.508 2506.073 2507.638 2509.203head(intensity(bpi_1))## F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
##    43888    43960    43392    42632    42200    42288The chromatogram method supports also extraction of chromatographic data from
a m/z-rt slice of the MS data. In the next section we will use this method to
create an extracted ion chromatogram (EIC) for a selected peak.
Note that chromatogram reads the raw data from each file to calculate the
chromatogram. The bpi and tic methods on the other hand do not read any data
from the raw files but use the respective information that was provided in the
header definition of the input files (which might be different from the actual
data).
Below we create boxplots representing the distribution of total ion currents per file. Such plots can be very useful to spot problematic or failing MS runs.
## Get the total ion current by file
tc <- split(tic(raw_data), f = fromFile(raw_data))
boxplot(tc, col = group_colors[raw_data$sample_group],
        ylab = "intensity", main = "Total ion current")
Figure 1: Distribution of total ion currents per file
Also, we can cluster the samples based on similarity of their base peak
chromatogram. This can also be helpful to spot potentially problematic samples
in an experiment or generally get an initial overview of the sample grouping in
the experiment. Since the retention times between samples are not exactly
identical, we use the bin function to group intensities in fixed time ranges
(bins) along the retention time axis. In the present example we use a bin size
of 1 second, the default is 0.5 seconds. The clustering is performed using
complete linkage hierarchical clustering on the pairwise correlations of the
binned base peak chromatograms.
## Bin the BPC
bpis_bin <- MSnbase::bin(bpis, binSize = 2)
## Calculate correlation on the log2 transformed base peak intensities
cormat <- cor(log2(do.call(cbind, lapply(bpis_bin, intensity))))
colnames(cormat) <- rownames(cormat) <- raw_data$sample_name
## Define which phenodata columns should be highlighted in the plot
ann <- data.frame(group = raw_data$sample_group)
rownames(ann) <- raw_data$sample_name
## Perform the cluster analysis
pheatmap(cormat, annotation = ann,
         annotation_color = list(group = group_colors))
Figure 2: Grouping of samples based on similarity of their base peak chromatogram
The samples cluster in a pairwise manner, the KO and WT samples for the sample index having the most similar BPC.
Next we perform the chromatographic peak detection using the centWave
algorithm [2]. Before running the peak detection it is however
strongly suggested to visually inspect e.g. the extracted ion chromatogram of
internal standards or known compounds to evaluate and adapt the peak detection
settings since the default settings will not be appropriate for most LCMS
experiments. The two most critical parameters for centWave are the peakwidth
(expected range of chromatographic peak widths) and ppm (maximum expected
deviation of m/z values of centroids corresponding to one chromatographic peak;
this is usually much larger than the ppm specified by the manufacturer)
parameters. To evaluate the typical chromatographic peak width we plot the EIC
for one peak.
## Define the rt and m/z range of the peak area
rtr <- c(2700, 2900)
mzr <- c(334.9, 335.1)
## extract the chromatogram
chr_raw <- chromatogram(raw_data, mz = mzr, rt = rtr)
plot(chr_raw, col = group_colors[chr_raw$sample_group])
Figure 3: Extracted ion chromatogram for one peak
Note that Chromatogram objects extracted by the chromatogram method contain
an NA value if in a certain scan (i.e. for a specific retention time) no
signal was measured in the respective mz range. This is reflected by the lines
not being drawn as continuous lines in the plot above.
The peak above has a width of about 50 seconds. The peakwidth parameter should
be set to accommodate the expected widths of peak in the data set. We set it to
20,80 for the present example data set.
For the ppm parameter we extract the full MS data (intensity, retention time
and m/z values) corresponding to the above peak. To this end we first filter the
raw object by retention time, then by m/z and finally plot the object with type = "XIC" to produce the plot below. We use the pipe (%>%) command better
illustrate the corresponding workflow. Note also that in this type of plot
identified chromatographic peaks would be indicated by default if present.
raw_data %>%
    filterRt(rt = rtr) %>%
    filterMz(mz = mzr) %>%
    plot(type = "XIC")
Figure 4: Visualization of the raw MS data for one peak
For each plot: upper panel: chromatogram plotting the intensity values against the retention time, lower panel m/z against retention time plot. The individual data points are colored according to the intensity.
In the present data there is actually no variation in the m/z values. Usually
one would see the m/z values (lower panel) scatter around the real m/z value
of the compound. The first step of the centWave algorithm defines so called
regions of interest (ROI) based on the difference of m/z values from consecutive
scans. In detail, m/z values from consecutive scans are included into a ROI if
the difference between the m/z and the mean m/z of the ROI is smaller than the
user defined ppm parameter. A reasonable choice for the ppm could thus be
the maximal m/z difference of data points from neighboring scans/spectra that
are part of the chromatographic peak. It is suggested to inspect the ranges of
m/z values for many compounds (either internal standards or compounds known to
be present in the sample) and define the ppm parameter for centWave
according to these.
Note that we can also perform the peak detection on the extracted ion
chromatogram. This can help to evaluate different peak detection settings. Only
be aware that peak detection on an extracted ion chromatogram will not consider
the ppm parameter and that the estimation of the background signal is
different to the peak detection on the full data set; values for the snthresh
will hence have different consequences. Below we perform the peak detection with
the findChromPeaks function on the extracted ion chromatogram. The submitted
parameter object defines which algorithm will be used and allows to define the
settings for this algorithm. We use the centWave algorithm with default
settings, except for snthresh.
xchr <- findChromPeaks(chr_raw, param = CentWaveParam(snthresh = 2))We can access the identified chromatographic peaks with the chromPeaks
function.
head(chromPeaks(xchr))##            rt    rtmin    rtmax       into       intb  maxo  sn row column
## [1,] 2781.505 2761.160 2809.674  412134.25  355516.37 16856  13   1      1
## [2,] 2786.199 2764.290 2812.803 1496244.21 1391821.33 58736  20   1      2
## [3,] 2734.556 2714.211 2765.855   21579.37   18449.43   899   4   1      3
## [4,] 2797.154 2775.245 2815.933  159058.78  150289.31  6295  12   1      3
## [5,] 2784.635 2761.160 2808.109   54947.54   37923.53  2715   2   1      4
## [6,] 2859.752 2845.668 2878.532   13895.21   13874.87   905 904   1      4Parallel to the chromPeaks matrix there is also a data frame chromPeakData
that allows to add arbitrary annotations to each chromatographic peak. Below we
extract this data frame that by default contains only the MS level in which the
peak was identified.
chromPeakData(xchr)## DataFrame with 12 rows and 4 columns
##      ms_level is_filled       row    column
##     <integer> <logical> <integer> <integer>
## 1           1     FALSE         1         1
## 2           1     FALSE         1         2
## 3           1     FALSE         1         3
## 4           1     FALSE         1         3
## 5           1     FALSE         1         4
## ...       ...       ...       ...       ...
## 8           1     FALSE         1         4
## 9           1     FALSE         1         5
## 10          1     FALSE         1         6
## 11          1     FALSE         1         7
## 12          1     FALSE         1         8Next we plot the identified chromatographic peaks in the extracted ion chromatogram. We use the col
parameter to color the individual chromatogram lines. Colors can also be
specified for the identified peaks, peakCol for the foreground/border color,
peakBg for the background/fill color. One color has to be provided for each
chromatographic peak listed by chromPeaks. Below we define a color to indicate
the sample group from which the sample is and use the sample information in the
peaks’ "sample" column to assign the correct color to each chromatographic
peak. More peak highlighting options are described further below.
sample_colors <- group_colors[xchr$sample_group]
plot(xchr, col = sample_colors,
     peakBg = sample_colors[chromPeaks(xchr)[, "column"]])
Figure 5: Signal for an example peak
Red and blue colors represent KO and wild type samples, respectively. Peak area of identified chromatographic peaks are highlighted in the sample group color.
Finally we perform the chromatographic peak detection on the full data set. Note
that we set the argument prefilter to c(6, 5000) and noise to 5000 to
reduce the run time of this vignette. With this setting we consider only signals
with a value larger than 5000 in the peak detection step.
cwp <- CentWaveParam(peakwidth = c(20, 80), noise = 5000,
                     prefilter = c(6, 5000))
xdata <- findChromPeaks(raw_data, param = cwp)The results are returned as an XCMSnExp object which extends the
OnDiskMSnExp object by storing also LC/GC-MS preprocessing results. This means
also that all methods to sub-set and filter the data or to access the (raw) data
are inherited from the OnDiskMSnExp object and can thus be re-used. Note also
that it is possible to perform additional rounds of peak detection (e.g. on MS
level > 1 data) on the xdata object by calling findChromPeaks with the
parameter add = TRUE.
The results from the chromatographic peak detection can be accessed with the
chromPeaks method.
head(chromPeaks(xdata))##           mz mzmin mzmax       rt    rtmin    rtmax      into      intb  maxo
## CP0001 453.2 453.2 453.2 2509.203 2501.378 2527.982 1007409.0 1007380.8 38152
## CP0002 236.1 236.1 236.1 2518.593 2501.378 2537.372  253501.0  226896.3 12957
## CP0003 594.0 594.0 594.0 2601.535 2581.191 2637.529  161042.2  149297.3  7850
## CP0004 577.0 577.0 577.0 2604.665 2581.191 2626.574  136105.2  129195.5  6215
## CP0005 369.2 369.2 369.2 2587.451 2556.151 2631.269  483852.3  483777.1  7215
## CP0006 369.2 369.2 369.2 2568.671 2557.716 2578.061  144624.8  144602.9  7033
##           sn sample
## CP0001 38151      1
## CP0002    11      1
## CP0003    13      1
## CP0004    13      1
## CP0005  7214      1
## CP0006  7032      1The returned matrix provides the m/z and retention time range for each
identified chromatographic peak as well as the integrated signal intensity
(“into”) and the maximal peak intensitity (“maxo”). Columns “sample” contains
the index of the sample in the object/experiment in which the peak was
identified.
Annotations for each individual peak can be extracted with the chromPeakData
function. This data frame could also be used to add/store arbitrary annotations
for each detected peak.
chromPeakData(xdata)## DataFrame with 1707 rows and 2 columns
##         ms_level is_filled
##        <integer> <logical>
## CP0001         1     FALSE
## CP0002         1     FALSE
## CP0003         1     FALSE
## CP0004         1     FALSE
## CP0005         1     FALSE
## ...          ...       ...
## CP1703         1     FALSE
## CP1704         1     FALSE
## CP1705         1     FALSE
## CP1706         1     FALSE
## CP1707         1     FALSEPeak detection will not always work perfectly leading to peak detection
artifacts, such as overlapping peaks or artificially split peaks. The
refineChromPeaks function allows to refine peak detection results by either
removing identified peaks not passing a certain criteria or by merging
artificially split chromatographic peaks. With parameter objects
CleanPeaksParam and FilterIntensityParam it is possible to remove peaks with
a retention time range or intensities below a threshold, respectively (see their
respective help pages for more details and examples). With
MergeNeighboringPeaksParam it is possible to merge chromatographic
peaks. Below we post-process the peak detection results merging peaks
overlapping in a 4 second window per file if the signal between in between them
is lower than 75% of the smaller peak’s maximal intensity. See the
MergeNeighboringPeaksParam help page for a detailed description of the
settings and the approach.
mpp <- MergeNeighboringPeaksParam(expandRt = 4)
xdata_pp <- refineChromPeaks(xdata, mpp)An example for a merged peak is given below.
mzr_1 <- 305.1 + c(-0.01, 0.01)
chr_1 <- chromatogram(filterFile(xdata, 1), mz = mzr_1)
chr_2 <- chromatogram(filterFile(xdata_pp, 1), mz = mzr_1)
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)
Figure 6: Result from the peak refinement step
Left: data before processing, right: after refinement. The splitted peak was merged into one.
For the first trace in the chromatogram above centWave detected 3 peaks (1 for
the full area and two smaller ones, see left panel in the plot above). The peak
refinement with MergeNeighboringPeaksParam reduced them to a single peak
(right panel in the figure above). Note that this refinement does not merge
neighboring peaks for which the signal in between them is lower than a certain
proportion (see figure below).
mzr_1 <- 496.2 + c(-0.01, 0.01)
chr_1 <- chromatogram(filterFile(xdata, 1), mz = mzr_1)
chr_2 <- chromatogram(filterFile(xdata_pp, 1), mz = mzr_1)
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)
Figure 7: Result from the peak refinement step
Left: data before processing, right: after refinement. The peaks were not merged.
Note also that it is possible to perform the peak refinement on extracted ion
chromatograms. This could e.g. be used to fine-tune the settings for the
parameter. To illustrate this we perform below a peak refinement on the
extracted ion chromatogram chr_1 reducing the minProp parameter to force
joining the two peaks.
res <- refineChromPeaks(chr_1, MergeNeighboringPeaksParam(minProp = 0.05))
chromPeaks(res)##         mz mzmin mzmax       rt    rtmin    rtmax     into intb    maxo   sn
## CPM1 496.2 496.2 496.2 3384.012 3294.809 3412.181 45940118   NA 1128960 1255
##      sample row column
## CPM1      1   1      1plot(res)Before proceeding we replace the xdata object with the results from the peak
refinement.
xdata <- xdata_ppBelow we use the data from the chromPeaks matrix to calculate some per-file
summaries.
summary_fun <- function(z)
    c(peak_count = nrow(z), rt = quantile(z[, "rtmax"] - z[, "rtmin"]))
T <- lapply(split.data.frame(
    chromPeaks(xdata), f = chromPeaks(xdata)[, "sample"]),
    FUN = summary_fun)
T <- do.call(rbind, T)
rownames(T) <- basename(fileNames(xdata))
pandoc.table(
    T,
    caption = paste0("Summary statistics on identified chromatographic",
                     " peaks. Shown are number of identified peaks per",
                     " sample and widths/duration of chromatographic ",
                     "peaks."))| peak_count | rt.0% | rt.25% | rt.50% | rt.75% | rt.100% | |
|---|---|---|---|---|---|---|
| ko15.CDF | 224 | 10.95 | 29.73 | 40.69 | 51.64 | 319.2 | 
| ko16.CDF | 280 | 9.39 | 32.86 | 44.6 | 54.77 | 150.2 | 
| ko21.CDF | 132 | 10.95 | 42.25 | 50.08 | 64.55 | 164.3 | 
| ko22.CDF | 149 | 10.95 | 28.17 | 45.38 | 57.9 | 118.9 | 
| wt15.CDF | 260 | 10.95 | 27.78 | 42.25 | 51.64 | 161.2 | 
| wt16.CDF | 180 | 10.95 | 34.43 | 45.38 | 54.77 | 142.4 | 
| wt21.CDF | 147 | 9.389 | 32.08 | 46.95 | 63.38 | 133 | 
| wt22.CDF | 204 | 10.95 | 34.43 | 43.82 | 56.34 | 137.7 | 
We can also plot the location of the identified chromatographic peaks in the
m/z - retention time space for one file using the plotChromPeaks
function. Below we plot the chromatographic peaks for the 3rd sample.
plotChromPeaks(xdata, file = 3)
Figure 8: Identified chromatographic peaks in the m/z by retention time space for one sample
To get a global overview of the peak detection we can plot the frequency of identified peaks per file along the retention time axis. This allows to identify time periods along the MS run in which a higher number of peaks was identified and evaluate whether this is consistent across files.
plotChromPeakImage(xdata)
Figure 9: Frequency of identified chromatographic peaks along the retention time axis
The frequency is color coded with higher frequency being represented by yellow-white. Each line shows the peak frequency for one file.
Next we highlight the identified chromatographic peaks for the example peak from
before. Evaluating such plots on a list of peaks corresponding to known peaks or
internal standards helps to ensure that peak detection settings were appropriate
and correctly identified the expected peaks. We extract the ion chromatogram
from the peak detection result object, which contains then also the identified
chromatographic peaks for that ion that we can extract with the chromPeaks
function.
chr_ex <- chromatogram(xdata, mz = mzr, rt = rtr)
chromPeaks(chr_ex)##         mz mzmin mzmax       rt    rtmin    rtmax      into      intb  maxo sn
## CP0045 335   335   335 2781.505 2761.160 2809.674  412134.3  383167.4 16856 23
## CP0309 335   335   335 2786.199 2764.290 2812.803 1496244.2 1461187.3 58736 72
## CP0587 335   335   335 2797.154 2775.245 2815.933  159058.8  149229.6  6295 13
## CP1194 335   335   335 2786.199 2764.290 2812.803  932645.2  915333.8 35856 66
## CP1378 335   335   335 2792.461 2768.987 2823.760  876585.5  848569.1 27200 36
##        sample row column
## CP0045      1   1      1
## CP0309      2   1      2
## CP0587      3   1      3
## CP1194      6   1      6
## CP1378      7   1      7We can also plot the extracted ion chromatogram. Identified chromatographic peaks will be automatically highlighted in the plot. Below we highlight chromatographic peaks with a rectangle from the peak’s minimal to maximal rt and from an intensity of 0 to the maximal signal of the peak.
sample_colors <- group_colors[chr_ex$sample_group]
plot(chr_ex, col = sample_colors, peakType = "rectangle",
     peakCol = sample_colors[chromPeaks(chr_ex)[, "sample"]],
     peakBg = NA)
Figure 10: Signal for an example peak
Red and blue colors represent KO and wild type samples, respectively. The rectangles indicate the identified chromatographic peaks per sample.
Alternatively to the rectangle visualization above, it is possible to represent
the apex position of each peak with a single point (passing argument
type = "point" to the function), or draw the actually identified peak by
specifying type = "polygon". To completely omit highlighting the identified
peaks (e.g. to plot base peak chromatograms or similar) type = "none" can be
used. Below we use type = "polygon" to fill the peak area
for each identified chromatographic peak in each sample. Whether individual
peaks can be still identified in such a plot depends however on the number of
samples from which peaks are drawn.
plot(chr_ex, col = group_colors[chr_raw$sample_group], lwd = 2,
     peakBg = sample_colors[chromPeaks(chr_ex)[, "sample"]])
Figure 11: Signal for an example peak
Red and blue colors represent KO and wild type samples, respectively. The signal area of identified chromatographic peaks are filled with a color.
Note that we can also specifically extract identified chromatographic peaks for
a selected region by providing the respective m/z and retention time ranges with
the mz and rt arguments in the chromPeaks method.
pander(chromPeaks(xdata, mz = mzr, rt = rtr),
       caption = paste("Identified chromatographic peaks in a selected ",
                       "m/z and retention time range."))| mz | mzmin | mzmax | rt | rtmin | rtmax | into | intb | |
|---|---|---|---|---|---|---|---|---|
| CP0045 | 335 | 335 | 335 | 2782 | 2761 | 2810 | 412134 | 383167 | 
| CP0309 | 335 | 335 | 335 | 2786 | 2764 | 2813 | 1496244 | 1461187 | 
| CP0587 | 335 | 335 | 335 | 2797 | 2775 | 2816 | 159059 | 149230 | 
| CP1194 | 335 | 335 | 335 | 2786 | 2764 | 2813 | 932645 | 915334 | 
| CP1378 | 335 | 335 | 335 | 2792 | 2769 | 2824 | 876586 | 848569 | 
| maxo | sn | sample | |
|---|---|---|---|
| CP0045 | 16856 | 23 | 1 | 
| CP0309 | 58736 | 72 | 2 | 
| CP0587 | 6295 | 13 | 3 | 
| CP1194 | 35856 | 66 | 6 | 
| CP1378 | 27200 | 36 | 7 | 
Finally we plot also the distribution of peak intensity per sample. This allows to investigate whether systematic differences in peak signals between samples are present.
## Extract a list of per-sample peak intensities (in log2 scale)
ints <- split(log2(chromPeaks(xdata)[, "into"]),
              f = chromPeaks(xdata)[, "sample"])
boxplot(ints, varwidth = TRUE, col = group_colors[xdata$sample_group],
        ylab = expression(log[2]~intensity), main = "Peak intensities")
grid(nx = NA, ny = NULL)
Figure 12: Peak intensity distribution per sample
Note that in addition to the above described identification of chromatographic
peaks, it is also possible to manually define and add chromatographic peaks
with the manualChromPeaks function (see ?manualChromPeaks help page for more
information).
The time at which analytes elute in the chromatography can vary between samples (and even compounds). Such a difference was already observable in the extracted ion chromatogram plot shown as an example in the previous section. The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.
A plethora of alignment algorithms exist (see [3]), with some of
them being implemented also in xcms. The method to perform the
alignment/retention time correction in xcms is adjustRtime which uses
different alignment algorithms depending on the provided parameter class.
In the example below we use the obiwarp method [4] to align the
samples. We use a binSize = 0.6 which creates warping functions in mz bins of
0.6. Also here it is advisable to modify the settings for each experiment and
evaluate if retention time correction did align internal controls or known
compounds properly.
xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.6))adjustRtime, besides calculating adjusted retention times for each spectrum,
does also adjust the reported retention times of the identified chromatographic
peaks.
To extract the adjusted retention times we can use the adjustedRtime method,
or simply the rtime method that, if present, returns by default adjusted
retention times from an XCMSnExp object.
## Extract adjusted retention times
head(adjustedRtime(xdata))## F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
## 2501.378 2502.958 2504.538 2506.118 2507.699 2509.280## Or simply use the rtime method
head(rtime(xdata))## F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
## 2501.378 2502.958 2504.538 2506.118 2507.699 2509.280Raw retention times can be extracted from an XCMSnExp containing
aligned data with rtime(xdata, adjusted = FALSE).
To evaluate the impact of the alignment we plot the BPC on the adjusted data. In
addition we plot the differences of the adjusted- to the raw retention times per
sample using the plotAdjustedRtime function. For a base peak chromatogram it
makes no sense to also extract identified chromatographic peaks from the result
object. We thus use parameter include = "none" in the chromatogram call to
not include chromatographic peaks in the returned object. Note that
alternatively it would also be possible to simply avoid plotting them by setting
peakType = "none" in the plot call.
## Get the base peak chromatograms.
bpis_adj <- chromatogram(xdata, aggregationFun = "max", include = "none")
par(mfrow = c(2, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(bpis_adj, col = group_colors[bpis_adj$sample_group])
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group])
Figure 13: Obiwarp aligned data
Base peak chromatogram after alignment (top) and difference between adjusted and raw retention times along the retention time axis (bottom).
Too large differences between adjusted and raw retention times could indicate poorly performing samples or alignment.
Note: XCMSnExp objects hold the raw along with the adjusted retention
times and subsetting will in most cases drop the adjusted retention
times. Sometimes it might thus be useful to replace the raw retention times
with the adjusted retention times. This can be done with the
applyAdjustedRtime.
At last we evaluate the impact of the alignment on the test peak.
par(mfrow = c(2, 1))
## Plot the raw data
plot(chr_raw, col = group_colors[chr_raw$sample_group])
## Extract the chromatogram from the adjusted object
chr_adj <- chromatogram(xdata, rt = rtr, mz = mzr)
plot(chr_adj, col = group_colors[chr_raw$sample_group], peakType = "none")
Figure 14: Example extracted ion chromatogram before (top) and after alignment (bottom)
In some experiments it might be helpful to perform the alignment based on only a
subset of the samples, e.g. if QC samples were injected at regular intervals or
if the experiment contains blanks. Alignment method in xcms allow to
estimate retention time drifts on a subset of samples (either all samples
excluding blanks or QC samples injected at regular intervals during a
measurement run) and use these to adjust the full data set.
Parameters subset (of the PeakGroupsParam or ObiwarpParam object) can be
used to define the subset of samples on which the alignment of the full data set
will be based (e.g. subset being the index of QC samples), and parameter
subsetAdjust allows to specify the method by which the left-out samples will
be adjusted. There are currently two options available:
subsetAdjust = "previous": adjust the retention times of a non-subset
sample based on the alignment results of the previous subset sample (e.g. a
QC sample). If samples are e.g. in the order A1, B1, B2, A2, B3,
B4 with A representing QC samples and B study samples, using
subset = c(1, 4) and subsetAdjust = "previous" would result in all A
samples to be aligned with each other and non-subset samples B1 and B2
being adjusted based on the alignment result of subset samples A1 and B3
and B4 on those of A2.
subsetAdjust = "average": adjust retention times of non-subset samples based
on an interpolation of the alignment results of the previous and subsequent
subset sample. In the example above, B1 would be adjusted based on the
average of adjusted retention times between subset (QC) samples A1 and
A2. Since there is no subset sample after non-subset samples B3 and B4
these will be adjusted based on the alignment results of A2 alone. Note
that a weighted average is used to calculate the adjusted retention time
averages, which uses the inverse of the difference of the index of the
non-subset sample to the subset samples as weights. Thus, if we have a
setup like A1, B1, B2, A2 the adjusted retention times of A1
would get a larger weight than those of A2 in the adjustment of
non-subset sample B1 causing it’s adjusted retention times to be closer
to those of A1 than to A2. See below for examples.
Both cases require a meaningful/correct ordering of the samples within the object (e.g. ordering by injection index).
The examples below aim to illustrate the effect of these alignment options. We
assume that samples 1, 4 and 7 in the faahKO data set are QC samples (sample
pools). We thus want to perform the alignment based on these samples and
subsequently adjust the retention times of the left-out samples (2, 3, 5, 6 and
8) based on interpolation of the results from the neighboring subset (QC)
samples. After initial peak grouping we perform below the alignment with the
peak groups method passing the indices of the samples on which we want the
alignment to be based on with the subset argument and specify subsetAdjust = "average" to adjust the study samples based on interpolation of the alignment
results from neighboring subset/QC samples.
Note that for any subset-alignment all parameters such as minFraction are
relative to the subset, not the full experiment!
To re-perform an alignment we can first remove previous alignment results with
the dropAdjustedRtime function.
xdata <- dropAdjustedRtime(xdata)
## Define the experimental layout
xdata$sample_type <- "study"
xdata$sample_type[c(1, 4, 7)] <- "QC"We next have to perform an initial correspondence analysis because the peak
groups alignment method adjusts the retention time by aligning previously
identified hook peaks (chromatographic peaks present in most/all samples;
details about the algorithm used are presented in the next section). We use here
the default settings, but it is strongly advised to adapt the parameters for
each data set. The definition of the sample groups (i.e. assignment of
individual samples to the sample groups in the experiment) is mandatory for the
PeakDensityParam. If there are no sample groups in the experiment
sampleGroups should be set to a single value for each file (e.g. rep(1, length(fileNames(xdata))).
## Initial peak grouping. Use sample_type as grouping variable
pdp_subs <- PeakDensityParam(sampleGroups = xdata$sample_type,
                             minFraction = 0.9)
xdata <- groupChromPeaks(xdata, param = pdp_subs)
## Define subset-alignment options and perform the alignment
pgp_subs <- PeakGroupsParam(minFraction = 0.85,
                            subset = which(xdata$sample_type == "QC"),
                            subsetAdjust = "average", span = 0.4)
xdata <- adjustRtime(xdata, param = pgp_subs)Below we plot the results of the alignment labeling the samples being part of
the subset in green and the others in grey. This nicely shows how the
interpolation of the subsetAdjust = "average" works: retention times of sample
2 are adjusted based on those from subset sample 1 and 4, giving however more
weight to the closer subset sample 1 which results in the adjusted retention
times of 2 being more similar to those of sample 1. Sample 3 on the other hand
gets adjusted giving more weight to the second subset sample (4).
clrs <- rep("#00000040", 8)
clrs[xdata$sample_type == "QC"] <- c("#00ce0080")
par(mfrow = c(2, 1), mar = c(4, 4.5, 1, 0.5))
plot(chromatogram(xdata, aggregationFun = "sum"),
     col = clrs, peakType = "none")
plotAdjustedRtime(xdata, col = clrs, peakGroupsPch = 1,
                  peakGroupsCol = "#00ce0040")
Figure 15: Subset-alignment results with option average
Difference between adjusted and raw retention times along the retention time axis. Samples on which the alignment models were estimated are shown in green, study samples in grey.
Option subsetAdjust = "previous" adjusts the retention times of a non-subset
sample based on a single subset sample (the previous), which results in most
cases in the adjusted retention times of the non-subset sample being highly
similar to those of the subset sample which was used for adjustment.
The final step in the metabolomics preprocessing is the correspondence that
matches detected chromatographic peaks between samples (and depending on the
settings, also within samples if they are adjacent). The method to perform the
correspondence in xcms is groupChromPeaks. We will use the peak density
method [5] to group chromatographic peaks. The algorithm combines
chromatographic peaks depending on the density of peaks along the retention time
axis within small slices along the mz dimension. To illustrate this we plot
below the chromatogram for an mz slice with multiple chromatographic peaks
within each sample. We use below a value of 0.4 for the minFraction parameter
hence only chromatographic peaks present in at least 40% of the samples per
sample group are grouped into a feature. The sample group assignment is
specified with the sampleGroups argument.
## Define the mz slice.
mzr <- c(305.05, 305.15)
## Extract and plot the chromatograms
chr_mzr <- chromatogram(xdata, mz = mzr)
## Define the parameters for the peak density method
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
                        minFraction = 0.4, bw = 30)
plotChromPeakDensity(chr_mzr, col = sample_colors, param = pdp,
                     peakBg = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
                     peakCol = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
                     peakPch = 16)
Figure 16: Example for peak density correspondence
Upper panel: chromatogram for an mz slice with multiple chromatographic peaks. lower panel: identified chromatographic peaks at their retention time (x-axis) and index within samples of the experiments (y-axis) for different values of the bw parameter. The black line represents the peak density estimate. Grouping of peaks (based on the provided settings) is indicated by grey rectangles.
The upper panel in the plot above shows the extracted ion chromatogram for each sample with the detected peaks highlighted. The middle and lower plot shows the retention time for each detected peak within the different samples. The black solid line represents the density distribution of detected peaks along the retention times. Peaks combined into features (peak groups) are indicated with grey rectangles. This type of visualization is ideal to test correspondence settings on example m/z slices before applying them to the full data set.
Below we perform the correspondence analysis with the defined settings on the full data set.
## Perform the correspondence
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
                        minFraction = 0.4, bw = 30)
xdata <- groupChromPeaks(xdata, param = pdp)Results from the xcms-based preprocessing can be summarized into a
SummarizedExperiment object from the SummarizedExperiment
package with the quantify method. This object will contain the feature
abundances as the assay matrix, the feature definition (their m/z, retention
time and other metadata) as rowData (i.e. row annotations) and the
sample/phenotype information as colData (i.e. column annotations). All the
processing history will be put into the object’s metadata. This object can
then be used for any further (xcms-independent) processing and analysis.
Below we use quantify to generate the result object for the present
analysis. The parameters value and any other additional parameters are passed
along to the featureValues method that is used internally to create the
feature abundance matrix.
res <- quantify(xdata, value = "into")Sample annotations can be accessed with the colData method.
colData(res)## DataFrame with 8 rows and 3 columns
##          sample_name sample_group sample_type
##          <character>  <character> <character>
## ko15.CDF        ko15           KO          QC
## ko16.CDF        ko16           KO       study
## ko21.CDF        ko21           KO       study
## ko22.CDF        ko22           KO          QC
## wt15.CDF        wt15           WT       study
## wt16.CDF        wt16           WT       study
## wt21.CDF        wt21           WT          QC
## wt22.CDF        wt22           WT       studyFeature annotations with rowData:
rowData(res)## DataFrame with 225 rows and 11 columns
##           mzmed     mzmin     mzmax     rtmed     rtmin     rtmax    npeaks
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT001     200.1     200.1     200.1   2901.63   2880.73   2922.53         2
## FT002     205.0     205.0     205.0   2789.39   2782.30   2795.36         8
## FT003     206.0     206.0     206.0   2788.73   2780.73   2792.86         7
## FT004     207.1     207.1     207.1   2718.12   2713.21   2726.70         7
## FT005     219.1     219.1     219.1   2518.82   2517.40   2520.81         3
## ...         ...       ...       ...       ...       ...       ...       ...
## FT221    591.30     591.3     591.3   3005.03   2992.87   3006.05         5
## FT222    592.15     592.1     592.3   3022.11   2981.91   3107.59         6
## FT223    594.20     594.2     594.2   3418.16   3359.10   3427.90         3
## FT224    595.25     595.2     595.3   3010.15   2992.87   3013.77         6
## FT225    596.20     596.2     596.2   2997.91   2992.87   3002.95         2
##              KO        WT         peakidx  ms_level
##       <numeric> <numeric>          <list> <integer>
## FT001         2         0         287,679         1
## FT002         4         4  47,272,542,...         1
## FT003         3         4  32,259,663,...         1
## FT004         4         3  19,249,525,...         1
## FT005         1         2   639, 788,1376         1
## ...         ...       ...             ...       ...
## FT221         2         3 349,684,880,...         1
## FT222         1         3  86,861,862,...         1
## FT223         1         2   604, 985,1543         1
## FT224         2         3  67,353,876,...         1
## FT225         0         2        866,1447         1The feature abundances can be accessed with the assay method. Note also that a
SummarizedExperiment supports multiple such assay matrices.
head(assay(res))##        ko15.CDF  ko16.CDF  ko21.CDF  ko22.CDF  wt15.CDF  wt16.CDF  wt21.CDF
## FT001        NA  506848.9        NA  169955.6        NA        NA        NA
## FT002 1924712.0 1757151.0 1383416.7 1180288.2 2129885.1 1634342.0 1623589.2
## FT003  213659.3  289500.7        NA  178285.7  253825.6  241844.4  240606.0
## FT004  349011.5  451863.7  343897.8  208002.8  364609.8  360908.9        NA
## FT005        NA        NA        NA  107348.5  223951.8        NA        NA
## FT006  286221.4        NA  164009.0  149097.6  255697.7  311296.8  366441.5
##         wt22.CDF
## FT001         NA
## FT002 1354004.93
## FT003  185399.47
## FT004  221937.53
## FT005   84772.92
## FT006  271128.02In addition it is possible to extract the results from the correspondence
analysis individually using the featureDefinitions and featureValues
methods, the former returning a DataFrame with the definition of the features
(i.e. the mz and rt ranges and, in column peakidx, the index of the
chromatographic peaks in the chromPeaks matrix for each feature), the latter
the feature abundances.
## Extract the feature definitions
featureDefinitions(xdata)## DataFrame with 225 rows and 11 columns
##           mzmed     mzmin     mzmax     rtmed     rtmin     rtmax    npeaks
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT001     200.1     200.1     200.1   2901.63   2880.73   2922.53         2
## FT002     205.0     205.0     205.0   2789.39   2782.30   2795.36         8
## FT003     206.0     206.0     206.0   2788.73   2780.73   2792.86         7
## FT004     207.1     207.1     207.1   2718.12   2713.21   2726.70         7
## FT005     219.1     219.1     219.1   2518.82   2517.40   2520.81         3
## ...         ...       ...       ...       ...       ...       ...       ...
## FT221    591.30     591.3     591.3   3005.03   2992.87   3006.05         5
## FT222    592.15     592.1     592.3   3022.11   2981.91   3107.59         6
## FT223    594.20     594.2     594.2   3418.16   3359.10   3427.90         3
## FT224    595.25     595.2     595.3   3010.15   2992.87   3013.77         6
## FT225    596.20     596.2     596.2   2997.91   2992.87   3002.95         2
##              KO        WT         peakidx  ms_level
##       <numeric> <numeric>          <list> <integer>
## FT001         2         0         287,679         1
## FT002         4         4  47,272,542,...         1
## FT003         3         4  32,259,663,...         1
## FT004         4         3  19,249,525,...         1
## FT005         1         2   639, 788,1376         1
## ...         ...       ...             ...       ...
## FT221         2         3 349,684,880,...         1
## FT222         1         3  86,861,862,...         1
## FT223         1         2   604, 985,1543         1
## FT224         2         3  67,353,876,...         1
## FT225         0         2        866,1447         1The featureValues method returns a matrix with rows being features and
columns samples. The content of this matrix can be defined using the value
argument. The default value = "into" returns a matrix with the integrated
signal of the peaks corresponding to a feature in a sample. Any column name of
the chromPeaks matrix can be passed to the argument value. Below we extract
the integrated peak intensity per feature/sample.
## Extract the into column for each feature.
head(featureValues(xdata, value = "into"))##        ko15.CDF  ko16.CDF  ko21.CDF  ko22.CDF  wt15.CDF  wt16.CDF  wt21.CDF
## FT001        NA  506848.9        NA  169955.6        NA        NA        NA
## FT002 1924712.0 1757151.0 1383416.7 1180288.2 2129885.1 1634342.0 1623589.2
## FT003  213659.3  289500.7        NA  178285.7  253825.6  241844.4  240606.0
## FT004  349011.5  451863.7  343897.8  208002.8  364609.8  360908.9        NA
## FT005        NA        NA        NA  107348.5  223951.8        NA        NA
## FT006  286221.4        NA  164009.0  149097.6  255697.7  311296.8  366441.5
##         wt22.CDF
## FT001         NA
## FT002 1354004.93
## FT003  185399.47
## FT004  221937.53
## FT005   84772.92
## FT006  271128.02This feature matrix contains NA for samples in which no chromatographic peak
was detected in the feature’s m/z-rt region. While in many cases there might
indeed be no peak signal in the respective region, it might also be that there
is signal, but the peak detection algorithm failed to detect a chromatographic
peak (e.g. because the signal was too low or too noisy). xcms provides
the fillChromPeaks method to fill in intensity data for such missing values
from the original files. The filled in peaks are added to the chromPeaks
matrix and indicated with a value TRUE in the "is_filled" column of
the chromPeakData data frame. Below we perform such a gap filling.
xdata <- fillChromPeaks(xdata, param = ChromPeakAreaParam())
head(featureValues(xdata))##        ko15.CDF   ko16.CDF   ko21.CDF  ko22.CDF  wt15.CDF  wt16.CDF  wt21.CDF
## FT001  159738.1  506848.88  113441.08  169955.6  216096.6  145509.7  230477.9
## FT002 1924712.0 1757150.96 1383416.72 1180288.2 2129885.1 1634342.0 1623589.2
## FT003  213659.3  289500.67  162897.19  178285.7  253825.6  241844.4  240606.0
## FT004  349011.5  451863.66  343897.76  208002.8  364609.8  360908.9  223322.5
## FT005  135978.5   25524.79   71530.84  107348.5  223951.8  134398.9  190203.8
## FT006  286221.4  289908.23  164008.97  149097.6  255697.7  311296.8  366441.5
##         wt22.CDF
## FT001  140551.30
## FT002 1354004.93
## FT003  185399.47
## FT004  221937.53
## FT005   84772.92
## FT006  271128.02For features without detected peaks in a sample, the method extracts all
intensities in the mz-rt region of the feature, integrates the signal and adds a
filled-in peak to the chromPeaks matrix. No peak is added if no signal is
measured/available for the mz-rt region of the feature. For these, even after
filling in missing peak data, a NA is reported in the featureValues matrix.
Different options to define the mz-rt region of the features are
available. With the ChromPeakAreaParam() parameter object used above, the
feature area is defined using the m/z and rt ranges of all of its (detected)
chromatographic peaks: the lower m/z value of the area is defined as the lower
quartile (25% quantile) of the "mzmin" values of all peaks of the feature,
the upper m/z value as the upper quartile (75% quantile) of the "mzmax"
values, the lower rt value as the lower quartile (25% quantile) of the "rtmin"
and the upper rt value as the upper quartile (75% quantile) of the "rtmax"
values. This ensures that the signal is integrated from a feature-specific area.
Alternatively, it is possible to use the FillChromPeaksParam parameter object
in the fillChromPeaks call, which resembles the approach of the original (old)
xcms implementation.
Below we compare the number of missing values before and after filling in
missing values. We can use the parameter filled of the featureValues method
to define whether or not filled-in peak values should be returned too.
## Missing values before filling in peaks
apply(featureValues(xdata, filled = FALSE), MARGIN = 2,
      FUN = function(z) sum(is.na(z)))## ko15.CDF ko16.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF wt21.CDF wt22.CDF 
##       71       69      114      102       54       89      104       75## Missing values after filling in peaks
apply(featureValues(xdata), MARGIN = 2,
      FUN = function(z) sum(is.na(z)))## ko15.CDF ko16.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF wt21.CDF wt22.CDF 
##        4        4        6        6        2        6        6        3Next we use the featureSummary function to get a general per-feature summary
that includes the number of samples in which a peak was found or the number of
samples in which more than one peak was assigned to the feature. Specifying also
sample groups breaks down these summary statistics for each individual sample
group.
head(featureSummary(xdata, group = xdata$sample_group))##       count  perc multi_count multi_perc       rsd KO_count KO_perc
## FT001     2  25.0           0          0 0.7039537        2      50
## FT002     8 100.0           0          0 0.1936518        4     100
## FT003     7  87.5           0          0 0.1717662        3      75
## FT004     7  87.5           0          0 0.2609145        4     100
## FT005     3  37.5           0          0 0.5385767        1      25
## FT006     7  87.5           0          0 0.3016973        3      75
##       KO_multi_count KO_multi_perc    KO_rsd WT_count WT_perc WT_multi_count
## FT001              0             0 0.7039537        0       0              0
## FT002              0             0 0.2178920        4     100              0
## FT003              0             0 0.2501505        4     100              0
## FT004              0             0 0.2957873        3      75              0
## FT005              0             0        NA        2      50              0
## FT006              0             0 0.3765933        4     100              0
##       WT_multi_perc    WT_rsd
## FT001             0        NA
## FT002             0 0.1918936
## FT003             0 0.1327983
## FT004             0 0.2575039
## FT005             0 0.6375539
## FT006             0 0.1641781We can add the feature value matrix with the filled-in data for missing peaks
also to our SummarizedExperiment object res as an additional assay:
assays(res)$raw_filled <- featureValues(xdata, filled = TRUE)We have now two matrices (assays) available, the matrix with the detected and the matrix with the detected and filled-in values, each can be accessed by their name.
assayNames(res)## [1] "raw"        "raw_filled"head(assay(res, "raw"))##        ko15.CDF  ko16.CDF  ko21.CDF  ko22.CDF  wt15.CDF  wt16.CDF  wt21.CDF
## FT001        NA  506848.9        NA  169955.6        NA        NA        NA
## FT002 1924712.0 1757151.0 1383416.7 1180288.2 2129885.1 1634342.0 1623589.2
## FT003  213659.3  289500.7        NA  178285.7  253825.6  241844.4  240606.0
## FT004  349011.5  451863.7  343897.8  208002.8  364609.8  360908.9        NA
## FT005        NA        NA        NA  107348.5  223951.8        NA        NA
## FT006  286221.4        NA  164009.0  149097.6  255697.7  311296.8  366441.5
##         wt22.CDF
## FT001         NA
## FT002 1354004.93
## FT003  185399.47
## FT004  221937.53
## FT005   84772.92
## FT006  271128.02head(assay(res, "raw_filled"))##        ko15.CDF   ko16.CDF   ko21.CDF  ko22.CDF  wt15.CDF  wt16.CDF  wt21.CDF
## FT001  159738.1  506848.88  113441.08  169955.6  216096.6  145509.7  230477.9
## FT002 1924712.0 1757150.96 1383416.72 1180288.2 2129885.1 1634342.0 1623589.2
## FT003  213659.3  289500.67  162897.19  178285.7  253825.6  241844.4  240606.0
## FT004  349011.5  451863.66  343897.76  208002.8  364609.8  360908.9  223322.5
## FT005  135978.5   25524.79   71530.84  107348.5  223951.8  134398.9  190203.8
## FT006  286221.4  289908.23  164008.97  149097.6  255697.7  311296.8  366441.5
##         wt22.CDF
## FT001  140551.30
## FT002 1354004.93
## FT003  185399.47
## FT004  221937.53
## FT005   84772.92
## FT006  271128.02The performance of peak detection, alignment and correspondence should always be
evaluated by inspecting extracted ion chromatograms e.g. of known compounds,
internal standards or identified features in general. The featureChromatograms
function allows to extract chromatograms for each feature present in
featureDefinitions. The returned MChromatograms object contains an ion
chromatogram for each feature (each row containing the data for one feature) and
sample (each column representing containing data for one sample). Below we
extract the chromatograms for the first 4 features.
feature_chroms <- featureChromatograms(xdata, features = 1:4)
feature_chroms## XChromatograms with 4 rows and 8 columns
##                    1               2               3               4
##      <XChromatogram> <XChromatogram> <XChromatogram> <XChromatogram>
## [1,]        peaks: 0        peaks: 1        peaks: 0        peaks: 1
## [2,]        peaks: 1        peaks: 1        peaks: 1        peaks: 1
## [3,]        peaks: 1        peaks: 1        peaks: 0        peaks: 1
## [4,]        peaks: 1        peaks: 1        peaks: 1        peaks: 1
##                    5               6               7               8
##      <XChromatogram> <XChromatogram> <XChromatogram> <XChromatogram>
## [1,]        peaks: 0        peaks: 0        peaks: 0        peaks: 0
## [2,]        peaks: 1        peaks: 1        peaks: 1        peaks: 1
## [3,]        peaks: 1        peaks: 1        peaks: 1        peaks: 1
## [4,]        peaks: 1        peaks: 1        peaks: 0        peaks: 1
## phenoData with 3 variables
## featureData with 5 variables
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: centWave 
## Correspondence:
##  method: chromatographic peak density 
##  4 feature(s) identified.And plot the extracted ion chromatograms. We again use the group color for each identified peak to fill the area.
plot(feature_chroms, col = sample_colors,
     peakBg = sample_colors[chromPeaks(feature_chroms)[, "sample"]])
Figure 17: Extracted ion chromatograms for features 1 to 4
To access the EICs of the second feature we can simply subset the
feature_chroms object.
eic_2 <- feature_chroms[2, ]
chromPeaks(eic_2)##         mz mzmin mzmax       rt    rtmin    rtmax    into    intb  maxo sn
## CP0055 205   205   205 2790.427 2770.441 2813.596 1924712 1850331 84280 64
## CP0310 205   205   205 2794.406 2772.731 2819.327 1757151 1711473 68384 69
## CP0595 205   205   205 2795.358 2773.524 2820.417 1383417 1334570 47384 54
## CP0736 205   205   205 2788.495 2768.076 2812.080 1180288 1126958 48336 32
## CP0921 205   205   205 2782.296 2761.887 2805.849 2129885 2054677 93312 44
## CP1197 205   205   205 2787.083 2766.688 2812.188 1634342 1566379 67984 53
## CP1379 205   205   205 2790.294 2763.639 2821.635 1623589 1531573 49208 28
## CP1542 205   205   205 2787.159 2766.777 2812.235 1354005 1299188 55712 35
##        sample row column
## CP0055      1   1      1
## CP0310      2   1      2
## CP0595      3   1      3
## CP0736      4   1      4
## CP0921      5   1      5
## CP1197      6   1      6
## CP1379      7   1      7
## CP1542      8   1      8At last we perform a principal component analysis to evaluate the grouping of the samples in this experiment. Note that we did not perform any data normalization hence the grouping might (and will) also be influenced by technical biases.
## Extract the features and log2 transform them
ft_ints <- log2(assay(res, "raw_filled"))
## Perform the PCA omitting all features with an NA in any of the
## samples. Also, the intensities are mean centered.
pc <- prcomp(t(na.omit(ft_ints)), center = TRUE)
## Plot the PCA
cols <- group_colors[xdata$sample_group]
pcSummary <- summary(pc)
plot(pc$x[, 1], pc$x[,2], pch = 21, main = "",
     xlab = paste0("PC1: ", format(pcSummary$importance[2, 1] * 100,
                                   digits = 3), " % variance"),
     ylab = paste0("PC2: ", format(pcSummary$importance[2, 2] * 100,
                                   digits = 3), " % variance"),
     col = "darkgrey", bg = cols, cex = 2)
grid()
text(pc$x[, 1], pc$x[,2], labels = xdata$sample_name, col = "darkgrey",
     pos = 3, cex = 2)
Figure 18: PCA for the faahKO data set, un-normalized intensities
We can see the expected separation between the KO and WT samples on PC2. On PC1 samples separate based on their ID, samples with an ID <= 18 from samples with an ID > 18. This separation might be caused by a technical bias (e.g. measurements performed on different days/weeks) or due to biological properties of the mice analyzed (sex, age, litter mates etc).
Normalizing features’ signal intensities is required, but at present not (yet)
supported in xcms (some methods might be added in near future). It is advised
to use the SummarizedExperiment returned by the quantify method for any
further data processing, as this type of object stores feature definitions,
sample annotations as well as feature abundances in the same object. For the
identification of e.g. features with significant different
intensities/abundances it is suggested to use functionality provided in other R
packages, such as Bioconductor’s excellent limma package. To enable support
also for other packages that rely on the old xcmsSet result object, it is
possible to coerce the new XCMSnExp object to an xcmsSet object using xset <- as(x, "xcmsSet"), with x being an XCMSnExp object.
For a detailed description of the new data objects and changes/improvements compared to the original user interface see the new_functionality vignette.
XCMSnExp objects allow to capture all performed pre-processing steps along
with the used parameter class within the @processHistory slot. Storing also
the parameter class ensures the highest possible degree of analysis
documentation and in future might enable to replay analyses or parts of it.
The list of all performed preprocessings can be extracted using the
processHistory method.
processHistory(xdata)## [[1]]
## Object of class "XProcessHistory"
##  type: Peak detection 
##  date: Tue Apr 25 19:15:59 2023 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8 
##  Parameter class: CentWaveParam 
##  MS level(s) 1 
## 
## [[2]]
## Object of class "XProcessHistory"
##  type: Peak refinement 
##  date: Tue Apr 25 19:16:19 2023 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8 
##  Parameter class: MergeNeighboringPeaksParam 
##  MS level(s) 1 
## 
## [[3]]
## Object of class "XProcessHistory"
##  type: Peak grouping 
##  date: Tue Apr 25 19:16:36 2023 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8 
##  Parameter class: PeakDensityParam 
##  MS level(s) 1 
## 
## [[4]]
## Object of class "XProcessHistory"
##  type: Retention time correction 
##  date: Tue Apr 25 19:16:37 2023 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8 
##  Parameter class: PeakGroupsParam 
##  MS level(s) 1 
## 
## [[5]]
## Object of class "XProcessHistory"
##  type: Peak grouping 
##  date: Tue Apr 25 19:16:44 2023 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8 
##  Parameter class: PeakDensityParam 
##  MS level(s) 1 
## 
## [[6]]
## Object of class "XProcessHistory"
##  type: Missing peak filling 
##  date: Tue Apr 25 19:16:45 2023 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8 
##  Parameter class: ChromPeakAreaParam 
##  MS level(s) 1It is also possible to extract specific processing steps by specifying its
type. Available types can be listed with the processHistoryTypes
function. Below we extract the parameter class for the alignment/retention time
adjustment step.
ph <- processHistory(xdata, type = "Retention time correction")
ph## [[1]]
## Object of class "XProcessHistory"
##  type: Retention time correction 
##  date: Tue Apr 25 19:16:37 2023 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8 
##  Parameter class: PeakGroupsParam 
##  MS level(s) 1And we can also extract the parameter class used in this preprocessing step.
## Access the parameter
processParam(ph[[1]])## Object of class:  PeakGroupsParam 
##  Parameters:
##  - minFraction: [1] 0.85
##  - extraPeaks: [1] 1
##  - smooth: [1] "loess"
##  - span: [1] 0.4
##  - family: [1] "gaussian"
##  - peakGroupsMatrix:      ko15.CDF ko22.CDF wt21.CDF
## FT12 2617.185 2623.444       NA
## FT26 2679.783 2686.043       NA
## FT25 2679.783 2686.043 2690.739
## FT30 2676.653 2687.608 2695.434
## FT31 2679.783 2687.608 2693.869
## FT32 2679.783 2689.172 2692.304
## FT42 2681.348 2689.172 2696.999
## FT43 2678.218 2689.172 2700.129
## FT45 2678.218 2689.172 2692.304
## FT03 2711.082 2723.601       NA
## FT07 2712.647 2720.472 2723.603
## FT20 2784.635       NA       NA
## FT13 2784.635 2790.894       NA
## FT06 2784.635 2794.024       NA
## FT01 2784.635 2789.330 2795.591
## FT02 2783.070 2789.330 2794.026
## FT18 2783.070 2790.894 2792.461
## FT79 2784.635 2790.894 2792.461
## FT09 2792.459 2801.849 2809.676
## FT75 2903.571 2914.526 2922.352
## FT14 2923.916 2923.916 2928.612
## FT16 2923.916 2922.351 2928.612
## FT15 2923.916 2927.045 2933.307
## FT17 2995.903 2995.903 3002.165
## FT80 2992.773 3008.423 3011.555
## FT22 3019.378 3020.943 3031.899
## FT04 3024.073 3014.683 3033.464
## FT41 3005.293 3027.202 3028.769
## FT40 3053.807 3069.456 3063.198
## FT46 3064.761 3085.106 3089.802
## FT51 3128.924 3146.139 3153.966
## FT35 3194.652 3158.659 3163.355
## FT60 3155.529 3171.178 3177.440
## FT69 3166.483 3179.003 3204.044
## FT70 3168.048       NA 3208.739
## FT73 3185.263 3224.387 3225.953
## FT74 3185.263 3227.516 3227.518
## FT76 3189.958 3230.646 3233.778
## FT77 3191.523 3230.646 3233.778
## FT36 3258.816 3260.381       NA
## FT08 3255.686 3260.381 3269.772
## FT19 3276.030 3297.939 3304.201
## FT47 3276.030 3297.939 3304.201
## FT53 3318.284 3316.719 3333.935
## FT52 3302.634 3324.544 3326.110
## FT54 3297.939 3324.544 3330.805
## FT39 3324.544 3327.674 3337.065
## FT23 3299.504 3333.933 3332.370
## FT71 3341.758 3357.408 3362.104
## FT72 3341.758 3358.973 3362.104
## FT37       NA 3355.843 3362.104
## FT38 3341.758 3360.538 3362.104
## FT10 3379.317 3398.096 3412.183
## FT63       NA 3402.791 3410.618
## FT21 3415.311 3421.571 3426.267
## FT78 3374.622 3427.831 3427.832
## FT55 3416.876 3429.395 3432.527
## FT64 3377.752 3446.610 3446.612
## FT65 3377.752 3448.175 3449.742
## FT58 3384.012 3449.740 3449.742
## FT49 3454.435 3470.084 3474.781
## FT27 3490.429 3498.253 3498.255
##  - subset: [1] 1 4 7
##  - subsetAdjust: [1] "average"XCMSnEx objects can be subsetted/filtered using the [ method, or one of the
many filter* methods. All these methods aim to ensure that the data in the
returned object is consistent. This means for example that if the object is
subsetted by selecting specific spectra (by using the [ method) all identified
chromatographic peaks are removed. Correspondence results (i.e. identified
features) are removed if the object is subsetted to contain only data from
selected files (using the filterFile method). This is because the
correspondence results depend on the files on which the analysis was performed -
running a correspondence on a subset of the files would lead to different
results. Note that with keepFeatures = TRUE it would be possible to overwrite
this and keep also correspondence results for the specified files.
As an exception, it is possible to force keeping adjusted retention times in the
subsetted object setting the keepAdjustedRtime argument to TRUE in any of
the subsetting methods.
Below we subset our results object the data for the files 2 and 4.
subs <- filterFile(xdata, file = c(2, 4))
## Do we have identified chromatographic peaks?
hasChromPeaks(subs)## [1] TRUEPeak detection is performed separately on each file, thus the subsetted object contains all identified chromatographic peaks from the two files. However, we used a retention time adjustment (alignment) that was based on available features. All features have however been removed and also the adjusted retention times (since the alignment based on features that were identified on chromatographic peaks on all files).
## Do we still have features?
hasFeatures(subs)## [1] FALSE## Do we still have adjusted retention times?
hasAdjustedRtime(subs)## [1] TRUEWe can however use the keepAdjustedRtime argument to force keeping the
adjusted retention times, keepFeatures would even keep correspondence results.
subs <- filterFile(xdata, keepAdjustedRtime = TRUE)
hasAdjustedRtime(subs)## [1] TRUEThe filterRt method can be used to subset the object to spectra within a
certain retention time range.
subs <- filterRt(xdata, rt = c(3000, 3500))
range(rtime(subs))## [1] 3000.186 3499.956Filtering by retention time does not change/affect adjusted retention times (also, if adjusted retention times are present, the filtering is performed on the adjusted retention times).
hasAdjustedRtime(subs)## [1] TRUEAlso, we have all identified chromatographic peaks within the specified retention time range:
hasChromPeaks(subs)## [1] TRUErange(chromPeaks(subs)[, "rt"])## [1] 3000.712 3499.399The most natural way to subset any object in R is with [. Using [ on an
XCMSnExp object subsets it keeping only the selected spectra. The index i
used in [ has thus to be an integer between 1 and the total number of spectra
(across all files). Below we subset xdata using both [ and filterFile to
keep all spectra from one file.
## Extract all data from the 3rd file.
one_file <- filterFile(xdata, file = 3)
one_file_2 <- xdata[fromFile(xdata) == 3]
## Is the content the same?
all.equal(one_file[[2]], one_file_2[[2]])## [1] "Attributes: < Component \"rt\": Mean relative difference: 0.0007411294 >"While the spectra-content is the same in both objects, one_file contains also
the identified chromatographic peaks while one_file_2 does not. Thus, in most
situations subsetting using one of the filter functions is preferred over the
use of [.
## Subsetting with filterFile preserves chromatographic peaks
head(chromPeaks(one_file))##         mz mzmin mzmax       rt    rtmin    rtmax      into      intb   maxo sn
## CP0555 231   231   231 2509.494 2503.234 2515.754  184167.0  165087.9  15818 11
## CP0556 334   334   334 2514.189 2504.799 2520.449  107281.7  104197.1   7976 14
## CP0557 337   337   337 2515.754 2503.234 2553.313 4381724.8 4040088.7 123448 21
## CP0558 333   333   333 2515.754 2503.234 2545.488  934386.2  877441.9  27944 15
## CP0559 316   316   316 2520.449 2503.234 2556.443  863947.7  832855.2  24496 29
## CP0560 332   332   332 2518.884 2503.234 2553.313 4835730.4 4644120.0 131520 41
##        sample
## CP0555      1
## CP0556      1
## CP0557      1
## CP0558      1
## CP0559      1
## CP0560      1## Subsetting with [ not
head(chromPeaks(one_file_2))## NULLNote however that also [ does support the keepAdjustedRtime argument. Below
we subset the object to spectra 20:30.
subs <- xdata[20:30, keepAdjustedRtime = TRUE]
hasAdjustedRtime(subs)## [1] TRUE## Access adjusted retention times:
rtime(subs)## F1.S0020 F1.S0021 F1.S0022 F1.S0023 F1.S0024 F1.S0025 F1.S0026 F1.S0027 
## 2539.356 2540.921 2542.486 2544.051 2545.616 2547.181 2548.746 2550.311 
## F1.S0028 F1.S0029 F1.S0030 
## 2551.876 2553.441 2555.006## Access raw retention times:
rtime(subs, adjusted = FALSE)## F1.S0020 F1.S0021 F1.S0022 F1.S0023 F1.S0024 F1.S0025 F1.S0026 F1.S0027 
## 2531.112 2532.677 2534.242 2535.807 2537.372 2538.937 2540.502 2542.067 
## F1.S0028 F1.S0029 F1.S0030 
## 2543.632 2545.197 2546.762As with MSnExp and OnDiskMSnExp objects, [[ can be used to extract a
single spectrum object from an XCMSnExp object. The retention time of the
spectrum corresponds to the adjusted retention time if present.
## Extract a single spectrum
xdata[[14]]## Object of class "Spectrum1"
##  Retention time: 42:10 
##  MSn level: 1 
##  Total ion count: 445 
##  Polarity: -1At last we can also use the split method that allows to split an XCMSnExp
based on a provided factor f. Below we split xdata per file. Using
keepAdjustedRtime = TRUE ensures that adjusted retention times are not
removed.
x_list <- split(xdata, f = fromFile(xdata), keepAdjustedRtime = TRUE)
lengths(x_list)##   1   2   3   4   5   6   7   8 
## 639 639 639 639 639 639 639 639lapply(x_list, hasAdjustedRtime)## $`1`
## [1] TRUE
## 
## $`2`
## [1] TRUE
## 
## $`3`
## [1] TRUE
## 
## $`4`
## [1] TRUE
## 
## $`5`
## [1] TRUE
## 
## $`6`
## [1] TRUE
## 
## $`7`
## [1] TRUE
## 
## $`8`
## [1] TRUEMost methods in xcms support parallel processing. Parallel processing is
handled and configured by the BiocParallel Bioconductor package and can be
globally defined for an R session.
Unix-based systems (Linux, macOS) support multicore-based parallel
processing. To configure it globally we register the parameter class. Note also
that bpstart is used below to initialize the parallel processes.
register(bpstart(MulticoreParam(2)))Windows supports only socket-based parallel processing:
register(bpstart(SnowParam(2)))Note that multicore-based parallel processing might be buggy or failing on
macOS. If so, the DoparParam could be used instead (requiring the foreach
package).
For other options and details see the vignettes from the BiocParallel package.
1. Saghatelian A, Trauger SA, Want EJ, Hawkins EG, Siuzdak G, Cravatt BF: Assignment of endogenous substrates to enzymes by global metabolite profiling. Biochemistry 2004, 43:14332–9.
2. Tautenhahn R, Böttcher C, Neumann S: Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics 2008, 9:504.
3. Smith R, Ventura D, Prince JT: LC-MS alignment in theory and practice: a comprehensive algorithmic review. Briefings in bioinformatics 2013, 16:bbt080–117.
4. Prince JT, Marcotte EM: Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical chemistry 2006, 78:6140–6152.
5. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical chemistry 2006, 78:779–787.