This tutorial shows how to use the CluMSID
package to
help annotate MS2 spectra from untargeted LC-MS/MS data.
CluMSID
works with MS2 data generated by
data-dependent acquisition and requires an mzXML file (like in this
example) or any other file that can be parsed by mzR
, like
mzML, mzTab or netCDF, as input. It can be used both stand-alone and
together with the XCMS suite of preprocessing tools.
CluMSID
extracts and merges MS2 spectra and
generates neutral loss patterns for each feature. Additionally, it can
make use of information from the CAMERA
package to generate
pseudospectra from MS1 level data. The tool uses cosine
similarity to generate distance matrices from MS2 spectra,
neutral loss patterns and pseudospectra.
These distance matrices are the basis for multivariate statistics
methods such as multidimensional scaling, density-based clustering,
hierarchical clustering and correlation networks. The
CluMSID
package provides functions for these methods
including (interactive) visualisation but the distance/similarity data
can also be analysed with other R
functions.
For the demonstrations in this tutorial, we will mainly use data from pooled Pseudomonas aeruginosa cell extracts, measured in ESI-(+) mode with auto-MS/MS on a Bruker maxisHD qTOF after reversed phase separation by UPLC. For details, please refer to the Depke et al. 2017 publication (doi: 10.1016/j.jchromb.2017.06.002.).
To be able to access the example data, we also need the related
package CluMSIDdata
. The packages can be loaded as
follows:
MS2spectrum
and pseudospectrum
classesCluMSID
uses a custom S4 class named
MS2spectrum
to store spectral information in the following
slots:
id
: a character string similar to the ID used by
XCMSonline or the ID given in a predefined peak listannotation
: a character string containing a
user-defined annotation, defaults to emptyprecursor
: (median) m/z of the spectrum’s
precursor ionrt
: (median) retention time of the spectrum’s precursor
ionpolarity
: the polarity with which the spectrum was
recorded, either positive
or negative
spectrum
: the actual MS2 spectrum as
two-column matrix (column 1 is (median) m/z, column 2 is
(median) intensity of the product ions)neutral_losses
: a neutral loss pattern generated by
subtracting the product ion mass-to-charge ratios from the precursor
m/z in a matrix format analogous to the spectrum
slotThe pseudospectrum
class is very similar but it contains
no information on precursor m/z and therefore no neutral loss
pattern, either. By default, the id
slot contains the
“pcgroup” number assigned by CAMERA
.
The individual slots of MS2spectrum
and
pseudospectrum
objects can be accessed via the standard S4
way using object@slot
, e.g. object@annotation
or by using an accessor function. These exist for all slots and are
called accessFoo()
, where Foo
is the slot name
(not exactly, though, because Bioconductor
does not allow
to mix snake_case
and camelCase
in function
names):
accessID(object)
accessAnnotation(object)
accessPrecursor(object)
accessRT(object)
accessPolarity(object)
accessSpectrum(object)
accessNeutralLosses(object)
.The first step in the CluMSID
workflow is to extract
MS2 spectra from the raw data file (in mzXML format). This is
done by the extractMS2spectra
function which internally
uses several functions from the mzR
package. The function
offers the possibility to filter spectra that contain less a defined
number of peaks and/or do not fall in a defined retention time window.
Setting the recalibrate_precursor
argument to
TRUE
activates a correction process for uncalibrated
precursor m/z data that existed in older version of Bruker’s
Compass Xport (cf. Depke et al. 2017). It is not
necessary to use it with files generated by other software but does not
corrupt the data, either.
Please be aware that mzR
often throws warnings
concerning the Rcpp
version that can usually be
ignored.
ms2list <- extractMS2spectra(system.file("extdata",
"PoolA_R_SE.mzXML",
package = "CluMSIDdata"),
min_peaks = 2,
recalibrate_precursor = TRUE,
RTlims = c(0,25))
This operation has now extracted all the MS2 spectra from
the raw data file and stored them in a list. Each list entry is an
object of class MS2spectrum
. The list is quite long because
it still contains a lot of spectra that derive from the same
chromatographic peak.
In our example, the first two spectra in the list derive from the same peak and thus have the same precursor ion and almost the same retention time.
head(ms2list, 4)
#> [[1]]
#> An object of class "MS2spectrum"
#> id:
#> annotation:
#> precursor: 146.1652
#> retention time: 56.266
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 0 neutral losses
#> [[2]]
#> An object of class "MS2spectrum"
#> id:
#> annotation:
#> precursor: 146.1653
#> retention time: 57.292
#> polarity: positive
#> MS2 spectrum with 3 fragment peaks
#> neutral loss pattern with 0 neutral losses
#> [[3]]
#> An object of class "MS2spectrum"
#> id:
#> annotation:
#> precursor: 129.1387
#> retention time: 57.545
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 0 neutral losses
#> [[4]]
#> An object of class "MS2spectrum"
#> id:
#> annotation:
#> precursor: 112.1119
#> retention time: 57.797
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 0 neutral losses
From the output above, you also see that the MS2spectrum
class has a show()
generic that summarises the
MS2 spectrum and neutral loss pattern data. To show the
default output, use showDefault()
. Be aware that neutral
loss patterns have not been calculated in this step.
showDefault(ms2list[[2]])
#> An object of class "MS2spectrum"
#> Slot "id":
#> character(0)
#>
#> Slot "annotation":
#> character(0)
#>
#> Slot "precursor":
#> [1] 146.1653
#>
#> Slot "rt":
#> [1] 57.292
#>
#> Slot "polarity":
#> [1] "positive"
#>
#> Slot "spectrum":
#> mz intensity
#> [1,] 72.08064 2448
#> [2,] 84.08077 328
#> [3,] 112.11228 843
#>
#> Slot "neutral_losses":
#> <0 x 0 matrix>
To reduce the amount of redundant MS2 spectra, the
mergeMS2spectra()
function is used to generate consensus
spectra from the MS2 spectra that derive from the same
precursor. CluMSID
offers two possibilities to do so:
This possibility is the standard method for stand-alone use of
CluMSID
and is equivalent to what has been described in
Depke et al. 2017. It does not need additional input and
summarises consecutive spectra that have the same precursor m/z
if their retention time fall within a defined threshold
(rt_tolerance
, defaults to 30s). A retention time
difference between consecutive spectra larger than
rt_tolerance
is interpreted as chromatographic separation
and respective spectra will be assigned to a new feature. The
mz_tolerance
argument should be set according to your
instruments m/z precision, the default is 1 * 10-5
(10ppm, equivalent to ±5ppm instrument precision). The
peaktable
and exclude_unmatched
arguments are
not used in this method and are to be left at their default.
head(featlist, 4)
#> [[1]]
#> An object of class "MS2spectrum"
#> id: M146.17T59.35
#> annotation:
#> precursor: 146.1653
#> retention time: 59.35
#> polarity: positive
#> MS2 spectrum with 8 fragment peaks
#> neutral loss pattern with 7 neutral losses
#> [[2]]
#> An object of class "MS2spectrum"
#> id: M129.14T58.57
#> annotation:
#> precursor: 129.1387
#> retention time: 58.57
#> polarity: positive
#> MS2 spectrum with 4 fragment peaks
#> neutral loss pattern with 3 neutral losses
#> [[3]]
#> An object of class "MS2spectrum"
#> id: M112.11T57.8
#> annotation:
#> precursor: 112.1119
#> retention time: 57.8
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 1 neutral losses
#> [[4]]
#> An object of class "MS2spectrum"
#> id: M251.16T60.64
#> annotation:
#> precursor: 251.1603
#> retention time: 60.64
#> polarity: positive
#> MS2 spectrum with 9 fragment peaks
#> neutral loss pattern with 8 neutral losses
The total amount of spectra was reduced from 2290 to 518 and as many other, the redundant spectra #1 and #2 in the raw list are now merged to one consensus spectrum (#1 in the merged list).
In this step, neutral loss patterns have been generated that look like this:
The second possibility is to supply a peaktable, i.e. a list of picked peaks with their mass-to-charge ratios and retention times. This is particularly useful if you want to annotate a complete metabolomics data set. In our example, we have a metabolomics dataset called “TD035” in which we have measured a range of samples in MS1 mode for relative quantification. Additionally, we have measured a pooled QC sample in MS2 mode for annotation. The MS1 data were analysed using XCMSonline and we want to group the MS2 spectra so that they match the XCMSonline peak picking.
The spectra are extracted as shown above:
ms2list2 <- extractMS2spectra(system.file("extdata",
"TD035-PoolMSMS2.mzXML",
package = "CluMSIDdata"),
min_peaks = 2,
recalibrate_precursor = TRUE,
RTlims = c(0,25))
The peaklist is imported from the XCMSonline output. The list has to contain at least 3 columns:
Shown below is an easy way of getting from an XCMSonline annotated diffreport to a suitable peaktable using tidyverse functions. Of course, you can achieve the same goal with base R functions or even in Excel. Depending on the retention time format in your *.mzXML file, you might have to convert from minutes to seconds or vice versa. Here, we have minutes in the XCMSonline output but seconds in the MS2 file, so we multiply by 60.
require(magrittr)
ptable <-
readr::read_delim(file = system.file("extdata",
"TD035_XCMS.annotated.diffreport.tsv",
package = "CluMSIDdata"),
delim = "\t") %>%
dplyr::select(c(name, mzmed, rtmed)) %>%
dplyr::mutate(rtmed = rtmed * 60)
head(ptable)
#> # A tibble: 6 × 3
#> name mzmed rtmed
#> <chr> <dbl> <dbl>
#> 1 M245T2 245. 100.
#> 2 M440T2_1 440. 107.
#> 3 M578T2 578. 104.
#> 4 M85T1 85.0 60.8
#> 5 M126T1_1 126. 61.0
#> 6 M688T24 688. 1468.
We can now use this peaktable as an argument for
mergeMS2spectra()
. You can choose whether you want to keep
or exclude MS2 spectra that do not match any peak in the
peaktable. These can occur in regions of the chromatogramm where there
are no clear peaks but the auto-MS/MS still fragments the most abundant
ions. These unmatched spectra are merged following the same rules as
described above (method without peaktable). In this example, we keep the
unmatched spectra. We use the default values for m/z and
retention time tolerance and thus do not need to specify them.
featlist2 <- mergeMS2spectra(ms2list2,
peaktable = ptable,
exclude_unmatched = FALSE)
head(featlist2, 4)
#> [[1]]
#> An object of class "MS2spectrum"
#> id: M213T0
#> annotation:
#> precursor: 213.1462
#> retention time: 6.04
#> polarity: positive
#> MS2 spectrum with 5 fragment peaks
#> neutral loss pattern with 3 neutral losses
#> [[2]]
#> An object of class "MS2spectrum"
#> id: xM158T31.17
#> annotation:
#> precursor: 158.0027
#> retention time: 31.17
#> polarity: positive
#> MS2 spectrum with 3 fragment peaks
#> neutral loss pattern with 3 neutral losses
#> [[3]]
#> An object of class "MS2spectrum"
#> id: M146T1_3
#> annotation:
#> precursor: 146.1650
#> retention time: 61.15
#> polarity: positive
#> MS2 spectrum with 7 fragment peaks
#> neutral loss pattern with 6 neutral losses
#> [[4]]
#> An object of class "MS2spectrum"
#> id: M129T1_4
#> annotation:
#> precursor: 129.1384
#> retention time: 60.74
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 2 neutral losses
Note that the 2nd entry in featlist2
is
marked with an ‘x’ which means that it could not be assigned to a
feature in the peaktable.
For the sake of simplicity, only the data generated from the
stand-alone procedure will be used for the following examples. Be
assured that all of them would also work with the data generated with
the help of an external peaktable (featlist2
).
The next step is to add (external) annotations to the list of
features, e.g. from a spectral library that you curate in-house or one
that has been supplied by your instrument manufacturer. If you do not
(want to) annotate your features at all, this step can be skipped
completely, leaving the annotation
slot of the
MS2spectrum
objects empty.
CluMSID
offers several possibilities to add annotations
to your feature list. The most basic one first generates a list of
features and saves it as *.csv file. For that you use the
writeFeaturelist()
function and only have to specify your
list of spectra and a file name for the output file (here:
pre_anno.csv
). You can then manually fill in your
annotations in a new column in the table, save it (in this example under
the name post_anno.csv
) and reload it to
R
:
annotatedSpeclist <- addAnnotations(featlist, system.file("extdata",
"post_anno.csv",
package = "CluMSIDdata"))
annotatedSpeclist
will then be equivalent to
featlist
with annotations added to the
annotation
slot of the list entries.
You can add annotations without leaving the R
environment, too. addAnnotations()
also accepts objects of
class data.frame
as annolist
argument. Be
aware that addAnnotations()
assigns the annotation based on
the position in the feature list. I.e., if the order of the features in
your list of features (featlist
) and your list of
annotations (annolist
) is different, you will get nonsense
results.
The savest ways to addAnnotations()
with a
data.frame
is to use featureList()
to generate
a data.frame
that is formatted in the same way as the file
output from writeFeaturelist()
and then match your
identifications against this data.frame
and use the result
as argument for addAnnotations()
.
Say you have an object called annos
that contains
feature IDs (the same as in featlist
) and annotations in a
two-column data.frame
with “id” and “annotation” as column
names. It could look like this:
str(annos)
#> 'data.frame': 154 obs. of 2 variables:
#> $ id : chr "M146.17T59.35" "M129.14T58.57" "M112.11T57.8" "M148.06T69.65" ...
#> $ annotation: chr "spermidine" "spermidine (fragment)" "spermidine (fragment)" "glutamate" ...
head(annos)
#> id annotation
#> 1 M146.17T59.35 spermidine
#> 2 M129.14T58.57 spermidine (fragment)
#> 3 M112.11T57.8 spermidine (fragment)
#> 4 M148.06T69.65 glutamate
#> 5 M130.05T69.64 glutamate (fragment)
#> 6 M179.06T71.32 gluconolactone
addAnnotations(featlist, annos, annotationColumn = 2)
will throw an error because featlist
and annos
are of different length. Instead, you need to do the following:
Now, you can annotate your list of spectra using
addAnnotations(featlist, fl_annos, annotationColumn = 4)
.
An analogous procedure works if you have your annotations stored in a
peaktable that you have used for mergeMSspectra()
. As the
order of spectra in the list will not be same as the order of features
in your peaktable, you need to do a matching with the output of
featureList()
as well.
Once we have a list of MS2spectrum
objects containing
all the required information with or without annotation, we can generate
distance matrices from (product ion) MS2 spectra as well as
from neutral loss patterns. These distance matrices serve as the basis
for further analysis of the data. Both for MS2 spectra and
neutral loss patterns, cosine similarity is used as similarity
metric:
\[ cos(\theta) = \frac{\sum_{i}a_i \cdot b_i}{\sqrt{\sum_{i}{a_{i}}^2 \cdot \sum_{i}{b_{i}}^2}} \]
For most applications, analysing the similarity of product ion MS2 spectra will be most useful. The generation of the distance matrix is done by just one simple command but it can take some time to calculate.
Common neutral losses and neutral loss patterns can convey
information about structural similarity, as well, e.g. with nucleotides
or glykosylated secondary metabolites. CluMSID
offers the
possibility to study neutral loss patterns independently from product
ion spectra. The generation of a distance matrix is analogous, you just
need to set the ‘type’ argument to “neutral_losses”:
One rather simple possibility to visually analyse the spectral
similarity data is multidimensional scaling, a dimension reduction
method that simplifies distances in n-dimensional space to
those in two-dimensional space (n in this case being the number
of consensus spectra or neutral loss patterns that were used to generate
the distance matrix in the previous step). CluMSID
offers a
simple function to produce an MDS plot from the distance matrix with the
option to highlight annotated metabolites and the possibility to
generate an interactive plot using plotly
.
Standard MDS plots are generated as follows:
For MS2 spectra:
For neutral loss patterns:
Interactive plots are zoomable and show feature names upon
mouse-over. They are generated like normal MDS plots and can be viewed
within RStudio or—after saving as html file using
htmlwidgets
—displayed in a normal web browser.
my_mds <- MDSplot(distmat, interactive = TRUE, highlight_annotated = TRUE)
htmlwidgets::saveWidget(my_mds, "mds.html")
This is how it looks like if you open the html file in Firefox and mouse over a feature:
For density-based clustering with CluMSID
, the ‘OPTICS’
algorithm and its implementation in the dbscan
package is
used. Density-based clustering is a useful clustering method that often
yields different results than hierarchical clustering and can thus
provide additional insight into the data. CluMSID
has two
functions to perform density-based clustering, one for the reachability
plot which is the most useful visualisation of OPTICS results and one
that outputs a data.frame
containing the cluster
assignations for every feature.
Both functions require as arguments a distance matrix as well as
three parameters for the underlying functions
dbscan::optics
and dbscan::extractDBSCAN
:
eps
, minPts
and eps_cl
. Lowering
the eps
parameter (default is 10000) limits the size of the
epsilon neighbourhood which from experience has very little effect on
the results. minPts
defaults to 3 in CluMSID
.
It defines how many points are considered for reachability distance
calculation between clusters. The dbscan::optics
default
for minPts
is 5. Users are encourage to experiment with
this parameter. eps_cl
is the reachability threshold to
identify clusters and can be varied based on your data. Lowering
eps_cl
leads to a larger number of smaller clusters and
vice versa for raising the value. In general, it is advisable to chose a
higher eps_cl
for MS2 spectra than for neutral
loss patterns, since the latter tend to show less similarity to each
other. For details, please refer to the dbscan
help for the
dbscan::optics
and dbscan::extractDBSCAN
functions.
If the default parameters are used, the generation of an OPTICS reachability plots is very simple, shown here for MS2 spectra and neutral loss patterns:
In the reachability plots, every line represents a feature and the
height of the line is the reachability distance to the next feature in
the OPTICS order. Thus, valleys represent groups of similar spectra or
neutral loss patterns. The order and the cluster assignment can be
studied using the OPTICStbl
function that outputs a
three-column data.frame
with feature id, cluster assignment
and OPTICS order. The order of features in the data.frame
corresponds to the original order in the input distance matrix. Features
that were not assigned to a cluster are black in the reachability plot
and have the cluster ID 0. OPTICStbl
takes the same
arguments as OPTICSplot
. The two functions have to be run
with exactly the same parameters to assure compatibility of results.
In Depke et al. 2017, hierarchical clustering proved the
most useful method to unveil structural similarities between features.
analogous to density-based clustering, CluMSID
offers two
functions, one for plots and one for a data.frame
with
cluster assignments, both taking a distance matrix as the only
compulsory argument. The other two parameters are h
(defaults to 0.95
), the height where the tree should be cut
(see stats::cutree
for details) and type
that
determines the type visualisation:
heatmap
: a heatmap displaying pairwise
similarities/distances along with cluster dendrogramsdendrogram
(default): a circular dendrogram with colour
code for cluster assignmentHeatmaps of our example data for MS2 and neutral loss
pattern similarity are created as follows (with reduced label font size
by changing cexRow
and cexCol
as well as
margins
of the underlying heatmap.2
function):
Obviously, it makes sense to export the plots to larger pdf or png
files (e.g. 2000 \(\times\) 2000
pixels) to examine them closely. If exported to pdf, the feature names
remain searchable (Ctrl+F
in Windows).
With the dendrogram, too, it is advisable to export is to pdf in a large format, e.g. as follows:
The plot from our example data looks like this:
The clusters are colour-coded and if exported to pdf, the tip labels containing feature ID and annotation are searchable.The height of the dendrogram’s branching points serves as another piece of information when interpreting the clustered data as it signifies similarity of features.
For a detailed example of how to interpret, please refer to Depke
et al. 2017, where CluMSID
helped to identify new
members of several classes of secondary metabolites in Pseudomonas
aeruginosa.
Like with density-based clustering, it is also possible to generate a
list of features with respective cluster assignments using
HCtbl
. As mentioned above for
CluMSID_OPTISplot
and OPTICStbl
, it is crucial
to run HCplot
and HCtbl
using the same
parameters.
As a new functionality, CluMSID
offers the possibility
to analyse the similarity data using weighted correlation networks.
These networks offer some advantages with respect to standard clustering
methods, most notably that they do not strictly assign every feature to
a distinct cluster but also represent similarities between features that
would fall into different clusters in hierarchical or density-based
clustering. Thus, correlation networks potentially contain more useful
information for data interpretation. On the downside, the interpretation
is also complicated by this lack of concrete cluster assignments. E.g.,
we cannot simply look up which features belong to the same cluster in
order to examine their spectra closely but we have to go back to the
correlation network visualisation and search for connected features
manually.
networkplot
requires some arguments:
distmat
: matrix; a distance matrix like for
all other functions described aboveinteractive
: logical; Similar to
MDSplotplot
, correlation network can be generate as
interactive plots that are zoomable and display feature IDs on
mouse-over. If that is desired, set interactive
to
TRUE
(default is FALSE
).show_labels
: logical; whether to display
feature IDs in the (non-interactive) plot (default is
FALSE
, ignored if interacive = TRUE
)label_size
: numeric; font size of feature ID
labels (default is 1.5
, which is way smaller than the
default in GGally::ggnet2
, 4.5
)highlight_annotated
: logical; whether to plot
dots for features with annotation in a different colour (same as in
MDSplotplot
, default is FALSE
)min_similarity
: numeric; the minimum
similarity (1 – distance) threshold (similarities below this threshold
will be ignored, default is 0.1
)exclude_singletons
: logical; whether to
exclude features from the plot that do not have connections to other
features, particularly useful with data sets containing very dissimilar
spectra, e.g. neutral loss patterns or MS1 pseudospectra
(default is FALSE
)A standard non-interactive correlation network for the MS2 example data can be plotted like this:
As you can guess from this plot, it makes sense to use the
interactive visualisation. Just like with MDSplotplot
, you
can view the interactive plot within RStudio or save it as html and view
it in web browser.
my_net <- networkplot(distmat, interactive = TRUE,
highlight_annotated = TRUE)
htmlwidgets::saveWidget(my_net, "net.html")
This is how it looks like if you open the html file in Firefox, zoom in on a cluster and mouse over a feature:
Please be aware that the spatial arrangement of the data points in the plot has a random component, i.e. while the relative position of the points (the distance to each other) is always the same, the absolute position varies and will not be the same even if the same command is executed twice.
The pairwise similarity of spectra or neutral loss patterns of
features expressed by the cosine score is signified by the width of the
line connecting the two features. All pairwise similarities greater than
min_similarity
result in a connecting line in the plot. The
spatial proximity in which the features are mapped onto the plot is
determined by the multivariate method underlying the network
generation.
As we have already noticed after inspection of the heatmaps on
p.13–14, the neutral loss patterns show much less similarity to each
other than the MS2 spectra data. Thus, we expect quite a few
neutral loss patterns that do not show any similarity to another neutral
loss pattern. This expectation justifies the exclusion of these
‘singletons’ from the correlation network analysis. To do so, just set
exclude_singletons
to TRUE
:
Multidimensional scaling, density-based clustering, hierarchical
clustering and correlation network analysis are the main
CluMSID
tools to analyse MS2 spectra or neutral
loss pattern similarity data, however, the package contains some
additional functionalities that may facilitate data analysis in some
cases and can also be used in other contexts with or without the
above-mentioned unsupervised methods.
Accessing S4 objects within lists is not trivial. Therefore,
CluMSID
offers a function to access individual or several
MS2spectrum
objects by their slot entries.
getSpectrum()
requires the following arguments:
featlist
: a list
that contains only
objects of class MS2spectrum
slot
: the slot to be searched (invalid
slot
arguments will produce errors):
id
annotation
precursor
(m/z of precursor ion)rt
(retention time of precursor)what
: the search term or number, must be
character for id
and annotation
and
numeric for precursor
and rt
mz.tol
: the tolerance used for precursor ion
m/z searches, defaults to 1E-05 (10ppm)rt.tol
: the tolerance used for precursor ion retention
time searches, defaults to 30s; high values can be used to specify
retention time ranges (see example)Some examples will demonstrate the use of
getSpectrum()
:
1. Accessing a spectrum by its ID. For this, the exact feature ID must be known:
getSpectrum(annotatedSpeclist, "id", "M244.17T796.4")
#> An object of class "MS2spectrum"
#> id: M244.17T796.4
#> annotation: HHQ
#> precursor: 244.1700
#> retention time: 796.4
#> polarity: positive
#> MS2 spectrum with 98 fragment peaks
#> neutral loss pattern with 81 neutral losses
2. Accessing a spectrum by its annotation. For this, the exact annotation has to be known as well, other annotations will produce a message:
getSpectrum(annotatedSpeclist, "annotation", "HHQ")
#> An object of class "MS2spectrum"
#> id: M244.17T796.4
#> annotation: HHQ
#> precursor: 244.1700
#> retention time: 796.4
#> polarity: positive
#> MS2 spectrum with 98 fragment peaks
#> neutral loss pattern with 81 neutral losses
3. Accessing spectra by their precursor ion
m/z. If the list contains more than one spectrum with
a precursor ion m/z within the tolerance, the output is again a
list of MS2spectrum
objects that meet the specified
criterion:
getSpectrum(annotatedSpeclist, "precursor", 286.18, mz.tol = 1E-03)
#> [[1]]
#> An object of class "MS2spectrum"
#> id: M286.18T728.73
#> annotation: C9:1-QNO
#> precursor: 286.1799
#> retention time: 728.73
#> polarity: positive
#> MS2 spectrum with 4 fragment peaks
#> neutral loss pattern with 2 neutral losses
#> [[2]]
#> An object of class "MS2spectrum"
#> id: M286.18T808.85
#> annotation: C9:1-QNO
#> precursor: 286.1804
#> retention time: 808.85
#> polarity: positive
#> MS2 spectrum with 7 fragment peaks
#> neutral loss pattern with 5 neutral losses
#> [[3]]
#> An object of class "MS2spectrum"
#> id: M286.18T864.04
#> annotation: C9:1-QNO
#> precursor: 286.1808
#> retention time: 864.04
#> polarity: positive
#> MS2 spectrum with 183 fragment peaks
#> neutral loss pattern with 167 neutral losses
#> [[4]]
#> An object of class "MS2spectrum"
#> id: M286.18T921.6
#> annotation: C9:1-PQS
#> precursor: 286.1808
#> retention time: 921.6
#> polarity: positive
#> MS2 spectrum with 3 fragment peaks
#> neutral loss pattern with 1 neutral losses
4. Accessing spectra by their precursor retention
time. Here, too, we can extract several
MS2spectrum
objects by setting a larger retention time
tolerance. If we want to extract the spectra of all compounds that elute
from 6min (360s) to 8min (480s), we proceed as follows:
Another pair of accessory functions is findFragment()
and findNL()
which are used to find spectra that contain a
specific fragment ion or neutral loss. Analogous to
getSpectrum()
, they need as arguments a list of
MS2spectrum
objects, the m/z of the fragment or
neutral loss of interest and the respective m/z tolerance in
ppm (default is 10ppm).
The two functions can be useful in many situation, e.g. when working with lipid data where head groups and fatty acids often give characteristic fragments or neutral losses. In the world of P. aeruginosa secondary metabolites, alkylquinolones (AQs) play an important role and most of the AQ MS2 spectra contain a signature fragment with an m/z of 159.068. Based on this fragment m/z, we can create a list of putative AQs:
putativeAQs <- findFragment(annotatedSpeclist, 159.068)
#> 70 spectra were found that contain a fragment of m/z 159.068 +/- 10 ppm.
An example for common neutral losses are nucleoside monophospates
that all loose ribose-5’-monophosphate, resulting in a neutral loss of
212.009 in ESI-(+). Using findNL()
we find CMP, UMP, AMP
and GMP.
findNL(annotatedSpeclist, 212.009)
#> 4 neutral loss patterns were found that contain a neutral loss of m/z 212.009 +/- 10 ppm.
#> [[1]]
#> An object of class "MS2spectrum"
#> id: M324.06T75.32
#> annotation: CMP
#> precursor: 324.0591
#> retention time: 75.32
#> polarity: positive
#> MS2 spectrum with 8 fragment peaks
#> neutral loss pattern with 8 neutral losses
#> [[2]]
#> An object of class "MS2spectrum"
#> id: M325.04T78.94
#> annotation: UMP
#> precursor: 325.0429
#> retention time: 78.94
#> polarity: positive
#> MS2 spectrum with 5 fragment peaks
#> neutral loss pattern with 5 neutral losses
#> [[3]]
#> An object of class "MS2spectrum"
#> id: M348.07T90.34
#> annotation: AMP
#> precursor: 348.0707
#> retention time: 90.34
#> polarity: positive
#> MS2 spectrum with 21 fragment peaks
#> neutral loss pattern with 19 neutral losses
#> [[4]]
#> An object of class "MS2spectrum"
#> id: M364.07T97.19
#> annotation: GMP
#> precursor: 364.0659
#> retention time: 97.19
#> polarity: positive
#> MS2 spectrum with 6 fragment peaks
#> neutral loss pattern with 6 neutral losses
If you are mainly interested in one or a few number of spectra or
neutral loss patterns, it may be sufficient to match one feature at a
time against a larger set of spectra. This set of spectra can be all
spectra contained in one mzXML file like in all the examples in this
tutorial or they could be a spectral library, as long as its format in
R
is a list of MS2spectrum
objects.
The getSimilarities()
function requires several
arguments:
spec
: The spectrum to be compared to other spectra. Can
be either an object of class MS2spectrum
or a two-column
numerical matrix that contains fragment mass-to-charge ratios in the
first and intensities in the second column.speclist
: The set of spectra to which spec
is to be compared. Must be a list where every entry is an object of
class MS2spectrum
. Can be generated from an mzXML file as
shown above or constructed using new("MS2spectrum", ...)
for every list entry (see example).type
: Specifies whether MS2 spectra or
neutral loss patterns are to be compared. Must be either ‘spectrum’
(default) or ‘neutral_losses’.hits_only
: Logical that indicates whether the result
should contain only similarities greater than zero (see example).In the first example, we want to find all MS2 spectra in
our example data set that are similar to the spectrum of pyocyanin, an
important secondary metabolite from Pseudomonas aeruginosa and
therefore match the pyocyanin spectrum against our
annotatedSpeclist
. Because we have already identified
pyocyanin in the data set, we can use getSpectrum
to
extract the MS2spectrum
object from
annotatedSpeclist
. We do not want to search all 518
elements of the result vector, so we set hits_only
to
TRUE
to exclude spectra that have 0 similarity to the
pyocyanin spectrum.
pyo <- getSpectrum(annotatedSpeclist, "annotation", "pyocyanin")
sim_pyo <- getSimilarities(pyo, annotatedSpeclist, hits_only = TRUE)
sim_pyo
#> M110.06T100.45 M123.06T103.31 M332.56T107.48 M332.08T113.21 M182.08T125.93
#> 0.0235166588 0.0071763662 0.0032575891 0.0035153018 0.0414005385
#> M166.09T233.22 M120.08T233.48 M103.05T235.3 M174.06T277.59 M220.12T336.88
#> 0.0394723541 0.0492390806 0.0826780036 0.0391004892 0.0205482303
#> M525.18T352.51 M243.08T362.4 M188.07T371.25 M205.1T370.99 M211.09T382.17
#> 0.0060019991 0.0145904545 0.0176900909 0.0179895663 1.0000000000
#> M187.12T391.55 M188.12T399.72 M254.09T400.89 M160.08T433.66 M291.15T444.85
#> 0.0210280136 0.0105392131 0.2071528536 0.0489638040 0.0106479317
#> M120.04T450.56 M138.06T451.33 M176.07T465.7 M491.29T496.41 M255.08T482.73
#> 0.0287432023 0.0202198052 0.0275059908 0.0610208210 0.6451546287
#> M245.59T495.11 M145.08T508.11 M163.09T512.64 M188.11T535.78 M321.1T537.6
#> 0.2583432230 0.0473127795 0.0034167239 0.0057005179 0.0293635312
#> M243.09T558.67 M136.08T584.09 M118.06T585.64 M215.12T626.24 M224.08T640.69
#> 0.0116275804 0.0132716679 0.0203921366 0.3252546561 0.0325490977
#> M213.07T652.92 M216.14T670.01 M227.08T670.26 M264.18T675.46 M233.13T676.23
#> 0.0083842257 0.0009299928 0.0034818309 0.0172023762 0.0143332573
#> M225.07T698.07 M207.06T699.1 M257.06T704.3 M226.18T703.76 M325.07T739.09
#> 0.0253940205 0.0230298767 0.0028192053 0.0255571995 0.0010572974
#> M181.08T724.14 M330.19T724.4 M255.08T740.91 M446.19T745.32 M321.1T746.09
#> 0.1283755709 0.5019236568 0.2030839014 0.0750793676 0.1017149711
#> M891.36T747.39 M231.1T763.28 M185.1T763.53 M288.2T765.88 M258.15T768.47
#> 0.0714108632 0.0070294838 0.0094225379 0.0018840838 0.0168976240
#> M328.14T772.12 M242.15T789.6 M304.19T786.75 M260.16T796.66 M244.17T796.4
#> 0.0009052790 0.0071606643 0.0066966285 0.0141834395 0.0106077162
#> M159.07T795.61 M314.21T825.99 M326.18T833.85 M286.18T864.04 M270.19T873.15
#> 0.0124577125 0.0006167122 0.0023434969 0.0205024121 0.0089744139
#> M268.17T875.49 M178.05T867.16 M198.09T874.45 M170.1T885.9 M288.2T902.1
#> 0.0011687551 0.0413446455 0.0064665757 0.0062571323 0.0100270721
#> M184.08T908.6 M272.2T910.42 M312.2T929.44 M314.21T944.87 M296.2T958.11
#> 0.0036517997 0.0086043375 0.0085388744 0.0128215188 0.0054678320
#> M298.22T984.33 M500.22T974.72 M304.19T977.83 M303.19T979.65 M340.23T1007.62
#> 0.0065812089 0.0396920510 0.0059045590 0.0049045093 0.0052002486
#> M324.23T1025.79 M314.21T1034.15 M326.25T1043.73 M679.43T1051.39
#> 0.0005826366 0.0030626495 0.0005424581 0.0008290325
We get 84 spectra that have a non-zero similarity to the pyocyanin
spectrum, including pyocyanin itself with a similarity of
1
. Of course, we can further filter the data by subsetting
the result vector in order to exclude spectra that have only minimal
similarity, e.g. M679.43T1051.39
with a cosine similarity
of only 0.0008
(the last element in the vector).
In the second example, we generate a new speclist
,
e.g. from a spectral library. We look at the unknown feature that has
most similarity to pyocyanin. As pyocyanin is contained in
annotatedSpeclist
itself, we have to look at the second
highest similarity. Again, we use getSpectrum()
to extract
the object from annotatedSpeclist
:
highest_sim <- sort(sim_pyo, decreasing = TRUE)[2]
sim_spec <- getSpectrum(annotatedSpeclist, "id", names(highest_sim))
sim_spec
#> An object of class "MS2spectrum"
#> id: M255.08T482.73
#> annotation:
#> precursor: 255.0761
#> retention time: 482.73
#> polarity: positive
#> MS2 spectrum with 5 fragment peaks
#> neutral loss pattern with 3 neutral losses
We see that the feature is not annotated. We are interested whether
this feature also shows similarity to other members of the phenazine
family of P. aeruginosa secondary metabolites. Some phenazines
are contained in annotatedSpeclist
but some are not, so we
make a new speclist
called phenazines
and add
the missing spectra manually from an in-house library:
phenazines <- list()
phenazines[[1]] <- getSpectrum(annotatedSpeclist,
"annotation", "pyocyanin")
phenazines[[2]] <- getSpectrum(annotatedSpeclist,
"annotation", "phenazine-1-carboxamide")
phenazines[[3]] <- getSpectrum(annotatedSpeclist,
"annotation", "phenazine-1-carboxylic acid")
phenazines[[4]] <- getSpectrum(annotatedSpeclist,
"annotation", "phenazine-1,6-dicarboxylic acid")
phenazines[[5]] <- new("MS2spectrum", id = "lib_entry_1",
annotation = "1-hydroxyphenazine",
spectrum = matrix(c(168.0632, 14,
169.0711, 288,
170.0743, 33,
179.0551, 62,
197.0653, 999),
byrow = TRUE,
ncol = 2))
phenazines[[6]] <- new("MS2spectrum", id = "lib_entry_2",
annotation = "2-hydroxy-phenazine-1-carboxylic acid",
spectrum = matrix(c(167.0621, 43,
179.0619, 93,
180.0650, 12,
195.0564, 40,
223.0509, 999,
224.0541, 142,
241.0611, 60),
byrow = TRUE,
ncol = 2))
phenazines[[7]] <- new("MS2spectrum", id = "lib_entry_3",
annotation = "pyocyanin (library spectrum)",
spectrum = matrix(c(168.0690, 58,
183.0927, 152,
184.0958, 19,
196.0640, 118,
197.0674, 15,
211.0873, 999,
212.0905, 145),
byrow = TRUE,
ncol = 2))
getSimilarities(sim_spec, phenazines, hits_only = FALSE)
#> M211.09T382.17 M224.08T640.69 M225.07T698.07 M269.06T708.74 lib_entry_1
#> 0.6451546 0.0000000 0.0000000 0.0000000 0.0000000
#> lib_entry_2 lib_entry_3
#> 0.0000000 0.6375061
As a result, we get the interesting information that the MS2 spectra similarity of our unknown feature seems to be specific to pyocyanin (both the experimental and the library spectrum).
MSnbase
objects to class
MS2spectrum
The MSnbase
package—which is commonly used for
proteomics applications and is also associated with XCMS3—has two
classes for (MS2) spectra, Spectrum
and
Spectrum2
which contain spectra along with metainformation.
These metainformation differ from those contained in
MS2spectrum
objects and are not very well suited for
metabolomics applications. Still, it is possible to use
CluMSID
functions with objects of those two classes by
converting them to MS2spectrum
objects using
as.MS2spectrum()
:
As polarity-switching and similar methords are gaining importance in LC-MS/MS metabolomics, CluMSID offers the possibility to process LC-MS/MS data containing spectra of different polarities. As spectra from positive and negative ionisation show different fragmentation mechanisms and patterns, it does not appear to be useful to compare spectra of different polarity to each other. Therefore, CluMSID provides a function to separate positive and negative spectra from each other. This has to be done in the very beginning of the analysis to not interfere with spectral merging. Positive and negative spectra can than be processed independently from each other as shown above.
A schematic workflow would like like this:
raw_list_mixedpolarities <- extractMS2spectra("raw_file_mixedpolarities.mzXML")
raw_list_positive <- splitPolarities(raw_list_mixedpolarities, "positive")
raw_list_negative <- splitPolarities(raw_list_mixedpolarities, "negative")
speclist_positive <- mergeMS2spectra(raw_list_positive)
speclist_negative <- mergeMS2spectra(raw_list_negative)
… and so on as described in this tutorial.
MS1 pseudospectra are groups of peaks/ions that derive or
are assumed to derive from the same compound. They consist of peaks for
in-source fragment, adducts etc. Pseudospectra can contain structural
information about analytes, e.g. about moieties that easily fragment
even in MS1 mode without CID. Thus, it might sometimes be
useful to study similarities between pseudospectra analogously to those
between MS2 spectra. CluMSID
makes use of the
CAMERA
package to assign peaks to pseudospectra. A custom
S4 class named pseudospectrum
is used which is very similar
to the MS2spectrum
class. For obvious reasons, it does not
contain a precursor ion m/z slot and thus no neutral loss
pattern, either. The pcgroup
defined by CAMERA
is used as ID, an annotation can be added if desired.
To extract pseudospectra, you first have to process your data using
the CAMERA
package, either in R or via XCMSonline, where
this is done automatically. There are two possibilities to use the
extractPseudospectra()
function in CluMSID
:
either with an xsAnnotate
object which you generate with
CAMERA
in R or with a data.frame
that contains
data on m/z, retention time, intensity and
pcgroup
, e.g. the results table from XCMSonline.
The latter is demonstrated with the XCMSonline results table already
used to generate a peak table. If the column names are not changed, the
data.frame
can be supplied as-is and
intensity_columns
does not have to be specified. We want to
exclude pseudospectra that have only one peak, so we set
min_peaks = 2
.
pstable <-
readr::read_delim(file = system.file("extdata",
"TD035_XCMS.annotated.diffreport.tsv",
package = "CluMSIDdata"),
delim = "\t")
pseudospeclist <- extractPseudospectra(pstable, min_peaks = 2)
As a result, we get a list with 198 pseudospectra that we can now process further.
The creation of a distance matrix is analogous to the procedure for MS2 spectra:
The distance matrix can now be used for MDS, clustering and correlation networks just like described above. For demonstration, we generate a correlation network:
With the exclusion of singletons, we get a much less busy plot than for MS2 data but we still find quite a few connections that may prove informative.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] magrittr_2.0.3 metaMSdata_1.41.0 metaMS_1.43.0
#> [4] CAMERA_1.63.0 xcms_4.5.0 BiocParallel_1.41.0
#> [7] Biobase_2.67.0 BiocGenerics_0.53.0 CluMSIDdata_1.21.0
#> [10] CluMSID_1.23.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 rstudioapi_0.17.1
#> [3] jsonlite_1.8.9 MultiAssayExperiment_1.33.0
#> [5] farver_2.1.2 MALDIquant_1.22.3
#> [7] rmarkdown_2.28 fs_1.6.4
#> [9] zlibbioc_1.53.0 vctrs_0.6.5
#> [11] base64enc_0.1-3 htmltools_0.5.8.1
#> [13] S4Arrays_1.7.0 progress_1.2.3
#> [15] Formula_1.2-5 SparseArray_1.7.0
#> [17] mzID_1.45.0 sass_0.4.9
#> [19] KernSmooth_2.23-24 bslib_0.8.0
#> [21] htmlwidgets_1.6.4 plyr_1.8.9
#> [23] impute_1.81.0 plotly_4.10.4
#> [25] cachem_1.1.0 igraph_2.1.1
#> [27] lifecycle_1.0.4 iterators_1.0.14
#> [29] pkgconfig_2.0.3 Matrix_1.7-1
#> [31] R6_2.5.1 fastmap_1.2.0
#> [33] GenomeInfoDbData_1.2.13 MatrixGenerics_1.19.0
#> [35] clue_0.3-65 digest_0.6.37
#> [37] pcaMethods_1.99.0 colorspace_2.1-1
#> [39] GGally_2.2.1 S4Vectors_0.45.0
#> [41] Hmisc_5.2-0 GenomicRanges_1.59.0
#> [43] labeling_0.4.3 Spectra_1.17.0
#> [45] fansi_1.0.6 httr_1.4.7
#> [47] abind_1.4-8 compiler_4.5.0
#> [49] bit64_4.5.2 withr_3.0.2
#> [51] doParallel_1.0.17 backports_1.5.0
#> [53] htmlTable_2.4.3 DBI_1.2.3
#> [55] ggstats_0.7.0 highr_0.11
#> [57] gplots_3.2.0 MASS_7.3-61
#> [59] MsExperiment_1.9.0 DelayedArray_0.33.0
#> [61] gtools_3.9.5 caTools_1.18.3
#> [63] mzR_2.41.0 tools_4.5.0
#> [65] foreign_0.8-87 PSMatch_1.11.0
#> [67] ape_5.8 nnet_7.3-19
#> [69] glue_1.8.0 dbscan_1.2-0
#> [71] nlme_3.1-166 QFeatures_1.17.0
#> [73] grid_4.5.0 checkmate_2.3.2
#> [75] cluster_2.1.6 reshape2_1.4.4
#> [77] generics_0.1.3 gtable_0.3.6
#> [79] tzdb_0.4.0 preprocessCore_1.69.0
#> [81] tidyr_1.3.1 sna_2.8
#> [83] data.table_1.16.2 hms_1.1.3
#> [85] MetaboCoreUtils_1.15.0 utf8_1.2.4
#> [87] XVector_0.47.0 foreach_1.5.2
#> [89] pillar_1.9.0 stringr_1.5.1
#> [91] vroom_1.6.5 limma_3.63.0
#> [93] robustbase_0.99-4-1 dplyr_1.1.4
#> [95] lattice_0.22-6 bit_4.5.0
#> [97] RBGL_1.83.0 tidyselect_1.2.1
#> [99] knitr_1.48 gridExtra_2.3
#> [101] IRanges_2.41.0 ProtGenerics_1.39.0
#> [103] SummarizedExperiment_1.37.0 stats4_4.5.0
#> [105] xfun_0.48 statmod_1.5.0
#> [107] MSnbase_2.33.0 matrixStats_1.4.1
#> [109] DEoptimR_1.1-3 stringi_1.8.4
#> [111] UCSC.utils_1.3.0 statnet.common_4.10.0
#> [113] lazyeval_0.2.2 yaml_2.3.10
#> [115] evaluate_1.0.1 codetools_0.2-20
#> [117] archive_1.1.9 MsCoreUtils_1.19.0
#> [119] tibble_3.2.1 graph_1.85.0
#> [121] BiocManager_1.30.25 cli_3.6.3
#> [123] affyio_1.77.0 rpart_4.1.23
#> [125] munsell_0.5.1 jquerylib_0.1.4
#> [127] network_1.18.2 Rcpp_1.0.13
#> [129] GenomeInfoDb_1.43.0 MassSpecWavelet_1.73.0
#> [131] coda_0.19-4.1 XML_3.99-0.17
#> [133] parallel_4.5.0 readr_2.1.5
#> [135] ggplot2_3.5.1 prettyunits_1.2.0
#> [137] AnnotationFilter_1.31.0 bitops_1.0-9
#> [139] viridisLite_0.4.2 MsFeatures_1.15.0
#> [141] scales_1.3.0 affy_1.85.0
#> [143] ncdf4_1.23 purrr_1.0.2
#> [145] crayon_1.5.3 rlang_1.1.4
#> [147] vsn_3.75.0