This vignette describes how to analyse mass-spectrometry based differential localisation experiments using the BANDLE method (Crook et al. 2021). Data should be stored as lists of MSnSets. There is also features for quality control and visualisation of results. Please see other vignettes for convergence and other methodology.
bandle 1.8.0
In this vignette we use a real-life biological use-case to demonstrate how to analyse mass-spectrometry based proteomics data using the Bayesian ANalysis of Differential Localisation Experiments (BANDLE) method.
As mentioned in “Vignette 1: Getting Started with BANDLE” data from mass
spectrometry based proteomics methods most commonly yield a matrix of
measurements where we have proteins/peptides/peptide spectrum matches
(PSMs) along the rows, and samples/fractions along the columns. To use bandle
the data must be stored as a MSnSet, as implemented in the Bioconductor
MSnbase package. Please see the relevant vignettes in
MSnbase for constructing these data containers.
The data used in this vignette has been published in Mulvey et al. (2021) and is currently
stored as MSnSet instances in the the pRolocdata package. We
will load it in the next section.
In this workflow we analyse the data produced by Mulvey et al. (2021). In this experiment triplicate hyperLOPIT experiments (Mulvey et al. 2017) were conducted on THP-1 human leukaemia cells where the samples were analysed and collected (1) when cells were unstimulated and then (2) following 12 hours stimulation with LPS (12h-LPS).
In the following code chunk we load 4 of the datasets from the study: 2 replicates of the unstimulated and 2 replicates of the 12h-LPS stimulated samples. Please note to adhere to Bioconductor vignette build times we only load 2 of the 3 replicates for each condition to demonstrate the BANDLE workflow.
library("pRolocdata")
data("thpLOPIT_unstimulated_rep1_mulvey2021")
data("thpLOPIT_unstimulated_rep3_mulvey2021")
data("thpLOPIT_lps_rep1_mulvey2021")
data("thpLOPIT_lps_rep3_mulvey2021")By typing the names of the datasets we get a MSnSet data summary. For
example,
thpLOPIT_unstimulated_rep1_mulvey2021## MSnSet (storageMode: lockedEnvironment)
## assayData: 5107 features, 20 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: unstim_rep1_set1_126_cyto unstim_rep1_set1_127N_F1.4 ...
##     unstim_rep1_set2_131_F24 (20 total)
##   varLabels: Tag Treatment ... Fraction (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: A0AVT1 A0FGR8-2 ... Q9Y6Y8 (5107 total)
##   fvarLabels: Checked_unst.r1.s1 Confidence_unst.r1.s1 ... markers (107
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:48 2021. 
## Normalised to sum of intensities. 
##  MSnbase version: 2.14.2thpLOPIT_lps_rep1_mulvey2021## MSnSet (storageMode: lockedEnvironment)
## assayData: 4879 features, 20 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: lps_rep1_set1_126_cyto lps_rep1_set1_127N_F1.4 ...
##     lps_rep1_set2_131_F25 (20 total)
##   varLabels: Tag Treatment ... Fraction (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: A0A0B4J2F0 A0AVT1 ... Q9Y6Y8 (4879 total)
##   fvarLabels: Checked_lps.r1.s1 Confidence_lps.r1.s1 ... markers (107
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:57 2021. 
## Normalised to sum of intensities. 
##  MSnbase version: 2.14.2We see that the datasets thpLOPIT_unstimulated_rep1_mulvey2021 and
thpLOPIT_lps_rep1_mulvey2021 contain 5107 and 4879 proteins respectively,
across 20 TMT channels. The data is accessed through different slots of the
MSnSet (see str(thpLOPIT_unstimulated_rep1_mulvey2021) for all available
slots). The 3 main slots which are used most frequently are those that contain
the quantitation data, the features i.e. PSM/peptide/protein information and the
sample information, and these can be accessed using the functions exprs,
fData, and pData, respectively.
First, let us load the bandle package along with some other R packages needed
for visualisation and data manipulation,
library("bandle")
library("pheatmap")
library("viridis")
library("dplyr")
library("ggplot2")To run bandle there are a few minimal requirements that the data must fulfill.
list of MSnSet instancesIf we use the dim function we see that the datasets we have loaded have the
same number of channels but a different number of proteins per experiment.
dim(thpLOPIT_unstimulated_rep1_mulvey2021)## [1] 5107   20dim(thpLOPIT_unstimulated_rep3_mulvey2021)## [1] 5733   20dim(thpLOPIT_lps_rep1_mulvey2021)## [1] 4879   20dim(thpLOPIT_lps_rep3_mulvey2021)## [1] 5848   20We use the function commonFeatureNames to extract proteins that are common
across all replicates. This function has a nice side effect which is that it
also wraps the data into a list, ready for input into bandle.
thplopit <- commonFeatureNames(c(thpLOPIT_unstimulated_rep1_mulvey2021,  ## unstimulated rep
                                 thpLOPIT_unstimulated_rep3_mulvey2021,  ## unstimulated rep
                                 thpLOPIT_lps_rep1_mulvey2021,           ## 12h-LPS rep
                                 thpLOPIT_lps_rep3_mulvey2021))          ## 12h-LPS rep## 3727 features in commonWe now have our list of MSnSets ready for bandle with 3727 proteins common
across all 4 replicates/conditions.
thplopit## Instance of class 'MSnSetList' containig 4 objects.We can visualise the data using the plot2D function from pRoloc
## create a character vector of title names for the plots
plot_id <- c("Unstimulated 1st rep", "Unstimulated 2nd rep",
             "12h-LPS 1st rep", "12h-LPS 2nd rep")
## Let's set the stock colours of the classes to plot to be transparent
setStockcol(NULL)
setStockcol(paste0(getStockcol(), "90"))
## plot the data
par(mfrow = c(2,2))
for (i in seq(thplopit))
    plot2D(thplopit[[i]], main = plot_id[i])
addLegend(thplopit[[4]], where = "topleft", cex = .75)By default the plot2D uses principal components analysis (PCA)
for the data transformation. Other options such as t-SNE, kernal
PCA etc. are also available, see ?plot2D and the method argument.
PCA sometimes will randomly flip the axis, because the eigenvectors
only need to satisfy \(||v|| = 1\), which allows a sign flip. You will
notice this is the case for the 3rd plot. If desired you can flip
the axis/change the sign of the PCs by specifying any of the arguments
mirrorX, mirrorY, axsSwitch to TRUE when you call plot2D.
bandle: fitting GPs and setting the priorsAs mentioned in the first vignette, bandle uses a complex model to analyse the
data. Markov-Chain Monte-Carlo (MCMC) is used to sample the posterior
distribution of parameters and latent variables from which statistics of
interest can be computed. Again, here we only run a few iterations for brevity
but typically one needs to run thousands of iterations to ensure convergence, as
well as multiple parallel chains.
First, we need to fit non-parametric regression functions to the markers
profiles. We use the fitGPmaternPC function using the default penalised
complexity priors (see ?fitGP), which work well.
gpParams <- lapply(thplopit, function(x) fitGPmaternPC(x))We apply the fitGPmaternPC function on to each dataset by using lapply over
the thplopit list of data. The posterior predictive means, standard deviations
and MAP hyperparamters for the GP are returned. If desired we can visualise the
predictives overlaid onto the marker profiles of the data by using the plotGPmatern
function.
The prior needs to form a K*3 matrix (where K is the number of subcellular
classes in the data),
(mrkCl <- getMarkerClasses(thplopit[[1]], fcol = "markers"))##  [1] "40S/60S Ribosome"      "Chromatin"             "Cytosol"              
##  [4] "Endoplasmic Reticulum" "Golgi Apparatus"       "Lysosome"             
##  [7] "Mitochondria"          "Nucleolus"             "Nucleus"              
## [10] "Peroxisome"            "Plasma Membrane"So for this data we require a 11*3 matrix. Three columns are needed which
represent the hyperparameters length-scale, amplitude, variance. We have found
that the matrix(c(10, 60, 250), nrow = 1) worked well for the smaller datasets
with a few hundred proteins, as tested in Crook et al. (2021). Here, we found that
matrix(c(1, 60, 100) worked well. This is a bigger dataset with several
thousand proteins and many more subcellular classes. This was visually assessed
by passing these values and visualising the GP fit using the plotGPmatern
function. Generally, (1) increasing the lengthscale parameter (the first column
of the hyppar matrix) increases the spread of the covariance i.e. the similarity
between points, (2) increasing the amplitude parameter (the second column of the
hyppar matrix) increases the maximum value of the covariance and lastly (3)
decreasing the variance (third column of the hyppar matrix) reduces the
smoothness of the function to allow for local variations. We strongly recommend
users start with the recommended parameters and change and assess them as
necessary for their dataset by visually evaluating the fit of the GPs using the
plotGPmatern function.
K <- length(mrkCl)
pc_prior <- matrix(NA, ncol = 3, K)
pc_prior[seq.int(1:K), ] <- matrix(rep(c(1, 60, 100),
                                       each = K), ncol = 3)
head(pc_prior)##      [,1] [,2] [,3]
## [1,]    1   60  100
## [2,]    1   60  100
## [3,]    1   60  100
## [4,]    1   60  100
## [5,]    1   60  100
## [6,]    1   60  100Now we have generated these complexity priors we can pass them as an
argument to the fitGPmaternPC function. For example,
gpParams <- lapply(thplopit,
                   function(x) fitGPmaternPC(x, hyppar = pc_prior))By plotting the predictives using the plotGPmatern function we see that
the distributions and fit looks sensible for each class so we will proceed with
setting the prior on the weights.
par(mfrow = c(4, 3))
plotGPmatern(thplopit[[1]], gpParams[[1]])For the interest of keeping the vignette size small, in the above chunk we
plot only the first dataset and its respective predictive. To plot the
second dataset we would execute plotGPmatern(thplopit[[i]], gpParams[[i]])
where i = 2, and similarly for the third i = 3 and so on.
The next step is to set up the matrix Dirichlet prior on the mixing weights.
If dirPrior = NULL a default Dirichlet prior is computed see ?bandle. We
strongly advise you to set your own prior. In “Vignette 1: Getting Started with
BANDLE” we give some suggestions on how to set this and in the below code we try
a few different priors and assess the expectations.
As per Vignette 1, let’s try a dirPrior as follows,
set.seed(1)
dirPrior = diag(rep(1, K)) + matrix(0.001, nrow = K, ncol = K)
predDirPrior <- prior_pred_dir(object = thplopit[[1]],
                               dirPrior = dirPrior,
                               q = 15)The mean number of relocalisations is
predDirPrior$meannotAlloc## [1] 0.421633The prior probability that more than q differential localisations are
expected is
predDirPrior$tailnotAlloc## [1] 0.0016hist(predDirPrior$priornotAlloc, col = getStockcol()[1])We see that the prior probability that proteins are allocated to different
components between datasets concentrates around 0. This is what we expect, we
expect subtle changes between conditions for this data. We may perhaps wish to
be a little stricter with the number of differential localisations output by
bandle and in this case we could make the off-diagonal elements of the
dirPrior smaller. In the below code chunk we test 0.0005 instead of 0.001,
which reduces the number of re-localisations.
set.seed(1)
dirPrior = diag(rep(1, K)) + matrix(0.0005, nrow = K, ncol = K)
predDirPrior <- prior_pred_dir(object = thplopit[[1]],
                               dirPrior = dirPrior,
                               q = 15)
predDirPrior$meannotAlloc## [1] 0.2308647predDirPrior$tailnotAlloc## [1] 6e-04hist(predDirPrior$priornotAlloc, col = getStockcol()[1])Again, we see that the prior probability that proteins are allocated to different components between datasets concentrates around 0.
Now we have computed our gpParams and pcPriors we can run the main bandle
function.
Here for convenience of building the vignette we only run 2 of the triplicates
for each condition and run the bandle function for a small number of
iterations to minimise the vignette build-time. Typically we’d recommend you run
the number of iterations (numIter) in the \(1000\)s.
We first subset our data into two objects called control and treatment
which we subsequently pass to bandle along with our priors.
control <- list(thplopit[[1]], thplopit[[2]])
treatment <- list(thplopit[[3]], thplopit[[4]])
bandleres <- bandle(objectCond1 = control,
                    objectCond2 = treatment,
                    numIter = 50,       # usually 10,000
                    burnin = 5L,        # usually 5,000
                    thin = 1L,          # usually 20
                    gpParams = gpParams,
                    pcPrior = pc_prior,
                    numChains = 1,     # usually >=4
                    dirPrior = dirPrior,
                    seed = 1)A bandleParams object is produced
bandleres## Object of class "bandleParams"
## Method: bandle 
## Number of chains: 1bandle resultsFollowing Vignette 1 we populate the bandleres object by calling the
bandleProcess function. This may take a few seconds to process.
bandleres <- bandleProcess(bandleres)These slots have now been populated
summary(summaries(bandleres))##      Length Class         Mode
## [1,] 1      bandleSummary S4  
## [2,] 1      bandleSummary S4The posteriorEstimates slot gives posterior quantities of interest for
different proteins. The object is of length 2,
- 1 slot for control
- 1 slot for treatment
length(summaries(bandleres))## [1] 2We explicitly extract the posterior estimates and protein allocation predictions as follows
pe1 <- posteriorEstimates(summaries(bandleres)[[1]])
pe2 <- posteriorEstimates(summaries(bandleres)[[2]])
head(pe1)## DataFrame with 6 rows and 7 columns
##          bandle.allocation bandle.probability bandle.outlier
##                <character>          <numeric>      <numeric>
## A0AVT1             Cytosol           1.000000              0
## A0FGR8-2     Golgi Appa...           0.882133              0
## A0JNW5             Nucleus           1.000000              1
## A0MZ66-3           Cytosol           1.000000              0
## A0PJW6          Peroxisome           1.000000              0
## A1L0T0       Endoplasmi...           0.999860              0
##          bandle.probability.lowerquantile bandle.probability.upperquantile
##                                 <numeric>                        <numeric>
## A0AVT1                           1.000000                         1.000000
## A0FGR8-2                         0.589440                         0.998249
## A0JNW5                           1.000000                         1.000000
## A0MZ66-3                         1.000000                         1.000000
## A0PJW6                           1.000000                         1.000000
## A1L0T0                           0.998759                         1.000000
##          bandle.mean.shannon bandle.differential.localisation
##                    <numeric>                        <numeric>
## A0AVT1           0.00000e+00                         0.000000
## A0FGR8-2         0.00000e+00                         0.911111
## A0JNW5           2.65618e-10                         0.000000
## A0MZ66-3         0.00000e+00                         0.000000
## A0PJW6           0.00000e+00                         0.866667
## A1L0T0           0.00000e+00                         0.000000The full joint probability distribution can be found in the bandle.joint slot
e.g. for the control in slot 1 this would be
bandleJoint(summaries(bandleres)[[1]]) and the treatment in slot 2 this would
be bandleJoint(summaries(bandleres)[[2]]).
Let’s look at the posterior estimates and allocation predictions found in pe1
and pe2. Each object is a data.frame containing the protein allocations and
associated localisation probabilities for each condition. The 7 columns are
bandle.allocation which contains the the localisation predictions to one of the
subcellular classes that appear in the training data.bandle.probability is the allocation probability, corresponding to the mean
of the distribution probability.bandle.outlier is the probability of being an outlier. A high value
indicates that the protein is unlikely to belong to any annotated class (and is
hence considered an outlier).bandle.probability.lowerquantile and bandle.probability.upperquantile are
the upper and lower quantiles of the allocation probability distribution.bandle.mean.shannon is the Shannon entropy, measuring the uncertainty in the
allocations (a high value representing high uncertainty; the highest value is
the natural logarithm of the number of classes).bandle.differential.localisation is the differential localisation probability.We plot the distribution of protein allocations by bandle
par(mfrow = c(1, 2), oma = c(6, 2, 2, 2))
barplot(table(pe1$bandle.allocation), col = getStockcol()[2],
        las = 2, main = "Control: Protein allocation",
        ylab = "Number of proteins")
barplot(table(pe2$bandle.allocation), col = getStockcol()[2],
        las = 2, main = "Treatment: Protein allocation")The bar plot above tells us for this data bandle has allocated the majority of
unlabelled proteins to the nucleus. The allocation result for each condition
(found in bandle.allocation) is determined by bandle by looking at which
subcellular niche was given the highest probability from the full distribution
e.g. from bandle.joint. If we plot the bandle.probability (corresponding to
the mean of the distribution) against the protein allocation results we can see
that not all protein allocations are confident, this is why it is important to
threshold when deducing a protein’s location.
par(mfrow = c(1, 2), oma = c(6, 2, 2, 2))
boxplot(pe1$bandle.probability ~ pe1$ bandle.allocation, 
        col = getStockcol()[2], xlab = "",
        ylab = "BANDLE probability (mean)",
        las = 2, main = "Control: Probability distribution\n by allocation class")
boxplot(pe2$bandle.probability ~ pe1$ bandle.allocation, 
        col = getStockcol()[2], xlab = "", ylab = "",
        las = 2, main = "Treatment: Probability distribution\n by allocation class")As mentioned in Vignette 1, it is common to threshold allocation results based
on the posterior probability. Proteins that do not meet the threshold are not
assigned to a subcellular location and left unlabelled (here we use the
terminology “unknown” for consistency with the pRoloc package). It is
important not to force proteins to allocate to one of the niches defined here in
the training data, if they have low probability to reside there. We wish to
allow for greater subcellular diversity and to have multiple location, this is
captured essentially in leaving a protein “unlabelled” or “unknown”.
We use the bandlePredict function to append our results to the original
MSnSet datasets.
## Add the bandle results to a MSnSet
xx <- bandlePredict(control, 
                    treatment, 
                    params = bandleres, 
                    fcol = "markers")
res_0h <- xx[[1]]
res_12h <- xx[[2]]The BANDLE model combines replicate information within each condition to obtain the localisation of proteins for each single experimental condition.
The results for each condition are appended to the first dataset in the list
of MSnSets (for each condition). It is important to familiarise yourself with
the MSnSet data structure. To further highlight this in the below code chunk
we look at the fvarLabels of each datasets, this shows the column header names
of the fData feature data. We see that the first replicate at 0h e.g.
res_0h[[1]] has 7 columns with the output of bandle e.g.
bandle.probability, bandle.allocation, bandle.outlier etc. (as described
above) appended to the feature data (fData(res_0h[[1]])). The second dataset
at 0h i.e. res_0h[[2]] does not have this information appended to the feature
data. This is the same for the second condition at 12h post LPS stimulation.
fvarLabels(res_0h[[1]])
fvarLabels(res_0h[[2]])
fvarLabels(res_12h[[1]])
fvarLabels(res_12h[[2]])To obtain classification results we threshold using a 1% FDR based on the
bandle.probability and append the results to the data using the
getPredictions function from MSnbase.
## threshold results using 1% FDR
res_0h[[1]] <- getPredictions(res_0h[[1]], 
                              fcol = "bandle.allocation",  
                              scol = "bandle.probability",    
                              mcol = "markers", 
                              t = .99)## ans
##      40S/60S Ribosome             Chromatin               Cytosol 
##                   267                   210                   508 
## Endoplasmic Reticulum       Golgi Apparatus              Lysosome 
##                   154                   130                   285 
##          Mitochondria             Nucleolus               Nucleus 
##                   388                   112                   671 
##            Peroxisome       Plasma Membrane               unknown 
##                   162                   253                   587res_12h[[1]] <- getPredictions(res_12h[[1]], 
                               fcol = "bandle.allocation",
                               scol = "bandle.probability", 
                               mcol = "markers",      
                               t = .99)## ans
##      40S/60S Ribosome             Chromatin               Cytosol 
##                   160                   223                   465 
## Endoplasmic Reticulum       Golgi Apparatus              Lysosome 
##                   289                   299                   202 
##          Mitochondria             Nucleolus               Nucleus 
##                   362                   117                   750 
##            Peroxisome       Plasma Membrane               unknown 
##                   204                   352                   304A table of predictions is printed as a side effect when running getPredictions function.
In addition to thresholding on the bandle.probability we can threshold
based on the bandle.outlier i.e. the probability of being an outlier. A high value
indicates that the protein is unlikely to belong to any annotated class (and is
hence considered an outlier). We wish to assign proteins to a subcellular niche
if they have a high bandle.probability and also a low bandle.outlier probability.
This is a nice way to ensure we keep the most high confidence localisations.
In the below code chunk we use first create a new column called bandle.outlier.t
in the feature data which is 1 - outlier probability. This allows us then to use
getPredictions once again and keep only proteins which meet both the 0.99
threshold on the bandle.probability and the bandle.outlier.
Note, that running getPredictions appends the results to a new feature data
column called fcol.pred, please see ?getPredictions for the documentation.
As we have run this function twice, our column of classification results
are found in bandle.allocation.pred.pred.
## add outlier probability
fData(res_0h[[1]])$bandle.outlier.t <- 1 -  fData(res_0h[[1]])$bandle.outlier
fData(res_12h[[1]])$bandle.outlier.t <- 1 -  fData(res_12h[[1]])$bandle.outlier
## threshold again, now on the outlier probability
res_0h[[1]] <- getPredictions(res_0h[[1]], 
                              fcol = "bandle.allocation.pred",  
                              scol = "bandle.outlier.t",    
                              mcol = "markers", 
                              t = .99)## ans
##      40S/60S Ribosome             Chromatin               Cytosol 
##                    90                   144                   326 
## Endoplasmic Reticulum       Golgi Apparatus              Lysosome 
##                   152                    63                   219 
##          Mitochondria             Nucleolus               Nucleus 
##                   350                    67                    98 
##            Peroxisome       Plasma Membrane               unknown 
##                   133                   225                  1860res_12h[[1]] <- getPredictions(res_12h[[1]], 
                               fcol = "bandle.allocation.pred",
                               scol = "bandle.outlier.t", 
                               mcol = "markers",      
                               t = .99)## ans
##      40S/60S Ribosome             Chromatin               Cytosol 
##                   116                   173                   287 
## Endoplasmic Reticulum       Golgi Apparatus              Lysosome 
##                   255                   225                   170 
##          Mitochondria             Nucleolus               Nucleus 
##                   346                    94                   129 
##            Peroxisome       Plasma Membrane               unknown 
##                   187                   303                  1442Let’s append the results to the second replicate (by default they are appended
to the first only, as already mentioned above). This allows us to plot each
dataset and the results using plot2D.
## Add results to second replicate at 0h
res_alloc_0hr <- fData(res_0h[[1]])$bandle.allocation.pred.pred
fData(res_0h[[2]])$bandle.allocation.pred.pred <- res_alloc_0hr
## Add results to second replicate at 12h
res_alloc_12hr <- fData(res_12h[[1]])$bandle.allocation.pred.pred
fData(res_12h[[2]])$bandle.allocation.pred.pred <- res_alloc_12hrWe can plot these results on a PCA plot and compare to the original subcellular markers.
par(mfrow = c(5, 2))
plot2D(res_0h[[1]], main = "Unstimulated - replicate 1 \n subcellular markers", 
       fcol = "markers")
plot2D(res_0h[[1]], main = "Unstimulated - replicate 1 \nprotein allocations (1% FDR)", 
       fcol = "bandle.allocation.pred.pred")
plot2D(res_0h[[2]], main = "Unstimulated - replicate 2 \nsubcellular markers", 
       fcol = "markers")
plot2D(res_0h[[2]], main = "Unstimulated - replicate 2 \nprotein allocations (1% FDR)", 
       fcol = "bandle.allocation.pred.pred")
plot2D(res_0h[[1]], main = "12h LPS - replicate 1 \nsubcellular markers", 
       fcol = "markers")
plot2D(res_0h[[1]], main = "12h LPS - replicate 1 \nprotein allocations (1% FDR)", 
       fcol = "bandle.allocation.pred.pred")
plot2D(res_0h[[2]], main = "12h LPS - replicate 2 \nsubcellular markers", 
       fcol = "markers")
plot2D(res_0h[[2]], main = "12h LPS - replicate 2 \nprotein allocations (1% FDR)", 
       fcol = "bandle.allocation.pred.pred")
plot(NULL, xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1)
addLegend(res_0h[[1]], where = "topleft", cex = .8)The differential localisation probability tells us which proteins are most
likely to differentially localise, that exhibit a change in their steady-state
subcellular location. Quantifying changes in protein subcellular location
between experimental conditions is challenging and Crook et al (Crook et al. 2021) have
used a Bayesian approach to compute the probability that a protein
differentially localises upon cellular perturbation, as well quantifying the
uncertainty in these estimates. The differential localisation probability is
found in the bandle.differential.localisation column of the bandleParams
output.
diffloc_probs <- pe1$bandle.differential.localisationIf we take a 5% FDR and examine how many proteins get a differential probability greater than 0.95 we find there are 619 proteins above this threshold.
length(which(diffloc_probs[order(diffloc_probs, decreasing = TRUE)] > 0.95))## [1] 687On a rank plot we can see the distribution of differential probabilities.
plot(diffloc_probs[order(diffloc_probs, decreasing = TRUE)],
     col = getStockcol()[2], pch = 19, ylab = "Probability",
     xlab = "Rank", main = "Differential localisation rank plot")This indicated that most proteins are not differentially localised and there are a few hundred confident differentially localised proteins of interest.
One advantage of using Bayesian methods over classic machine learning is the
ability to quantify the uncertainty in our estimates. There are many ways to do
this, as discussed in “Vignette 1: Getting Started with BANDLE”. In the below
code chunk we use the binomialDiffLocProb function to obtain credible
intervals from the binomial distribution and then extract a probability
estimate for the differential localisation.
Please note, that in interest of time and for the purpose of demonstration we
set nsample = 500 and thus only return 500 samples of the binomial distribution.
In practice the minimum recommended number of samples is 5000.
set.seed(1)
bin_t <- binomialDiffLocProb(params = bandleres, top = 500,
                             nsample = 500, decreasing = TRUE)As we have a large number of proteins as candidates we have chosen to threshold on the interval to reduce the number of differential localisations.
qt <- apply(bin_t, 1, function(x) quantile(x, .025))This leaves us with 147 proteins to investigate.
candidates <- names(which(qt > 0.95))
head(candidates)## [1] "A1L170-2" "B7ZBB8"   "O43169"   "O43464"   "O43493"   "O43719"Let’s add the results to each replicate in the MSnSets.
The reason for doing this is so that later on when we wish
to visulalise the data we have the information readily
accessible to make use of the functions in the pRoloc
package.
Let’s double check all datasets have the same proteins,
all(featureNames(res_0h[[1]]) == featureNames(res_0h[[2]]))## [1] TRUEall(featureNames(res_0h[[1]]) == featureNames(res_12h[[1]]))## [1] TRUEall(featureNames(res_12h[[1]]) == featureNames(res_12h[[2]]))## [1] TRUENow let’s add the differential location estimates,
dl.estimate <- qt[candidates]
fn <- featureNames(control[[1]])
cmn <- fn %in% names(dl.estimate)
## Add results to the 0h time-point (control)
for (i in seq(res_0h)) {
    ## create column called "dl.estimate" in the data
    mcol <- "dl.estimate"
    fData(res_0h[[i]])[, mcol] <- NA
    fData(res_0h[[i]])[cmn, mcol] <- dl.estimate
    ## create column called "dl.candidate" in the data
    mcol <- "dl.candidate"
    fData(res_0h[[i]])[, mcol] <- "unknown"
    fData(res_0h[[i]])[cmn, mcol] <- "DL candidate"
}
## Add results to the 12h time-point (treatment)
for (i in seq(res_12h)) {
    ## create column called "dl.estimate" in the data
    mcol <- "dl.estimate"
    fData(res_12h[[i]])[, mcol] <- NA
    fData(res_12h[[i]])[cmn, mcol] <- dl.estimate
    ## create column called "dl.candidate" in the data
    mcol <- "dl.candidate"
    fData(res_12h[[i]])[, mcol] <- "unknown"
    fData(res_12h[[i]])[cmn, mcol] <- "DL candidate"
}In the next section we can visualise these results.
There are several different ways we can visualise the output of bandle. Now we
have our set of candidates we can subset MSnSet datasets and plot the the
results.
To subset the data,
msnset_cands <- list(res_0h[[1]][candidates, ], 
                     res_12h[[1]][candidates, ])We can visualise this as a data.frame too for ease,
# construct data.frame
df_cands <- data.frame(
    fData(msnset_cands[[1]])[, c("bandle.differential.localisation", 
                                 "dl.estimate",
                                 "bandle.allocation.pred.pred")],
    fData(msnset_cands[[2]])[, "bandle.allocation.pred.pred"])
colnames(df_cands) <- c("bandle.differential.localisation", "dl.estimate", 
                        "0hr_location", "12h_location")
# order by highest differential localisation estimate
df_cands <- df_cands %>% arrange(desc(bandle.differential.localisation))
head(df_cands)##          bandle.differential.localisation dl.estimate 0hr_location
## A1L170-2                                1   0.9506301      unknown
## B7ZBB8                                  1   0.9525446      unknown
## O43169                                  1   0.9559090      unknown
## O43464                                  1   0.9505633   Peroxisome
## O43493                                  1   0.9528166      unknown
## O43719                                  1   0.9530232      unknown
##             12h_location
## A1L170-2         unknown
## B7ZBB8           unknown
## O43169   Golgi Apparatus
## O43464      Mitochondria
## O43493           unknown
## O43719           unknownWe can now plot this on an alluvial plot to view the changes in subcellular
location. The class label is taken from the column called
"bandle.allocation.pred.pred" which was deduced above by thresholding on the
posterior and outlier probabilities before assigning BANDLE’s allocation
prediction.
## set colours for organelles and unknown
cols <- c(getStockcol()[seq(mrkCl)], "grey")
names(cols) <- c(mrkCl, "unknown")
## plot
alluvial <- plotTranslocations(msnset_cands, 
                               fcol = "bandle.allocation.pred.pred", 
                               col = cols)## 147 features in common## ------------------------------------------------
## If length(fcol) == 1 it is assumed that the
## same fcol is to be used for both datasets
## setting fcol = c(bandle.allocation.pred.pred,bandle.allocation.pred.pred)
## ----------------------------------------------alluvial + ggtitle("Differential localisations following 12h-LPS stimulation")To view a table of the translocations, we can call the function plotTable,
(tbl <- plotTable(msnset_cands, fcol = "bandle.allocation.pred.pred"))## 147 features in common## ------------------------------------------------
## If length(fcol) == 1 it is assumed that the
## same fcol is to be used for both datasets
## setting fcol = c(bandle.allocation.pred.pred, bandle.allocation.pred.pred)
## ----------------------------------------------##               Condition1            Condition2 value
## 8              Chromatin               unknown     1
## 18               Cytosol               unknown     3
## 23 Endoplasmic Reticulum       Golgi Apparatus     3
## 28 Endoplasmic Reticulum               unknown     1
## 33       Golgi Apparatus Endoplasmic Reticulum     1
## 44              Lysosome       Golgi Apparatus     2
## 47              Lysosome       Plasma Membrane    12
## 48              Lysosome               unknown     4
## 53          Mitochondria Endoplasmic Reticulum     1
## 56          Mitochondria            Peroxisome     2
## 58          Mitochondria               unknown     3
## 63            Peroxisome Endoplasmic Reticulum     1
## 66            Peroxisome          Mitochondria     9
## 78       Plasma Membrane               unknown     4
## 81               unknown             Chromatin     2
## 82               unknown               Cytosol     1
## 83               unknown Endoplasmic Reticulum     7
## 84               unknown       Golgi Apparatus     3
## 85               unknown              Lysosome     1
## 87               unknown            Peroxisome     7
## 88               unknown       Plasma Membrane     8
## 89               unknown      40S/60S Ribosome     1
## 90               unknown             Nucleolus     2Although this example analysis is limited compared to that of Mulvey et al. (2021), we do see similar trends inline with the results seen in the paper. For examples, we see a large number of proteins translocating between organelles that are involved in the secretory pathway. We can further examine these cases by subsetting the datasets once again and only plotting proteins that involve localisation with these organelles. Several organelles are known to be involved in this pathway, the main ones, the ER, Golgi (and plasma membrane).
Let’s subset for these proteins,
secretory_prots <- unlist(lapply(msnset_cands, function(z) 
    c(which(fData(z)$bandle.allocation.pred.pred == "Golgi Apparatus"),
      which(fData(z)$bandle.allocation.pred.pred == "Endoplasmic Reticulum"),
      which(fData(z)$bandle.allocation.pred.pred == "Plasma Membrane"),
      which(fData(z)$bandle.allocation.pred.pred == "Lysosome"))))
secretory_prots <- unique(secretory_prots)
msnset_secret <- list(msnset_cands[[1]][secretory_prots, ], 
                    msnset_cands[[2]][secretory_prots, ])
secretory_alluvial <- plotTranslocations(msnset_secret, 
                                         fcol = "bandle.allocation.pred.pred", 
                                         col = cols)## 48 features in common## ------------------------------------------------
## If length(fcol) == 1 it is assumed that the
## same fcol is to be used for both datasets
## setting fcol = c(bandle.allocation.pred.pred,bandle.allocation.pred.pred)
## ----------------------------------------------secretory_alluvial + ggtitle("Movements of secretory proteins")In the next section we see how to plot proteins of interest.Our differential
localisation candidates can be found in df_cands,
head(df_cands)##          bandle.differential.localisation dl.estimate 0hr_location
## A1L170-2                                1   0.9506301      unknown
## B7ZBB8                                  1   0.9525446      unknown
## O43169                                  1   0.9559090      unknown
## O43464                                  1   0.9505633   Peroxisome
## O43493                                  1   0.9528166      unknown
## O43719                                  1   0.9530232      unknown
##             12h_location
## A1L170-2         unknown
## B7ZBB8           unknown
## O43169   Golgi Apparatus
## O43464      Mitochondria
## O43493           unknown
## O43719           unknownLet’s take the first protein as an example; protein with accession B2RUZ4. It has a high differential localisation score and it’s steady state localisation in the control is predicted to be lysosomal and in the treatment condition at 12 hours-LPS it is predicted to localise to the plasma membrane. This fits with the information we see on Uniprot which tells us it is Small integral membrane protein 1 (SMIM1).
In the below code chunk we plot the protein profiles of all proteins that were identified as lysosomal from BANDLE in the control and then overlay SMIM1. We do the same at 12hrs post LPS with all plasma membrane proteins.
par(mfrow = c(2, 1))
## plot lysosomal profiles
lyso <- which(fData(res_0h[[1]])$bandle.allocation.pred.pred == "Lysosome")
plotDist(res_0h[[1]][lyso], pcol = cols["Lysosome"], alpha = 0.04)
matlines(exprs(res_0h[[1]])["B2RUZ4", ], col = cols["Lysosome"], lwd = 3)
title("Protein SMIM1 (B2RUZ4) at 0hr (control)")
## plot plasma membrane profiles
pm <- which(fData(res_12h[[1]])$bandle.allocation.pred.pred == "Plasma Membrane")
plotDist(res_12h[[1]][pm], pcol = cols["Plasma Membrane"], alpha = 0.04)
matlines(exprs(res_12h[[1]])["B2RUZ4", ], col = cols["Plasma Membrane"], lwd = 3)
title("Protein SMIM1 (B2RUZ4) at 12hr-LPS (treatment)")We can also visualise there on a PCA or t-SNE plot.
par(mfrow = c(1, 2))
plot2D(res_0h[[1]], fcol = "bandle.allocation.pred.pred",
       main = "Unstimulated - replicate 1 \n predicted location")
highlightOnPlot(res_0h[[1]], foi = "B2RUZ4")
plot2D(res_12h[[1]], fcol = "bandle.allocation.pred.pred",
       main = "12h-LPS - replicate 1 \n predicted location")
highlightOnPlot(res_12h[[1]], foi = "B2RUZ4")All software and respective versions used to produce this document are listed below.
sessionInfo()## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggplot2_3.5.1        dplyr_1.1.4          pRolocdata_1.41.0   
##  [4] viridis_0.6.5        viridisLite_0.4.2    pheatmap_1.0.12     
##  [7] bandle_1.8.0         pRoloc_1.44.0        BiocParallel_1.38.0 
## [10] MLInterfaces_1.84.0  cluster_2.1.6        annotate_1.82.0     
## [13] XML_3.99-0.16.1      AnnotationDbi_1.66.0 IRanges_2.38.0      
## [16] MSnbase_2.30.1       ProtGenerics_1.36.0  mzR_2.38.0          
## [19] Rcpp_1.0.12          Biobase_2.64.0       S4Vectors_0.42.0    
## [22] BiocGenerics_0.50.0  BiocStyle_2.32.0    
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.0               filelock_1.0.3             
##   [3] tibble_3.2.1                hardhat_1.3.1              
##   [5] preprocessCore_1.66.0       pROC_1.18.5                
##   [7] rpart_4.1.23                lifecycle_1.0.4            
##   [9] httr2_1.0.1                 doParallel_1.0.17          
##  [11] globals_0.16.3              lattice_0.22-6             
##  [13] MASS_7.3-60.2               MultiAssayExperiment_1.30.0
##  [15] dendextend_1.17.1           magrittr_2.0.3             
##  [17] limma_3.60.0                plotly_4.10.4              
##  [19] sass_0.4.9                  rmarkdown_2.26             
##  [21] jquerylib_0.1.4             yaml_2.3.8                 
##  [23] MsCoreUtils_1.16.0          DBI_1.2.2                  
##  [25] RColorBrewer_1.1-3          lubridate_1.9.3            
##  [27] abind_1.4-5                 zlibbioc_1.50.0            
##  [29] GenomicRanges_1.56.0        purrr_1.0.2                
##  [31] mixtools_2.0.0              AnnotationFilter_1.28.0    
##  [33] nnet_7.3-19                 rappdirs_0.3.3             
##  [35] ipred_0.9-14                circlize_0.4.16            
##  [37] lava_1.8.0                  GenomeInfoDbData_1.2.12    
##  [39] ggrepel_0.9.5               listenv_0.9.1              
##  [41] gdata_3.0.0                 parallelly_1.37.1          
##  [43] ncdf4_1.22                  codetools_0.2-20           
##  [45] DelayedArray_0.30.0         xml2_1.3.6                 
##  [47] shape_1.4.6.1               tidyselect_1.2.1           
##  [49] farver_2.1.1                UCSC.utils_1.0.0           
##  [51] matrixStats_1.3.0           BiocFileCache_2.12.0       
##  [53] jsonlite_1.8.8              caret_6.0-94               
##  [55] e1071_1.7-14                ggalluvial_0.12.5          
##  [57] survival_3.6-4              iterators_1.0.14           
##  [59] foreach_1.5.2               segmented_2.0-4            
##  [61] tools_4.4.0                 progress_1.2.3             
##  [63] lbfgs_1.2.1.2               glue_1.7.0                 
##  [65] prodlim_2023.08.28          gridExtra_2.3              
##  [67] SparseArray_1.4.0           xfun_0.43                  
##  [69] MatrixGenerics_1.16.0       GenomeInfoDb_1.40.0        
##  [71] withr_3.0.0                 BiocManager_1.30.22        
##  [73] fastmap_1.1.1               fansi_1.0.6                
##  [75] digest_0.6.35               timechange_0.3.0           
##  [77] R6_2.5.1                    colorspace_2.1-0           
##  [79] gtools_3.9.5                lpSolve_5.6.20             
##  [81] biomaRt_2.60.0              RSQLite_2.3.6              
##  [83] utf8_1.2.4                  tidyr_1.3.1                
##  [85] generics_0.1.3              hexbin_1.28.3              
##  [87] data.table_1.15.4           recipes_1.0.10             
##  [89] FNN_1.1.4                   class_7.3-22               
##  [91] prettyunits_1.2.0           PSMatch_1.8.0              
##  [93] httr_1.4.7                  htmlwidgets_1.6.4          
##  [95] S4Arrays_1.4.0              ModelMetrics_1.2.2.2       
##  [97] pkgconfig_2.0.3             gtable_0.3.5               
##  [99] timeDate_4032.109           blob_1.2.4                 
## [101] impute_1.78.0               XVector_0.44.0             
## [103] htmltools_0.5.8.1           bookdown_0.39              
## [105] MALDIquant_1.22.2           clue_0.3-65                
## [107] scales_1.3.0                png_0.1-8                  
## [109] gower_1.0.1                 knitr_1.46                 
## [111] reshape2_1.4.4              coda_0.19-4.1              
## [113] nlme_3.1-164                curl_5.2.1                 
## [115] GlobalOptions_0.1.2         proxy_0.4-27               
## [117] cachem_1.0.8                stringr_1.5.1              
## [119] parallel_4.4.0              mzID_1.42.0                
## [121] vsn_3.72.0                  pillar_1.9.0               
## [123] grid_4.4.0                  vctrs_0.6.5                
## [125] pcaMethods_1.96.0           randomForest_4.7-1.1       
## [127] dbplyr_2.5.0                xtable_1.8-4               
## [129] evaluate_0.23               magick_2.8.3               
## [131] tinytex_0.50                mvtnorm_1.2-4              
## [133] cli_3.6.2                   compiler_4.4.0             
## [135] rlang_1.1.3                 crayon_1.5.2               
## [137] future.apply_1.11.2         LaplacesDemon_16.1.6       
## [139] mclust_6.1.1                QFeatures_1.14.0           
## [141] affy_1.82.0                 plyr_1.8.9                 
## [143] stringi_1.8.3               munsell_0.5.1              
## [145] Biostrings_2.72.0           lazyeval_0.2.2             
## [147] Matrix_1.7-0                hms_1.1.3                  
## [149] bit64_4.0.5                 future_1.33.2              
## [151] KEGGREST_1.44.0             statmod_1.5.0              
## [153] highr_0.10                  SummarizedExperiment_1.34.0
## [155] kernlab_0.9-32              igraph_2.0.3               
## [157] memoise_2.0.1               affyio_1.74.0              
## [159] bslib_0.7.0                 sampling_2.10              
## [161] bit_4.0.5Crook, Oliver M, Colin TR Davies, Laurent Gatto, Paul DW Kirk, and Kathryn S Lilley. 2021. “Inferring Differential Subcellular Localisation in Comparative Spatial Proteomics Using Bandle.” bioRxiv. doi: https://doi.org/10.1101/2021.01.04.425239.
Mulvey, Claire M., Lisa M. Breckels, Oliver M. Crook, David J. Sanders, Andre L. R. Ribeiro, Aikaterini Geladaki, Andy Christoforou, et al. 2021. “Spatiotemporal Proteomic Profiling of the Pro-Inflammatory Response to Lipopolysaccharide in the THP-1 Human Leukaemia Cell Line.” Nature Communications 12 (1). https://doi.org/10.1038/s41467-021-26000-9.
Mulvey, Claire M, Lisa M Breckels, Aikaterini Geladaki, Nina Kočevar Britovšek, Daniel J H Nightingale, Andy Christoforou, Mohamed Elzek, Michael J Deery, Laurent Gatto, and Kathryn S Lilley. 2017. “Using hyperLOPIT to Perform High-Resolution Mapping of the Spatial Proteome.” Nature Protocols 12 (6): 1110–35. https://doi.org/10.1038/nprot.2017.026.