1 Introduction

Here we present the R package RUCova, a novel method designed to address confounding factors such as heterogeneous cell size and staining efficiency in mass cytometry data. RUCova removes unwanted covariance using multivariate linear regression based on Surrogates of Unwanted Covariance (SUCs), and Principal Component Analysis (PCA).

RUCova comprises two major steps:

First, it fits a multivariate model for each measured marker ( $m$ ) across cells ( $i$ ) from samples ( $j_i$ ) with respect to the surrogates of sources of unwanted covariance (SUC, $\vec{x_i}$ ). Samples $j_i$ can be either different cell lines, perturbations, conditions, metacells (clusters), or even batches.
Second, it eliminates such dependency by assigning the residuals $\epsilon$ of the model as the new modified expression of the marker. The fit can be expressed as: $y^{m}_i(\vec{x}_{i},j_i) = \underbrace{O^m(j_i) + S^m(\vec{x}_{i},j_i) }_{M(\vec{x}_{i},j_i)} + \epsilon^m_i\$ , where $S^m(\vec{x}_{i},j_i)$ describes the slope of the fit and $O^m(j_i)$ the intercept or offset.

The predictors are the surrogates of sources unwanted covariance (SUC, $\vec{x}$ ), which can be either specific markers as proxies of the confounding factors or the principal components (PCs) derived from a Principal Component Analysis (PCA) performed on such markers.

RUCova offers 3 different uni- or multi-variate linear models to describe the relationship between marker expression and SUC: (1) $M_1(\vec{x}_{i})$ : simple, (2) $M_2(\vec{x}_i,j_i)$ : offset, and (3) $M_3(\vec{x}_i,j_i)$ : interaction. In this vignette you can find a detailed description of these methods and their differences.

The RUCova method eliminates the dependency of each measured marker on the SUCs by computing the model’s residuals and the intercept as the revised expression for each marker ( $y^{m*}_i$ ).:

$y^{*m}_i(j_i) = O_1^m(j_i) + \epsilon^m_i$ , where $\epsilon^m_i$ are the residuals of the model.

In this vignette, we guide you step-by-step through it so you can make good use this package and reliably analyse single-cell mass cytometry data.

1.1 Citation

If you use RUCova in published research, please cite:

RUCova: Removal of Unwanted Covariance in mass cytometry data Rosario Astaburuaga-García, Thomas Sell, Samet Mutlu, Anja Sieber, Kirsten Lauber, Nils Blüthgen. Bioinformatics 2024; doi: https://doi.org/10.1101/2024.05.24.595717

2 Installation

To install the stable release of RUCova from Bioconductor (recommended), run:


if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("RUCova")

To install the latest development version directly from GitHub, use:


remotes::install_github("molsysbio/RUCova@devel", force = TRUE)

library(RUCova)
library(ggplot2)
library(tidyr)
library(tibble)
library(dplyr)
library(SingleCellExperiment)
theme_set(theme_classic())

3 Input data

Let’s first go through each input of RUCova.

3.1 sce

Is a SingleCellExperiment object containing an assay (e.g.: “counts”) with mass cytometry data of single-cell marker signals [rows = markers, columns = cells] in linear scale. This data should be clean, meaning without calibration beads, debris, doublets, and dead cells, and single-cells are demultiplexed (important if you want to adapt the linear fits to the samples). Metadata of the cells (eg.: samples, treatment, etc) are stored in colData.

In this example we offer a mass cytometry data set in form of a tibble consisting of 8 Head-and-Neck Squamous Cell Carcinoma (HNSCC) lines in irradiated (10 Gy) and control (0 Gy) conditions. It corresponds to the data in Figure 2 and Figure 3 of the manuscript, but subsampled to 500 cells per line and condition.

data <- RUCova::HNSCC_data
data <- as.data.frame(data)
colnames(data) <- make.names(colnames(data)) 
head(data)
#>     bc    Time Event_length  Pd102Di  Pd104Di  Pd105Di  Pd106Di  Pd108Di
#> 1 BC08 4242820           13 546.7474 22.72933 19.88457 451.7216 417.2545
#> 2 BC08 1789821           14 222.9223 18.07157 11.54708 265.7439 221.4544
#> 3 BC08 3652033           12 297.9668 15.28636 12.25072 245.5512 282.9252
#> 4 BC08 3377584           20 508.8753 22.95869 31.89641 516.1778 464.6930
#> 5 BC08 2382658           13 556.1230 29.67961 43.14814 530.4541 529.1018
#> 6 BC08 4431651           14 330.9426 11.54808 11.52863 244.8138 296.4420
#>     Pd110Di        IdU  Ce140Di     pChk2   cCasp3      cPARP  pan_Akt      pH3
#> 1  3.657092   1.513940 320.8808 65.922653 93.28447 29.5808659 62.74430 8.942398
#> 2  7.112436   0.000000 153.0259 30.766462 34.50168 13.3431931 55.06167 4.529729
#> 3 10.820088   0.000000 118.8508  6.252229 19.16000  6.4659705 23.14730 0.000000
#> 4  9.890303 585.112671 432.2830 42.141312 75.54626  9.6980858 87.53520 4.364116
#> 5 19.013130   2.273655 276.7180 22.336010 59.24590  0.7331275 41.64369 9.424998
#> 6 11.214272   9.763472 214.9876 24.970592 65.60184  7.6563377 66.69602 8.903942
#>   Cyclin_D1    pH2A.X   pSmad1.8      YAP  pMEK1.2      pAkt  pSmad2.3
#> 1 22.690607 2.1654239  1.7064563 62.78539 3.014425 12.870280 40.063335
#> 2 15.481157 0.0000000  7.4238014 52.48014 6.863272  1.023845 33.989986
#> 3  1.559301 0.0000000  0.7789557 35.73836 1.096706  2.010562  6.333865
#> 4  5.714023 5.2749600  4.0601649 33.65604 1.314413  8.461620 72.930046
#> 5 22.867458 2.0731292  0.2730380 60.78325 7.590259  2.415889 42.437962
#> 6 11.319405 0.5927898 11.4713850 53.22401 4.346868 14.388021 44.998573
#>       pStat1      pNFkB      p.p38   pStat3    pCDC25c  Cyclin_B1     Ki.67
#> 1  7.2032270 13.5217361 10.6009207 0.000000  8.0731153  0.5236204  7.889261
#> 2  0.0000000  2.7251220  0.1972288 1.825840  1.7840031  0.0000000 14.613651
#> 3  0.0000000  0.0000000  9.0423908 1.481277  0.7777136  0.1674132  4.776067
#> 4  0.5150139 14.4650135 18.9907207 0.000000 17.6455097  0.9212017 17.326904
#> 5  1.8294960  0.2578354 27.2833290 0.000000  5.2420778 46.5049477 25.254673
#> 6 10.3977222  7.7564101 11.0821972 6.946990  8.1020355  6.2739663 88.163605
#>         pRb     IkBa        CXCL1  pAkt_T308    GDF15   p4e.BP1  pERK1.2
#> 1  73.95050 40.49585 0.0000000000 0.00000000 0.000000 102.91698 0.000000
#> 2  24.58367 51.41679 0.0001835762 0.06737211 0.000000  99.53587 0.000000
#> 3  57.21548 35.68195 0.0000000000 0.00000000 0.000000  60.02824 0.000000
#> 4 278.09930 54.17620 3.2442445755 1.75861299 0.000000 147.40877 2.847847
#> 5 120.08437 28.66582 0.0000000000 0.00000000 0.000000 112.68913 1.945096
#> 6  86.16685 55.65934 2.7437324524 2.92606282 4.006303 133.04027 2.471024
#>       p.p53       NICD total_ERK         pS6 Lamin_B1 DNA_191Ir DNA_193Ir
#> 1 0.0000000 0.01819583 11.213743   1.9784247 190.2875  110.0675  182.4585
#> 2 0.5755522 0.00000000 16.922426   0.5345835 451.9402  135.5184  204.8300
#> 3 0.0000000 2.00119233  4.781286   2.3451004 118.6992  163.1378  283.1072
#> 4 0.1994688 3.60878825 10.808772   0.0000000 425.7264  364.0222  652.8927
#> 5 3.6124716 2.11165023  3.111277 448.8321228 315.2308  356.6400  545.4122
#> 6 1.4175761 5.52166224 20.698357   0.1589808 245.5370  168.6429  281.4619
#>   Dead_cells_194Pt Dead_cells_195Pt Dead_cells_196Pt Dead_cells_198Pt   line
#> 1        222.91628        3.6735010        0.8512599         5.417649 UDSCC2
#> 2         94.29201        2.9783978        0.7097119         4.697655 UDSCC2
#> 3        218.48654        1.4468752        0.0000000         9.654497 UDSCC2
#> 4        575.00250       11.7251320        0.8311821         8.197575 UDSCC2
#> 5        488.76108        0.9324946        0.7700021         5.099325 UDSCC2
#> 6        219.33310        5.6865716        3.8901842         0.686854 UDSCC2
#>   irradiated replicate Pt194Di_norm beadsOut singlets lowPt dose
#> 1      FALSE         1    0.5552820     TRUE     TRUE  TRUE  0Gy
#> 2      FALSE         1    0.2348803     TRUE     TRUE  TRUE  0Gy
#> 3      FALSE         1    0.5442475     TRUE     TRUE  TRUE  0Gy
#> 4      FALSE         1    1.4323248     TRUE     TRUE  TRUE  0Gy
#> 5      FALSE         1    1.2174984     TRUE     TRUE  TRUE  0Gy
#> 6      FALSE         1    0.5463563     TRUE     TRUE  TRUE  0Gy
rownames(data) <- 1:dim(data)[1]
data <- data |> mutate(cell_id = 1:n())

We convert our data into a SingleCellExperiment to ensure interoperability across Bioconductor packages. We also add an id per cell to be able to do a cell-wise signals value modification.

sce <- SingleCellExperiment(
  assays = list(counts = t(data[,4:48])),    # Assay data
  rowData = DataFrame(measured_channels = colnames(data[,4:48])),# Metadata for rows
  colData = DataFrame(cell_id = data$cell_id,
                      line = data$line,
                      dose = data$dose), # Metadata for columns
)

### RUCova::sce stores the SingleCellExperiment object above

3.2 name_assay_before

A string specifying the name of the assay containing the data before applying RUCova. In this example, name_assay_before = "counts".

3.3 markers

A vector of markers $m$ ,for which we want to regress out the unwanted covariance. In this example:

m <- c("pH3","IdU","Cyclin_D1","Cyclin_B1", "Ki.67","pRb","pH2A.X","p.p53","p.p38","pChk2","pCDC25c","cCasp3","cPARP","pAkt","pAkt_T308","pMEK1.2","pERK1.2","pS6","p4e.BP1","pSmad1.8","pSmad2.3","pNFkB","IkBa", "CXCL1","Lamin_B1", "pStat1","pStat3", "YAP","NICD")

The character variables in the vectors x (surrogates) and m (markers) must be findable as column names in the tibble data.

3.4 SUCs

Since cell volume and labeling efficiency cannot be directly measured with mass cytometry, we use four urrogates of sources of nwanted ovariance (SUCs):

$\vec{x}_i = [\rm{mean~DNA_i},\rm{mean~highest~BC}_i,\rm{total~ERK_i},\rm{pan~Akt_i}]$

3.4.1 Mean DNA: Mean value of normalised iridium channels

$\rm{mean~DNA}_i = \frac{Ir191_{norm,i} + Ir193_{norm,i}}{2}$ , where $\rm Ir191_{norm}$ and $\rm Ir193_{norm}$ are the percentile-normalised intensities (e.g., to the 95th percentile) of the iridium intercalators serving as DNA stains for each cell $i$ .

The RUCova function called RUCova::calc_mean_DNA applies the asinh() function and adjusts the transformed distributions of the iridium isotopes (Ir191 and Ir193) by matching a specific percentile q to take then the mean value of these two signals per cell. The function returns a vector with the mean DNA signals in linear scale, as it applies the inverse transformation sinh().

# Identify the relevant measured channels (DNA_191Ir and DNA_193Ir)
dna_channels <- c("DNA_191Ir", "DNA_193Ir")
# Calculate and add the `mean_DNA` column to the colData of the SingleCellExperiment
sce <- RUCova::calc_mean_DNA(sce, name_assay = "counts", dna_channels, q = 0.95)

3.4.2 Mean BC: Mean value of the highest (used) barcoding isotopes per cell:

$\rm{mean~highest~BC}_i = \frac{1}{N_{BC}}\sum_{k=1}^{N_{BC}}{BC_{i,k,norm}}$ , where $\rm BC_{norm}$ are the percentile-normalised intensities of the barcoding isotopes across all cells ( to the 95th percentile), $\rm N_{BC}$ is the number of barcoding isotopes used per cell (e.g.: $\rm N_{BC}$ = 3 for the Fluidigm kit of Palladium isotopes), and $\rm BC_k$ is the barcoding isotope with the k-th highest signal in cell $i$ , meaning the isotope was used in that cell to barcode it.

The RUCova function called RUCova::calc_mean_BC works in two steps. First, it applies the asinh() function and adjusts the transformed distributions of the barcoding isotopes by matching a specific percentile q. Then, it looks at the signals of each isotope for each cell and picks the top n_bc signals used for barcoding. The function returns a vector with the mean BC signals in linear scale, as it applies the inverse transformation sinh().

In the following example the Cell-ID 20-Plex Pd Barcoding Kit was used, in addition to Platinum 194 and 198, where 4 out of 8 isotopes are mixed to form one barcode:


# Identify the barcoding channels
bc_channels <- c("Pd102Di", "Pd104Di", "Pd105Di", "Pd106Di", "Pd108Di", "Pd110Di", 
                 "Dead_cells_194Pt", "Dead_cells_198Pt")
# Calculate and add the `mean_BC` column to the colData of the SingleCellExperiment
sce <- RUCova::calc_mean_BC(sce, name_assay = "counts", bc_channels, n_bc = 4, q = 0.95)

3.4.3 pan Akt and total ERK

The surrogates pan Akt and total ERK are markers included in our mass cytometry panel.

The selection of SUCs should be strategically aligned with the biological system, experimental framework, and specific markers involved. In this investigation, our mass cytometry panel is primarily focused on the MAPK pathway and its interactions. Consequently, selecting total ERK and pan Akt as SUCs is deemed suitable. Nonetheless, we are aware that these markers may not always be optimal in other scenarios due to possible variations. Alternatively, markers such as metabolic enzymes like GAPDH present a viable option. Although GAPDH was not tested in our study, we recognize its importance and frequent use as a control in western blots.

Now we define our vector of SUCs as:

x = c("total_ERK", "pan_Akt", "mean_DNA", "mean_BC")

3.4.4 PCA

SUCs can also be linearly transformed into a new coordinate system by applying Principal Component Analysis. By doing this, users can decide the extent of unwanted correlation to be removed by using eg.: only PC1 as a predictive variable in the univariate model, or PC1 to PC2, or all PCs equivalent to taking all surrogates as predictive variables (more examples below).

pca <- t(assay(sce,"counts")) |> 
  as_tibble() |> 
  select(all_of(x)) |> 
  mutate_all(asinh) |> 
  mutate_all(scale) |> 
  prcomp()

This PCA can be stored in the SingleCellExperiment object as follows:

reducedDim(sce, "PCA") <- pca$x

3.5 name_reduced_dim

Name of the dimensionality reduction data stored under reducedDim(). In this example is "PCA. This is important if the regression is desired to be fitted as a function of principal components or other reduced dimensions.

3.6 apply_asinh_SUCs

Apply (TRUE) or not (FALSE) asinh transformation to the SUCs. TRUE if SUCs are the measured surrogates ( $x$ ), FALSE if SUCs are PCs.

3.7 model

A character: simple, offset or interaction defining the linear model to be used. Here we offer a quick description of each model and their rationale.

3.7.1 M1: Simple

Consists of one fit per measured marker $y^m_i$ across the entire data set.

$M_1(\vec{x}_{i}) = \underbrace{\beta^m}_{=O_1^m} + \underbrace{\sum_{p = 1}^{N_{SUC}} \alpha^m_{p} \cdot x_{i,p}}_{=S^m_1(\vec{x}_{i})}\$ , where $\beta_0^m$ is the intercept and $\alpha_p^m$ is the slope coefficient for each predictor or SUC $p$ .

This is suitable when the data set includes one cell line/organoid/biological system with multiple perturbations, and the goal is to evaluate the treatment effect across these perturbations. We recommend this model when the relationship between marker abundance and the source of confounding factors (slope) is theoretically thought to be consistent across the data set.

3.7.2 M2: Offset

Consists of one fit per measured marker $y^m_i$ and sample $j_i$ . The fits for the samples share the same slope, while differing in the intercept (offset term $O^m_2(j_i)$ ). Samples $j_i$ can be either different cell lines, perturbations, conditions, metacells (clusters), or even batches.

$M_2(\vec{x}_{i},j_i) = \underbrace{\beta^m_{j_i}}_{=O_2^m(j_i)} + \underbrace{\sum_{p = 1}^{N_{SUC}} \alpha^m_{p} \cdot x_{i,p}}_{=S_2^m(\vec{x}_{i})}$

The offset - intercept - term $O^m$ can either represent desired variation between samples (keep_offset = TRUE) or unwanted variation (keep_offset = FALSE). If the offset is considered wanted, the output from the offset model will closely resemble that of the simple model. However, the offset model becomes particularly important when the variation between samples is unwanted, such as when different samples represent batches and the goal is to remove batch effects.

3.7.3 M3: Interaction

Consists of one fit per measured marker $y_i^m$ and sample $j_i$ . The fits for the samples can have different slopes (interaction term $S_3^m(\vec{x_i},j_i)$ ) and intercepts (offset term $O_3^m(j_i)$ ).

$M_3(\vec{x}_{i},j_i) = \underbrace{\beta^m_{j_i}}_{=O_3^m(j_i)} + \underbrace{\sum_{p = 1}^{N_{SUC}} \alpha^m_{j_i,p} \cdot x_{i,p}}_{=S_3^m(\vec{x}_{i},j_i)}\$

This model is ideal for datasets with different cell types (e.g., immune cells) or when comparing treatment effects across different cell lines with varying relationships between marker abundance and the source of unwanted covariance (e.g., cell size). Similarly, when removing batch effects is desired, fitting an interaction model per batch and removing the offset between fits can be effective (keep_offset = FALSE).

3.8 col_name_sample

A character indicating the column name in data defining each sample (cell lines, batches, treatments, etc). Clarifying examples are below.

3.9 center_SUCs

A character across_samples or per_sample defining how to zero-center the SUCs.

3.9.1 across samples

SUCs are zero-centered across all samples before applying the linear model. This is recommended when the relationship (e.g.: logFC) between samples is an unwanted covariance and it is thought to be affected by cell size or staining efficiency.

Figure 1: RUCova models centering SUCs across samples
A,B) Simple model. C,D) Offset model. E,F) Interaction model. A,C,E) Before RUCova. B,D,F) After RUCova.

3.9.2 per sample

SUCs are zero-centered per sample applying the linear model. This is recommended when the relationship (e.g.: logFC) between samples is a desired covariance and is not affected by cell size or staining efficiency (more conservative)

Figure 2: RUCova models centering SUCs per sample
A,B) Simple model. C,D) Offset model. E,F) Interaction model. A,C,E) Before RUCova. B,D,F) After RUCova.

3.10 keep_offset

As written above, the RUCova method eliminates the dependency of each measured marker on the SUCs by computing the model’s residuals and the intercept as the revised expression for each marker ( $y^{m*}_i$ ). If the offset between samples $O^m(j_i$ is a desired covariance (e.g.: cell lines), then keep_offset = TRUE, and the modified value of the marker is described by: $y^{*m}_i(j_i) = O_1^m(j_i) + \epsilon^m_i$ , where $\epsilon^m_i$ are the residuals of the model. If the offset between samples is an unwanted covariance (e.g.: batches), then keep_offset = FALSE, and the modified value of the marker is: $y^{*m}_i(j_i) = \epsilon^m_i$ ## name_assay_after

A string specifying the name of the assay containing the data after applying RUCova.

4 Zero values

RUCova fits the chosen linear model including the zero values. By removing the correlation with the SUCs, zeros will get a value accordingly. Hence, RUCova addresses the issue of zero-values in mass cytometry data by assigning non-zero values while simultaneously removing such covariance. Examples below will further clarify this point.

5 Output data

The output of the function rucova is a SingleCellExperiment containing the same information as the input SingleCellExperiment, plus:

Additional assay named name_assay_after containing the modified marker intensity values in linear scale.

A metadata list named model_ concatenated with name_assay_after accesible via metadata(sce). This list contains:

All the input variables: name_assay_before, markers, SUCs, name_reduced_dim, apply_asinh_SUCs, model, col_name_sample, center_SUCs, keep_offset, `` (to remind you how you obtained this result).

model_formula: the lm model formula applied to each marker to regress-out surrogates of sources of unwanted covariance.

model_coefficients: intercept $O^m(j_i)$ and slope $\alpha^m_{p}$ of the fit for surrogate $p$ and marker $m$ (and sample $j$ if applicable). This values are for dummy variables. For meaningful coefficients, refer to effective_coefficients.

model_residuals: the residuals of the fit $\epsilon^m_i$ .

adj_r2: adjusted R-squared or coefficient of determination to evaluate the goodness of fit for each marker. It quantify the proportion of the variance in the dependent variable (marker) that is explained by the independent variables (surrogates) in the model. The adjusted R-squared is adjusted by the number of independent variables used to predict the target variable. This is done to account for the automatic increase of $R^2$ values when extra explanatory variables are added to the model. By analysing the $\rm{R^2_{adj}}$ ,we can determine whether adding new variables increases the model fit. $R_{adj}^2 = \Big(1 - \underbrace{(1-\frac{SS_{res}}{SS_{tot}}}_{R^2})\Big)\cdot\frac{n-1}{n-q-1}$

where $SS_{res} = \sum_i \epsilon_i^2$ is the residual sum of squares, $\epsilon_i$ are the residuals of the model, $SS_{tot} = \sum_i(y_i - \overline{y})$ is the total sum of squares, $\overline{y}$ is the mean value of the marker, $n$ is the sample size (total number of cells $i$ ) and $q$ is the number of explanatory variables in the model.

stand_slopes: We standardised the slope coefficients in order to make them comparable. In the case of the interaction model, where the slopes $\alpha^m$ depend on samples $j_i$ and the SUC $p$ , we standardised the slope by multiplying it by the standard deviation of the corresponding SUC in each sample $j$ ( $\sigma_{x_{j_i,p}}$ ) and dividing it by the standard deviation of the marker $m$ in sample $j$ ( $\sigma_{y^m_{j_i}}$ ) : $\alpha^{m*}_{j_i,x_{i,p}} = \alpha^m_{j_i,x_{i,p}}\cdot \frac{\sigma_{x_{j_i,p}}}{\sigma_{y^m_{j_i}}}$

6 Examples

6.1 M1: Simple model

As we have multiple cancer cell lines in the current data set, we run the simple model in only one cell line and across control and treated condition.

sce_Cal33 <-  sce[,sce$line == "Cal33"]

Let’s give a look at the pearson correlation coefficients between markers before applying RUCova (symmetric matrix). RUCova offers the option of returning the correlation coefficients on a matrix or directly the heatmap, which is automatically plotted and can also be stored:

matrix_corr <- RUCova::compare_corr(sce_Cal33[c(m,x)]) 
heatmap_corr <- RUCova::heatmap_compare_corr(sce_Cal33[c(m,x)]) #same in lower and upper triangle


RUCova::heatmap_compare_corr(sce_Cal33[c(m,x)]) #same in lower and upper triangle

6.1.1.1.1 maintaing the logFC between treated and control
6.1.1.1.2 changing logFC between samples accordingly

Let’s say you’re not convinced yet by the package, so you want to be conservative and remove correlations to all SUCs but not the logFC between conditions. In this case, center_SUCs = "per_sample".

Now we can compare the new Pearson correlation coefficients calculated from the assay counts_simple_persample (after RUCova, upper triangle) to the original coefficients from the assay ``counts``` (lower triangle).


heatmap_compare_corr(sce_Cal33, name_assay_before = "counts", name_assay_after = "counts_simple_persample")

Log fold-changes between irradiated and control condition are kept (positive means higher in irradiated).


FC_before <-  t(assay(sce_Cal33,"counts")) |>  
  as.tibble() |> 
  cbind(colData(sce_Cal33)) |> 
  mutate_at(vars(x,m), asinh) |> 
  pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |> 
  group_by(marker) |> 
  summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |> 
  mutate(data = "before RUCova") 

FC_after <- t(assay(sce_Cal33,"counts_simple_persample")) |>  
  as.tibble() |> cbind(colData(sce_Cal33))  |> 
  mutate_at(vars(x,m), asinh) |> 
  pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |> 
  group_by(marker) |> 
  summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |> 
  ungroup() |> 
  mutate(data = "simple all, per sample")

rbind(FC_before,FC_after) |> 
  ggplot(aes(x = logFC, y = marker, fill = data)) + 
  geom_col(position = "dodge")

As radiation changes the cell volume, we think differences in protein intensities between treated and control are confounded. Hence, we want to remove any difference that correlates with the SUCs (center_SUCs = "across_sample").


heatmap_compare_corr(sce_Cal33, name_assay_before = "counts", name_assay_after = "counts_simple_acrosssamples")

Log fold-changes between irradiated and control condition are modified accordingly (positive means higher in irradiated).

FC_before <-  t(assay(sce_Cal33,"counts")) |>  
  as.tibble() |> 
  cbind(colData(sce_Cal33)) |> 
  mutate_at(vars(x,m), asinh) |> 
  pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |> 
  group_by(marker) |> 
  summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |> 
  mutate(data = "before RUCova") 

FC_after <- t(assay(sce_Cal33,"counts_simple_acrosssamples")) |>  
  as.tibble() |> cbind(colData(sce_Cal33))  |> 
  mutate_at(vars(x,m), asinh) |> 
  pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |> 
  group_by(marker) |> 
  summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |> 
  ungroup() |> 
  mutate(data = "simple all, per sample")

rbind(FC_before,FC_after) |> 
  ggplot(aes(x = logFC, y = marker, fill = data)) + 
  geom_col(position = "dodge")

Let’s imagine you want to be conservative and only remove correlations between markers and PC1 (of SUCs).

pca_cal33 <- t(assay(sce_Cal33,"counts")) |>
  as.tibble() |> 
  cbind(colData(sce_Cal33)) |> 
  select(x) |>
  mutate_all(asinh) |>
  mutate_all(scale) |>
  prcomp()

Calculate and plot the variance explained by each PC:

tibble(perc = as.numeric(pca_cal33$sdev^2/sum(pca_cal33$sdev^2))*100,
       PC = 1:length(pca_cal33$sdev)) |>
  ggplot(aes(x = PC, y = perc, label = round(perc,1))) +
  geom_col() +
  geom_label()

Check the loadings of each PC:

as.data.frame(pca_cal33$rotation) |>
  rownames_to_column("x") |>
  pivot_longer(names_to = "PC", values_to = "loadings", -x) |>
  ggplot(aes(x = loadings, y = x)) +
  geom_col() +
  facet_wrap(~PC, nrow = 1)

In this example, PC1 has positive loadings. Meaning PC1 will positively correlate with the markers, which is intuitive if we think of it as the cell size. In case for your data set, PC1 has negative loadings, you can just the direction for a more intuitive analysis:

pca_cal33$x |> as.data.frame() |> mutate(PC1 = -PC1) #variable not saved as not necessary here

Add the PCA to the sce object under the name “PCA”:

name_reduced_dim = "PCA"
reducedDim(sce_Cal33, name_reduced_dim) <- pca_cal33$x

Then, SUCs= "PC1" and apply_asinh_SUCs = FALSE, as asinh transformation is not necessary on PCs (it was applied on SUCs before PCA). This applies to all models.

If we regress-out any PCs and want to check the correlation coefficient, it is important we specify now the name for the heatmap function to include it: ``name_reduced_dim = “PCA”```.


heatmap_compare_corr(sce_Cal33, name_assay_before = "counts", name_assay_after = "counts_simple_PC1", name_reduced_dim = "PCA")

Log fold-changes between irradiated and control condition are modified accordingly (positive means higher in irradiated).

FC_before <-  t(assay(sce_Cal33,"counts")) |>  
  as.tibble() |> 
  cbind(colData(sce_Cal33)) |> 
  mutate_at(vars(x,m), asinh) |> 
  pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |> 
  group_by(marker) |> 
  summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |> 
  mutate(data = "before RUCova") 

FC_after <- t(assay(sce_Cal33,"counts_simple_PC1")) |>  
  as.tibble() |> cbind(colData(sce_Cal33))  |> 
  mutate_at(vars(x,m), asinh) |> 
  pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |> 
  group_by(marker) |> 
  summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |> 
  ungroup() |> 
  mutate(data = "simple all, per sample")

rbind(FC_before,FC_after) |> 
  ggplot(aes(x = logFC, y = marker, fill = data)) + 
  geom_col(position = "dodge")

6.1.2 Adjusted R-squared

Let’s compare the adjusted $R²$ per marker (goodness of fit, see complete definition above).


r2_a <- metadata(sce_Cal33)$model_counts_simple_acrosssamples$adjr2 |> mutate(model = "simple all, across samples")
r2_b <- metadata(sce_Cal33)$model_counts_simple_persample$adjr2|> mutate(model = "simple all, per sample")
r2_c <- metadata(sce_Cal33)$model_counts_simple_PC1$adjr2 |> mutate(model = "simple PC1, across sample")

rbind(rbind(r2_a,r2_b),r2_c) |> 
  ggplot(aes(x = adj_r_squared, y = marker, fill = model)) + 
  geom_col(position = "dodge")

In general, across all markers, the adjusted $R^2$ is higher when using all SUCs as predictive variables (instead of just PC1) and when SUCs are center across samples (residuals are smaller).

rbind(rbind(r2_a,r2_b),r2_c) |> 
  ggplot(aes(y = adj_r_squared, x = model, color = model)) + 
  geom_boxplot()

The decision of which model parameters to use should be based on: (1) the extent of variance to remove (all SUCs or some PCs) and (2) whether to keep or not the FC between samples.

6.1.3 Standardised slopes

We can also compare the slopes for each marker in each model.

slope_a <- metadata(sce_Cal33)$model_counts_simple_acrosssamples$stand_slopes |> mutate(model = "simple all, across samples") 
slope_b <- metadata(sce_Cal33)$model_counts_simple_persample$stand_slopes |> mutate(model = "simple all, per sample")
slope_c <- metadata(sce_Cal33)$model_counts_simple_PC1$stand_slopes |> mutate(model = "simple PC1, across sample")

rbind(slope_a, slope_b, slope_c) |> 
  mutate(coef_key = factor(coef_key, levels = c("mean_DNA", "mean_BC","pan_Akt", "total_ERK", "PC1"))) |> 
  ggplot(aes(x = stand_value, y = marker, fill = model)) + 
  geom_col(position = "dodge") + 
  facet_wrap(~coef_key, nrow = 1)

6.2 M2: Offset model

Here we just show one example: Let’s say different samples (dose) within Cal33 are different batches, and we would like to adjust for it in addition to removing unwanted correlations with all SUCs. Then keep_offset = FALSE.


sce_Cal33 <- RUCova::rucova(sce = sce_Cal33, 
                            name_assay_before = "counts",
                            markers = m,
                            SUCs = x, 
                            apply_asinh_SUCs = TRUE, 
                            keep_offset = FALSE,
                            model = "offset",
                            center_SUCs = "per_sample",
                            col_name_sample = "dose",     
                            name_assay_after = "counts_offset_persample")

Fold-change between conditions (“dose”) becomes zero:


FC_before <- t(assay(sce_Cal33,"counts")) |> 
  as.tibble() |>
  cbind(colData(sce_Cal33))  |> 
  mutate_at(vars(x,m), asinh) |> 
  pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |> 
  group_by(marker) |> 
  summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |> 
  mutate(data = "before RUCova") |> 
  ungroup() 

FC_after <- t(assay(sce_Cal33,"counts_offset_persample")) |> 
  as.tibble() |>
  cbind(colData(sce_Cal33))  |> 
  mutate_at(vars(x,m), asinh) |> 
  pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |> 
  group_by(marker) |> 
  summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |> 
  ungroup() |> 
  mutate(data = "offset all, per sample") 


rbind(FC_before, FC_after) |> 
  ggplot(aes(x = logFC, y = marker, fill = data)) + 
  geom_col(position = "dodge")

Different intercepts between samples:

metadata(sce_Cal33)$model_counts_offset_persample$eff_coefficients |> 
  filter(surrogate == FALSE) |> 
  ggplot(aes(x = eff_value, y = marker, fill = sample)) + 
  geom_col(position = "dodge") +
  xlab("eff_value (intercept)")

6.3 M3: Interaction model

Here we just show one example, the least conservative: we eliminate differences between cell lines that correlate with the SUCs (center_SUCs = "across_samples") and apply one fit per cell line (model = "interaction"). Then:

sce <- RUCova::rucova(sce = sce, 
                      name_assay_before = "counts",
                      markers = m,
                      SUCs = x, 
                      apply_asinh_SUCs = TRUE, 
                      model = "interaction",
                      center_SUCs = "across_samples",
                      col_name_sample = "line",     
                      name_assay_after = "counts_interaction_persample")

Different slopes for different cell lines:

metadata(sce)$model_counts_interaction_persample$eff_coefficients |> 
  filter(surrogate != FALSE) |> 
  ggplot(aes(x = eff_value, y = marker, fill = sample)) + 
  geom_col(position = "dodge") +
  facet_wrap(~surrogate) + 
  xlab("eff_value (slope)")

6.4 Session information

sessionInfo()
#> R version 4.5.0 RC (2025-04-04 r88126)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.30.0 SummarizedExperiment_1.38.0
#>  [3] Biobase_2.68.0              GenomicRanges_1.60.0       
#>  [5] GenomeInfoDb_1.44.0         IRanges_2.42.0             
#>  [7] S4Vectors_0.46.0            BiocGenerics_0.54.0        
#>  [9] generics_0.1.3              MatrixGenerics_1.20.0      
#> [11] matrixStats_1.5.0           dplyr_1.1.4                
#> [13] tibble_3.2.1                tidyr_1.3.1                
#> [15] ggplot2_3.5.2               RUCova_1.0.0               
#> [17] BiocStyle_2.36.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6            circlize_0.4.16         shape_1.4.6.1          
#>  [4] rjson_0.2.23            xfun_0.52               bslib_0.9.0            
#>  [7] GlobalOptions_0.1.2     lattice_0.22-7          Cairo_1.6-2            
#> [10] vctrs_0.6.5             tools_4.5.0             parallel_4.5.0         
#> [13] cluster_2.1.8.1         pkgconfig_2.0.3         Matrix_1.7-3           
#> [16] data.table_1.17.0       tidyverse_2.0.0         RColorBrewer_1.1-3     
#> [19] lifecycle_1.0.4         GenomeInfoDbData_1.2.14 farver_2.1.2           
#> [22] compiler_4.5.0          stringr_1.5.1           tinytex_0.57           
#> [25] munsell_0.5.1           codetools_0.2-20        ComplexHeatmap_2.24.0  
#> [28] clue_0.3-66             htmltools_0.5.8.1       sass_0.4.10            
#> [31] yaml_2.3.10             pillar_1.10.2           crayon_1.5.3           
#> [34] jquerylib_0.1.4         cachem_1.1.0            DelayedArray_0.34.0    
#> [37] magick_2.8.6            iterators_1.0.14        abind_1.4-8            
#> [40] foreach_1.5.2           tidyselect_1.2.1        digest_0.6.37          
#> [43] stringi_1.8.7           purrr_1.0.4             bookdown_0.43          
#> [46] labeling_0.4.3          fastmap_1.2.0           grid_4.5.0             
#> [49] colorspace_2.1-1        cli_3.6.4               SparseArray_1.8.0      
#> [52] magrittr_2.0.3          fastDummies_1.7.5       S4Arrays_1.8.0         
#> [55] withr_3.0.2             scales_1.3.0            UCSC.utils_1.4.0       
#> [58] rmarkdown_2.29          XVector_0.48.0          httr_1.4.7             
#> [61] png_0.1-8               GetoptLong_1.0.5        evaluate_1.0.3         
#> [64] knitr_1.50              doParallel_1.0.17       rlang_1.1.6            
#> [67] Rcpp_1.0.14             glue_1.8.0              BiocManager_1.30.25    
#> [70] jsonlite_2.0.0          R6_2.6.1

Removing Unwanted Covariance in mass cytometry data with RUCova

15 April 2025

Contents