SuperCellCyto 0.99.2
This vignette describes the steps to generate supercells for cytometry data using SuperCellCyto R package.
Briefly, supercells are “mini” clusters of cells that are similar in their marker expressions. The motivation behind supercells is that instead of analysing millions of individual cells, you can analyse thousands of supercells, making downstream analysis much faster while maintaining biological interpretability.
See other vignettes for how to:
You can install stable version of SuperCellCyto from Bioconductor using:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SuperCellCyto")
For the latest development version, you can install it from GitHub using pak
:
if (!requireNamespace("pak", quietly = TRUE))
install.packages("pak")
pak::install_github("phipsonlab/SuperCellCyto")
The function which creates supercells is called runSuperCellCyto
, and it
operates on a data.table
object, an enhanced version of R native
data.frame
.
In addition to needing the data stored in a data.table
object it also
requires:
runSuperCellCyto
does not perform any data transformation or scaling.If you are not sure how to import CSV or FCS files into data.table
object, and/or how to subsequently prepare the object ready for
SuperCellCyto, please consult this vignette.
In that vignette, we also provide an explanation behind why we need to have the
cell ID and sample column.
For this vignette, we will simulate some toy data using the simCytoData
function.
Specifically, we will simulate 15 markers and 3 samples,
with each sample containing 10,000 cells.
Hence in total, we will have a toy dataset containing 15 markers and
30,000 cells.
n_markers <- 15
n_samples <- 3
dat <- simCytoData(nmarkers = n_markers, ncells = rep(10000, n_samples))
head(dat)
#> Marker_1 Marker_2 Marker_3 Marker_4 Marker_5 Marker_6 Marker_7 Marker_8
#> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 11.917580 9.938748 12.49596 18.00871 13.59679 7.735635 10.60091 4.917327
#> 2: 9.710698 7.999078 12.99032 18.12173 11.77093 8.176242 10.45278 5.279087
#> 3: 11.670935 10.246662 12.22982 18.82432 15.05277 8.291313 11.13275 4.748578
#> 4: 10.112629 7.693873 10.47931 18.08441 12.13923 8.822299 9.18755 4.156962
#> 5: 12.316708 8.110632 13.11245 17.89170 12.89112 8.284525 8.85180 6.717492
#> 6: 11.510465 9.264141 12.51132 16.75203 13.10943 8.367071 10.22723 6.902138
#> Marker_9 Marker_10 Marker_11 Marker_12 Marker_13 Marker_14 Marker_15
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: 6.764583 13.48395 11.96863 18.63712 14.98998 17.17756 10.37141
#> 2: 8.388537 14.05831 13.03141 17.89301 13.84939 16.97657 11.15416
#> 3: 8.554747 15.23281 12.76964 17.97769 13.90550 15.30430 12.34240
#> 4: 9.201780 14.49434 12.83594 17.22928 14.45801 15.75491 11.87348
#> 5: 9.560643 14.92334 12.68093 18.81947 15.88161 17.21752 11.47131
#> 6: 9.063776 15.48591 13.71127 16.73957 14.81726 17.23603 13.13819
#> Sample Cell_Id
#> <char> <char>
#> 1: Sample_1 Cell_1
#> 2: Sample_1 Cell_2
#> 3: Sample_1 Cell_3
#> 4: Sample_1 Cell_4
#> 5: Sample_1 Cell_5
#> 6: Sample_1 Cell_6
For our toy dataset, we will transform our data using arcsinh transformation.
We will use the base R asinh
function to do this:
# Specify which columns are the markers to transform
marker_cols <- paste0("Marker_", seq_len(n_markers))
# The co-factor for arc-sinh
cofactor <- 5
# Do the transformation
dat_asinh <- asinh(dat[, marker_cols, with = FALSE] / cofactor)
# Rename the new columns
marker_cols_asinh <- paste0(marker_cols, "_asinh")
names(dat_asinh) <- marker_cols_asinh
# Add them our previously loaded data
dat <- cbind(dat, dat_asinh)
head(dat[, marker_cols_asinh, with = FALSE])
#> Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh Marker_5_asinh
#> <num> <num> <num> <num> <num>
#> 1: 1.603079 1.438143 1.646931 1.993302 1.725754
#> 2: 1.417456 1.248886 1.683048 1.999331 1.591673
#> 3: 1.583825 1.465482 1.626973 2.036047 1.821779
#> 4: 1.453664 1.216081 1.485702 1.997344 1.620096
#> 5: 1.633530 1.260652 1.691787 1.987022 1.675897
#> 6: 1.571113 1.375812 1.648072 1.923791 1.691571
#> Marker_6_asinh Marker_7_asinh Marker_8_asinh Marker_9_asinh Marker_10_asinh
#> <num> <num> <num> <num> <num>
#> 1: 1.220623 1.496126 0.8696334 1.110307 1.717936
#> 2: 1.267518 1.483415 0.9202970 1.289462 1.757144
#> 3: 1.279464 1.540592 0.8453665 1.306359 1.833069
#> 4: 1.333050 1.368513 0.7569943 1.369873 1.785971
#> 5: 1.278762 1.335955 1.1046965 1.403633 1.813587
#> 6: 1.287262 1.463777 1.1265529 1.356618 1.848738
#> Marker_11_asinh Marker_12_asinh Marker_13_asinh Marker_14_asinh
#> <num> <num> <num> <num>
#> 1: 1.607022 2.026391 1.817813 1.947852
#> 2: 1.685996 1.987093 1.743049 1.936556
#> 3: 1.667075 1.991641 1.746853 1.837518
#> 4: 1.671899 1.950739 1.783599 1.865140
#> 5: 1.660587 2.035799 1.872777 1.950083
#> 6: 1.733627 1.923078 1.806825 1.951115
#> Marker_15_asinh
#> <num>
#> 1: 1.476370
#> 2: 1.542345
#> 3: 1.635461
#> 4: 1.599662
#> 5: 1.567988
#> 6: 1.693619
We will also create a column Cell_id_dummy which uniquely identify each cell.
It will have values such as Cell_1, Cell_2,
all the way until Cell_x
where x is the number of cells in the dataset.
dat$Cell_id_dummy <- paste0("Cell_", seq_len(nrow(dat)))
head(dat$Cell_id_dummy, n = 10)
#> [1] "Cell_1" "Cell_2" "Cell_3" "Cell_4" "Cell_5" "Cell_6" "Cell_7"
#> [8] "Cell_8" "Cell_9" "Cell_10"
By default, the simCytoData
function will generate cells for multiple samples,
and that the resulting data.table
object will already have a column
called Sample that denotes the sample the cells come from.
unique(dat$Sample)
#> [1] "Sample_1" "Sample_2" "Sample_3"
Let’s take note of the sample and cell id column for later.
sample_col <- "Sample"
cell_id_col <- "Cell_id_dummy"
Now that we have our data, let’s create some supercells.
To do this, we will use runSuperCellCyto
function and pass the markers,
sample and cell ID columns as parameters.
The reason why we need to specify the markers is because the function will
create supercells based on only the expression of those markers.
We highly recommend creating supercells using all markers in your data, let
that be cell type or cell state markers.
However, if for any reason you only want to only use a subset of the markers in
your data, then make sure you specify them in a vector that you later pass to
runSuperCellCyto
function.
For this tutorial, we will use all the arcsinh transformed markers in the toy data.
supercells <- runSuperCellCyto(
dt = dat,
markers = marker_cols_asinh,
sample_colname = sample_col,
cell_id_colname = cell_id_col
)
Let’s dig deeper into the object it created:
class(supercells)
#> [1] "list"
It is a list containing 3 elements:
names(supercells)
#> [1] "supercell_expression_matrix" "supercell_cell_map"
#> [3] "supercell_object"
The supercell_object
contains the metadata used to create the supercells.
It is a list, and each element contains the metadata used to create the
supercells for a sample.
This will come in handy if we need to either regenerate the supercells using
different gamma values (so we get more or less supercells) or do some
debugging later down the line.
More on regenerating supercells on
Controlling supercells granularity
section below.
The supercell_expression_matrix
contains the marker expression of each
supercell.
These are calculated by taking the average of the marker expression of
all the cells contained within a supercell.
head(supercells$supercell_expression_matrix)
#> Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh Marker_5_asinh
#> <num> <num> <num> <num> <num>
#> 1: 1.545198 1.425163 1.610168 2.027007 1.756782
#> 2: 1.574298 1.305234 1.653355 2.009477 1.770085
#> 3: 1.549332 1.473761 1.557750 2.036047 1.759024
#> 4: 1.613277 1.278351 1.575097 2.035458 1.724837
#> 5: 1.501669 1.244653 1.571350 2.010944 1.756883
#> 6: 1.607818 1.331621 1.622374 2.034332 1.765747
#> Marker_6_asinh Marker_7_asinh Marker_8_asinh Marker_9_asinh Marker_10_asinh
#> <num> <num> <num> <num> <num>
#> 1: 1.167019 1.438076 0.8227829 1.304818 1.788458
#> 2: 1.118226 1.216379 0.9405080 1.256179 1.804837
#> 3: 1.329879 1.389120 1.0335337 1.419759 1.824809
#> 4: 1.051915 1.443926 1.0480630 1.243548 1.807572
#> 5: 1.184294 1.270616 0.8711111 1.331822 1.795831
#> 6: 1.215730 1.478182 0.9100977 1.291433 1.824620
#> Marker_11_asinh Marker_12_asinh Marker_13_asinh Marker_14_asinh
#> <num> <num> <num> <num>
#> 1: 1.736881 1.939551 1.820370 1.944246
#> 2: 1.631336 1.937891 1.838019 1.951925
#> 3: 1.655072 1.923090 1.826439 1.931838
#> 4: 1.717496 1.927918 1.805303 1.941336
#> 5: 1.713807 1.919457 1.827244 1.931491
#> 6: 1.688726 1.950513 1.850783 1.954683
#> Marker_15_asinh Sample SuperCellId
#> <num> <char> <char>
#> 1: 1.592103 Sample_1 SuperCell_1_Sample_Sample_1
#> 2: 1.538787 Sample_1 SuperCell_2_Sample_Sample_1
#> 3: 1.589022 Sample_1 SuperCell_3_Sample_Sample_1
#> 4: 1.619444 Sample_1 SuperCell_4_Sample_Sample_1
#> 5: 1.642410 Sample_1 SuperCell_5_Sample_Sample_1
#> 6: 1.638045 Sample_1 SuperCell_6_Sample_Sample_1
Therein, we will have the following columns:
markers_col
variable.
In this example, they are the arcsinh transformed markers in our toy data.Sample
in this case) denoting which sample a supercell
belongs to, (note the column name is the same as what is stored in sample_col
variable).SuperCellId
column denoting the unique ID of the supercell.Let’s have a look at SuperCellId
:
head(unique(supercells$supercell_expression_matrix$SuperCellId))
#> [1] "SuperCell_1_Sample_Sample_1" "SuperCell_2_Sample_Sample_1"
#> [3] "SuperCell_3_Sample_Sample_1" "SuperCell_4_Sample_Sample_1"
#> [5] "SuperCell_5_Sample_Sample_1" "SuperCell_6_Sample_Sample_1"
Let’s break down one of them, SuperCell_1_Sample_Sample_1
.
SuperCell_1
is a numbering (1 to however many supercells there are in
a sample) used to uniquely identify each supercell in a sample.
Notably, you may encounter this (SuperCell_1
, SuperCell_2
) being repeated
across different samples, e.g.,
supercell_ids <- unique(supercells$supercell_expression_matrix$SuperCellId)
supercell_ids[grep("SuperCell_1_", supercell_ids)]
#> [1] "SuperCell_1_Sample_Sample_1" "SuperCell_1_Sample_Sample_2"
#> [3] "SuperCell_1_Sample_Sample_3"
While these 3 supercells’ id are pre-fixed with SuperCell_1
, it does
not make them equal to one another!
SuperCell_1_Sample_Sample_1
will only contain cells from Sample_1
while
SuperCell_1_Sample_Sample_2
will only contain cells from Sample_2
.
By now, you may have noticed that we appended the sample name into each supercell id. This aids in differentiating the supercells in different samples.
supercell_cell_map
maps each cell in our dataset to the supercell it
belongs to.
head(supercells$supercell_cell_map)
#> SuperCellID CellId Sample
#> <char> <char> <char>
#> 1: SuperCell_426_Sample_Sample_1 Cell_1 Sample_1
#> 2: SuperCell_190_Sample_Sample_1 Cell_2 Sample_1
#> 3: SuperCell_84_Sample_Sample_1 Cell_3 Sample_1
#> 4: SuperCell_159_Sample_Sample_1 Cell_4 Sample_1
#> 5: SuperCell_41_Sample_Sample_1 Cell_5 Sample_1
#> 6: SuperCell_244_Sample_Sample_1 Cell_6 Sample_1
This map is very useful if we later need to expand the supercells out. Additionally, this is also the reason why we need to have a column in the dataset which uniquely identify each cell.
runSuperCellCyto
in parallelBy default, runSuperCellCyto
will process each sample one after the other.
As each sample is processed independent of one another, strictly speaking, we
can process all of them in parallel.
To do this, we need to:
BiocParallelParam
object from the BiocParallel package.
This object can either be of type MulticoreParam
or SnowParam
.
We highly recommend consulting their vignette for more information.BiocParallelParam
object to the number of
samples we have in the dataset.load_balancing
parameter for runSuperCellCyto
function to TRUE.
This is to ensure even distribution of the supercell creation jobs.
As each sample will be processed by a parallel job, we don’t want a job that
processs large sample to also be assigned other smaller samples if possible.
If you want to know more how this feature works, please refer to our manuscript.supercell_par <- runSuperCellCyto(
dt = dat,
markers = marker_cols_asinh,
sample_colname = sample_col,
cell_id_colname = cell_id_col,
BPPARAM = MulticoreParam(tasks = n_samples),
load_balancing = TRUE
)
This is described in the runSuperCellCyto
function’s documentation, but let’s
briefly go through it here.
The runSuperCellCyto
function is equipped with various parameters which
can be customised to alter the composition of the supercells.
The one that is very likely to be used the most is the gamma parameter,
denoted as gam
in the function.
By default, the value for gam
is set to 20, which we found work well for
most cases.
The gamma parameter controls how many supercells to generate, and
indirectly, how many cells are captured within each supercell.
This parameter is resolved into the following formula
gamma=n_cells/n_supercells
where n_cell
denotes the number of cells and
n_supercells
denotes the number of supercells.
In general, the larger gamma parameter is set to, the less supercells we will get. Say for instance we have 10,000 cells. If gamma is set to 10, we will end up with about 1,000 supercells, whereas if gamma is set to 50, we will end up with about 200 supercells.
You may have noticed, after reading the sections above, runSuperCellCyto
is ran on each sample independent of each other, and that we can only set
1 value as the gamma parameter.
Indeed, for now, the same gamma value will be used across all samples,
and that depending on how many cells we have in each sample, we will end up
with different number of supercells for each sample.
For instance, say we have 10,000 cells for sample 1, and 100,000 cells for
sample 2.
If gamma is set to 10, for sample 1, we will get 1,000 supercells (10,000/10)
while for sample 2, we will get 10,000 supercells (100,000/10).
Do note: whatever gamma value you chose, you should not expect each supercell to contain exactly the same number of cells. This behaviour is intentional to ensure rare cell types are not intermixed with non-rare cell types in a supercell.
If you have run runSuperCellCyto
once and have not discarded the
SuperCell object it generated (no serious, please don’t!),
you can use the object to quickly
regenerate supercells using different gamma values.
As an example, using the SuperCell object we have generated for our
toy dataset, we will regenerate the supercells using gamma of 10 and 50.
The function to do this is recomputeSupercells
.
We will store the output in a list, one element per gamma value.
addt_gamma_vals <- c(10, 50)
supercells_addt_gamma <- lapply(addt_gamma_vals, function(gam) {
recomputeSupercells(
dt = dat,
sc_objects = supercells$supercell_object,
markers = marker_cols_asinh,
sample_colname = sample_col,
cell_id_colname = cell_id_col,
gam = gam
)
})
We should end up with a list containing 2 elements. The 1st element contains supercells generated using gamma = 10, and the 2nd contains supercells generated using gamma = 50.
supercells_addt_gamma[[1]]
#> $supercell_expression_matrix
#> Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh
#> <num> <num> <num> <num>
#> 1: 1.365522 1.330306 1.603267 2.004976
#> 2: 1.573328 1.271869 1.563290 2.031285
#> 3: 1.585367 1.158819 1.544811 2.008930
#> 4: 1.493953 1.387302 1.659269 2.029209
#> 5: 1.511918 1.393743 1.625597 2.012778
#> ---
#> 2996: 1.814442 1.755414 1.309774 1.821050
#> 2997: 1.698609 1.689043 1.412053 1.812329
#> 2998: 1.565285 1.660066 1.434231 1.847811
#> 2999: 1.698557 1.717204 1.314340 1.800945
#> 3000: 1.814806 1.761465 1.324014 1.830987
#> Marker_5_asinh Marker_6_asinh Marker_7_asinh Marker_8_asinh
#> <num> <num> <num> <num>
#> 1: 1.743522 1.304722 1.344450 0.8498903
#> 2: 1.759176 1.290860 1.109373 0.9354421
#> 3: 1.770757 1.117507 1.488140 0.8247560
#> 4: 1.749797 1.232881 1.397749 0.7887713
#> 5: 1.739865 1.227833 1.544994 0.9869867
#> ---
#> 2996: 1.502856 1.222373 1.990972 1.0860809
#> 2997: 1.583628 1.482696 1.893302 1.1274391
#> 2998: 1.521408 1.394495 1.903209 1.1748749
#> 2999: 1.528352 1.108055 1.875250 0.8588515
#> 3000: 1.377356 1.245392 1.981251 1.1122633
#> Marker_9_asinh Marker_10_asinh Marker_11_asinh Marker_12_asinh
#> <num> <num> <num> <num>
#> 1: 1.289884 1.792930 1.670542 1.914128
#> 2: 1.202328 1.821607 1.678908 1.952280
#> 3: 1.336041 1.771297 1.731432 1.927394
#> 4: 1.095742 1.805152 1.646922 1.911037
#> 5: 1.459162 1.812026 1.640164 1.946678
#> ---
#> 2996: 2.055739 1.139228 1.918828 1.936411
#> 2997: 2.019481 1.165835 1.949322 1.988700
#> 2998: 2.009509 1.331811 1.904941 1.901110
#> 2999: 2.023246 1.266022 1.885414 1.870819
#> 3000: 2.029319 1.104736 1.960144 1.923194
#> Marker_13_asinh Marker_14_asinh Marker_15_asinh Sample
#> <num> <num> <num> <char>
#> 1: 1.809042 1.919517 1.600132 Sample_1
#> 2: 1.814841 1.950542 1.573013 Sample_1
#> 3: 1.794229 1.942131 1.606390 Sample_1
#> 4: 1.834365 1.944649 1.528265 Sample_1
#> 5: 1.829610 1.951921 1.620612 Sample_1
#> ---
#> 2996: 1.706271 1.085527 2.020115 Sample_3
#> 2997: 1.554352 1.007425 2.093563 Sample_3
#> 2998: 1.582714 1.047489 2.044579 Sample_3
#> 2999: 1.610309 1.078441 2.058036 Sample_3
#> 3000: 1.514850 1.103272 2.057525 Sample_3
#> SuperCellId
#> <char>
#> 1: SuperCell_1_Sample_Sample_1
#> 2: SuperCell_2_Sample_Sample_1
#> 3: SuperCell_3_Sample_Sample_1
#> 4: SuperCell_4_Sample_Sample_1
#> 5: SuperCell_5_Sample_Sample_1
#> ---
#> 2996: SuperCell_996_Sample_Sample_3
#> 2997: SuperCell_997_Sample_Sample_3
#> 2998: SuperCell_998_Sample_Sample_3
#> 2999: SuperCell_999_Sample_Sample_3
#> 3000: SuperCell_1000_Sample_Sample_3
#>
#> $supercell_cell_map
#> SuperCellID CellId Sample
#> <char> <char> <char>
#> 1: SuperCell_711_Sample_Sample_1 Cell_1 Sample_1
#> 2: SuperCell_823_Sample_Sample_1 Cell_2 Sample_1
#> 3: SuperCell_97_Sample_Sample_1 Cell_3 Sample_1
#> 4: SuperCell_160_Sample_Sample_1 Cell_4 Sample_1
#> 5: SuperCell_75_Sample_Sample_1 Cell_5 Sample_1
#> ---
#> 29996: SuperCell_48_Sample_Sample_3 Cell_29996 Sample_3
#> 29997: SuperCell_299_Sample_Sample_3 Cell_29997 Sample_3
#> 29998: SuperCell_49_Sample_Sample_3 Cell_29998 Sample_3
#> 29999: SuperCell_668_Sample_Sample_3 Cell_29999 Sample_3
#> 30000: SuperCell_802_Sample_Sample_3 Cell_30000 Sample_3
The output generated by recomputeSupercells
is essentially a list:
supercell_expression_matrix
: A data.table object that contains the marker
expression for each supercell.supercell_cell_map
: A data.table that maps each cell to its
corresponding supercell.As mentioned before, gamma dictates the granularity of supercells. Compared to the previous run where gamma was set to 20, we should get more supercells for gamma = 10, and less for gamma = 50. Let’s see if that’s the case.
n_supercells_gamma20 <- nrow(supercells$supercell_expression_matrix)
n_supercells_gamma10 <- nrow(
supercells_addt_gamma[[1]]$supercell_expression_matrix
)
n_supercells_gamma50 <- nrow(
supercells_addt_gamma[[2]]$supercell_expression_matrix
)
n_supercells_gamma10 > n_supercells_gamma20
#> [1] TRUE
n_supercells_gamma50 < n_supercells_gamma20
#> [1] TRUE
In the future, we may add the ability to specify different gam
value for different samples.
For now, if we want to do this, we will need to break down our data
into multiple data.table
objects, each containing data from 1 sample,
and run runSuperCellCyto
function on each of them with different gam
parameter value.
Something like the following:
n_markers <- 10
dat <- simCytoData(nmarkers = n_markers)
markers_col <- paste0("Marker_", seq_len(n_markers))
sample_col <- "Sample"
cell_id_col <- "Cell_Id"
samples <- unique(dat[[sample_col]])
gam_values <- c(10, 20, 10)
supercells_diff_gam <- lapply(seq_len(length(samples)), function(i) {
sample <- samples[i]
gam <- gam_values[i]
dat_samp <- dat[dat$Sample == sample, ]
supercell_samp <- runSuperCellCyto(
dt = dat_samp,
markers = markers_col,
sample_colname = sample_col,
cell_id_colname = cell_id_col,
gam = gam
)
return(supercell_samp)
})
Subsequently, to extract and combine the supercell_expression_matrix
and
supercell_cell_map
, we will need to use rbind
:
supercell_expression_matrix <- do.call(
"rbind", lapply(
supercells_diff_gam, function(x) x[["supercell_expression_matrix"]]
)
)
supercell_cell_map <- do.call(
"rbind", lapply(
supercells_diff_gam, function(x) x[["supercell_cell_map"]]
)
)
rbind(
head(supercell_expression_matrix, n = 3),
tail(supercell_expression_matrix, n = 3)
)
#> Marker_1 Marker_2 Marker_3 Marker_4 Marker_5 Marker_6 Marker_7 Marker_8
#> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 16.14744 8.604378 7.762897 6.090112 5.889048 19.65768 6.530894 19.67721
#> 2: 15.75461 8.515063 8.214616 4.883079 4.797482 18.81266 7.000315 18.98093
#> 3: 17.71163 10.169033 7.735079 3.054726 5.701306 17.43081 6.980343 18.34254
#> 4: 14.19250 8.043898 16.204262 8.197661 8.670738 18.38199 10.727111 9.76753
#> 5: 14.55217 8.923347 15.347301 8.791877 7.261080 18.41835 11.662246 10.28580
#> 6: 10.12469 8.877643 15.112540 7.025214 9.877729 19.31746 11.721420 12.52236
#> Marker_9 Marker_10 Sample SuperCellId
#> <num> <num> <char> <char>
#> 1: 10.21883 11.986301 Sample_1 SuperCell_1_Sample_Sample_1
#> 2: 11.55477 10.244540 Sample_1 SuperCell_2_Sample_Sample_1
#> 3: 11.09386 11.401419 Sample_1 SuperCell_3_Sample_Sample_1
#> 4: 10.95150 6.435362 Sample_2 SuperCell_498_Sample_Sample_2
#> 5: 12.01795 7.878121 Sample_2 SuperCell_499_Sample_Sample_2
#> 6: 10.89465 7.049960 Sample_2 SuperCell_500_Sample_Sample_2
rbind(head(supercell_cell_map, n = 3), tail(supercell_cell_map, n = 3))
#> SuperCellID CellId Sample
#> <char> <char> <char>
#> 1: SuperCell_44_Sample_Sample_1 Cell_1 Sample_1
#> 2: SuperCell_156_Sample_Sample_1 Cell_2 Sample_1
#> 3: SuperCell_988_Sample_Sample_1 Cell_3 Sample_1
#> 4: SuperCell_156_Sample_Sample_2 Cell_19998 Sample_2
#> 5: SuperCell_161_Sample_Sample_2 Cell_19999 Sample_2
#> 6: SuperCell_364_Sample_Sample_2 Cell_20000 Sample_2
If for whatever reason you don’t mind (or perhaps more to the point want)
each supercell to contain cells from different biological samples,
you still need to have the sample column in your data.table
.
However, what you need to do is essentially set the value in the column
to exactly one unique value.
That way, SuperCellCyto will treat all cells as coming from one sample.
Just note, the parallel processing feature in SuperCellCyto won’t work for this as you will essentially only have 1 sample and nothing for SuperCellCyto to parallelise.
Is your dataset so huge that you are constantly running out of RAM when generating supercells? This thing happens and we have a solution for it.
Since supercells are generated for each sample independent of others you can easily break up the process. For example:
supercell_expression_matrix
and supercell_cell_map
,
and export them out as a csv file using data.table
’s fwrite
function.Once you have processed all the samples, you can then load all
supercell_expression_matrix
and supercell_cell_map
csv files and
analyse them.
If you want to regenerate the supercells using different gamma values,
load the relevant output saved using the qs package and the relevant data
(remember to note which output belongs to which sets of samples!), and run
recomputeSupercells
function.
sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BiocParallel_1.43.4 SuperCellCyto_0.99.2 BiocStyle_2.37.1
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.5 knitr_1.50 rlang_1.1.6
#> [4] xfun_0.53 jsonlite_2.0.0 data.table_1.17.8
#> [7] plyr_1.8.9 htmltools_0.5.8.1 sass_0.4.10
#> [10] rmarkdown_2.30 grid_4.5.1 evaluate_1.0.5
#> [13] jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.10
#> [16] lifecycle_1.0.4 bookdown_0.45 BiocManager_1.30.26
#> [19] compiler_4.5.1 igraph_2.1.4 codetools_0.2-20
#> [22] Rcpp_1.1.0 pkgconfig_2.0.3 lattice_0.22-7
#> [25] digest_0.6.37 SuperCell_1.0.1 R6_2.6.1
#> [28] RANN_2.6.2 magrittr_2.0.4 bslib_0.9.0
#> [31] Matrix_1.7-4 tools_4.5.1 cachem_1.1.0