Introduction

Comparing samples defined over a single dimension is a straightforward task that relies on standard, well established methods. Meanwhile distance between samples in high dimensional space remains a largely unexplored field. Available solutions rely on multivariate normal distributions, a condition that is both difficult to check and overlooking key behaviors in biological samples, where populations of interest often correspond to a small proportion (<1%) of all the points that have been measured.

We have developed hilbertSimilarity to address the problem of sample similarity in mass cytometry where samples are measured over up to 100 dimensions, and where each sample contains a slightly different number of points (or cells). Our method first transforms each sample into a probability vector, by dividing each dimension into a fixed number of bins and by associating each cell to a specific multidimensional cube. The proportion of cells in each hypercube is characteristic of a given sample. To characterize an compare samples we use measures from Information Theory, since their interpretation in terms of similarity and complexity is straightforward.

After determining sample similarity, our method can also be used to determine which cells are different between samples, by comparing the number of cells in each hypercube to a reference treatment. Since every sample contains a different number of cells, we use a bootstrap approach where the same number of cells is sampled from each treatment to allow for a direct comparison.

To demonstrate the power of hilbertSimilarity we applied the method to a subset of the bodenmiller et al. dataset, comparing the effect of different stimulations and identifying groups of cells that are significantly affected by different treatments.

Compared to other methods, hilbertSimilarity does not rely on expert-driven gating, or require any hypothesis about the number of populations that are present in the data. This makes it a powerful tool to quickly assess a dataset before using traditional methods, or when populations a not known a priori.

Installation

hilbertSimilarity can be installed using the following command

devtools::install_github(yannabraham/hilbertSimilarity)

Once installed the package can be loaded using the standard library command.

library(hilbertSimilarity)

Loading some test data

We first need data from the bodenmiller package:

library(bodenmiller)
data(refPhenoMat)
data(untreatedPhenoMat)
data(refFuncMat)
data(untreatedFuncMat)
data(refAnnots)
refAnnots$Treatment <- 'reference'
data(untreatedAnnots)
fullAnnots <- rbind(refAnnots[,names(untreatedAnnots)],
                    untreatedAnnots)
fullAnnots$Treatment <- factor(fullAnnots$Treatment)
fullAnnots$Treatment <- relevel(fullAnnots$Treatment,'reference')
refMat <- cbind(refPhenoMat,refFuncMat)
untreatedMat <- cbind(untreatedPhenoMat,
                      untreatedFuncMat)
fullMat <- rbind(refMat,untreatedMat)
fullMat <- apply(fullMat,2,function(x) {
    x[x<0] <- 0
    return(x)
  }
)

In this dataset 51936 cells corresponding to 4 have been measured over 23 channels. Cells have been gated into 14 populations. The following treatments are compared to one another:

	reference	BCR/FcR-XL	PMA/Ionomycin	vanadate
cd14-hladr-	22	18	25	27
cd14-hladrhigh	60	19	4	1
cd14-hladrmid	244	149	54	18
cd14-surf-	1807	1845	1520	1824
cd14+hladr-	13	17	15	91
cd14+hladrhigh	21	3	5	1
cd14+hladrmid	1373	839	298	155
cd14+surf-	81	77	214	166
cd4+	3989	4076	4546	503
cd8+	5068	5098	5122	1200
dendritic	58	30	28	23
igm-	232	330	181	90
igm+	1069	863	992	395
nk	1755	1888	1572	1822

The smallest cell population, cd14+hladrhigh, corresponds to 0.058% of the total cells.

Computing the similarity between treatments

We will first compute the similarity between treatments, by generating a 3^rd-order Hilbert curve using combined cuts:

computing the cuts

# compute the 
nbins <- 3
cuts <- make.cut(fullMat,
                 n=nbins+1,
                 count.lim=40)

## CD20 
## IgM 
## CD4 
## CD33 
## HLA.DR 
## CD14 
## CD7 
## CD3 
## CD123 
## pStat1 
## pSlp76 
## pBtk 
## pPlcg2 
## pErk 
## pLat 
## pS6 
## pNFkB 
## pp38 
## pStat5 
## pAkt 
## pSHP2 
## pZap70 
## pStat3

cutting the matrix

cutFullMat <- do.cut(fullMat,cuts,type='combined')

ordering the dimensions

library(entropy)
miFullMat <- matrix(NA,nrow=ncol(fullMat),ncol = ncol(fullMat) )
for (i in seq(ncol(fullMat)-1)) {
  for (j in seq(i+1,ncol(fullMat))) {
    cur.tbl <- table(cutFullMat[,i],cutFullMat[,j])
    nent <- 1-mi.empirical(cur.tbl)/entropy.empirical(cur.tbl)
    miFullMat[i,j] <- miFullMat[j,i] <- nent
  }
}
dimnames(miFullMat) <- list(colnames(fullMat),colnames(fullMat))
hcFullMat <- hclust(as.dist(miFullMat))

generating the Hilbert curve

hc <- do.hilbert(cutFullMat[,rev(hcFullMat$order)],2)

Visualizing the different samples

library(reshape2)
library(dplyr)
library(ggplot2)
treatment.table <- with(fullAnnots,
                        table(hc,Treatment))
treatment.pc <- apply(treatment.table,2,function(x) x/sum(x))
av <- andrewsProjection(t(treatment.pc),breaks=40)
treatment.proj <- t(treatment.pc) %*% t(av$freq)
melt(treatment.proj) %>% 
    rename(AndrewsIndex=Var2) %>% 
    mutate(AndrewsIndex=av$i[AndrewsIndex]) %>% 
    ggplot(aes(x=AndrewsIndex,y=value))+
    geom_line(aes(group=Treatment,color=Treatment))+
    theme_light(base_size=16)+
    theme(axis.title.x=element_blank(),
          axis.text.x=element_blank(),
          axis.ticks.x=element_blank(),
          axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank())

Identifying treatment effects

For the comparison to be meaningful, we need to limit the analysis to bins that contain a minimum number of cells. This number is arbitrary, and represents the smallest group of cells that should be called a population. We also need to define the number of bootstrap, the significant fold change, and the limit of significance:

# total cells per Hilbert index per treatment
treatment.table <- with(fullAnnots,
                        table(hc,Treatment))
# minimum number of cells
lim <- 40
# number of bootstraps
N <- 400
# significant fold change
flim <- 2
# limit of significance
p <- 0.95
# bins with enough cells
boot.ind <- rownames(treatment.table)[which(rowSums(treatment.table)>=lim)]
# number of cells in each bootstrap
ncells <- floor(min(colSums(treatment.table[boot.ind,]))/1000)*1000

For this exercise we use 40 as the minimum population size. 218 indices out of 7114 contain enough cells.

Because each treatment corresponds to a different number of cells, we will use a bootstrap approach to identify indices that are significantly different between conditions. we will bootstrap 2000 cells from the selected bins, then check which bin contains at least 2 times more cells in the reference compared to other treatments. This operation will be repeated 400 times, and any bin that is consistently different in 380 bins will be considered significant.

library(abind)
boot.mat <- lapply(seq(1,N),function(x) {
    if (x%/%100==x/100) cat(x,'\n')
    # bootstrap hilbert indices
    cur.boot <- sample(which(hc %in% boot.ind),
                       size=ncells,
                       replace=TRUE)
    return(table(factor(hc[cur.boot],levels=boot.ind),
                 droplevels(fullAnnots[cur.boot,'Treatment'])))
  }
)

## 100 
## 200 
## 300 
## 400

boot.mat <- abind(boot.mat,along=3)

boot.log <- log2(boot.mat)
boot.fold <- sweep(boot.log[,-1,],c(1,3),boot.log[,1,])
boot.sign <- sign(boot.fold)
boot.sign[abs(boot.fold)<log2(flim)] <- 0
boot.res <- apply(boot.sign,c(1,2),function(x) table(sign(x))[c('-1','0','1')])
boot.res <- aperm(boot.res,c(2,1,3))
boot.res[is.na(boot.res)] <- 0
dimnames(boot.res)[[2]] <- c('-1','0','1')

After the analysis, the following number of bins are found significant:

sig.res <- apply(boot.res[,c('-1','1'),]>(N*p),c(1,3),any)
sum.sig.res <- cbind(colSums(sig.res),
                     colSums(treatment.table[boot.ind[rowSums(sig.res)>0],colnames(sig.res)]))

p.cells.sig.res <- lapply(colnames(sig.res),
                          function(cur.cl) {
                              x <- treatment.table[boot.ind,cur.cl]
                              return(sum(x[sig.res[,cur.cl]])/sum(x))
                            }
                          )
names(p.cells.sig.res) <- colnames(sig.res)
sum.sig.res <- cbind(sum.sig.res,
                     round(100*unlist(p.cells.sig.res),0))
colnames(sum.sig.res) <- c('Number of indices',
                           'Number of Significant cells',
                           'Percent Significant Cells')
kable(sum.sig.res)

	Number of indices	Number of Significant cells	Percent Significant Cells
BCR/FcR-XL	3	5797	1
PMA/Ionomycin	45	4142	45
vanadate	48	1742	30

Focusing on BCR/FcR-XL, which are the 3 indices that were found significant?

kable(treatment.table[boot.ind[sig.res[,'BCR/FcR-XL']],])

	reference	BCR/FcR-XL	PMA/Ionomycin	vanadate
2098176	1	97	32	0
4194304	69	3	7	13
6291455	198	14	23	48

Compared to the unstimulated sample, cells are decreasing in 2 bins and increasing in a 3^rd one. Based on manual gating, what do these bins correspond to?

cur.treatment <- 'BCR/FcR-XL'
sig.cells <- with(fullAnnots,table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]]])))
total.sig.cells <- with(fullAnnots,table(droplevels(Cells[Cells %in% names(sig.cells) &
                                                       hc %in% boot.ind])))
treatment.sig.cells <- with(fullAnnots,
                            table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]] &
                                                     Treatment==cur.treatment])))
treatment.total.cells <- with(fullAnnots,
                              table(droplevels(Cells[Cells %in% names(sig.cells) &
                                                       hc %in% boot.ind &
                                                       Treatment==cur.treatment])))
sum.sig.cells <- cbind(sig.cells,
                       total.sig.cells,
                       round(100*sig.cells/total.sig.cells),
                       treatment.sig.cells,
                       treatment.total.cells,
                       round(100*treatment.sig.cells/treatment.total.cells))

colnames(sum.sig.cells) <- c('Number of Significant Cells',
                             'Total number of Cells',
                             'Percent of Total',
                             paste('Number of Signigicant Cells in',cur.treatment),
                             paste('Number of Cells in',cur.treatment),
                             paste('Percent of',cur.treatment))
kable(sum.sig.cells)

	Number of Significant Cells	Total number of Cells	Percent of Total	Number of Signigicant Cells in BCR/FcR-XL	Number of Cells in BCR/FcR-XL	Percent of BCR/FcR-XL
igm-	30	288	10	23	119	19
igm+	475	1373	35	91	320	28

All the cells that are found significant are B cells, and they represent a large fraction of all B cells in the BCR/FcR-XL sample, even when only taking the bootstrapped indices into account.

Identifying Treatment effects using hilbertSimilarity

Yann Abraham

2026-01-13

Introduction

Installation

Loading some test data

Computing the similarity between treatments

Identifying treatment effects