# HCA human bone marrow (10X Genomics)
## Introduction
Here, we use an example dataset from the [Human Cell Atlas immune cell profiling project on bone marrow](https://preview.data.humancellatlas.org), which contains scRNA-seq data for 380,000 cells generated using the 10X Genomics technology.
This is a fairly big dataset that represents a good use case for the techniques in [Advanced Chapter 14](http://bioconductor.org/books/3.18/OSCA.advanced/dealing-with-big-data.html#dealing-with-big-data).
## Data loading
This dataset is loaded via the *[HCAData](https://bioconductor.org/packages/3.18/HCAData)* package, which provides a ready-to-use `SingleCellExperiment` object.
```r
library(HCAData)
sce.bone <- HCAData('ica_bone_marrow', as.sparse=TRUE)
sce.bone$Donor <- sub("_.*", "", sce.bone$Barcode)
```
We use symbols in place of IDs for easier interpretation later.
```r
library(EnsDb.Hsapiens.v86)
rowData(sce.bone)$Chr <- mapIds(EnsDb.Hsapiens.v86, keys=rownames(sce.bone),
column="SEQNAME", keytype="GENEID")
library(scater)
rownames(sce.bone) <- uniquifyFeatureNames(rowData(sce.bone)$ID,
names = rowData(sce.bone)$Symbol)
```
## Quality control
Cell calling was not performed (see [here](https://s3.amazonaws.com/preview-ica-expression-data/Brief+ICA+Read+Me.pdf)) so we will perform QC using all metrics and block on the donor of origin during outlier detection.
We perform the calculation across multiple cores to speed things up.
```r
library(BiocParallel)
bpp <- MulticoreParam(8)
sce.bone <- unfiltered <- addPerCellQC(sce.bone, BPPARAM=bpp,
subsets=list(Mito=which(rowData(sce.bone)$Chr=="MT")))
qc <- quickPerCellQC(colData(sce.bone), batch=sce.bone$Donor,
sub.fields="subsets_Mito_percent")
sce.bone <- sce.bone[,!qc$discard]
```
```r
unfiltered$discard <- qc$discard
gridExtra::grid.arrange(
plotColData(unfiltered, x="Donor", y="sum", colour_by="discard") +
scale_y_log10() + ggtitle("Total count"),
plotColData(unfiltered, x="Donor", y="detected", colour_by="discard") +
scale_y_log10() + ggtitle("Detected features"),
plotColData(unfiltered, x="Donor", y="subsets_Mito_percent",
colour_by="discard") + ggtitle("Mito percent"),
ncol=2
)
```
(\#fig:unref-hca-bone-qc)Distribution of QC metrics in the HCA bone marrow dataset. Each point represents a cell and is colored according to whether it was discarded.
(\#fig:unref-hca-bone-mito)Percentage of mitochondrial reads in each cell in the HCA bone marrow dataset compared to its total count. Each point represents a cell and is colored according to whether that cell was discarded.
## Normalization
For a minor speed-up, we use already-computed library sizes rather than re-computing them from the column sums.
```r
sce.bone <- logNormCounts(sce.bone, size_factors = sce.bone$sum)
```
```r
summary(sizeFactors(sce.bone))
```
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05 0.47 0.65 1.00 0.89 42.38
```
## Variance modeling
We block on the donor of origin to mitigate batch effects during HVG selection.
We select a larger number of HVGs to capture any batch-specific variation that might be present.
```r
library(scran)
set.seed(1010010101)
dec.bone <- modelGeneVarByPoisson(sce.bone,
block=sce.bone$Donor, BPPARAM=bpp)
top.bone <- getTopHVGs(dec.bone, n=5000)
```
```r
par(mfrow=c(4,2))
blocked.stats <- dec.bone$per.block
for (i in colnames(blocked.stats)) {
current <- blocked.stats[[i]]
plot(current$mean, current$total, main=i, pch=16, cex=0.5,
xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(current)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
}
```
(\#fig:unref-hca-bone-var)Per-gene variance as a function of the mean for the log-expression values in the HCA bone marrow dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances.
## Data integration
Here we use multiple cores, randomized SVD and approximate nearest-neighbor detection to speed up this step.
```r
library(batchelor)
library(BiocNeighbors)
set.seed(1010001)
merged.bone <- fastMNN(sce.bone, batch = sce.bone$Donor, subset.row = top.bone,
BSPARAM=BiocSingular::RandomParam(deferred = TRUE),
BNPARAM=AnnoyParam(),
BPPARAM=bpp)
reducedDim(sce.bone, 'MNN') <- reducedDim(merged.bone, 'corrected')
```
We use the percentage of variance lost as a diagnostic measure:
```r
metadata(merged.bone)$merge.info$lost.var
```
```
## MantonBM1 MantonBM2 MantonBM3 MantonBM4 MantonBM5 MantonBM6 MantonBM7
## [1,] 0.006922 0.006392 0.000000 0.000000 0.000000 0.000000 0.000000
## [2,] 0.006380 0.006863 0.023049 0.000000 0.000000 0.000000 0.000000
## [3,] 0.005068 0.003084 0.005178 0.019496 0.000000 0.000000 0.000000
## [4,] 0.002009 0.001891 0.001901 0.001786 0.023105 0.000000 0.000000
## [5,] 0.002452 0.002003 0.001770 0.002926 0.002646 0.023852 0.000000
## [6,] 0.003167 0.003222 0.003169 0.002636 0.003362 0.003442 0.024650
## [7,] 0.001968 0.001701 0.002441 0.002045 0.001585 0.002312 0.002003
## MantonBM8
## [1,] 0.00000
## [2,] 0.00000
## [3,] 0.00000
## [4,] 0.00000
## [5,] 0.00000
## [6,] 0.00000
## [7,] 0.03216
```
## Dimensionality reduction
We set `external_neighbors=TRUE` to replace the internal nearest neighbor search in the UMAP implementation with our parallelized approximate search.
We also set the number of threads to be used in the UMAP iterations.
```r
set.seed(01010100)
sce.bone <- runUMAP(sce.bone, dimred="MNN",
external_neighbors=TRUE,
BNPARAM=AnnoyParam(),
BPPARAM=bpp,
n_threads=bpnworkers(bpp))
```
## Clustering
Graph-based clustering generates an excessively large intermediate graph so we will instead use a two-step approach with $k$-means.
We generate 1000 small clusters that are subsequently aggregated into more interpretable groups with a graph-based method.
If more resolution is required, we can increase `centers` in addition to using a lower `k` during graph construction.
```r
library(bluster)
set.seed(1000)
colLabels(sce.bone) <- clusterRows(reducedDim(sce.bone, "MNN"),
TwoStepParam(KmeansParam(centers=1000), NNGraphParam(k=5)))
table(colLabels(sce.bone))
```
```
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 20331 11161 55464 47426 15731 10581 64721 26493 18703 15043 17097 4992 3157
## 14 15
## 3403 2422
```
We observe mostly balanced contributions from different samples to each cluster (Figure \@ref(fig:unref-hca-bone-ab)), consistent with the expectation that all samples are replicates from different donors.
```r
tab <- table(Cluster=colLabels(sce.bone), Donor=sce.bone$Donor)
library(pheatmap)
pheatmap(log10(tab+10), color=viridis::viridis(100))
```
(\#fig:unref-hca-bone-ab)Heatmap of log~10~-number of cells in each cluster (row) from each sample (column).
(\#fig:unref-hca-bone-umap)UMAP plots of the HCA bone marrow dataset after merging. Each point represents a cell and is colored according to the assigned cluster (top) or the donor of origin (bottom).
## Differential expression
We identify marker genes for each cluster while blocking on the donor.
```r
markers.bone <- findMarkers(sce.bone, block = sce.bone$Donor,
direction = 'up', lfc = 1, BPPARAM=bpp)
```
We visualize the top markers for a randomly chosen cluster using a "dot plot" in Figure \@ref(fig:unref-hca-bone-dotplot).
The presence of upregulated genes like _LYZ_, _S100A8_ and _VCAN_ is consistent with a monocyte identity for this cluster.
```r
top.markers <- markers.bone[["4"]]
best <- top.markers[top.markers$Top <= 10,]
lfcs <- getMarkerEffects(best)
library(pheatmap)
pheatmap(lfcs, breaks=seq(-5, 5, length.out=101))
```
(\#fig:unref-hca-bone-dotplot)Heatmap of log~2~-fold changes for the top marker genes (rows) of cluster 4 compared to all other clusters (columns).
## Cell type classification
We perform automated cell type classification using a reference dataset to annotate each cluster based on its pseudo-bulk profile.
This is faster than the per-cell approaches described in Chapter \@ref(cell-type-annotation) at the cost of the resolution required to detect heterogeneity inside a cluster.
Nonetheless, it is often sufficient for a quick assignment of cluster identity, and indeed, cluster 4 is also identified as consisting of monocytes from this analysis.
```r
se.aggregated <- sumCountsAcrossCells(sce.bone, id=colLabels(sce.bone), BPPARAM=bpp)
library(celldex)
hpc <- HumanPrimaryCellAtlasData()
library(SingleR)
anno.single <- SingleR(se.aggregated, ref = hpc, labels = hpc$label.main,
assay.type.test="sum")
anno.single
```
```
## DataFrame with 15 rows and 4 columns
## scores labels delta.next
##
## 1 0.366050:0.741975:0.637800:... GMP 0.3366255
## 2 0.399229:0.715314:0.628516:... Pro-B_cell_CD34+ 0.0690916
## 3 0.325812:0.654869:0.570039:... T_cells 0.1292127
## 4 0.296455:0.742466:0.529404:... Monocyte 0.3099254
## 5 0.345773:0.565378:0.479722:... T_cells 0.4943226
## ... ... ... ...
## 11 0.326882:0.646229:0.561060:... T_cells 0.027835817
## 12 0.380710:0.684123:0.784540:... BM & Prog. 0.000499754
## 13 0.368546:0.652935:0.580330:... B_cell 0.201098545
## 14 0.294361:0.706019:0.527282:... Monocyte 0.355212614
## 15 0.339786:0.687074:0.569933:... GMP 0.131830602
## pruned.labels
##
## 1 GMP
## 2 Pro-B_cell_CD34+
## 3 T_cells
## 4 Monocyte
## 5 T_cells
## ... ...
## 11 T_cells
## 12 BM & Prog.
## 13 NA
## 14 Monocyte
## 15 GMP
```
## Session Info {-}