SMDIC User Guide

Baotong Zheng, Junwei Han

1 Introduce

We developed a novel software package (SMDIC) that enables automated identification of Somatic Mutation-Driven Immune Cell. The operation modes include inference of the relative abundance matrix of tumor-infiltrating immune cells, detection of differential abundance immune cells concerning the gene mutation status, conversion of the abundance matrix of significantly dysregulated cells into two binary matrices (one for up-regulated and one for down-regulated cells), identification of somatic mutation-driven immune cells by comparing the gene mutation status with each immune cell in the binary matrices across all samples, and visualization of immune cell abundance of samples in different mutation status for each gene. Here we present the flow diagram of SMDIC.

#Flow diagram of SMDIC.
knitr::include_graphics("../inst/workflow.jpg")

SMDIC has three main functions (flow diagram):
(A) inferring the relative abundance matrix of tumor-infiltrating immune cells
(B) detecting differential abundance immune cells concerning a particular gene mutation status and converting the abundance matrix of significantly dysregulated immune cells into two binary matrices (one for up-regulated and one for down-regulated cells)
(C) identifying somatic mutation-driven immune cells by comparing the gene mutation status with each immune cell in the binary matrices across all samples.

This vignette illustrates how to easily use the SMDIC package. With the use of functions in this package, users could identify the immune cells driven by somatic mutations in the tumor microenvironment.

2 Example: The inference of the relative abundance matrix of immune cells

We can use the function GetExampleData to return example data and environment variables.

The function exp2cell can use gene expression profiles to quantify cell abundance matrix. ‘exp2cell’ provides three methods for estimating the relative infiltration abundance of different cell types in the tumor microenvironment (TME), which including xCell, ssGSEA estimated method proposed by Şenbabaoğlu et al. and CIBERSORT.Through these methods, the relative abundance matrix of immune cells is inferred with cells as the row and samples as the column.

We selected the breast cancer gene expression profile data in the GDC TCGA database for exp2cell and obtained the results of the cell abundance matrix. The commands are as follows.

library(SMDIC)
#get breast cancer gene expression profile.
exp.example<-GetExampleData("exp.example")

# perform the exp2cell method. The method must be one of "xCell","ssGSEA" and "CIBERSORT".
cellmatrix.example<-exp2cell(exp.example,method="ssGSEA")
#> Warning in .filterFeatures_newAPI(dataMatrix, dropConstantRows = FALSE): 659
#> rows with constant values throughout the samples.
#> Warning in .gsva_newAPI(expr = dataMatrix, gset.idx.list = mappedGeneSets, :
#> Some gene sets have size one. Consider setting 'minSize > 1'.
#get the result of the exp2cell function
#view the first six rows and six columns of the cell abundance matrix.
head(cellmatrix.example)
#>                 TCGA.A7.A13G.01 TCGA.A7.A26E.01 TCGA.A1.A0SQ.01 TCGA.GI.A2C9.11
#> aDC                  -0.1721896      -0.2365829     -0.09769543     -0.11102476
#> B cells              -0.1785598      -0.2060056     -0.21060641     -0.15950503
#> CD8 T cells           0.2603276       0.2939651      0.32180621      0.34343045
#> Cytotoxic cells      -0.1710623      -0.1744474     -0.13125948      0.04400304
#> DC                   -0.2110966      -0.2248506     -0.16642146      0.16974204
#> Eosinophils           0.1456574       0.1297275      0.16402532      0.12760198
#>                 TCGA.E9.A1NE.01 TCGA.A2.A0SX.01 TCGA.EW.A1PD.01 TCGA.BH.A0HN.01
#> aDC                  0.30829474     0.255985640     -0.13083749     -0.06885779
#> B cells             -0.06773615    -0.007057109     -0.25506334     -0.25029785
#> CD8 T cells          0.42561320     0.337348145      0.30512749      0.28520165
#> Cytotoxic cells      0.29537554     0.125879324     -0.16228083     -0.15758420
#> DC                  -0.01435208     0.109405174     -0.08489845     -0.26360357
#> Eosinophils          0.13450722     0.068610081      0.11624230      0.09294411
#>                 TCGA.E2.A1IU.01 TCGA.BH.A0BO.01
#> aDC                -0.008594315     -0.08444126
#> B cells            -0.124540116     -0.19944249
#> CD8 T cells         0.317656631      0.32995068
#> Cytotoxic cells    -0.048766592     -0.09360501
#> DC                 -0.158910587      0.04338637
#> Eosinophils         0.125777697      0.16156528

3 Example: Use mutation annotation file (MAF) format data to build a binary mutations matrix

The somatic mutation data used in the package is the mutation annotation file (MAF) format. We extract the non-silent somatic mutations (nonsense mutation, missense mutation, frame-shift indels, splice site, nonstop mutation, translation start site, inframe indels) in protein-coding regions, and built a binary mutations matrix, in which 1 represents any mutation occurs in a particular gene in a particular sample, otherwise the element is 0. The genes with a given mutation frequency greater than a threshold value (one percent as the default value) are retained for the following analysis.

We selected the breast cancer somatic mutation data in the GDC TCGA database for maf2matrix and obtained the results of a binary mutation matrix. The commands are as follows.

# get the path of the mutation annotation file.
maf <- system.file("extdata","example.maf.gz",package = "SMDIC") 

# perform the maf2matrix method.
mutmatrix.example<-maf2matrix(maffile = maf,percent = 0.01) 

#get the result of the exp2cell function
#view the first six rows and six columns of the binary mutations matrix
head(mutmatrix.example)[1:6,1:6]
#>         TCGA.E9.A22G.01 TCGA.OL.A6VO.01 TCGA.AR.A1AP.01 TCGA.BH.A0AW.01
#> TP53                  1               1               1               1
#> TP53BP2               0               0               0               0
#> TP53BP1               0               0               0               0
#> PCDH10                0               0               0               0
#> CDH1                  0               0               0               0
#> CDH10                 0               0               0               0
#>         TCGA.E2.A1LG.01 TCGA.A2.A04U.01
#> TP53                  1               1
#> TP53BP2               0               0
#> TP53BP1               0               0
#> PCDH10                1               0
#> CDH1                  0               0
#> CDH10                 0               0

4 Example: Identification of somatic mutation-driven immune cells.

For a particular mutation gene, we compare the binary mutation vector with each binary cell abundance vector in the up-regulated or down-regulated immune cell-matrix respectively, and a 2×2 contingency table is constructed.Fisher’s exact test is applied to recover the cells that had drastic mutation-correlated up-regulated or down-regulated response. This process is repeated for each immune cell in the binary abundance matrices. To correct for multiple comparisons, we adjust the exact test p-values by using the false discovery rate (FDR) method proposed by Benjamini and Hochberg. The immune cells with the default FDR<0.05 are deemed as statistically significant mutation-correlated, and which may be driven by the somatic mutation. We repeat the above process for each mutated gene.

The “mutcorcell” function is implemented to identify somatic mutation-specific immune cell response by inputting the abundance matrix of immune cells and binary mutations matrix. This function firstly detects differential abundance of immune cells concerning a particular gene mutation status with the Significance Analysis of Microarrays (SAM) method. Function mutcorcell detects the differential immune cells concerning a particular gene mutation status and construction of two binary matrices based on the abundance matrix of significant differential immune cell (one for up-regulated and one for down-regulated), identification of somatic mutation-driven immune cells by comparing the gene mutation status with each immune cell in the binary matrices.

We selected the abundance matrix of immune cells and binary mutations matrix for mutcorcell and obtained the results of A list of four matrices.
* A binary numerical matrix which shows the immune cells driven by somatic mutant gene;
* Two numerical matrices which show the pvalue and fdr of the immune cells driven by somatic mutant gene;
* A character matrix which shows the cell responses of the immune cells driven by a somatic mutant gene.
The commands are as follows.

# get breast cancer cell abundance matrix, which can be the result of exp2cell function.
cellmatrix<-GetExampleData("cellmatrix") 

# get breast cancer binary mutations matrix, which can be the result of maf2matrix function.
mutmatrix<-GetExampleData("mutmatrix")

# perform the function `mutcorcell`.
mutcell<-mutcorcell(cellmatrix= cellmatrix,mutmatrix = mutmatrix,fisher.adjust = TRUE) 

#get the result of the `mutcorcell` function
mutcell<-GetExampleData("mutcell")

# the binary numerical matrix which shows the immune cells driven by somatic mutant gene.
mutcell$mut_cell[1:6,1:6]

#the numerical matrix which shows the pvalue of the immune cells driven by a somatic mutant gene
#mutcell$mut_cell_p

#the numerical matrix which show the fdr of the immune cells driven by somatic mutant gene
#mutcell$mut_cell_fdr

#the character matrix which shows the cell responses of the immune cells driven by a somatic mutant gene."up" means up-regulation, "down" means down-regulation, and "0" means no significant adjustment relationship
#mutcell$mut_cell_cellresponses

tip: The binary matrix can not only come from the maf2matrix function but also any binary mutations matrix, in which 1 represents any mutation occurs in a particular gene in a particular sample, otherwise the element is 0.

Function mutcellsummary produces result summaries of the results of mutcorcell function. The summary has four columns. The first column is gene names, the second column is the cells driven by the gene, the third column is the number of cells driven by the gene, the fourth column is mutation rates of the gene.

We selected the result of mutcorcell function, a binary mutations matrix and a cell abundance matrix for mutcellsummary and obtained the summaries of mutcorcell function.
The commands are as follows.

# perform the function mutcellsummary
summary<-mutcellsummary(mutcell =mutcell,mutmatrix = mutmatrix,cellmatrix = cellmatrix)

# get the result of the mutcellsummary function
head(summary)
#>     gene
#> 1   TP53
#> 2   CDH1
#> 3 NBPF14
#> 4  OBSCN
#> 5    NF1
#> 6  KCNA4
#>                                                                                                                                                                                          cells
#> 1 Astrocytes,CD8+ naive T-cells,CD8+ Tem,DC,Keratinocytes,Macrophages,Macrophages M1,Melanocytes,NK cells,pDC,Pericytes,Plasma cells,pro B-cells,Sebocytes,Tgd cells,Th1 cells,Th2 cells,Tregs
#> 2                                          Adipocytes,CD4+ Tcm,Chondrocytes,CMP,Endothelial cells,Fibroblasts,HSC,ly Endothelial cells,Megakaryocytes,Mesangial cells,mv Endothelial cells,NKT
#> 3                                                                                                                                          CD8+ Tem,DC,iDC,Keratinocytes,pro B-cells,Sebocytes
#> 4                                                                                                                               Epithelial cells,Keratinocytes,pro B-cells,Sebocytes,Th1 cells
#> 5                                                                                                                                                  aDC,B-cells,Memory B-cells,pDC,Plasma cells
#> 6                                                                                                                             CD8+ T-cells,CD8+ Tcm,Class-switched memory B-cells,NK cells,pDC
#>   count   mut rate
#> 1    18 0.34394251
#> 2    12 0.12936345
#> 3     6 0.01334702
#> 4     5 0.02977413
#> 5     5 0.03696099
#> 6     5 0.01129363

Function gene2cellsummary produces result summaries of the immune cells driven by a somatic mutation. The summaries are a matrix that shows the short name, full name, pvalue, fdr, cell responses(up or down) of the cells driven by a somatic mutation.
We selected the result of mutcorcell function, the method we use in exp2cell and a gene we are interested in for gene2cellsummary ,and obtained the result summaries of the immune cells driven by a somatic mutation.
The commands are as follows.

# perform the function gene2cellsummary
gene2cellsummary(gene="TP53",method="xCell",mutcell = mutcell) 
#>                    gene          cell name                    full name
#> Astrocytes         TP53         Astrocytes                   Astrocytes
#> CD8+ naive T-cells TP53 CD8+ naive T-cells           CD8+ naive T-cells
#> CD8+ Tem           TP53           CD8+ Tem CD8+ effector memory T-cells
#> DC                 TP53                 DC              Dendritic cells
#> Keratinocytes      TP53      Keratinocytes                Keratinocytes
#> Macrophages        TP53        Macrophages                  Macrophages
#> Macrophages M1     TP53     Macrophages M1               Macrophages M1
#> Melanocytes        TP53        Melanocytes                  Melanocytes
#> NK cells           TP53           NK cells                     NK cells
#> pDC                TP53                pDC Plasmacytoid dendritic cells
#> Pericytes          TP53          Pericytes                    Pericytes
#> Plasma cells       TP53       Plasma cells                 Plasma cells
#> pro B-cells        TP53        pro B-cells                  pro B-cells
#> Sebocytes          TP53          Sebocytes                    Sebocytes
#> Tgd cells          TP53          Tgd cells          Gamma delta T-cells
#> Th1 cells          TP53          Th1 cells        Type 1 T-helper cells
#> Th2 cells          TP53          Th2 cells        Type 2 T-helper cells
#> Tregs              TP53              Tregs           Regulatory T-cells
#>                          pvalue          fdr cell responses
#> Astrocytes         1.010916e-07 1.162554e-06             up
#> CD8+ naive T-cells 3.992545e-03 1.311836e-02             up
#> CD8+ Tem           1.748251e-02 4.467753e-02             up
#> DC                 1.604871e-02 4.385580e-02             up
#> Keratinocytes      2.331616e-07 1.981328e-06             up
#> Macrophages        1.820606e-04 1.196398e-03             up
#> Macrophages M1     6.327255e-09 9.701790e-08             up
#> Melanocytes        5.155187e-03 1.580924e-02             up
#> NK cells           5.359277e-04 2.465267e-03             up
#> pDC                2.076020e-03 8.121323e-03             up
#> Pericytes          3.445410e-04 1.981111e-03             up
#> Plasma cells       1.620758e-02 4.385580e-02             up
#> pro B-cells        2.637040e-03 9.331064e-03             up
#> Sebocytes          1.510706e-09 3.474624e-08             up
#> Tgd cells          2.118606e-03 8.121323e-03             up
#> Th1 cells          2.584341e-07 1.981328e-06             up
#> Th2 cells          8.859960e-13 4.075581e-11             up
#> Tregs              4.775752e-04 2.440940e-03             up

5 Example: Visualization

We provide a set of visual analysis functions including heatmapcell,plotwaterfall,plotCoocMutex ,and survcell.

Function heatmapcell is a function to draw clustered heatmaps for the cells driven by a somatic mutation.
The commands are as follows.

# load dependent package.
require(pheatmap)
#> Loading required package: pheatmap

# plot significant up-regulation or down-regulation cells heat map specific for breast cancer
heatmapcell(gene = "TP53",mutcell = mutcell,cellmatrix = cellmatrix,mutmatrix = mutmatrix)

We use the TCGA breast cancer sample data set to display the waterfall and co-occurrence and mutual exclusivity plots, and users can also enter the cancer data set they are interested in.
Function plotwaterfall and plotCoocMutex provide visualization of the mutation genes correlated with immune cells using a waterfall plot and mutually exclusive or co-occurring plot.

The commands are as follows.

#maf<-"dir" 
#tips: dir is the name of the mutation annotation file (MAF) format data. It must be an absolute path or the name relative to the current working directory.

#plot the waterfall for mutation genes which drive immune cells
#plotwaterfall(maffile = maf,mutcell.summary = summary,cellnumcuoff =4)

#plot the co-occurrence and mutual exclusivity plots for mutation genes that drive immune cells.
#plotCoocMutex(maffile = maf,mutcell.summary = summary,cellnumcuoff =4)

#view the result of the plotwaterfall function
knitr::include_graphics("../inst/plotwaterfall.jpeg")


#view the result of the plotCoocMutex function
knitr::include_graphics("../inst/plotCoocMutex.jpeg")

Function survcell draws Kaplan–Meier survival curves in the above-median and below-median groups for cell risk score. The cell risk score is calculated by the weighted mean of cells driven by a gene mutation, where the weight of cells is estimated by the “Univariate” or “Multivariate” cox regression.

# get the result of `mutcorcell` function.
mutcell<-GetExampleData("mutcell") 

# get the result of `exp2cell` function.
cellmatrix<-GetExampleData("cellmatrix") 

#get the survival data, the first column is the sample name, the second column is the survival time, and the third is the survival event.
surv<-GetExampleData("surv")

#draw Kaplan–Meier survival curves
survcell(gene ="TP53",mutcell=mutcell,cellmatrix=cellmatrix,surv=surv,palette = c("#E7B800", "#2E9FDF"))