Rapid advances and reducing costs have enabled laboratories to apply a multi-omic approach to biological systems. Studies frequently measure several biological molecules, including mRNA, proteins, metabolites, lipids and phosphoproteins using different technological platforms generating multiple datasets on the same set of biological samples. We have described the following different exploratory data analysis methods that enable one to identify co-relationships between high dimensional datasets.
Multiple Co-Intertia Analysis (MCIA) simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale and allows one identify the co-structure between datasets (Meng, Kuster, Culhane, and Moghaddas Gholami (2014)).
iBBiG is a biclustering algorithm which we wrote for analysis of noisy binary data (Gusenleitner, Howe, Bentink, Quackenbush, and Culhane (2012)). We developed it to identify biclusters in results from GSEA/pathway analysis from >1,000 different studies. Taking several thousand vectors which each contain p-values enrichment scores for many thousands of pathways, we discretize these to create a binary matrix and apply iBBiG to find pathways associated with covariates across many studies.
Today, I will provide an overview of the first MCIA.
In this workshop we will examine data from the the Cancer Genome Atlas (TCGA) Breast Cancer project (TCGA (2012)). We prepared a subset of the TCGA breast cancer study from the Nature 2012 publication. These are contained in a list and are multiple molecular data of 79 primary breast tumors. These data were downloaded from https://tcga-data.nci.nih.gov/docs/publications/brca_2012/. The data are miRNA, miRNAprecursor, RNAseq, Methylation, proteins measures on an RPPA array, and GISTIC SNP calls. I split the latter GISTIC data into 2 datasets; CNA and LOH (loss of heterzygousity), as these may be dervived by different biological mechanism and represent distinct molecular events in the cell (Wang, Birkbak, Culhane, Drapkin, Fatima, Tian, Schwede, Alsop, Daniels, Piao, and others (2012)). The eighth data.frame is the clinical data on these 79 cases. These data are in matrix or data.frame format.
(Note: more recently I have been using TCGA Assembler (Zhu, Qiu, and Ji (2014))) to download and process TCGA data. I am happy to describe this if there is interest.
I am also happy to review/demo how we reproduced the figures from our recent paper (Meng, Kuster, Culhane, et al. (2014)) and have uploaded code to regenerate the figures on the Bioconductor website.
load(file.path(datadir, "BRCA_TCGA_79_filtered_matchedNames_Basal+LuminalA.RData"))
summary(BRCA_TCGA_79)
## Length Class Mode
## miRNA 79 data.frame list
## miRNAprecursor 79 data.frame list
## RNAseq 1055440 -none- numeric
## Methyl 79 data.frame list
## RPPA 79 data.frame list
## LOH 996348 -none- numeric
## CNA 934649 -none- numeric
## clin 29 data.frame list
MCIA can be computed using principcal component analysis (pca), correspondence analysis (coa) or one of several other ordination approaches.
There are many functions in Bioconductor and R for dimension reduction analysis. Within ade4 these are computed using the duality diagram (dudi) framework (Dray and Dufour (2007); De la Cruz and Holmes (2011)) and hence are called dudi.pca(), dudi.coa() etc.
library(ade4)
BRCApca<-dudi.pca(BRCA_TCGA_79$RNAseq, scannf=FALSE, nf=5)
print(BRCApca)
## Duality diagramm
## class: pca dudi
## $call: dudi.pca(df = BRCA_TCGA_79$RNAseq, scannf = FALSE, nf = 5)
##
## $nf: 5 axis-components saved
## $rank: 79
## eigen values: 17.76 4.719 3.67 2.685 2.572 ...
## vector length mode content
## 1 $cw 79 numeric column weights
## 2 $lw 13360 numeric row weights
## 3 $eig 79 numeric eigen values
##
## data.frame nrow ncol content
## 1 $tab 13360 79 modified array
## 2 $li 13360 5 row coordinates
## 3 $l1 13360 5 row normed scores
## 4 $co 79 5 column coordinates
## 5 $c1 79 5 column normed scores
## other elements: cent norm
(In ade4, $co are columns and $li are lines or rows)
summary(BRCApca)
## Class: pca dudi
## Call: dudi.pca(df = BRCA_TCGA_79$RNAseq, scannf = FALSE, nf = 5)
##
## Total inertia: 79
##
## Eigenvalues:
## Ax1 Ax2 Ax3 Ax4 Ax5
## 17.763 4.719 3.670 2.685 2.572
##
## Projected inertia (%):
## Ax1 Ax2 Ax3 Ax4 Ax5
## 22.485 5.973 4.646 3.398 3.256
##
## Cumulative projected inertia (%):
## Ax1 Ax1:2 Ax1:3 Ax1:4 Ax1:5
## 22.49 28.46 33.10 36.50 39.76
##
## (Only 5 dimensions (out of 79) are shown)
We have a BioC packages called made4 which is a wrapper around the ade4 that utilizes Bioconductor data classes such as Expression Set. In made4 the simplest way to view results from the function ord is to use the function plot. This will draw a plot of the eigenvalues, along with plots of the variables (genes) and a plot of the cases (microarray samples).
library(made4)
cl<-as.character(BRCA_TCGA_79$clin$PAM50.mRNA)
BRCAord= ord(BRCA_TCGA_79$RNAseq, classvec=factor(cl))
summary(BRCAord$ord)
## Class: coa dudi
## Call: dudi.coa(df = data.tr, scannf = FALSE, nf = ord.nf)
##
## Total inertia: 0.006782
##
## Eigenvalues:
## Ax1 Ax2 Ax3 Ax4 Ax5
## 0.0015441 0.0004271 0.0002812 0.0002410 0.0001826
##
## Projected inertia (%):
## Ax1 Ax2 Ax3 Ax4 Ax5
## 22.767 6.297 4.146 3.553 2.692
##
## Cumulative projected inertia (%):
## Ax1 Ax1:2 Ax1:3 Ax1:4 Ax1:5
## 22.77 29.06 33.21 36.76 39.45
##
## (Only 5 dimensions (out of 78) are shown)
plot(BRCAord, nlab=3, arraylabels=rep("T", 79))
figure 1: Correspondence analysis plots. A. plot of the eigenvalues, B. projection of microarray samples C. projection of genes (gray filled circles) and D. biplot showing both genes and samples. Samples and genes with a strong association are projected in the same direction from the origin. The greater the distance from the origin the stronger the association . The distinction between molecular subtypes is captured on the first eigenvector.
The case and gene projections can be visualised with functions plotarrays and plotgenes (sorry the names of these functions steam from the olden days of microarrays). These are wrappers around ade4 plotting functions. The number of genes that are labelled at the end of the axis can be defined. The default is 10. These plots are simplistic and could be rendered more useful with shiny/ggplots/etc
par(mfrow=c(2,1))
plotarrays(BRCAord$ord$co, classvec=BRCA_TCGA_79$clin$PAM50.mRNA)
plotgenes(BRCAord, n=5, col="red")
Figure 2. Plot of A) sample coordinates b) feature coordinates
To extract a list of variables with greatest loadings or weights on an axes, (ie those at the end of an axes), use function topgenes. For example, to get a list of the 5 genes at the negative and postive ends of axes 1.
ax1<- topgenes(BRCAord, axis=1, n=5)
Coinertia analysis (CIA) identifies trends or co-relationships in two datasets which contain the same samples. Either the rows or the columns of a matrix must be “matchable”. CIA can be applied to datasets where the number of variables exceed the number of samples.
CIA is related to canonical correlation analysis, but CIA optimizes the squared covariance between the eigenvectors whereas CCA optimizes the correlation. There was a nice comparison of sparse CCA and CIA by LeCao et al., LeCao2009 (Le Cao, Martin, Robert-Granie, and Besse (2009))
The function cia calls the function coinertia in the R package ade4 .
## Coinertia analysis
## call: coinertia(dudiX = coa1, dudiY = coa2, scannf = cia.scan, nf = cia.nf)
## class: coinertia dudi
##
## $rank (rank) : 78
## $nf (axis saved) : 2
## $RV (RV coeff) : 0.8315
##
## eigenvalues: 9.302e-06 4.838e-07 2.381e-07 2.02e-07 1.475e-07 ...
##
## vector length mode content
## 1 $eig 78 numeric Eigenvalues
## 2 $lw 171 numeric Row weigths (for coa2 cols)
## 3 $cw 13360 numeric Col weigths (for coa1 cols)
##
## data.frame nrow ncol content
## 1 $tab 171 13360 Crossed Table (CT): cols(coa2) x cols(coa1)
## 2 $li 171 2 CT row scores (cols of coa2)
## 3 $l1 171 2 Principal components (loadings for coa2 cols)
## 4 $co 13360 2 CT col scores (cols of coa1)
## 5 $c1 13360 2 Principal axes (cols of coa1)
## 6 $lX 79 2 Row scores (rows of coa1)
## 7 $mX 79 2 Normed row scores (rows of coa1)
## 8 $lY 79 2 Row scores (rows of coa2)
## 9 $mY 79 2 Normed row scores (rows of coa2)
## 10 $aX 2 2 Corr coa1 axes / coinertia axes
## 11 $aY 2 2 Corr coa2 axes / coinertia axes
##
## CT rows = cols of coa2 (171) / CT cols = cols of coa1 (13360)
figure 4. Plot of coinertia analysis of gene and protein expression datasets.
The correlation between the coa axes and the coinertia axes are in \(aX and \)aY:
par(mfrow=c(1,2))
s.corcircle(BRCAcia$coinertia$aX)
s.corcircle(BRCAcia$coinertia$aY)
fig 5 Scatter diagram of a correlation circle of the coa and coinertia axes. The angle between two vectors gives the degree of correlation (adjacent = highly correlated, orthogonal (90°) = uncorrelated, and opposite (180°) = negatively correlated).
The RV coefficient is a measure of global similarity between the datasets. The RV coefficient of this coinertia analysis is: 0.8315.
Multiple Co-Interia Analysis (MCIA) simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale, enabling one to extract the most variant from each dataset (Meng, Kuster, Culhane, et al. (2014)).
# Check that there are no zero or low variant low count rows.
for (i in 1:7) BRCA_TCGA_79[[i]]<-BRCA_TCGA_79[[i]][!apply(BRCA_TCGA_79[[i]],1, sum)==min(BRCA_TCGA_79[[i]]),]
library(omicade4)
mcia79<-mcia(BRCA_TCGA_79[1:7])
save(mcia79, file="mciaRes.RData")
plot(mcia79, axes=1:2, sample.lab=FALSE, sample.legend=FALSE, phenovec=as.numeric(BRCA_TCGA_79$clin$PAM50.mRNA), gene.nlab=2, df.color=c("navy", "cyan", "magenta", "red4", "brown","yellow", "orange"),df.pch=2:8)
It is useful to examine the RV coefficient, first examine the reference structure, what is the joint co-structure between each dataset and the reference, or each pair of datasets. Not surprisingly the proteomics and gene expression data share high similarity
RV.mcoa(mcia79$mcoa)
## miRNA miRNAprecursor RNAseq Methyl RPPA
## 0.8968 0.8989 0.8207 0.7759 0.7639
## LOH CNA
## 0.5377 0.5397
mcia79$mcoa$RV
## miRNA miRNAprecursor RNAseq Methyl RPPA LOH CNA
## miRNA 1.0000 0.9370 0.3780 0.3847 0.3514 0.4404 0.4087
## miRNAprecursor 0.9370 1.0000 0.3404 0.3424 0.3199 0.3819 0.3800
## RNAseq 0.3780 0.3404 1.0000 0.6553 0.8315 0.3496 0.5181
## Methyl 0.3847 0.3424 0.6553 1.0000 0.5746 0.3580 0.4773
## RPPA 0.3514 0.3199 0.8315 0.5746 1.0000 0.3192 0.4888
## LOH 0.4404 0.3819 0.3496 0.3580 0.3192 1.0000 0.3518
## CNA 0.4087 0.3800 0.5181 0.4773 0.4888 0.3518 1.0000
PseduoEigen values
mcia79$mcoa$cov2
## cov21 cov22
## miRNA 0.13060 0.12491
## miRNAprecursor 0.13197 0.12492
## RNAseq 0.21240 0.03951
## Methyl 0.13825 0.02503
## RPPA 0.20379 0.02861
## LOH 0.06203 0.01219
## CNA 0.08033 0.02284
plot(mcia79$mcoa$cov2, xlab = "pseudoeig 1", ylab = "pseudoeig 2", pch=19, col="red")
text(mcia79$mcoa$cov2, labels=rownames(mcia79$mcoa$cov2), cex=0.8, adj=0)
Eigenvalues (computed on the separate analysis) are in $lambda. These will either be weighted by the first eigenvalues (option=“lambda1”), or weighted by the total inertia (option=“inertia”) depending on the parameter “option”. See more in the help for the function ade4:::mcoa.
Examine the projects of the axes using the correlation circle
mcia79$mcoa$Tax
## Axis1 Axis2
## miRNA.1 0.956722 0.105140
## miRNA.2 -0.033670 0.948745
## miRNA.3 0.206512 -0.285106
## miRNA.4 -0.063743 0.051587
## miRNAprecursor.1 0.951915 0.166798
## miRNAprecursor.2 -0.090401 0.934020
## miRNAprecursor.3 0.221543 -0.308292
## miRNAprecursor.4 -0.038513 0.012779
## RNAseq.1 0.995599 -0.062898
## RNAseq.2 0.070107 0.896686
## RNAseq.3 -0.004684 -0.210508
## RNAseq.4 -0.019510 -0.080803
## Methyl.1 0.968830 -0.062894
## Methyl.2 0.212475 0.324962
## Methyl.3 0.052842 -0.585813
## Methyl.4 -0.017095 0.116557
## RPPA.1 0.991280 -0.038901
## RPPA.2 0.103149 0.457035
## RPPA.3 -0.035056 0.521484
## RPPA.4 0.014560 -0.206096
## LOH.1 0.820020 -0.133287
## LOH.2 0.303922 0.306115
## LOH.3 0.323478 -0.007005
## LOH.4 0.195270 -0.004495
## CNA.1 0.707346 -0.499733
## CNA.2 0.635149 0.477023
## CNA.3 -0.049951 0.285993
## CNA.4 -0.162432 0.152731
dev.off()
## null device
## 1
par(mfrow=c(4,2))
xx<-by(mcia79$mcoa$Tax, substr(rownames(mcia79$mcoa$Tax),1,3), s.corcircle)
Coordinates of reference samples (scores against which the sq covariance is maximized) are in mcia79\(mcoa\)SynVar. The reference and each data projection can be visualized using kplot. Again the graphics could be so much better.
#plotarrays(mcia79$mcoa$SynVar, classvec=BRCA_TCGA_79$clin$PAM50.mRNA)
kplot(mcia79$mcoa, mfrow = c(3,4), clab = .8, csub = 3, cpoi = 3)
All of the data have been transformed onto the same scale and the coordinates of genes tranformed in this space are avaialble in mcia79\(mcoa\)axis
summary(mcia79$mcoa$axis)
## Axis1 Axis2
## Min. :-7.332 Min. :-11.621
## 1st Qu.:-0.737 1st Qu.: -0.611
## Median :-0.085 Median : -0.066
## Mean :-0.042 Mean : 0.026
## 3rd Qu.: 0.525 3rd Qu.: 0.587
## Max. :14.518 Max. : 6.921
par(mfrow=c(1,2))
plot(mcia79$mcoa$axis[,1]~factor(mcia79$mcoa$TC[,1]), col=1:7, names=names(mcia79$coa), ylab="Gene Scores PC1", xlab="", las=2)
plot(mcia79$mcoa$axis[,2]~factor(mcia79$mcoa$TC[,1]), col=1:7, names=names(mcia79$coa), ylab="Gene Scores PC2", xlab="", las=2)
For example if we look at the top 10 scores of features on the negative end of PC1, it contains features from the miRNA (#1), miRNAprecursor (#2) and RNAseq (#3) datasets. In mcia79\(mcoa\)axis, each feature ID has a suffix number which refers to the dataset from which it originated.
mcia79$mcoa$axis[order(mcia79$mcoa$axis[,1]),][1:10,1, drop=FALSE]
## Axis1
## hsa.mir.490.MIMAT0004764.1 -7.332
## hsa.mir.124.1.2 -7.304
## hsa.mir.124.MIMAT0000422.1 -7.260
## hsa.mir.124.2.2 -7.240
## hsa.mir.124.3.2 -7.047
## hsa.mir.124.MIMAT0004591.1 -5.252
## ROPN1.3 -5.200
## hsa.mir.1267.2 -5.156
## hsa.mir.2117.MIMAT0011162.1 -5.140
## hsa.mir.1267.MIMAT0005921.1 -5.130
## Dataset suffix
cbind(1:7,rownames(mcia79$mcoa$cov2))
## [,1] [,2]
## [1,] "1" "miRNA"
## [2,] "2" "miRNAprecursor"
## [3,] "3" "RNAseq"
## [4,] "4" "Methyl"
## [5,] "5" "RPPA"
## [6,] "6" "LOH"
## [7,] "7" "CNA"
## To "concatentate"" data,
mm<-function(x) substr(x, 1, nchar(x)-2)
ids<-mm(rownames(mcia79$mcoa$axis))
# Whilst it would be great, to have alll of our data mapped to genome co-ordinates and really take the union of everything, to keep it simple, I will only look at the Gene Symbols (RNAseq, RPPA)
library(HGNChelper)
library(Biobase)
idsFix<- checkGeneSymbols(ids)
## Warning: Some lower-case letters were found and converted to upper-case.
## HGNChelper is intended for human symbols only, which should be all
## upper-case except for open reading frames (orf).
## Warning: x contains non-approved gene symbols
## Get PC1
idsPC<-cbind(idsFix, PC=mcia79$mcoa$axis)
GOs<-select(org.Hs.eg.db, columns="GO",keytype="SYMBOL",keys=idsPC$Suggested.Symbol)
## Warning: 'NA' keys have been removed
## Warning: 'select' and duplicate query keys resulted in 1:many mapping
## between keys and return rows
#Get Coordinates for GOs terms
res<-tapply(GOs$SYMBOL, GOs$GO, function(Syms) colMeans(idsPC[idsPC$Suggested.Symbol%in%Syms,c("PC.Axis1","PC.Axis2")]))
res<-do.call(rbind,res)
res[1:2,]
## PC.Axis1 PC.Axis2
## GO:0000002 -0.3018 -0.02972
## GO:0000003 0.9200 1.86450
plot(mcia79$mcoa$axis, col =as.numeric(mcia79$mcoa$TC[,1]), pch=as.numeric(mcia79$mcoa$TC[,1]))
tt<-topgenes(res, n=5)
points(res[tt,], pch=19, col="gray")
text(res[tt,], labels=tt, cex=0.8, adj=0)
Or we can perform pathway analysis on the features. There are many tools within BioC for performing pathway enrichments, for examples we could perform ssGSVA pcs
#Exclude NA
idsPC<-idsPC[!is.na(idsPC$Suggested.Symbol),]
#Reduce to GeneSymbol and MCIA score, taking max
idsPC1<-tapply(idsPC$PC.Axis1, idsPC$Suggested.Symbol,max)
idsPC2<-tapply(idsPC$PC.Axis2, idsPC$Suggested.Symbol,max)
if( !all(names(idsPC1)==names(idsPC2))) stop()
idsPCs<-cbind(idsPC1, idsPC2)
# The "built-in" gsva library are mapped to entrezIDs so map symbols
entrezPC1<-select(org.Hs.eg.db, columns="ENTREZID",keytype="SYMBOL",keys=rownames(idsPCs))
entrezPC1<-entrezPC1[!duplicated(entrezPC1[,1]),]
table(rownames(idsPCs)==entrezPC1[,1])
rownames(idsPCs)= entrezPC1[,2]
# Run any enrichment test with gsva
require(GSVA)
require(GSVAdata)
gsva(idsPCs, c2BroadSets)