Introduction

Rapid advances and reducing costs have enabled laboratories to apply a multi-omic approach to biological systems. Studies frequently measure several biological molecules, including mRNA, proteins, metabolites, lipids and phosphoproteins using different technological platforms generating multiple datasets on the same set of biological samples. We have described the following different exploratory data analysis methods that enable one to identify co-relationships between high dimensional datasets.

Multiple Co-Intertia Analysis (MCIA) simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale and allows one identify the co-structure between datasets (Meng, Kuster, Culhane, and Moghaddas Gholami (2014)).
iBBiG is a biclustering algorithm which we wrote for analysis of noisy binary data (Gusenleitner, Howe, Bentink, Quackenbush, and Culhane (2012)). We developed it to identify biclusters in results from GSEA/pathway analysis from >1,000 different studies. Taking several thousand vectors which each contain p-values enrichment scores for many thousands of pathways, we discretize these to create a binary matrix and apply iBBiG to find pathways associated with covariates across many studies.

Today, I will provide an overview of the first MCIA.

TCGA Breast Cancer Dataset

In this workshop we will examine data from the the Cancer Genome Atlas (TCGA) Breast Cancer project (TCGA (2012)). We prepared a subset of the TCGA breast cancer study from the Nature 2012 publication. These are contained in a list and are multiple molecular data of 79 primary breast tumors. These data were downloaded from https://tcga-data.nci.nih.gov/docs/publications/brca_2012/. The data are miRNA, miRNAprecursor, RNAseq, Methylation, proteins measures on an RPPA array, and GISTIC SNP calls. I split the latter GISTIC data into 2 datasets; CNA and LOH (loss of heterzygousity), as these may be dervived by different biological mechanism and represent distinct molecular events in the cell (Wang, Birkbak, Culhane, Drapkin, Fatima, Tian, Schwede, Alsop, Daniels, Piao, and others (2012)). The eighth data.frame is the clinical data on these 79 cases. These data are in matrix or data.frame format.

(Note: more recently I have been using TCGA Assembler (Zhu, Qiu, and Ji (2014))) to download and process TCGA data. I am happy to describe this if there is interest.

I am also happy to review/demo how we reproduced the figures from our recent paper (Meng, Kuster, Culhane, et al. (2014)) and have uploaded code to regenerate the figures on the Bioconductor website.

load(file.path(datadir, "BRCA_TCGA_79_filtered_matchedNames_Basal+LuminalA.RData"))
summary(BRCA_TCGA_79)

##                Length  Class      Mode   
## miRNA               79 data.frame list   
## miRNAprecursor      79 data.frame list   
## RNAseq         1055440 -none-     numeric
## Methyl              79 data.frame list   
## RPPA                79 data.frame list   
## LOH             996348 -none-     numeric
## CNA             934649 -none-     numeric
## clin                29 data.frame list

Brief Background to ordination (PCA/COA)

MCIA can be computed using principcal component analysis (pca), correspondence analysis (coa) or one of several other ordination approaches.

There are many functions in Bioconductor and R for dimension reduction analysis. Within ade4 these are computed using the duality diagram (dudi) framework (Dray and Dufour (2007); De la Cruz and Holmes (2011)) and hence are called dudi.pca(), dudi.coa() etc.

library(ade4)
BRCApca<-dudi.pca(BRCA_TCGA_79$RNAseq, scannf=FALSE, nf=5)
print(BRCApca)

## Duality diagramm
## class: pca dudi
## $call: dudi.pca(df = BRCA_TCGA_79$RNAseq, scannf = FALSE, nf = 5)
## 
## $nf: 5 axis-components saved
## $rank: 79
## eigen values: 17.76 4.719 3.67 2.685 2.572 ...
##   vector length mode    content       
## 1 $cw    79     numeric column weights
## 2 $lw    13360  numeric row weights   
## 3 $eig   79     numeric eigen values  
## 
##   data.frame nrow  ncol content             
## 1 $tab       13360 79   modified array      
## 2 $li        13360 5    row coordinates     
## 3 $l1        13360 5    row normed scores   
## 4 $co        79    5    column coordinates  
## 5 $c1        79    5    column normed scores
## other elements: cent norm

(In ade4, $co are columns and $li are lines or rows)

summary(BRCApca)

## Class: pca dudi
## Call: dudi.pca(df = BRCA_TCGA_79$RNAseq, scannf = FALSE, nf = 5)
## 
## Total inertia: 79
## 
## Eigenvalues:
##     Ax1     Ax2     Ax3     Ax4     Ax5 
##  17.763   4.719   3.670   2.685   2.572 
## 
## Projected inertia (%):
##     Ax1     Ax2     Ax3     Ax4     Ax5 
##  22.485   5.973   4.646   3.398   3.256 
## 
## Cumulative projected inertia (%):
##     Ax1   Ax1:2   Ax1:3   Ax1:4   Ax1:5 
##   22.49   28.46   33.10   36.50   39.76 
## 
## (Only 5 dimensions (out of 79) are shown)

We have a BioC packages called made4 which is a wrapper around the ade4 that utilizes Bioconductor data classes such as Expression Set. In made4 the simplest way to view results from the function ord is to use the function plot. This will draw a plot of the eigenvalues, along with plots of the variables (genes) and a plot of the cases (microarray samples).

library(made4)
cl<-as.character(BRCA_TCGA_79$clin$PAM50.mRNA)
BRCAord= ord(BRCA_TCGA_79$RNAseq, classvec=factor(cl))
summary(BRCAord$ord)

## Class: coa dudi
## Call: dudi.coa(df = data.tr, scannf = FALSE, nf = ord.nf)
## 
## Total inertia: 0.006782
## 
## Eigenvalues:
##       Ax1       Ax2       Ax3       Ax4       Ax5 
## 0.0015441 0.0004271 0.0002812 0.0002410 0.0001826 
## 
## Projected inertia (%):
##     Ax1     Ax2     Ax3     Ax4     Ax5 
##  22.767   6.297   4.146   3.553   2.692 
## 
## Cumulative projected inertia (%):
##     Ax1   Ax1:2   Ax1:3   Ax1:4   Ax1:5 
##   22.77   29.06   33.21   36.76   39.45 
## 
## (Only 5 dimensions (out of 78) are shown)

plot(BRCAord, nlab=3,   arraylabels=rep("T", 79))

plot of chunk plotcoa figure 1: Correspondence analysis plots. A. plot of the eigenvalues, B. projection of microarray samples C. projection of genes (gray filled circles) and D. biplot showing both genes and samples. Samples and genes with a strong association are projected in the same direction from the origin. The greater the distance from the origin the stronger the association . The distinction between molecular subtypes is captured on the first eigenvector.

The case and gene projections can be visualised with functions plotarrays and plotgenes (sorry the names of these functions steam from the olden days of microarrays). These are wrappers around ade4 plotting functions. The number of genes that are labelled at the end of the axis can be defined. The default is 10. These plots are simplistic and could be rendered more useful with shiny/ggplots/etc

par(mfrow=c(2,1))
plotarrays(BRCAord$ord$co, classvec=BRCA_TCGA_79$clin$PAM50.mRNA)
plotgenes(BRCAord, n=5, col="red")

plot of chunk plotarrays2 Figure 2. Plot of A) sample coordinates b) feature coordinates

To extract a list of variables with greatest loadings or weights on an axes, (ie those at the end of an axes), use function topgenes. For example, to get a list of the 5 genes at the negative and postive ends of axes 1.

ax1<- topgenes(BRCAord, axis=1, n=5)

CIA: Two datasets

Coinertia analysis (CIA) identifies trends or co-relationships in two datasets which contain the same samples. Either the rows or the columns of a matrix must be “matchable”. CIA can be applied to datasets where the number of variables exceed the number of samples.

CIA is related to canonical correlation analysis, but CIA optimizes the squared covariance between the eigenvectors whereas CCA optimizes the correlation. There was a nice comparison of sparse CCA and CIA by LeCao et al., LeCao2009 (Le Cao, Martin, Robert-Granie, and Besse (2009))

The function cia calls the function coinertia in the R package ade4 .

## Coinertia analysis
## call: coinertia(dudiX = coa1, dudiY = coa2, scannf = cia.scan, nf = cia.nf)
## class: coinertia dudi 
## 
## $rank (rank)     : 78
## $nf (axis saved) : 2
## $RV (RV coeff)   : 0.8315
## 
## eigenvalues: 9.302e-06 4.838e-07 2.381e-07 2.02e-07 1.475e-07 ...
## 
##   vector length mode    content                    
## 1 $eig   78     numeric Eigenvalues                
## 2 $lw    171    numeric Row weigths (for coa2 cols)
## 3 $cw    13360  numeric Col weigths (for coa1 cols)
## 
##    data.frame nrow  ncol  content                                      
## 1  $tab       171   13360 Crossed Table (CT): cols(coa2) x cols(coa1)  
## 2  $li        171   2     CT row scores (cols of coa2)                 
## 3  $l1        171   2     Principal components (loadings for coa2 cols)
## 4  $co        13360 2     CT col scores (cols of coa1)                 
## 5  $c1        13360 2     Principal axes (cols of coa1)                
## 6  $lX        79    2     Row scores (rows of coa1)                    
## 7  $mX        79    2     Normed row scores (rows of coa1)             
## 8  $lY        79    2     Row scores (rows of coa2)                    
## 9  $mY        79    2     Normed row scores (rows of coa2)             
## 10 $aX        2     2     Corr coa1 axes / coinertia axes              
## 11 $aY        2     2     Corr coa2 axes / coinertia axes              
## 
## CT rows = cols of coa2 (171) / CT cols = cols of coa1 (13360)

plot of chunk BRCAciaplot figure 4. Plot of coinertia analysis of gene and protein expression datasets.

The correlation between the coa axes and the coinertia axes are in $aX and $aY:

par(mfrow=c(1,2))
s.corcircle(BRCAcia$coinertia$aX)
s.corcircle(BRCAcia$coinertia$aY)

plot of chunk ciaMoreCocircle fig 5 Scatter diagram of a correlation circle of the coa and coinertia axes. The angle between two vectors gives the degree of correlation (adjacent = highly correlated, orthogonal (90°) = uncorrelated, and opposite (180°) = negatively correlated).

The RV coefficient is a measure of global similarity between the datasets. The RV coefficient of this coinertia analysis is: 0.8315.

MCIA

Multiple Co-Interia Analysis (MCIA) simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale, enabling one to extract the most variant from each dataset (Meng, Kuster, Culhane, et al. (2014)).

# Check that there are no zero or low variant low count rows.
for (i in 1:7) BRCA_TCGA_79[[i]]<-BRCA_TCGA_79[[i]][!apply(BRCA_TCGA_79[[i]],1, sum)==min(BRCA_TCGA_79[[i]]),]

library(omicade4)
mcia79<-mcia(BRCA_TCGA_79[1:7])
save(mcia79, file="mciaRes.RData")

plot(mcia79, axes=1:2, sample.lab=FALSE, sample.legend=FALSE, phenovec=as.numeric(BRCA_TCGA_79$clin$PAM50.mRNA), gene.nlab=2, df.color=c("navy", "cyan", "magenta", "red4", "brown","yellow", "orange"),df.pch=2:8)

plot of chunk plotmcia It is useful to examine the RV coefficient, first examine the reference structure, what is the joint co-structure between each dataset and the reference, or each pair of datasets. Not surprisingly the proteomics and gene expression data share high similarity

RV.mcoa(mcia79$mcoa)

##          miRNA miRNAprecursor         RNAseq         Methyl           RPPA 
##         0.8968         0.8989         0.8207         0.7759         0.7639 
##            LOH            CNA 
##         0.5377         0.5397

Between each pair of datasets

mcia79$mcoa$RV

##                 miRNA miRNAprecursor RNAseq Methyl   RPPA    LOH    CNA
## miRNA          1.0000         0.9370 0.3780 0.3847 0.3514 0.4404 0.4087
## miRNAprecursor 0.9370         1.0000 0.3404 0.3424 0.3199 0.3819 0.3800
## RNAseq         0.3780         0.3404 1.0000 0.6553 0.8315 0.3496 0.5181
## Methyl         0.3847         0.3424 0.6553 1.0000 0.5746 0.3580 0.4773
## RPPA           0.3514         0.3199 0.8315 0.5746 1.0000 0.3192 0.4888
## LOH            0.4404         0.3819 0.3496 0.3580 0.3192 1.0000 0.3518
## CNA            0.4087         0.3800 0.5181 0.4773 0.4888 0.3518 1.0000

PseduoEigen values

mcia79$mcoa$cov2

##                  cov21   cov22
## miRNA          0.13060 0.12491
## miRNAprecursor 0.13197 0.12492
## RNAseq         0.21240 0.03951
## Methyl         0.13825 0.02503
## RPPA           0.20379 0.02861
## LOH            0.06203 0.01219
## CNA            0.08033 0.02284

plot(mcia79$mcoa$cov2,  xlab = "pseudoeig 1", ylab = "pseudoeig 2", pch=19, col="red")
text(mcia79$mcoa$cov2, labels=rownames(mcia79$mcoa$cov2), cex=0.8, adj=0)

plot of chunk assesingMCIAcov

Eigenvalues (computed on the separate analysis) are in $lambda. These will either be weighted by the first eigenvalues (option=“lambda1”), or weighted by the total inertia (option=“inertia”) depending on the parameter “option”. See more in the help for the function ade4:::mcoa.

Examine the projects of the axes using the correlation circle

mcia79$mcoa$Tax

##                      Axis1     Axis2
## miRNA.1           0.956722  0.105140
## miRNA.2          -0.033670  0.948745
## miRNA.3           0.206512 -0.285106
## miRNA.4          -0.063743  0.051587
## miRNAprecursor.1  0.951915  0.166798
## miRNAprecursor.2 -0.090401  0.934020
## miRNAprecursor.3  0.221543 -0.308292
## miRNAprecursor.4 -0.038513  0.012779
## RNAseq.1          0.995599 -0.062898
## RNAseq.2          0.070107  0.896686
## RNAseq.3         -0.004684 -0.210508
## RNAseq.4         -0.019510 -0.080803
## Methyl.1          0.968830 -0.062894
## Methyl.2          0.212475  0.324962
## Methyl.3          0.052842 -0.585813
## Methyl.4         -0.017095  0.116557
## RPPA.1            0.991280 -0.038901
## RPPA.2            0.103149  0.457035
## RPPA.3           -0.035056  0.521484
## RPPA.4            0.014560 -0.206096
## LOH.1             0.820020 -0.133287
## LOH.2             0.303922  0.306115
## LOH.3             0.323478 -0.007005
## LOH.4             0.195270 -0.004495
## CNA.1             0.707346 -0.499733
## CNA.2             0.635149  0.477023
## CNA.3            -0.049951  0.285993
## CNA.4            -0.162432  0.152731

dev.off()

## null device 
##           1

par(mfrow=c(4,2))
xx<-by(mcia79$mcoa$Tax, substr(rownames(mcia79$mcoa$Tax),1,3), s.corcircle)

Coordinates of reference samples (scores against which the sq covariance is maximized) are in mcia79$mcoa$SynVar. The reference and each data projection can be visualized using kplot. Again the graphics could be so much better.

#plotarrays(mcia79$mcoa$SynVar, classvec=BRCA_TCGA_79$clin$PAM50.mRNA)
kplot(mcia79$mcoa, mfrow = c(3,4), clab = .8, csub = 3, cpoi = 3)

plot of chunk samplescores

Extracting tranformed features from MCIA results

All of the data have been transformed onto the same scale and the coordinates of genes tranformed in this space are avaialble in mcia79$mcoa$axis

summary(mcia79$mcoa$axis)

##      Axis1            Axis2        
##  Min.   :-7.332   Min.   :-11.621  
##  1st Qu.:-0.737   1st Qu.: -0.611  
##  Median :-0.085   Median : -0.066  
##  Mean   :-0.042   Mean   :  0.026  
##  3rd Qu.: 0.525   3rd Qu.:  0.587  
##  Max.   :14.518   Max.   :  6.921

par(mfrow=c(1,2))
plot(mcia79$mcoa$axis[,1]~factor(mcia79$mcoa$TC[,1]), col=1:7, names=names(mcia79$coa), ylab="Gene Scores PC1", xlab="", las=2)
plot(mcia79$mcoa$axis[,2]~factor(mcia79$mcoa$TC[,1]), col=1:7, names=names(mcia79$coa), ylab="Gene Scores PC2", xlab="", las=2)

plot of chunk genescores

For example if we look at the top 10 scores of features on the negative end of PC1, it contains features from the miRNA (#1), miRNAprecursor (#2) and RNAseq (#3) datasets. In mcia79$mcoa$axis, each feature ID has a suffix number which refers to the dataset from which it originated.

mcia79$mcoa$axis[order(mcia79$mcoa$axis[,1]),][1:10,1, drop=FALSE]

##                              Axis1
## hsa.mir.490.MIMAT0004764.1  -7.332
## hsa.mir.124.1.2             -7.304
## hsa.mir.124.MIMAT0000422.1  -7.260
## hsa.mir.124.2.2             -7.240
## hsa.mir.124.3.2             -7.047
## hsa.mir.124.MIMAT0004591.1  -5.252
## ROPN1.3                     -5.200
## hsa.mir.1267.2              -5.156
## hsa.mir.2117.MIMAT0011162.1 -5.140
## hsa.mir.1267.MIMAT0005921.1 -5.130

## Dataset suffix
cbind(1:7,rownames(mcia79$mcoa$cov2))

##      [,1] [,2]            
## [1,] "1"  "miRNA"         
## [2,] "2"  "miRNAprecursor"
## [3,] "3"  "RNAseq"        
## [4,] "4"  "Methyl"        
## [5,] "5"  "RPPA"          
## [6,] "6"  "LOH"           
## [7,] "7"  "CNA"

Project annotation on plot

## To "concatentate"" data, 
mm<-function(x) substr(x, 1, nchar(x)-2)
ids<-mm(rownames(mcia79$mcoa$axis))

# Whilst it would be great, to have alll of our data mapped to genome co-ordinates and really take the union of everything, to keep it simple, I will only look at the Gene Symbols (RNAseq, RPPA) 
library(HGNChelper)
library(Biobase)

idsFix<- checkGeneSymbols(ids)

## Warning: Some lower-case letters were found and converted to upper-case.
##                  HGNChelper is intended for human symbols only, which should be all
##                  upper-case except for open reading frames (orf).
## Warning: x contains non-approved gene symbols

## Get PC1
idsPC<-cbind(idsFix, PC=mcia79$mcoa$axis)

GOs<-select(org.Hs.eg.db, columns="GO",keytype="SYMBOL",keys=idsPC$Suggested.Symbol)

## Warning: 'NA' keys have been removed
## Warning: 'select' and duplicate query keys resulted in 1:many mapping
## between keys and return rows

#Get Coordinates for GOs terms
res<-tapply(GOs$SYMBOL, GOs$GO, function(Syms) colMeans(idsPC[idsPC$Suggested.Symbol%in%Syms,c("PC.Axis1","PC.Axis2")]))
res<-do.call(rbind,res)

res[1:2,]

##            PC.Axis1 PC.Axis2
## GO:0000002  -0.3018 -0.02972
## GO:0000003   0.9200  1.86450

plot(mcia79$mcoa$axis, col =as.numeric(mcia79$mcoa$TC[,1]), pch=as.numeric(mcia79$mcoa$TC[,1]))
tt<-topgenes(res, n=5)
points(res[tt,], pch=19, col="gray")
text(res[tt,], labels=tt, cex=0.8, adj=0)

plot of chunk catData

Pathway

Or we can perform pathway analysis on the features. There are many tools within BioC for performing pathway enrichments, for examples we could perform ssGSVA pcs

#Exclude NA
idsPC<-idsPC[!is.na(idsPC$Suggested.Symbol),]

#Reduce to GeneSymbol and MCIA score, taking max
idsPC1<-tapply(idsPC$PC.Axis1, idsPC$Suggested.Symbol,max)
idsPC2<-tapply(idsPC$PC.Axis2, idsPC$Suggested.Symbol,max)
if( !all(names(idsPC1)==names(idsPC2))) stop() 
idsPCs<-cbind(idsPC1, idsPC2)


# The "built-in" gsva library are mapped to entrezIDs so map symbols
entrezPC1<-select(org.Hs.eg.db, columns="ENTREZID",keytype="SYMBOL",keys=rownames(idsPCs))
entrezPC1<-entrezPC1[!duplicated(entrezPC1[,1]),]
table(rownames(idsPCs)==entrezPC1[,1])
rownames(idsPCs)= entrezPC1[,2]

# Run any enrichment test with gsva
require(GSVA)
require(GSVAdata)
gsva(idsPCs, c2BroadSets)

Any Questions?

Aedin@jimmy.harvard.edu

Integrated pathway analysis of multiple ’omics datasets

Aedin Culhane

Friday August 1, 2014