%\VignetteIndexEntry{genefu An Introduction (HowTo)}
%\VignetteDepends{survcomp, mclust, rmeta, Biobase, xtable}
%\VignetteSuggests{GeneMeta, breastCancerVDX, breastCancerMAINZ, breastCancerTRANSBIG, breastCancerUPP, breastCancerUNT, breastCancerNKI, knitr}
%\VignetteImports{amap}
%\VignetteKeywords{Breast Cancer, Survival Analysis, Prognosis, Classification}
%\VignettePackage{genefu}
%\VignetteEngine{knitr::knitr}

%%%http://yihui.name/knitr/options/

\documentclass{article}
\usepackage{graphicx}
\usepackage{microtype}
\usepackage[T1]{fontenc}
\usepackage[latin1]{inputenc}
\usepackage{lmodern}
\usepackage{geometry}
\usepackage{authblk}
\geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
\usepackage[table]{xcolor}

%------------------------------------------------------------
% newcommand
%------------------------------------------------------------
\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rexpression}[1]{\texttt{#1}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}

\begin{document}

\title{\Rpackage{genefu}: a package for breast cancer gene expression analysis}

\author[1,2]{Deena M.A. Gendoo}
\author[1]{Natchar Ratanasirigulchai}
\author[3]{Markus Schr\"{o}der}
\author[1,2]{Benjamin Haibe-Kains\thanks{benjamin.haibe.kains@utoronto.ca }}

\affil[1]{Bioinformatics and Computational Genomics Laboratory, Princess Margaret Cancer Center, University Health Network, Toronto, Ontario, Canada}
\affil[2]{Department of Medical Biophysics, University of Toronto, Toronto, Canada}
\affil[3]{UCD School of Biomolecular and Biomedical Science, Conway Institute, University College Dublin, Belfield, Dublin, Ireland}
\affil[4]{Machine Learning Group, Universit\'{e} Libre de Bruxelles}

\maketitle
\tableofcontents

%------------------------------------------------------------
\section{Introduction}
%------------------------------------------------------------ 

The genefu package is providing relevant functions for gene expression analysis, especially in breast cancer.
This package includes a number of algorithms for molecular subtype classification. 
The package also includes implementations of prognostic prediction algorithms, 
along with lists of prognostic gene signatures on which these algorithms were based. 

Please refer to the manuscript URL and Lab website: http://www.pmgenomics.ca/bhklab/software/genefu

Please also refer to the References section below, for additional information on publications that have cited Version 1 of genefu. 

%------------------------------------------------------------
\section{Loading package for case studies}
%------------------------------------------------------------ 
First we load the genefu into the workspace. 
The package is publicly available and can be installed from Bioconductor version 2.8 or higher in R version 2.13.0 or higher. 

To install the genefu package:
<<installAndLoadPackages,eval=FALSE,results='hide',message=FALSE>>=
knitr::opts_chunk$set(eval=TRUE,cache=TRUE)
source("http://bioconductor.org/biocLite.R")
biocLite("genefu")
@

For computing the risk scores, estimates of the performance of the risk scores, combining the estimates and comparing the estimates we have to load the genefu and survcomp packages into the workspace. 
We also load all the packages we need to conduct the case studies. 

<<loadPackages,eval=TRUE,results='hide',message=FALSE>>=
library(genefu)
library(xtable)
library(rmeta)
library(Biobase)
library(caret)
@


%------------------------------------------------------------
\section{Load Datasets and Packages for Case Studies}
%------------------------------------------------------------ 

The following case study compares risk prediction models. 
This includes computing risk scores, computing estimates of the performance of the risk scores, 
as well as combining the estimates and comparing them.  

The five data sets that we use in the case study are publicly available as experimental data packages on Bioconductor.org. 
In particular we used:

\begin{description}
\item breastCancerMAINZ:  bioconductor.org/packages/release/data/experiment/html/breastCancerMAINZ.html
\item breastCancerUPP:  bioconductor.org/packages/release/data/experiment/html/breastCancerUPP.html
\item breastCancerUNT:  bioconductor.org/packages/release/data/experiment/html/breastCancerUNT.html
\item breastCancerNKI:  bioconductor.org/packages/release/data/experiment/html/breastCancerNKI.html
\item breastCancerTRANSBIG: bioconductor.org/packages/release/data/experiment/html/breastCancerTRANSBIG.html
\end{description}

Please Note: We don't use the breastCancerVDX experimental package in this case study since it has been used as training data set for GENIUS. 
Please refer to Haibe-Kains et al, 2010. The breastCancerVDX is found at the following link:

\begin{description}
\item breastCancerVDX:  http://www.bioconductor.org/packages/release/data/experiment/html/breastCancerVDX.html
\end{description}

These experimental data packages can be installed from Bioconductor version 2.8 or higher in R version 2.13.0 or higher. 
For the experimental data packages the commands for installing the data sets are:

<<installPackages,eval=FALSE,results='hide',message=FALSE>>=
source("http://www.bioconductor.org/biocLite.R")
biocLite("breastCancerMAINZ")
biocLite("breastCancerTRANSBIG")
biocLite("breastCancerUPP")
biocLite("breastCancerUNT")
biocLite("breastCancerNKI")
@

And to load the packages into R, please use the following commands:
<<LoadPackages,eval=TRUE,comment="##",results='hide',message=FALSE>>=
library(breastCancerMAINZ)
library(breastCancerTRANSBIG)
library(breastCancerUPP)
library(breastCancerUNT)
library(breastCancerNKI)
@

Table1 shows an overview of the data sets and the patients. 
From those 1123 breast cancer patients we selected only the patients that are node negative and didn't receive any treatment 
(except local radiotherapy), which results in 713 patients.

\begin{table}
\begin{center}
\caption{Detailed overview for the data sets used in the case study}
\label{tab1}
\begin{tabular}{ l | l*{4}{c}r }
Dataset & Patients [\#] & ER+ [\#] & HER2+ [\#] & Age [years] & Grade [1/2/3] & Platform \\ \hline
MAINZ & 200 & 155 & 23 & 25-90 & 29/136/35 & HGU133A \\
TRANSBIG & 198 & 123 & 35 & 24-60 & 30/83/83 & HGU133A \\
UPP & 251 & 175 & 46 & 28-93 & 67/128/54 & HGU133AB \\
UNT & 137 & 94 & 21 & 24-73 & 32/51/29 & HGU133AB \\
NKI & 337 & 212 & 53 & 26-62 & 79/109/149 & Agilent \\ \hline
Overall & 1123 & 759 & 178 & 25-73 & 237/507/350 & Affy/Agilent \\
\end{tabular}
\end{center}
\end{table}

Since there are duplicated patients in the five data sets, we have to identify the duplicated patients 
and we subsequently store them in a vector.
<<findDuplicatedPatients,eval=TRUE,results='hide',message=FALSE>>=
data(breastCancerData)
cinfo <- colnames(pData(mainz7g))
data.all <- c("transbig7g"=transbig7g, "unt7g"=unt7g, "upp7g"=upp7g, "mainz7g"=mainz7g, "nki7g"=nki7g)

idtoremove.all <- NULL
duplres <- NULL

## No overlaps in the MainZ and NKI datasets. 

## Focus on UNT vs UPP vs TRANSBIG
demo.all <- rbind(pData(transbig7g), pData(unt7g), pData(upp7g))
dn2 <- c("TRANSBIG", "UNT", "UPP")

## Karolinska
## Search for the VDXKIU, KIU, UPPU series
ds2 <- c("VDXKIU", "KIU", "UPPU")
demot <- demo.all[complete.cases(demo.all[ , c("series")]) & is.element(demo.all[ , "series"], ds2), ]

# Find the duplicated patients in that series
duplid <- sort(unique(demot[duplicated(demot[ , "id"]), "id"]))
duplrest <- NULL
for(i in 1:length(duplid)) {
  tt <- NULL
  for(k in 1:length(dn2)) {
    myx <- sort(row.names(demot)[complete.cases(demot[ , c("id", "dataset")]) & 
                                   demot[ , "id"] == duplid[i] & demot[ , "dataset"] == dn2[k]])
    if(length(myx) > 0) { tt <- c(tt, myx) }
  }
  duplrest <- c(duplrest, list(tt))
}
names(duplrest) <- duplid
duplres <- c(duplres, duplrest)

## Oxford
## Search for the VVDXOXFU, OXFU series
ds2 <- c("VDXOXFU", "OXFU")
demot <- demo.all[complete.cases(demo.all[ , c("series")]) & is.element(demo.all[ , "series"], ds2), ]

# Find the duplicated patients in that series
duplid <- sort(unique(demot[duplicated(demot[ , "id"]), "id"]))
duplrest <- NULL
for(i in 1:length(duplid)) {
  tt <- NULL
  for(k in 1:length(dn2)) {
    myx <- sort(row.names(demot)[complete.cases(demot[ , c("id", "dataset")]) & 
                                   demot[ , "id"] == duplid[i] & demot[ , "dataset"] == dn2[k]])
    if(length(myx) > 0) { tt <- c(tt, myx) }
  }
  duplrest <- c(duplrest, list(tt))
}
names(duplrest) <- duplid
duplres <- c(duplres, duplrest)

## Full set duplicated patients
duPL <- sort(unlist(lapply(duplres, function(x) { return(x[-1]) } )))
@


%------------------------------------------------------------
\section{Case Study : Compare Molecular Subtype Classifications}
%------------------------------------------------------------ 


We now perform molecular subtyping on each of the datasets. 
Here, we perform subtyping using the PAM50 as well as the SCMgene subtyping algorithms. 
<<CalculateMolecularSubtypes>>=
dn <- c("transbig", "unt", "upp", "mainz", "nki")
dn.platform <- c("affy", "affy", "affy", "affy", "agilent")
res <- ddemo.all <- ddemo.coln <- NULL

for(i in 1:length(dn)) {

  ## load dataset
  dd <- get(data(list=dn[i]))
  #Remove duplicates identified first
  message("obtained dataset!")
  
  #Extract expression set, pData, fData for each dataset
  ddata <- t(exprs(dd)) 

  ddemo <- phenoData(dd)@data
  
  if(length(intersect(rownames(ddata),duPL))>0)
  {
  ddata<-ddata[-which(rownames(ddata) %in% duPL),]
  ddemo<-ddemo[-which(rownames(ddemo) %in% duPL),]
  }
  
  dannot <- featureData(dd)@data

  #MOLECULAR SUBTYPING
  #Perform subtyping using scmod2.robust
  # scmod2.robust: List of parameters defining the subtype clustering model 
  # (as defined by Wirapati et al)
#   SubtypePredictions<-subtype.cluster.predict(sbt.model=scmod2.robust,
#                                               data=ddata,annot=dannot,do.mapping=TRUE,verbose=TRUE)
   
  SubtypePredictions<-molecular.subtyping(sbt.model = "scmod2",data = ddata,
                                          annot = dannot,do.mapping = TRUE)
  
  #Get sample counts pertaining to each subtype
  table(SubtypePredictions$subtype)
  #Select samples pertaining to Basal Subtype
  Basals<-names(which(SubtypePredictions$subtype == "ER-/HER2-"))
  #Select samples pertaining to HER2 Subtype
  HER2s<-names(which(SubtypePredictions$subtype == "HER2+"))
  #Select samples pertaining to Luminal Subtypes
  LuminalB<-names(which(SubtypePredictions$subtype == "ER+/HER2- High Prolif"))
  LuminalA<-names(which(SubtypePredictions$subtype == "ER+/HER2- Low Prolif"))
  #ASSIGN SUBTYPES TO EVERY SAMPLE, ADD TO THE EXISTING PHENODATA
  ddemo$SCMGENE<-SubtypePredictions$subtype
  ddemo[LuminalB,]$SCMGENE<-"LumB"
  ddemo[LuminalA,]$SCMGENE<-"LumA"
  ddemo[Basals,]$SCMGENE<-"Basal"
  ddemo[HER2s,]$SCMGENE<-"Her2"
 
  # Perform subtyping using PAM50
  # Matrix should have samples as ROWS, genes as COLUMNS
  #rownames(dannot)<-dannot$probe<-dannot$EntrezGene.ID
 
  PAM50Preds<-molecular.subtyping(sbt.model = "pam50",data=ddata,
                                        annot=dannot,do.mapping=TRUE)
#   PAM50Preds<-intrinsic.cluster.predict(sbt.model=pam50,data=ddata,
#                                         annot=dannot,do.mapping=TRUE,verbose=TRUE)
  table(PAM50Preds$subtype)
  ddemo$PAM50<-PAM50Preds$subtype
  LumA<-names(PAM50Preds$subtype)[which(PAM50Preds$subtype == "LumA")]
  LumB<-names(PAM50Preds$subtype)[which(PAM50Preds$subtype == "LumB")]
  ddemo[LumA,]$PAM50<-"LumA"
  ddemo[LumB,]$PAM50<-"LumB"
  
  ddemo.all <- rbind(ddemo, ddemo.all)
}

@

We can compare the performance of both molecular subtyping methods 
and determine how concordant subtype predictions are across the global population.
We first generate a confusion matrix of the subtype predictions.
<<CompareMolecularSubtypesByConfusionMatrix>>=
# Obtain the subtype prediction counts for PAM50
table(ddemo.all$PAM50)
Normals<-rownames(ddemo.all[which(ddemo.all$PAM50 == "Normal"),])

# Obtain the subtype prediction counts for SCMGENE
table(ddemo.all$SCMGENE)

ddemo.all$PAM50<-as.character(ddemo.all$PAM50)
# We compare the samples that are predicted as pertaining to a molecular subtyp
# We ignore for now the samples that predict as 'Normal' by PAM50
confusionMatrix(ddemo.all[-which(rownames(ddemo.all) %in% Normals),]$SCMGENE,
                ddemo.all[-which(rownames(ddemo.all) %in% Normals),]$PAM50)
@

From these results, the concordance of the predictions between these models is around 83 percent. 

We can also compare the survival of patients for each subtype. 
We plot the surival curves of patients by subtype, 
based on each molecular classification algorithm
<<CompareSurvivalBySubtypes>>=
# http://www.inside-r.org/r-doc/survival/survfit.coxph
library(survival)
ddemo<-ddemo.all
data.for.survival.SCMGENE <- ddemo[,c("e.os", "t.os", "SCMGENE","age")]
data.for.survival.PAM50 <- ddemo[,c("e.os", "t.os", "PAM50","age")]
# Remove patients with missing survival information
data.for.survival.SCMGENE <- data.for.survival.SCMGENE[complete.cases(data.for.survival.SCMGENE),]
data.for.survival.PAM50 <- data.for.survival.PAM50[complete.cases(data.for.survival.PAM50),]

days.per.month <- 30.4368
days.per.year <- 365.242

data.for.survival.PAM50$months_to_death <- data.for.survival.PAM50$t.os / days.per.month
data.for.survival.PAM50$vital_status <- data.for.survival.PAM50$e.os == "1"
surv.obj.PAM50 <- survfit(Surv(data.for.survival.PAM50$months_to_death, 
                               data.for.survival.PAM50$vital_status) ~ data.for.survival.PAM50$PAM50)

data.for.survival.SCMGENE$months_to_death <- data.for.survival.SCMGENE$t.os / days.per.month
data.for.survival.SCMGENE$vital_status <- data.for.survival.SCMGENE$e.os == "1"
surv.obj.SCMGENE <- survfit(Surv(
  data.for.survival.SCMGENE$months_to_death, 
  data.for.survival.SCMGENE$vital_status) ~ data.for.survival.SCMGENE$SCMGENE)

message("KAPLAN-MEIR CURVE - USING PAM50")
# survMisc::autoplot(surv.obj.PAM50, title="Survival curves PAM50", censSize=0)$plot + 
#   scale_colour_manual(name="Strata", values=c("black", "green", "blue", "red"))

plot(main = "Surival Curves PAM50", surv.obj.PAM50,
     col =c("#006d2c", "#8856a7","#a50f15", "#08519c", "#000000"),lty = 1,lwd = 3,
     xlab = "Time (months)",ylab = "Survival")
legend("topright",
       fill = c("#006d2c", "#8856a7","#a50f15", "#08519c", "#000000"),
       legend = c("Basal","Her2","LumA","LumB","Normal"),bty = "n")

message("KAPLAN-MEIR CURVE - USING SCMGENE")
# survMisc::autoplot(surv.obj.SCMGENE, title="Survival curves SCMGENE", censSize=0)$plot + 
#   scale_colour_manual(name="Strata", values=c("black", "green", "blue"))

plot(main = "Surival Curves SCMGENE", surv.obj.SCMGENE,
     col =c("#006d2c", "#8856a7","#a50f15", "#08519c"),lty = 1,lwd = 3,
     xlab = "Time (months)",ylab = "Survival")
legend("topright",
       fill = c("#006d2c", "#8856a7","#a50f15", "#08519c"),
       legend = c("Basal","Her2","LumA","LumB"),bty = "n")

## GENERATE A OVERLAYED PLOT OF SURVIVAL CURVES
message("Overlayed Surival Plots based on PAM50 and SCMGENE")
                          ## Basal    Her2        LuminalA  LuminalB   Normal
plot(surv.obj.PAM50,col =c("#006d2c", "#8856a7","#a50f15", "#08519c", "#000000"),lty = 1,lwd = 3,
     xlab = "Time (months)",ylab = "Survival",ymin = 0.2)
legend("topright",
       fill = c("#006d2c", "#8856a7","#a50f15", "#08519c", "#000000"),
       legend = c("Basal","Her2","LumA","LumB","Normal"),bty = "n")

par(new=TRUE)
                            ## Basal    Her2        LuminalA  LuminalB
lines(surv.obj.SCMGENE,col =c("#006d2c", "#8856a7","#a50f15", "#08519c"),lwd=2,lty=5)
legend("bottomright",c("PAM50","SCMGENE"),lty=c("solid", "dashed"))

@

We can now compare which of the molecular subtyping algorithms is more prognostic.
To do this we use a Cross-validated Partial Likelihood (cvpl) calculation from survcomp. 
This returns the mean cross-validated partial likelihood, for each algorithm, using molecular subtypes for stratification
<<CalculatedCVPL>>=
set.seed(12345)

PAM5_CVPL<-cvpl(x=data.for.survival.PAM50$age, 
                surv.time=data.for.survival.PAM50$months_to_death, 
                surv.event=data.for.survival.PAM50$vital_status, 
                strata=as.integer(factor(data.for.survival.PAM50$PAM50)), 
                nfold=10, setseed=54321)$cvpl

SCMEGENE_CVPL<-cvpl(x=data.for.survival.SCMGENE$age, 
                    surv.time=data.for.survival.SCMGENE$months_to_death, 
                    surv.event=data.for.survival.SCMGENE$vital_status, 
                    strata=as.integer(factor(data.for.survival.SCMGENE$SCMGENE)), 
                    nfold=10, setseed=54321)$cvpl

print.data.frame(data.frame(cbind(PAM5_CVPL,SCMEGENE_CVPL)))
@

\newpage

%------------------------------------------------------------
\section{Case Study : Comparing risk prediction models}
%------------------------------------------------------------ 

We compute the risk scores using the following list of algorithms (and corresponding genefu functions):

\begin{description}
\item Subtype Clustering Model using just the AURKA gene: scmgene.robust()
\item Subtype Clustering Model using just the ESR1 gene: scmgene.robust()
\item Subtype Clustering Model using just the ERBB2 gene: scmgene.robust()
\item NPI: npi()
\item GGI: ggi()
\item GENIUS: genius()
\item EndoPredict: endoPredict()
\item OncotypeDx: oncotypedx()
\item TamR: tamr()
\item GENE70: gene70()
\item PIK3CA: pik3cags()
\item rorS: rorS()
\end{description}

<<computeRiskScore>>=
dn <- c("transbig", "unt", "upp", "mainz", "nki")
dn.platform <- c("affy", "affy", "affy", "affy", "agilent")

res <- ddemo.all <- ddemo.coln <- NULL
for(i in 1:length(dn)) {

  ## load dataset
  dd <- get(data(list=dn[i]))

  #Extract expression set, pData, fData for each dataset
  ddata <- t(exprs(dd)) 
  ddemo <- phenoData(dd)@data
  dannot <- featureData(dd)@data
  ddemo.all <- c(ddemo.all, list(ddemo))
  if(is.null(ddemo.coln)) 
  { ddemo.coln <- colnames(ddemo) } else 
  { ddemo.coln <- intersect(ddemo.coln, colnames(ddemo)) }
  rest <- NULL
  
  ## AURKA
  ## if affy platform consider the probe published in Desmedt et al., CCR, 2008
  if(dn.platform[i] == "affy") { domap <- FALSE } else { domap <- TRUE }
  modt <- scmgene.robust$mod$AURKA
  ## if agilent platform consider the probe published in Desmedt et al., CCR, 2008
  if(dn.platform[i] == "agilent") {
    domap <- FALSE
    modt[ , "probe"] <- "NM_003600"
  }
  rest <- cbind(rest, "AURKA"=sig.score(x=modt, data=ddata, annot=dannot, do.mapping=domap)$score)

  ## ESR1
  ## if affy platform consider the probe published in Desmedt et al., CCR, 2008
  if(dn.platform[i] == "affy") { domap <- FALSE } else { domap <- TRUE }
  modt <- scmgene.robust$mod$ESR1
  ## if agilent platform consider the probe published in Desmedt et al., CCR, 2008
  if(dn.platform[i] == "agilent") {
    domap <- FALSE
    modt[ , "probe"] <- "NM_000125"
  }
  rest <- cbind(rest, "ESR1"=sig.score(x=modt, data=ddata, annot=dannot, do.mapping=domap)$score)
  
  ## ERBB2
  ## if affy platform consider the probe published in Desmedt et al., CCR, 2008
  if(dn.platform[i] == "affy") { domap <- FALSE } else { domap <- TRUE }
  modt <- scmgene.robust$mod$ERBB2
  ## if agilent platform consider the probe published in Desmedt et al., CCR, 2008
  if(dn.platform[i] == "agilent") {
    domap <- FALSE
    modt[ , "probe"] <- "NM_004448"
  }
  rest <- cbind(rest, "ERBB2"=sig.score(x=modt, data=ddata, annot=dannot, do.mapping=domap)$score)
  
  ## NPI
  ss <- ddemo[ , "size"]
  gg <- ddemo[ , "grade"]
  nn <- rep(NA, nrow(ddemo))
  nn[complete.cases(ddemo[ , "node"]) & ddemo[ , "node"] == 0] <- 1 
  nn[complete.cases(ddemo[ , "node"]) & ddemo[ , "node"] == 1] <- 3
  names(ss) <- names(gg) <- names(nn) <- rownames(ddemo)
  rest <- cbind(rest, "NPI"=npi(size=ss, grade=gg, node=nn, na.rm=TRUE)$score)

  ## GGI
  if(dn.platform[i] == "affy") { domap <- FALSE } else { domap <- TRUE }
  rest <- cbind(rest, "GGI"=ggi(data=ddata, annot=dannot, do.mapping=domap)$score)
  
  ## GENIUS
  if(dn.platform[i] == "affy") { domap <- FALSE } else { domap <- TRUE }
  rest <- cbind(rest, "GENIUS"=genius(data=ddata, annot=dannot, do.mapping=domap)$score)
  
  ## ENDOPREDICT
  if(dn.platform[i] == "affy") { domap <- FALSE } else { domap <- TRUE }
  rest <- cbind(rest, "EndoPredict"=endoPredict(data=ddata, annot=dannot, do.mapping=domap)$score)
  
  # OncotypeDx
  if(dn.platform[i] == "affy") { domap <- FALSE } else { domap <- TRUE }
  rest <- cbind(rest, "OncotypeDx"=oncotypedx(data=ddata, annot=dannot, do.mapping=domap)$score)
  
  ## TamR
  # Note: risk is not implemented, the function will return NA values
  if(dn.platform[i] == "affy") { domap <- FALSE } else { domap <- TRUE }
  rest <- cbind(rest, "TAMR13"=tamr13(data=ddata, annot=dannot, do.mapping=domap)$score)

  ## GENE70
  # Need to do mapping for Affy platforms because this is based on Agilent. 
  # Hence the mapping rule is reversed here! 
  if(dn.platform[i] == "affy") { domap <- TRUE } else { domap <- FALSE }
  rest <- cbind(rest, "GENE70"=gene70(data=ddata, annot=dannot, std="none",do.mapping=domap)$score)
  
  ## Pik3cags
  if(dn.platform[i] == "affy") { domap <- FALSE } else { domap <- TRUE }
  rest <- cbind(rest, "PIK3CA"=pik3cags(data=ddata, annot=dannot, do.mapping=domap))
  
  ## rorS
  # Uses the pam50 algorithm. Need to do mapping for both Affy and Agilent
  rest <- cbind(rest, "rorS"=rorS(data=ddata, annot=dannot, do.mapping=TRUE)$score)
  
  ## GENE76
  # Mainly designed for Affy platforms. Has been excluded here
  
  # BIND ALL TOGETHER
  res <- rbind(res, rest)
}
names(ddemo.all) <- dn
@

For further analysis and handling of the data we store all information in one object. 
We also remove the duplicated patients from the analysis and take only those patients into account, 
that have complete information for nodal, survival and treatment status.
<<simplifyAndRemoveDuplicatePatients>>=
ddemot <- NULL
for(i in 1:length(ddemo.all)) {
  ddemot <- rbind(ddemot, ddemo.all[[i]][ , ddemo.coln, drop=FALSE])
}
res[complete.cases(ddemot[ ,"dataset"]) & ddemot[ ,"dataset"] == "VDX", "GENIUS"] <- NA

## select only untreated node-negative patients with all risk predictions
## ie(incomplete cases (where risk prediction may be missing for a sample) are subsequently removed))
# Note that increasing the number of risk prediction analyses 
# may increase the number of incomplete cases
# In the pconcordance indexevious vignette for genefu version1, we were only testing 4 risk predictors, 
# so we had a total of 722 complete cases remaining
# Here, we are now testing 12 risk predictors, so we only have 713 complete cases remaining. 
# The difference of 9 cases between the two versions are all from the NKI dataset. 
myx <- complete.cases(res, ddemot[ , c("node", "treatment")]) & 
  ddemot[ , "treatment"] == 0 & ddemot[ , "node"] == 0 & !is.element(rownames(ddemot), duPL)

res <- res[myx, , drop=FALSE]
ddemot <- ddemot[myx, , drop=FALSE]
@

To compare the risk score performances, we compute the concordance index\footnote{The same analysis could be performed with D index and hazard ratio by using the functions \Rfunction{D.index} and \Rfunction{hazard.ratio} from the \Rpackage{survcomp} package respectively}, which is the probability that, for a pair of randomly chosen comparable samples, the sample with the higher risk prediction will experience an event before the other sample or belongs to a higher binary class.

<<cindexComputation>>=
cc.res <- complete.cases(res)
datasetList <- c("MAINZ","TRANSBIG","UPP","UNT","NKI")
riskPList <- c("AURKA","ESR1","ERBB2","NPI", "GGI", "GENIUS", 
               "EndoPredict","OncotypeDx","TAMR13","GENE70","PIK3CA","rorS")
setT <- setE <- NULL
resMatrix <- as.list(NULL)

for(i in datasetList)
{
  dataset.only <- ddemot[,"dataset"] == i 
  patientsAll <- cc.res & dataset.only
	
  ## set type of available survival data
  if(i == "UPP") {
    setT <- "t.rfs"
    setE <- "e.rfs"
  } else {
    setT <- "t.dmfs"
    setE <- "e.dmfs"
  }

  # Calculate cindex computation for each predictor
  for (Dat in riskPList)
  {
    cindex <- t(apply(X=t(res[patientsAll,Dat]), MARGIN=1, function(x, y, z) {
    tt <- concordance.index(x=x, surv.time=y, surv.event=z, method="noether", na.rm=TRUE);
    return(c("cindex"=tt$c.index, "cindex.se"=tt$se, "lower"=tt$lower, "upper"=tt$upper)); },
    y=ddemot[patientsAll,setT], z=ddemot[patientsAll, setE]))
    
    resMatrix[[Dat]] <- rbind(resMatrix[[Dat]], cindex)
  }
}
@


Using a random-effects model we combine the dataset-specific performance estimated into overall estimates for each risk prediction model:
<<combineEstimations>>=
for(i in names(resMatrix)){
  #Get a meta-estimate
  ceData <- combine.est(x=resMatrix[[i]][,"cindex"], x.se=resMatrix[[i]][,"cindex.se"], hetero=TRUE) 
  cLower <- ceData$estimate + qnorm(0.025, lower.tail=TRUE) * ceData$se
  cUpper <- ceData$estimate + qnorm(0.025, lower.tail=FALSE) * ceData$se
	
  cindexO <- cbind("cindex"=ceData$estimate, "cindex.se"=ceData$se, "lower"=cLower, "upper"=cUpper)
  resMatrix[[i]] <- rbind(resMatrix[[i]], cindexO)
  rownames(resMatrix[[i]]) <- c(datasetList, "Overall")
}
@


In order to compare the different risk prediction models we compute one-sided p-values of the meta-estimates:
<<computePValues>>=
pv <- sapply(resMatrix, function(x) { return(x["Overall", c("cindex","cindex.se")]) })
pv <- apply(pv, 2, function(x) { return(pnorm((x[1] - 0.5) / x[2], lower.tail=x[1] < 0.5)) })
printPV <- matrix(pv,ncol=length(names(resMatrix)))
rownames(printPV) <- "P-value"
colnames(printPV) <- names(pv)
printPV<-t(printPV)
@

And print the table of P-values:
<<printPvalue,results="asis">>=
xtable(printPV, digits=c(0, -1))
@


The following figures represent the risk score performances measured 
by the concordance index each of the prognostic predictors. 
<<forestplotDatasets,echo=TRUE>>=
RiskPList <- c("AURKA","ESR1","ERBB2","NPI", "GGI", "GENIUS", 
               "EndoPredict","OncotypeDx","TAMR13","GENE70","PIK3CA","rorS")
datasetListF <- c("MAINZ","TRANSBIG","UPP","UNT","NKI", "Overall")
myspace <- "   "
par(mfrow=c(2,2))
  for (RP in RiskPList)
  {
    
  #<<forestplotDat,fig=TRUE>>=
  ## Forestplot
  tt <- rbind(resMatrix[[RP]][1:5,],
            "Overall"=resMatrix[[RP]][6,])
  
  tt <- as.data.frame(tt)
  labeltext <- (datasetListF)
  
  r.mean <- c(tt$cindex)
  r.lower <- c(tt$lower)
  r.upper <- c(tt$upper)
  
  metaplot.surv(mn=r.mean, lower=r.lower, upper=r.upper, labels=labeltext, xlim=c(0.4,0.9), 
                boxsize=0.5, zero=0.5, 
                col=meta.colors(box="royalblue",line="darkblue",zero="firebrick"), 
                main=paste(RP))
    
  }
#@
#
@


We can also represent the overall estimates across all prognostic predictors, across all the datasets. 
<<forestplotOverall,echo=TRUE>>=
## Overall Forestplot
mybigspace <- "       "
tt <- rbind("OverallA"=resMatrix[["AURKA"]][6,],
            "OverallE1"=resMatrix[["ESR1"]][6,],
            "OverallE2"=resMatrix[["ERBB2"]][6,],
            "OverallN"=resMatrix[["NPI"]][6,],
          "OverallM"=resMatrix[["GGI"]][6,],
          "OverallG"=resMatrix[["GENIUS"]][6,],
          "OverallE3"=resMatrix[["EndoPredict"]][6,],
          "OverallOD"=resMatrix[["OncotypeDx"]][6,],
          "OverallT"=resMatrix[["TAMR13"]][6,],
          "OverallG70"=resMatrix[["GENE70"]][6,],
          "OverallP"=resMatrix[["PIK3CA"]][6,],
          "OverallR"=resMatrix[["rorS"]][6,]
          )

tt <- as.data.frame(tt)
labeltext <- cbind(c("Risk Prediction","AURKA","ESR1","ERBB2","NPI",
                     "GGI","GENIUS","EndoPredict","OncotypeDx","TAMR13","GENE70","PIK3CA","rorS"))

r.mean <- c(NA,tt$cindex)
r.lower <- c(NA,tt$lower)
r.upper <- c(NA,tt$upper)

metaplot.surv(mn=r.mean, lower=r.lower, upper=r.upper, labels=labeltext, xlim=c(0.45,0.75), 
              boxsize=0.5, zero=0.5, 
              col=meta.colors(box="royalblue",line="darkblue",zero="firebrick"), 
              main="Overall Concordance Index")
@

In order to assess the difference between the risk scores, 
we compute the concordance indices with their p-values and compare the estimates 
with the \Rfunction{cindex.comp.meta} with a paired student t test.
<<computeCindexWithPvalue>>=
cc.res <- complete.cases(res)
datasetList <- c("MAINZ","TRANSBIG","UPP","UNT","NKI")
riskPList <- c("AURKA","ESR1","ERBB2","NPI","GGI","GENIUS",
               "EndoPredict","OncotypeDx","TAMR13","GENE70","PIK3CA","rorS")
setT <- setE <- NULL
resMatrixFull <- as.list(NULL)

for(i in datasetList)
{
  dataset.only <- ddemot[,"dataset"] == i
  patientsAll <- cc.res & dataset.only
	
  ## set type of available survival data
  if(i == "UPP") {
    setT <- "t.rfs"
    setE <- "e.rfs"
  } else {
    setT <- "t.dmfs"
    setE <- "e.dmfs"
  }
	
  ## cindex and p-value computation per algorithm
  for (Dat in riskPList)
  {
    cindex <- t(apply(X=t(res[patientsAll,Dat]), MARGIN=1, function(x, y, z) {
    tt <- concordance.index(x=x, surv.time=y, surv.event=z, method="noether", na.rm=TRUE);
    return(tt); },
    y=ddemot[patientsAll,setT], z=ddemot[patientsAll, setE]))
    
    resMatrixFull[[Dat]] <- rbind(resMatrixFull[[Dat]], cindex)
  }
}

for(i in names(resMatrixFull)){
  rownames(resMatrixFull[[i]]) <- datasetList
}

ccmData <- tt <- rr <- NULL
for(i in 1:length(resMatrixFull)){
  tt <- NULL
  for(j in 1:length(resMatrixFull)){
    if(i != j) { rr <- cindex.comp.meta(list.cindex1=resMatrixFull[[i]], 
                                        list.cindex2=resMatrixFull[[j]], hetero=TRUE)$p.value } 
    else { rr <- 1 }
    tt <- cbind(tt, rr)
  }
  ccmData <- rbind(ccmData, tt)
}
ccmData <- as.data.frame(ccmData)
colnames(ccmData) <- riskPList
rownames(ccmData) <- riskPList
@


Table 2 displays the for multiple testing uncorrected p-values for the comparison of the different methods:
<<printCCM,results="asis">>=
xtable(ccmData, digits=c(0, rep(-1,ncol(ccmData))), size="footnotesize")
@



We correct the p-value with Holms method:
<<computeCCMPval>>=
ccmDataPval <- matrix(p.adjust(data.matrix(ccmData), method="holm"),ncol=length(riskPList),
                      dimnames=list(rownames(ccmData),colnames(ccmData)))
@


Table 3 displays the corrected p-values:
<<printCCMPval,results='asis'>>=
xtable(ccmDataPval, digits=c(0, rep(-1,ncol(ccmDataPval))), size="footnotesize")
@



%------------------------------------------------------------
%------------------------------------------------------------
%%% CODE Stop
%------------------------------------------------------------
%------------------------------------------------------------

\newpage

%------------------------------------------------------------
\section{References}
%------------------------------------------------------------ 
The following is a list of publications that have cited genefu (Version 1) in the past.


\subsection*{Where genefu was used in subtyping}

\begin{description}
\item Larsen, M.J. et al., 2014. Microarray-Based RNA Profiling of Breast Cancer: Batch Effect Removal Improves Cross-Platform Consistency. BioMed Research International, 2014, pp.1-11.
\item Miller, T.W. et al., 2011. A gene expression signature from human breast cancer cells with acquired hormone independence identifies MYC as a mediator of antiestrogen resistance. Clinical cancer research : an official journal of the American Association for Cancer Research, 17(7), pp.2024-2034.
\item Karn, T. et al., 2011. Homogeneous Datasets of Triple Negative Breast Cancers Enable the Identification of Novel Prognostic and Predictive Signatures S. Ranganathan, ed. PloS one, 6(12), p.e28403.
\end{description}


\subsection*{Where genefu was used in Comparing Subtyping Schemes}

\begin{description}
\item Haibe-Kains, B. et al., 2012. A three-gene model to robustly identify breast cancer molecular subtypes. Journal of the National Cancer Institute, 104(4), pp.311-325.
\item Curtis, C. et al., 2012. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 
\item Balko, J.M. et al., 2012. Profiling of residual breast cancers after neoadjuvant chemotherapy identifies DUSP4 deficiency as a mechanism of drug resistance. Nature medicine, 18(7), pp.1052-1059.
\item Paquet, E.R. and Hallett, M.T., 2015. Absolute Assignment of Breast Cancer Intrinsic Molecular Subtype. Journal of the National Cancer Institute, 107(1), pp.dju357-dju357.
\item Patil, P. et al., 2015. Test set bias affects reproducibility of gene signatures. Bioinformatics, p.btv157.
\end{description}


\subsection*{Where genefu was used to Compute Prognostic gene signature scores}

\begin{description}
\item Haibe-Kains, B. et al., 2008. A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? Bioinformatics, 24(19), pp.2200-2208.
\item Haibe-Kains, B. et al., 2010. A fuzzy gene expression-based computational approach improves breast cancer prognostication. Genome biology, 11(2), p.R18.
\item Madden, S.F. et al., 2013. BreastMark: An Integrated Approach to Mining Publicly Available Transcriptomic Datasets Relating to Breast Cancer Outcome. Breast Cancer Research, 15(4), p.R52.
\item Fumagalli, D. et al., 2014. Transfer of clinically relevant gene expression signatures in breast cancer: from Affymetrix microarray to Illumina RNA-Sequencing technology. BMC genomics, 15(1), p.1008.
\item Beck A.H. et al., 2013. Significance Analysis of Prognostic Signatures. PLoS Computational Biology, 9(1), e1002875.
\end{description}


\subsection*{As well as other publications}

\begin{description}
\item APOBEC3B expression in breast cancer reflects cellular proliferation, while a deletion polymorphism is associated with immune activation.Cescon DW, Haibe-Kains B, Mak TW. Proc Natl Acad Sci U S A. 2015 Mar 3;112(9):2841-6. doi: 10.1073/pnas.1424869112. Epub 2015 Feb 17. PMID: 25730878 
\item Radovich M. et al., 2014. Characterizing the heterogeneity of triple-negative breast cancers using microdissected normal ductal epithelium and RNA-sequencing. Breast cancer research and treatment, 143(1), pp.57-68. 
\item Tramm T. et al., 2014. Relationship between the prognostic and predictive value of the intrinsic subtypes and a validated gene profile predictive of loco-regional control and benefit from post-mastectomy radiotherapy in patients with high-risk breast cancer. Acta Oncologica 53(10), pp.1337-1346.
\item Doan, T.B. et al., 2014. Breast cancer prognosis predicted by nuclear receptor-coregulator networks. Molecular oncology 8(5), pp.998-1013. 
\end{description}


%------------------------------------------------------------
\section{Session Info}
%------------------------------------------------------------ 
<<sessionInfo,results='asis'>>=
toLatex(sessionInfo())
@

\end{document}