%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Luigi Marchionni, Wikum Dinalankara and Bahman Afsari %%% September 22 2016 %%% Baltimore %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \VignetteIndexEntry{Working with the switchBox package} % \VignetteKeywords{Classification, Prediction, Microarray, Cancer, RNAExpressionData} % \VignettePackage{switchBox} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Begin Document \documentclass[12pt]{article} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Preamble % \input{preamble} %%% Additional packages \usepackage{Sweave} \usepackage{times} \usepackage[colorlinks=TRUE, urlcolor=blue, citecolor=blue]{hyperref} \usepackage{color} \usepackage[usenames, dvipsnames]{xcolor} % \usepackage{authblk} % \usepackage{fullpage} %%% Sweave options \SweaveOpts{prefix.string=plots, eps=FALSE,echo=TRUE, keep.source=TRUE} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% New commands for R stuff \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioc}{\software{Bioconductor}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% fancy Sweave \DefineVerbatimEnvironment{Sinput}{Verbatim}{xleftmargin=1em, fontshape=sl, formatcom=\color{MidnightBlue}, fontsize=\scriptsize} %\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=1em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1em,fontshape=sl,formatcom=\color{OliveGreen}, fontsize=\scriptsize} %\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=1em,fontshape=sl,formatcom=\color{BrickRed}, fontsize=\scriptsize} %\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=1em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Title \title{Training and testing a K-Top-Scoring-Pair (KTSP) classifier with switchBox.} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Authors and \author{Bahman Afsari, Luigi Marchionni and Wikum Dinalankara\\\\ %%% Affiliations The Sidney Kimmel Comprehensive Cancer Center,\\ Johns Hopkins University School of Medicine } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Date \date{Modified: June 20, 2014. Compiled: \today} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Document \begin{document} \SweaveOpts{concordance=TRUE} \setlength{\parskip}{0.2\baselineskip} \setlength{\parindent}{0pt} \setkeys{Gin}{width=\textwidth} \maketitle \tableofcontents <>= options(width=85) options(continue=" ") rm(list=ls()) @ %\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{ {\Large \bf {Introduction}} } The \Rpackage{switchBox} package allows to train and validate a K-Top-Scoring-Pair (KTSP) classifier, as used by Marchionni et al in \cite{Marchionni:2013aa}. KTSP is an extension of the TSP classifier described by Geman and colleagues \cite{Geman:2004aa,Tan:2005aa,Xu:2005aa}. The TSP algorithm is a simple binary classifier based on the ordering of two measurements.Basing the prediction solely on the ordering of a small number of features (e.g. gene expressions), known as ranked based methodology, seems a promising approach to to build robust classifiers to data normalization and rise to more transparent decision rules. The first and simplest of such methodologies, the Top-Scoring Pair ({\em TSP}) classifier, was introduced in \cite{Geman:2004aa} and is based on reversal of two features (e.g. the expressions of two genes). Multiple extensions were proposed afterwards, e.g. \cite{Tan:2005aa} and many of these extensions have been successfully applied for diagnosis and prognosis of cancer such as recurrence of breast cancer in \cite{Marchionni:2013aa}. A popular successor of {\em TSP} classifiers is {\em kTSP} (\cite{Tan:2005aa}), which applies the majority voting among multiple of the the reversal of pairs of features. In addition to being applied by peer scientists, {\em kTSP} shown its power by wining the ICMLA the challenge for cancer classification in the presence of other competitive methods such as Support Vector Machines (\cite{ICMLA08}). %\textcolor{red}{ADD about TSP and KTSP} kTSP decision is based on $k$ feature (e.g. gene) pairs, say, \(\Theta=\{(i_1,j_1),\dots,(i_k,j_k)\}\). If we denote the feature profile with \(\underline{X}=(X_1,X_2,\dots)\), the family of rank based classifiers is an aggregation of the comparisons \(X_{i_l}\tau\}\) provided the labels \(Y \in \{0,1\}\). The standard threshold is \(\tau=0\). The only parameters required for calculating \(\kappa\) is the feature pairs. Usually, disjoint feature pairs are desirable because an outlier feature value cannot heavily influence the decision. In the introductory paper to kTSP (\cite{kTSPPaper}), the authors proposed an ad-hoc method for feature selection. This method was based on score for each pair of features which measures how discriminative is a comparison of the feature values. If we denote the score related to the gene \(i\) and \(j\) by \(s_{ij}\), then the score was defined as \[s_{ij}=|P(X_i>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("switchBox") @ Load the library. <>= require(switchBox) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Data structure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Training set} Load the example training data contained in the \Rpackage{switchBox} package. <>= ### Load the example data for the TRAINING set data(trainingData) @ The object \Robject{matTraining} is a numeric matrix containing gene expression data for the 78 breast cancer patients and the 70 genes used to implement the MammaPrint assay \cite{Glas:2006aa}. This data was obtained from from the \Rpackage{MammaPrintData} package, as described in \cite{Marchionni:2013aa}. Samples are stored by column and genes by row. Gene annotation is stored as \Robject{rownames(matTraining)}. <>= class(matTraining) dim(matTraining) str(matTraining) @ The factor \Robject{trainingGroup} contains the prognostic information: <>= ### Show group variable for the TRAINING set table(trainingGroup) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Testing set} Load the example testing data contained in the \Rpackage{switchBox} package. <>= ### Load the example data for the TEST set data(testingData) @ The object \Robject{matTesting} is a numeric matrix containing gene expression data for the 307 breast cancer patients and the 70 genes used to validate the MammaPrint assay \cite{Buyse:2006aa}. This data was obtained from from the \Rpackage{MammaPrintData} package, as described in \cite{Marchionni:2013aa}. Also in this case samples are stored by column and genes by row. Gene annotation is stored as \Robject{rownames(matTraining)}. <>= class(matTesting) dim(matTesting) str(matTesting) @ The factor \Robject{testingGroup} contains the prognostic information: <>= ### Show group variable for the TEST set table(testingGroup) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Training KTSP algorithm} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Unrestricted KTSP classifiers} We can train the KTSP algoritm using all possible feature pairs -- unrestricted KTSP classifier -- with or without statistical feature filtering, using the \Rfunction{SWAP.Train.KTSP} function. Note that \Rfunction{SWAP.KTSP.Train} is deprecated and maintained only for legacy reasons. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Default statistical filtering} Training an unrestricted KTSP predictor using a statistical feature filtering is the default and it is achieved by using the default parameters, as follows: <>= ### The arguments to the "SWAP.Train.KTSP" function args(SWAP.Train.KTSP) ### Train a classifier using default filtering function based on the Wilcoxon test classifier <- SWAP.Train.KTSP(matTraining, trainingGroup, krange=c(3:15)) ### Show the classifier classifier ### Extract the TSP from the classifier classifier$TSPs @ Below is shown the way the default feature filtering works. The \Rfunction{SWAP.Filter.Wilcoxon} function takes the phenotype factor, the predictor data, the number of feature to be returned, and a logical value to decide whether to include equal number of featured positively and negatively associated with the phenotype to be predicted. <>= ### The arguments to the "SWAP.Train.KTSP" function args(SWAP.Filter.Wilcoxon) ### Retrieve the top best 4 genes using default Wilcoxon filtering ### Note that there are ties SWAP.Filter.Wilcoxon(trainingGroup, matTraining, featureNo=4) @ Train a classifier using the \Rfunction{SWAP.Filter.Wilcoxon} filtering function. <>= ### Train a classifier from the top 4 best genes ### according to Wilcoxon filtering function classifier <- SWAP.Train.KTSP(matTraining, trainingGroup, FilterFunc=SWAP.Filter.Wilcoxon, featureNo=4) ### Show the classifier classifier @ Train a classifier using all possible features: <>= ### To use all features "FilterFunc" must be set to NULL classifier <- SWAP.Train.KTSP(matTraining, trainingGroup, FilterFunc=NULL) ### Show the classifier classifier @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Altenative filtering methods} Training can also be achieved using alternative filtering methods. These methods can be specified by passing a different filtering function to \Rfunction{SWAP.Train.KTSP}. These functions should use th \Rfunarg{phenoGroup}, \Rfunarg{inputData} arguments, as well as any other necessary argument (passed using \Rfunarg{...}), as shown below. For instance, we can define an alternative filtering function selecting 10 random features. <>= ### An alternative filtering function selecting 20 random features random10 <- function(situation, data) { sample(rownames(data), 10) } random10(trainingGroup, matTraining) @ Below is a more realistic example of an alternative filtering function. In this case we use the {\R } \Rfunction{t.test} function to select the features with an absolute t-statistics larger than a specified quantile. <>= ### An alternative filtering function based on a t-test topRttest <- function(situation, data, quant = 0.75) { out <- apply(data, 1, function(x, ...) t.test(x ~ situation)$statistic ) names(out[ abs(out) > quantile(abs(out), quant) ]) } ### Show the top 5% features using the newly defined filtering function topRttest(trainingGroup, matTraining, quant=0.95) @ Train a classifier using the alternative filtering function based on the t-test and also define the max number of TSP using \Rfunarg{krange}. <>= ### Train with t-test and krange classifier <- SWAP.Train.KTSP(matTraining, trainingGroup, FilterFunc = topRttest, quant = 0.9, krange=c(15:30) ) ### Show the classifier classifier @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Training a Restricted KTSP algorithm} The \Rpackage{swithcBox} allows to training a KTSP classifier using a pre-specified set of restricted feature pairs. This can be useful to implement KTSP classifiers restricted to specific TSPs based, for instane, on prior biological information (\cite{iTSP}). To this end, the user must specify a set of candidate pairs by setting \Rfunarg{RestrictedPairs} argument. As an example, we can define a set of candidate pairs by randolmly selecting some of the rownames from the \Robject{inputMat} matrix and the classifier chooses from this set. In a real example these pairs would be provided by the user, for instance usinf prior biological knowledge. The restricted pairs must contain valid feature names, \textit{i.e.} the row names of \Robject{inputMat}. <>= set.seed(123) somePairs <- matrix(sample(rownames(matTraining), 6^2, replace=FALSE), ncol=2) head(somePairs) dim(somePairs) @ Train a classifier using the set of restricted feature pairs and the default filtering: <>= ### Train classifier <- SWAP.Train.KTSP(matTraining, trainingGroup, RestrictedPairs = somePairs, krange=3:16) ### Show the classifier classifier @ Train a classifier using a set of restricted feature pairs, defining the maximum number of TSP using \Rfunarg{krange} and also filtering the features by T-test. <>= ### Train classifier <- SWAP.Train.KTSP(matTraining, trainingGroup, RestrictedPairs = somePairs, FilterFunc = topRttest, quant = 0.3, krange=c(3:10) ) ### Show the classifier classifier @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Calculate and aggregates the TSP votes} The \Rfunction{SWAP.KTSP.Statistics} function can be used to compute and aggregate the TSP votes using alternative functions to combine the votes. The default method is the count of the signed TSP votes. We can also pass a different function to combine the KTSPs. This function takes an argument \Rfunarg{x} -- a logical vector corresponding to the TSP votes -- of length equal to the number of columns (\textit{e.g.}, the number of cancer patients under analysis) and aggregates the votes of all $K$ TSPs of the classifier identified by the training proces (see the \Rfunction{SWAP.Train.KTSP} function). Here we will use the default parameters (the count of the signed TSP votes) <>= ### Train a classifier classifier <- SWAP.Train.KTSP(matTraining, trainingGroup, FilterFunc = NULL, krange=2:8) ### Compute the statistics using the default parameters: ### counting the signed TSP votes ktspStatDefault <- SWAP.KTSP.Statistics(inputMat = matTraining, classifier = classifier) ### Show the components in the output names(ktspStatDefault) ### Show some of the votes head(ktspStatDefault$comparisons[ , 1:2]) ### Show default statistics head(ktspStatDefault$statistics) @ Here we will use the sum to aggregate the TSP votes <>= ### Compute ktspStatSum <- SWAP.KTSP.Statistics(inputMat = matTraining, classifier = classifier, CombineFunc=sum) ### Show statistics obtained using the sum head(ktspStatSum$statistics) @ Here, for instance, we will apply a hard treshold equal to 2 <>= ### Compute ktspStatThreshold <- SWAP.KTSP.Statistics(inputMat = matTraining, classifier = classifier, CombineFunc = function(x) sum(x) > 2 ) ### Show statistics obtained using the threshold head(ktspStatThreshold$statistics) @ We can also make a heatmap showing the individual TSPs votes (see Figure~\ref{fig:heatmap} below). <>= ### Make a heatmap showing the individual TSPs votes colorForRows <- as.character(1+as.numeric(trainingGroup)) heatmap(1*ktspStatThreshold$comparisons, scale="none", margins = c(10, 5), cexCol=0.5, cexRow=0.5, labRow=trainingGroup, RowSideColors=colorForRows) @ \newpage \begin{figure} \begin{center} \setkeys{Gin}{width=1.0\textwidth} <>= ### Make a heatmap showing the individual TSPs votes colorForRows <- as.character(1+as.numeric(trainingGroup)) heatmap(1*ktspStatThreshold$comparisons, scale="none", margins = c(10, 5), cexCol=0.85, cexRow=1, labRow=trainingGroup, RowSideColors=colorForRows) @ \end{center} \caption{\small Heatmap showing the individual TSP votes.} \label{fig:heatmap} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Classifiy samples and compute the classifier performance} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Classifiy training samples} The \Rfunction{SWAP.KTSP.Classify} function allows to classify one or more samples using the classifier identified by \Rfunction{SWAP.Train.KTSP}. The \textbf{resubstitution} performance in the training set is shown below. <>= ### Show the classifier classifier ### Apply the classifier to the TRAINING set trainingPrediction <- SWAP.KTSP.Classify(matTraining, classifier) ### Show str(trainingPrediction) ### Resubstitution performance in the TRAINING set table(trainingPrediction, trainingGroup) @ We can apply the classifier using a specific decision to combine the $K$ TSP as specified with the \Rfunarg{DecideFunc} argument of \Rfunction{SWAP.KTSP.Classify}. This argument is a function working on a logical vector \Rfunarg{x} containing the votes of each TSP. We can for instance count all votes for class one and then classify a patient in one class or the other based on a specific threshold. <>= ### Usr a CombineFunc based on sum(x) > 5.5 trainingPrediction <- SWAP.KTSP.Classify(matTraining, classifier, DecisionFunc = function(x) sum(x) > 5.5 ) ### Show str(trainingPrediction) ### Resubstitution performance in the TRAINING set table(trainingPrediction, trainingGroup) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Classifiy validation samples} We can apply the trained classifier to one new sample of the test set: <>= ### Classify one sample testPrediction <- SWAP.KTSP.Classify(matTesting[ , 1, drop=FALSE], classifier) ### Show testPrediction @ We can apply the trained classifier to a new set of samples, using the defaul decision rule based on the ``majority wins'' principle: <>= ### Apply the classifier to the complete TEST set testPrediction <- SWAP.KTSP.Classify(matTesting, classifier) ### Show table(testPrediction) ### Resubstitution performance in the TEST set table(testPrediction, testingGroup) @ We can apply the trained classifier to predict of a new set of samples, using an alternative decision rule specified by \Rfunarg{DecideFunc} For instance, we can classify by thresholding vote counts in favor of one of the classes. <>= ### APlly the classifier using sum(x) > 5.5 testPrediction <- SWAP.KTSP.Classify(matTesting, classifier, DecisionFunc = function(x) sum(x) > 5.5 ) ### Resubstitution performance in the TEST set table(testPrediction, testingGroup) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Compute the signed TSP scores} The \Rpackage{switchBox} allows also to compute the individual scores for each TSP of interest. This can be achieved by using the \Rfunction{SWAP.CalculateSignedScore} function as shown below. Compute the scores using all features for all possible pairs: <>= ### Compute the scores using all features for all possible pairs scores <- SWAP.CalculateSignedScore(matTraining, trainingGroup, FilterFunc=NULL) ### Show scores class(scores) dim(scores$score) @ Extract the TSP scores of interest -- the absolute value correspond to the scores returned by \Rfunction{SWAP.Train.KTSP}. <>= ### Get the scores scoresOfInterest <- diag(scores$score[ classifier$TSPs[,1] , classifier$TSPs[,2] ]) ### Their absolute value should corresponf to the scores returned by SWAP.Train.KTSP all(classifier$score == abs(scoresOfInterest)) @ The \Rfunction{SWAP.CalculateSignedScore} function accept the same argumets used by \Rfunction{SWAP.Train.KTSP}. It can compute the scores with or without a filtering function and using or not the restricted pairs, as specified by \Rfunarg{FilterFunc} and \Rfunarg{RestrictedPairs} respectively. <>= ### Compute the scores with default filtering function scores <- SWAP.CalculateSignedScore(matTraining, trainingGroup, featureNo=20 ) ### Show scores dim(scores$score) ### Compute the scores without the default filtering function ### and using restricted pairs scores <- SWAP.CalculateSignedScore(matTraining, trainingGroup, FilterFunc = NULL, RestrictedPairs = somePairs ) ### Show scores class(scores$score) length(scores$score) @ In Figure~\ref{fig:scores} is shown the histograms for all possible TSP scores. <>= hist(scores$score, col="salmon", main="TSP scores") @ \begin{figure} \begin{center} \setkeys{Gin}{width=0.75\textwidth} <>= hist(scores$score, col="salmon", main="TSP scores") @ \end{center} \caption{\small Histograms of all TSP socres.} \label{fig:scores} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Use of deprecated functions} The two functions \Rfunction{KTSP.Train} and \Rfunction{KTSP.Classify} are deprecated and are included in the package only for backward compatibility. They have been substituted by respectively \Rfunction{SWAP.Train.KTSP} and \Rfunction{SWAP.KTSP.Classify}. These functions were used to train and validate the 8-TSP classifier described by Marchionni et al \cite{Marchionni:2013aa} and are maintained for reproducibility purposes. Example on the way they are used follows. Preparation of phenotype information (a numeric vector with values equal to 0 or 1) for training the KTSP classifier: <>= ### Phenotypic group variable for the 78 samples table(trainingGroup) levels(trainingGroup) ### Turn into a numeric vector with values equal to 0 and 1 trainingGroupNum <- as.numeric(trainingGroup) - 1 ### Show group variable for the TRAINING set table(trainingGroupNum) @ KTSP classifier training using the deprected function: <>= ### Train a classifier using default filtering function based on the Wilcoxon test classifier <- KTSP.Train(matTraining, trainingGroupNum, n=8) ### Show the classifier classifier @ KTSP classifier performance using the deprected function: <>= ### Apply the classifier to one sample of the TEST set using ### sum of votes less than 2.5 trainPrediction <- KTSP.Classify(matTraining, classifier, combineFunc = function(x) sum(x) < 2.5) ### Contingency table table(trainPrediction, trainingGroupNum) @ Preparation of phenotype information (a numeric vector with values equal to 0 or 1) for testing the KTSP classifier on new data: <>= ### Phenotypic group variable for the 307 samples table(testingGroup) levels(testingGroup) ### Turn into a numeric vector with values equal to 0 and 1 testingGroupNum <- as.numeric(testingGroup) - 1 ### Show group variable for the TEST set table(testingGroupNum) @ Testing on new data and getting KTSP classifier performance using the deprected function: <>= ### Apply the classifier to one sample of the TEST set using ### sum of votes less than 2.5 testPrediction <- KTSP.Classify(matTesting, classifier, combineFunc = function(x) sum(x) < 2.5) ### Show prediction table(testPrediction) ### Contingency table table(testPrediction, testingGroupNum) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \pagebreak \section{ {\Large \bf {System Information}} } Session information: <>= toLatex(sessionInfo()) @ %\pagebreak \section{ {\Large \bf {Literature Cited}} } \bibliographystyle{unsrt} \bibliography{./switchBox} \end{document}