% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{MiPP Overview}
%\VignetteKeywords{classification}
%\VignetteDepends{MiPP}
%\VignettePackage{MiPP}
%documentclass[12pt, a4paper]{article}
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

    \newcommand{\word}{Misclassification-Penalized Posteriors}
    \newcommand{\statm}{$\psi_p$}
     \newcommand{\acr}{MiPP}
     \newcommand{\acrs}{MiPP }
    \newcommand{\words}{Misclassification-Penalized Posteriors }
    \newcommand{\statms}{$\psi_p$ }
    \newcommand{\stat}{\psi_p}


\author{Mat Soukup, HyungJun Cho, and Jae K. Lee}
\begin{document}
\title{How to use the MiPP Package}

\maketitle
\tableofcontents

\section{Introduction}
The \Rpackage{MiPP} package is designed to sequentially add genes to a classification gene model 
based upon the \words (\acr) as discussed in Section 2. The construction of the
model is based upon a training data set and the estimated actual
performance of the model is based upon an independent data set.
When no clear distinction between the training and independent
data sets exists, the cross-validation technique is used to estimate actual performance. 
For the detailed algorithms, see Soukup, Cho, and Lee (2005) and Soukup and Lee (2004).  
The \Rpackage{MiPP} package employs libraries \Rpackage{MASS} 
for LDA/QDA (linear/quadratic discrimant analysis) and \Rpackage{e1071} for SVM (support vector machine). 
Users should install the \Rpackage{e1071} package from the main web page of R (http://www.r-project.org/).


\section{\words (\acr)}\label{mipp}
In the above section, estimated actual performance is mentioned a
number of times. Classically, the accuracy of a classification model is
done by reporting its estimated actual error rate. However, error
rate fails to take into account how likely a particular sample
belongs to a given class and dichotomizes the data into yes the
sample was correctly classified or no the sample was NOT correctly
classified. Although error rate, plays a key role in how well a
classification model performs, it fails to take into account all
the information that is available from a classification rule.

The \words (\acr) takes into account how likely a sample belongs
to a given class by using a posterior probability of correct
classification. \acrs also adjusts its definition any time a
sample is misclassified by subtracting a 1 from the posterior
probability of correct classification resulting in a negative
value of \acr. If we define the posterior probability of correct
classification using genes $\mathbf{x}$ as $\hat{f}(\mathbf{x})$,
\acrs can be calculated as
\begin{eqnarray}
\psi_p=\sum\limits_{correct}\hat{f}(\mathbf{x})+
\sum\limits_{wrong}\left(\hat{f}(\mathbf{x})-1\right).\label{introSp}
\end{eqnarray}
Here, \emph{correct} refers to the subset of samples that are
correctly classified and \emph{wrong} refers to the subset of
samples that are misclassified. By introducing a random variable
that takes into account whether a sample is misclassified or not
\acrs can be shown to be the sum of posterior probabilities of
correct classification minus the number of misclassified samples.
As a result, \acrs  increases whenever the sum of posterior
probabilities of correction classification increase, the number of
misclassified samples decreases, or both of these occur.

We standardize the MiPP score divided by the number of samples in each data set, 
denoted as sMiPP. Thus, the range of sMiPP is from -1 to 1. Note that as accuracy increases, sMiPP converges to 1. 

Some basic properties of \acrs are that the maximum value it can
take is equal to the sample size (or $sMiPP=1$), and on the flip side, the
minimum value is equal to the negation of the sample size (or $sMiPP=-1$). Under a
pure random model, the expected value of \acrs is equal to zero (or $sMiPP=0$).
The variance is derived and is available from the first author for
the two class case, however an explicit value for more than two
classes can not be derived analytically. Thus, a bootstrapped
estimate is the preferred method of estimating the variance.



\section{Examples}

\subsection{Acute Leukemia Data} 
This data set has been frequently
used for testing various methods in classification and prediction of
cancer sub-types. Two distinct subsets of array data for AML and ALL
leukemia patients are available: a training set of 27 ALL and 11 AML
samples and a test set of 20 ALL and 14 AML samples. The independent
set was from adult bone marrow samples, whereas the independent set
was from 24 bone marrow samples, 10 from peripheral blood samples,
and 4 of the AML samples from adults. Gene expression levels
contain probes for 6817 human genes from Affymetrix\texttrademark
oligonucleotide microarrays. Note that a subset of genes (713 probe sets) 
was stored into the \Rpackage{MiPP} package. 

\newpage
To run  \Rpackage{MiPP}, the data can be prepared as follows.

\begin{verbatim}
data(leukemia)
     
#IQR normalization     
leukemia <- cbind(leuk1, leuk2)
leukemia <- mipp.preproc(leukemia, data.type="MAS4")
     
#Train set
x.train <- leukemia[,1:38]
y.train <- factor(c(rep("ALL",27),rep("AML",11)))
#Test set
x.test <- leukemia[,39:72]
y.test <- factor(c(rep("ALL",20),rep("AML",14)))

\end{verbatim}

Since two distinct data sets exist,  the model is constructed on the training data and evaluated on the test data set 
as follows.

\begin{verbatim}
out <- mipp(x=x.train, y=y.train, x.test=x.test, y.test=y.test,
            n.fold=5, percent.cut=0.05, rule="lda")
\end{verbatim}

This sequentially selects genes one gene at a time with the LDA rule (\Rfunarg{rule="lda"}) 
and 5-fold cross-validation (\Rfunarg{n.fold=5}) on the training set.
To reduce computing time, it pre-selects the most plausuable 5\% out of 713 genes 
by the two-sample t-test (\Rfunarg{percent.cut=0.05}), 
and then performs gene selection.
To utilize all genes without pre-selection, set the argument \Rfunarg{percent.cut=1}. The above command generates the following output.

\begin{verbatim}     
out$model

  Order Gene  Tr.ER Tr.MiPP Tr.sMiPP  Te.ER Te.MiPP Te.sMiPP Select
1     1  571 0.0526   30.86   0.8122 0.1176   23.92   0.7035       
2     2  436 0.0000   36.89   0.9707 0.0294   30.41   0.8945       
3     3  366 0.0000   37.95   0.9988 0.0294   31.35   0.9222       
4     4  457 0.0000   38.00   0.9999 0.0294   32.14   0.9453       
5     5  413 0.0000   38.00   1.0000 0.0294   32.18   0.9464       
6     6  635 0.0000   38.00   1.0000 0.0000   33.75   0.9927     **
7     7  648 0.0000   38.00   1.0000 0.0000   33.57   0.9874       

\end{verbatim}

The gene models are evaluated by both train (denoted by {\bf Tr}) and test (denoted by {\bf Te}) sets; 
however, we select the final model based on the test set independent of the train set used for gene selection.
The gene model with the maximum sMiPP is indicated by one star (*) and 
the parsimonious model (indicated by **) contains the fewest number of genes with sMiPP greater than or equal to 
(max sMiPP - 0.01).  In this example, the maximum and parsimonious models (indicated by **) are the same.
Thus, the final model with sMiPP 0.993 contains genes 571, 436, 366, 457, 413, and 635.
Note that genes listed in the output correspond to the column number of the matrices.

\subsection{Colon Cancer Data} 
The colon cancer data set
consists of the 2000 genes with the highest minimal intensity across
the 62 tissue samples out of the original $6,500^+$ genes. The data
set is filtered using the procedures described at the author's web site.
The 62 samples consist of 40 colon tumor tissue samples and 22
normal colon tissue samples (Alon \emph{et al.}, 1999). Li \emph{et
al.} (2001) identified 5 samples (N34, N36, T30, T33, and T36) which
were likely to have been contaminated. As a result, these five
samples are excluded from any future analysis; our error rate would
be higher if they were included.

Since we are working with a small data set (57 samples), we will be
implementing cross-validation techniques. With the lack of a 'true' independent test set, we randomly create 
a training data set with 38 samples (25 tumor and 13 normal) and an independent data set with 19 samples 
(12 tumor and 7 normal).  Since this is a random creation of the data set, it would be of interest to see what model is
selected based upon a different random split of the data. Note that the choice of
the sizes of the training and independent test set is somewhat
arbitrary, but consistent results were found using a training and
test set of sizes 29 (19 tumor and 10 normal) and 28 (18 tumor and
10 normal), respectively. The colon data set of the \Rpackage{MiPP} package contains only 200 genes 
as an example. For the colon data with no independent test set, \Rpackage{MiPP} can be run as follows.

\begin{verbatim}
data(colon)
x <- mipp.preproc(colon)
y <- factor(c("T", "N", "T", "N", "T", "N", "T", "N", "T", "N", 
       "T", "N", "T", "N", "T", "N", "T", "N", "T", "N",
       "T", "N", "T", "N", "T", "T", "T", "T", "T", "T", 
       "T", "T", "T", "T", "T", "T", "T", "T", "N", "T", 
       "T", "N", "N", "T", "T", "T", "T", "N", "T", "N", 
       "N", "T", "T", "N", "N", "T", "T", "T", "T", "N", 
       "T", "N"))



#Deleting comtaminated chips
x <- x[,-c(51,55,45,49,56)]
y <- y[ -c(51,55,45,49,56)]

out <- mipp(x=x, y=y, n.fold=5, p.test=1/3, n.split=20, n.split.eval=100, 
            percent.cut = 0.1 , rule="lda")

\end{verbatim}
This divides the whole data into two groups for training (two-third) and testing (one-third) (\Rfunarg{p.test = 1/3}) 
and  performs the forward gene selection as done with the acute leukemia data.  
Splitting of the data set into training and independent dat seta and then selecting a model for a given split 
are repeated 20 times (\Rfunarg{n.split=20}). This generates the following output.

\begin{verbatim}

out$model

1       1     1   29 0.0263   34.75   0.9144 0.1579   13.38   0.7042       
2       1     2   36 0.0263   35.95   0.9461 0.0000   18.35   0.9659       
3       1     3   30 0.0000   37.89   0.9972 0.0000   18.88   0.9939     **
4       1     4  177 0.0000   38.00   0.9999 0.0000   18.98   0.9990       
5       1     5   18 0.0000   38.00   1.0000 0.0000   18.96   0.9979       
6       1     6  185 0.0000   38.00   1.0000 0.0000   19.00   1.0000       
7       1     7   49 0.0000   38.00   1.0000 0.0000   19.00   1.0000       
8       1     8  163 0.0000   38.00   1.0000 0.0000   19.00   1.0000       
9       1     9   91 0.0000   38.00   1.0000 0.0000   19.00   1.0000       
10      1    10   78 0.0000   38.00   1.0000 0.0000   19.00   0.9999       
11      1    11  148 0.0000   38.00   1.0000 0.0000   19.00   1.0000       
12      1    12   95 0.0000   38.00   1.0000 0.0000   19.00   1.0000      *

13      2     1  163 0.0789   30.22   0.7952 0.1053   14.39   0.7574       
14      2     2  177 0.0000   37.12   0.9767 0.0000   18.65   0.9818       
15      2     3   29 0.0000   37.76   0.9936 0.0000   18.98   0.9988     **
16      2     4   36 0.0000   37.97   0.9991 0.0000   18.99   0.9996       
17      2     5   28 0.0000   37.96   0.9991 0.0000   18.99   0.9997       
18      2     6  185 0.0000   37.98   0.9995 0.0000   18.99   0.9997      *
19      2     7  182 0.0000   37.98   0.9995 0.0000   18.98   0.9987       
20      2     8   65 0.0000   37.99   0.9998 0.0000   18.99   0.9993       
21      2     9   84 0.0000   37.99   0.9999 0.0000   18.99   0.9994       
22      2    10   18 0.0000   37.99   0.9999 0.0000   18.99   0.9995       
23      2    11  102 0.0000   38.00   0.9999 0.0000   18.99   0.9993       
24      2    12   76 0.0000   38.00   0.9999 0.0000   18.99   0.9996      

.
.
.

  
206    20     1   30 0.0263   35.31   0.9291 0.1579   12.39   0.6522       
207    20     2   78 0.0263   36.35   0.9566 0.0526   16.54   0.8704       
208    20     3   36 0.0000   38.00   0.9999 0.0000   18.86   0.9924     **
209    20     4   51 0.0000   38.00   1.0000 0.0526   17.05   0.8974       
210    20     5   18 0.0000   38.00   1.0000 0.0526   16.93   0.8911       
211    20     6  177 0.0000   38.00   1.0000 0.1053   15.32   0.8062       
212    20     7   28 0.0000   38.00   1.0000 0.1053   15.03   0.7911       
213    20     8  102 0.0000   38.00   1.0000 0.1579   13.38   0.7040       
214    20     9  182 0.0000   38.00   1.0000 0.1053   14.81   0.7794       
215    20    10  148 0.0000   38.00   1.0000 0.1053   14.98   0.7883       
216    20    11   84 0.0000   38.00   1.0000 0.1579   13.35   0.7027       
217    20    12  141 0.0000   38.00   1.0000 0.1053   14.86   0.7821   
    
\end{verbatim}

For each split, the parsimonious model identified (denoted as **) is evaluated by an 
independent 100 splits (n.split.eval=100) generating the following output. 
\begin{verbatim}     

out$model.eval

   Split G1  G2  G3  G4  G5  G6 G7 G8  G9 G10 G11 G12 mean ER mean MiPP mean sMiPP
S1    1  29  36  30  NA  NA  NA NA NA  NA  NA  NA  NA 0.022631  18.03   0.9487025
S2    2 163 177  29  NA  NA  NA NA NA  NA  NA  NA  NA 0.004210  18.70   0.9840302
S3    3  30  36 185  NA  NA  NA NA NA  NA  NA  NA  NA 0.000000  18.91   0.9951495
S4    4  29 102 177  36  84  NA NA NA  NA  NA  NA  NA 0.010526  18.51   0.9743218
S5    5  91 177  29  36  NA  NA NA NA  NA  NA  NA  NA 0.006842  18.62   0.9798232
S6    6  49  78 185 177  29  NA NA NA  NA  NA  NA  NA 0.015789  18.33   0.9646823
S7    7  29  78  91  30  51 102 36 18  76 141  NA  NA 0.006842  18.73   0.9857517
S8    8  30  36 185  NA  NA  NA NA NA  NA  NA  NA  NA 0.000000  18.91   0.9951495
S9    9  30  36  78  NA  NA  NA NA NA  NA  NA  NA  NA 0.000000  18.88   0.9938337
S10  10  29 185 177  NA  NA  NA NA NA  NA  NA  NA  NA 0.006315  18.67   0.9826242
S11  11  29 177  NA  NA  NA  NA NA NA  NA  NA  NA  NA 0.014736  18.08   0.9517509
S12  12  30  36 177 185  NA  NA NA NA  NA  NA  NA  NA 0.000000  18.97   0.9986411
S13  13  30  36  NA  NA  NA  NA NA NA  NA  NA  NA  NA 0.000526  18.76   0.9874807
S14  14  49 185 141  65 102  18 30 36 163  95  91  29 0.008947  18.62   0.9799060
S15  15 163 177 185  29  NA  NA NA NA  NA  NA  NA  NA 0.005263  18.73   0.9860384
S16  16  49  91 185  78  NA  NA NA NA  NA  NA  NA  NA 0.013684  18.32   0.9643648
S17  17 163  29  49  36 148  91 84 30  18 185 141 182 0.027894  17.85   0.9396381
S18  18  29 177 163  NA  NA  NA NA NA  NA  NA  NA  NA 0.004210  18.69   0.9840302
S19  19  29  36 185  91  NA  NA NA NA  NA  NA  NA  NA 0.020526  18.15   0.9551247
S20  20  30  78  36  NA  NA  NA NA NA  NA  NA  NA  NA 0.000000  18.88   0.9938337

         5% ER 50% ER 95% ER  5% sMiPP 50% sMiPP 95% sMiPP
S1  0.10526316      0      0 0.8083084 0.9898493 0.9969918
S2  0.05263158      0      0 0.9083072 0.9949838 0.9996731
S3  0.00000000      0      0 0.9858440 0.9965600 0.9990983
S4  0.05263158      0      0 0.8829518 0.9965193 0.9996263
S5  0.05263158      0      0 0.8964819 0.9946817 0.9993379
S6  0.05263158      0      0 0.8906567 0.9925318 0.9998255
S7  0.05263158      0      0 0.9074560 0.9992884 0.9999863
S8  0.00000000      0      0 0.9858440 0.9965600 0.9990983
S9  0.00000000      0      0 0.9856433 0.9947618 0.9987210
S10 0.05263158      0      0 0.8974827 0.9959758 0.9989191
S11 0.05526316      0      0 0.8655241 0.9726538 0.9902048
S12 0.00000000      0      0 0.9968295 0.9991665 0.9998100
S13 0.00000000      0      0 0.9744228 0.9896188 0.9954914
S14 0.05526316      0      0 0.8777725 0.9992962 0.9999910
S15 0.05263158      0      0 0.9071621 0.9976359 0.9997154
S16 0.05263158      0      0 0.8896781 0.9860472 0.9983609
S17 0.10526316      0      0 0.7931386 0.9778348 0.9995296
S18 0.05263158      0      0 0.9083072 0.9949838 0.9996731
S19 0.05263158      0      0 0.8801255 0.9952021 0.9989584
S20 0.00000000      0      0 0.9856433 0.9947618 0.9987210

\end{verbatim}


\subsection{Sequential selection} 
Good classifying (masked) genes may be masked by other better classifying (masking) genes, 
so the masked genes may be discovered if the masking genes are not present.  
Therefore, it is worth selecting gene models after removing genes selected in the previous runs.
This sequential selection can be performed by manually removing the genes selected in the previous runs.
For users' convenience, the \Rpackage{MiPP} package enables one to perform such a sequential selection process automatically many times.

For the acute leukemia data with an independent test set, the sequntial analysis can be performed by the following arguments 
in the \Rfunction{mipp.seq} function: 
\begin{verbatim}

out <- mipp.seq(x=x.train, y=y.train, x.test=x.test, y.test=y.test, n.seq=3)

\end{verbatim}
The argument \Rfunarg{n.seq=3} means that the sequential selection is performed 3 times 
after removing all the genes in the selected gene models.  
For the colon cancer data with no independent test set, the sequntial analysis can be performed by the following arguments: 
\begin{verbatim}

out <- mipp.seq(x=x, y=y, n.seq=3, cutoff.sMiPP=0.7)

\end{verbatim}
By the argument \Rfunarg{cutoff.sMiPP=0.7}, the gene models with 5\% sMiPP $> 0.7$ are selected, 
so all the genes in the selected models are removed for the next run. 
All the arguments in the \Rfunction{mipp.seq} function can also used in the \Rfunction{mipp.seq} function.



\vspace{.5in}
{\large \bf Reference}
\begin{itemize}
\item[]     Soukup M, Cho H, and Lee JK (2005). Robust classification
     modeling on microarray data  using misclassification penalized
     posterior, {\it Bioinformatics}, 21 (Suppl): i423-i430.
\item[]
     Soukup M and Lee JK (2004). Developing optimal prediction models
     for cancer classification  using gene expression data, 
     {\it Journal of Bioinformatics and Computational Biology}, 1(4) 681-694.
\end{itemize}


\end{document}