---
title: D. Machine Learning
author:
  Martin Morgan (mtmorgan@fredhutch.org)<br />
  Sonali Arora (sarora@fredhutch.org)
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{D. Machine Learning}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```

## Introduction to Machine Learning 

Lets say that we are interested in predicting the run time of an athlete 
depending on his shoe size, height and weight in a study of 100 people.
We can do so using a simple linear regression model where 

```{r eval=FALSE}
y = beta0 + beta1 * height + beta2 * weight + beta3 * shoe_size 
```

Here y is the response variable (run time), n is the number of observations 
(100 people), p is the number of variables/ features/ predictors (3 IE height, 
weight, shoe size), X is a nxp matrix

This data set is a low dimensional data where n >> p but most of the biological 
data sets coming out of modern biological techniques are high dimensional IE 
n << p This poses statistical challenge and simple linear regression can no 
longer help us.

For example, 

* Identify the risk factors(genes) for prostrate cancer based on gene
  expression data
* Predict the chances of breast cancer survival in a patient. 
* Identify patterns of gene expression among different sub types of
  breast cancer

In all of the 3 examples, listed above n, number of observations, is 30-40 patients
whereas p, number of features, is approximately 30,000 genes. Try writing a linear 
regression formula for the outcome variable, y, in any of the above three 
scenarios.. 

Listed below are things that can go wrong with high dimensional data
- some of these predictors are useful, some are not 
- if we include too many predictors, we can over fit the data 

This is why we need Machine Learning. Lets first introduce some basic concepts 
and then dive into examples and a lab session. 

**Supervised Learning** - Use a data set X to predict the association with a 
response variable Y. The response variable can be continuous or categorical. 
For example: Predicting the chances of breast cancer survival in a patient.

**Unsupervised Learning** - Discover the associations or patterns in X. No 
response variable is present. For example: Cluster similar genes into groups. 

**Training & Test Datasets** -  Usually we split observation into test and 
training data sets. We fit the model on the training data set and evaluate on the 
test data set. The test set error rate is an estimate of the models performance 
on future data sets. 

**Model Selection** - We usually consider numerous models for a given problem. 
For example, we are trying to identify the genes responsible for a given 
disease using gene expression data set- we could have the following models
a) model 1 - Use all 30000 genes from the array to build a model 
b) model 2 - we include only genes related to the pathway that we know is 
upregulated in that disease to build a model
c) model 3 - include genes found in literature which are known to influence
this disease 
It is highly recommended to use the test set only on our final model to see
how our model will do with new, unseen data. So how do we pick the best 
model which can be tested on the test data set?  

**Cross-validation**
We can use different approaches to find the best model. Lets look at the 
commonly used approaches, namely, validation set, leave one out 
cross-validation, k-fold cross validation. 

Briefly, the __validation set approach__ deals with diving the full data sets into 
3 groups - training set, validation set and the test set. We train the models on 
the training set, evaluate their performance on the validation set and then the
best model is chosen to fit on the test set. 

The __leave one out cross validation__ starts with fitting n models (where n is
number of observations in the training data set), each on n-1 observations, 
evaluating each model on the left-out observation. The best model is the one 
for which the total test error is the smallest and that is then used to predict 
the test set. 

Lastly the __5 fold cross validation__ (here k=5), is splitting the training 
data set into 5 sets and repeatedly training the model on the other 4 sets and 
evaluating the performance on the fifth.

**Bias, Variance, Overfitting** - Bias refers to the average difference between
the actual betas and the predicted betas, Variance refers to the amount by 
which the betas differ across experiments. As the model complexity(no of 
variables) increases, the bias decreases and the variance increases. This is 
know as the Bias-Variance Tradeoff and a model that has too much of variance, 
is said to be over fit. 

## Datasets

For **Unsupervised learning**, we will use RNA-Seq count data from the
Biocoductor package, `r Biocpkg("airway")`. From the abstract, a brief 
description of the RNA-Seq experiment on airway smooth muscle (ASM) cell 
lines: “Using RNA-Seq, a high-throughput sequencing method, we characterized 
transcriptomic changes in four primary human ASM cell lines that were treated 
with dexamethasone - a potent synthetic glucocorticoid (1 micromolar for 
18 hours).”

```{r message=FALSE}
library(airway)
data("airway")
se <- airway
colData(se)
library("DESeq2")
dds <- DESeqDataSet(se, design = ~ cell + dex)  
```

For **Supervised learning**, we will use cervical count data from the
Biocoductor package, `r Biocpkg("MLSeq")`. This data set contains
expressions of 714 miRNA's of human samples. There are 29 tumor and 29
non-tumor cervical samples. For learning purposes, we can treat these
as two separate groups and run various classification algorithms.

```{r message=FALSE}
library(MLSeq)
filepath = system.file("extdata/cervical.txt", package = "MLSeq")
cervical = read.table(filepath, header = TRUE)
```


## Unsupervised Learning 

Unsupervised Learning is a set of statistical tools intended for the setting
in which we have only a set of 'p' features  measured on 'n' observations. 
We are primarily interested in discovering interesting 
things about the 'p' features. 

Unsupervised Learning is often performed as a part of Exploratory Data Analysis. 
These tools help us to get a good idea about the data set. Unlike a supervised
learning problem, where we can use prediction to gain some confidence about our
learning algorithm, there is no way to check our model. The learning algorithm
is thus, aptly named "unsupervised".

**RLOG TRANSFORMATION** 

Many common statistical methods for exploratory analysis of multidimensional 
data, especially methods for clustering and ordination (e.g., 
principal-component analysis and the like), work best for (at least 
approximately) homoskedastic data; this means that the variance of an observed 
quantity (here, the expression strength of a gene) does not depend on the mean. 

In RNA-Seq data, the variance grows with the mean.If one performs PCA 
(principal components analysis) directly on a matrix of normalized read counts,
the result typically depends only on the few most strongly expressed genes 
because they show the largest absolute differences between samples.

As a solution, DESeq2 offers the regularized-logarithm transformation, or rlog 
for short. See the help for ?rlog for more information and options. 

The function rlog returns a SummarizedExperiment object which contains the 
rlog-transformed values in its assay slot:

```{r}    
rld <- rlog(dds)   
head(assay(rld))    
```

To assess overall similarity between samples: Which samples are similar to each 
other, which are different? Does this fit to the expectation from the 
experiment's design? We use the R function dist to calculate the Euclidean 
distance between samples. To avoid that the distance measure is dominated by 
a few highly variable genes, and have a roughly equal contribution from all 
genes, we use it on the rlog-transformed data

```{r}
sampleDists <- dist( t( assay(rld) ) )
sampleDists
```
Note the use of the function t to transpose the data matrix. We need this 
because dist calculates distances between data rows and our samples constitute 
the columns.

**HEATMAP**

We visualize the sample-to-sample distances in a heatmap, using the 
function heatmap.2 from  the gplots package. Note that we have changed the row 
names of the distance matrix to contain treatment type and patient number 
instead of sample ID, so that we have all this information in view when 
looking at the heatmap.

```{r message=FALSE}
library("gplots")
library("RColorBrewer")

sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
hc <- hclust(sampleDists)
heatmap.2( sampleDistMatrix, Rowv=as.dendrogram(hc),
          symm=TRUE, trace="none", col=colors,
          margins=c(2,10), labCol=FALSE )

```

**PCA**

Another way to visualize sample-to-sample distances is a principal-components 
analysis (PCA). In this ordination method, the data points (i.e., here, the 
samples) are projected onto the 2D plane such that they spread out in the two 
directions which explain most of the differences in the data. The x-axis is 
the direction (or principal component) which separates the data points the most.
The amount of the total variance which is contained in the direction is 
printed in the axis label. Here, we have used the function plotPCA which comes
with DESeq2. The two terms specified by intgroup are the interesting groups 
for labelling the samples; they tell the function to use them to choose colors. 

```{r}
plotPCA(rld, intgroup = c("dex", "cell"))
```

From both visualizations, we see that the differences between cells are 
considerable, though not stronger than the differences due to treatment 
with dexamethasone. This shows why it will be important to account for this 
in differential testing by using a paired design (“paired”, because each dex 
treated sample is paired with one untreated sample from the same cell line). 
We are already set up for this by using the design formula ~ cell + dex when 
setting up the data object in the beginning.

**MDS**
Another plot, very similar to the PCA plot, can be made using the 
multidimensional scaling (MDS) function in base R. This is useful when we don't
have the original data, but only a matrix of distances. Here we have the MDS 
plot for the distances calculated from the rlog transformed counts:

```{r}
library(ggplot2)
mds <- data.frame(cmdscale(sampleDistMatrix))
mds <- cbind(mds, colData(rld))
qplot(X1,X2,color=dex,shape=cell,data=as.data.frame(mds))
```

### Exercise:
Use the plotMDS function from the limma package to make a simila plot. 
What is the advtange of using this function over base R's cmdscale?

**Solutions:**

A similar plot can be made using the plotMDS() function in limma where the input
is a matrix of log-fold expression values. Here the advantage is that the 
distances on plot are proportional to log2-fold change and not only is the plot
created, but the object (with distance matrix) is also returned.

```{r plotMDS}
suppressPackageStartupMessages({
   library(limma)
   library(DESeq2)
   library(airway)
})
plotMDS(assay(rld), col=as.integer(dds$dex), pch=as.integer(dds$cell))
```



## Supervised Learning 

In supervised learning, along with the 'p' features, we 
also have the a response Y measured on the same n observations. The goal is then
to predict Y using X (n x p matrix) for new observations.

For the cervical data, we know that the first 29 are non-Tumor samples 
whereas the last 29 are Tumor samples. We will code these as 0 and 1 
respectively.  We will randomly sample 30% of our data and use that as a 
test set. The remaining 70% of the data will be used as training data

```{r }
set.seed(9)

class = data.frame(condition = factor(rep(c(0, 1), c(29, 29))))

nTest = ceiling(ncol(cervical) * 0.2)
ind = sample(ncol(cervical), nTest, FALSE)

cervical.train = cervical[, -ind]
cervical.train = as.matrix(cervical.train + 1)
classtr = data.frame(condition = class[-ind, ])

cervical.test = cervical[, ind]
cervical.test = as.matrix(cervical.test + 1)
classts = data.frame(condition = class[ind, ])
```

MLSeq aims to make computation less complicated for a user and
allows one to learn a model using various classifier's with one single function. 

The main function of this package is classify which requires data in the form of 
a DESeqDataSet instance. The DESeqDataSet is a subclass of SummarizedExperiment,
used to store the input values, intermediate calculations and results of an 
analysis of differential expression.

So lets create DESeqDataSet object for both the training and test set, and run 
DESeq on it. 

```{r}
cervical.trainS4 = DESeqDataSetFromMatrix(countData = cervical.train, 
        colData = classtr, formula(~condition))
cervical.trainS4 = DESeq(cervical.trainS4, fitType = "local")

cervical.testS4 = DESeqDataSetFromMatrix(countData = cervical.test, colData = classts,
formula(~condition))
cervical.testS4 = DESeq(cervical.testS4, fitType = "local")

```
Classify using Support Vector Machines. 

```{r}
svm = classify(data = cervical.trainS4, method = "svm", normalize = "deseq",
deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
svm
```

It returns an object of class 'MLseq' and we observe that it successfully
fitted a model with 97.8% accuracy. We can access the slots of this S4 object by
```{r}
getSlots("MLSeq")
```
And also, ask about the model trained. 

```{r}
trained(svm)
```

We can predict the class labels of our test data using "predict"

```{r}
pred.svm = predictClassify(svm, cervical.testS4)
table(pred.svm, relevel(cervical.testS4$condition, 2))
```

The other classification methods available are 'randomforest', 'cart' and 
'bagsvm'.

### Exercise:

Train the same training data and test data using randomForest.

**Solutions:**

```{r}
rf = classify(data = cervical.trainS4, method = "randomforest", 
        normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
trained(rf)
pred.rf = predictClassify(rf, cervical.testS4)
table(pred.rf, relevel(cervical.testS4$condition, 2))
```

## SessionInfo

```{r}
sessionInfo()
```

## References 

1. Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Unver T and Ozturk A (2014). MLSeq: Machine learning interface for RNA-Seq data. R package version 1.3.0.
2. Himes, E. B, Jiang, X., Wagner, P., Hu, R., Wang, Q., Klanderman, B., Whitaker, M. R, Duan, Q., Lasky-Su, J., Nikolos, C., Jester, W., Johnson, M., Panettieri, A. R, Tantisira, G. K, Weiss, T. S, Lu and Q. (2014). “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS ONE, 9(6), pp. e99625. http://www.ncbi.nlm.nih.gov/pubmed/24926665.
3. An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
4. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, Jerome Friedman