---
title: "Working with TCGAbiolinks package"
author: "Antonio Colaprico, Tiago Chedraoui Silva, Luciano Garofano, 
Catharina Olsen, Davide Garolini, Claudia Cava, Isabella Castiglioni,
Thais Sarraf Sabedot, Tathiane Maistro Malta, Stefano Pagnotta, Michele Ceccarelli,
Gianluca Bontempi, Houtan Noushmehr"
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
        toc: true
        number_sections: false
        toc_depth: 2
        highlight: haddock

references:
- id: ref1
  title: Orchestrating high-throughput genomic analysis with Bioconductor
  author: 
  - family: Huber, Wolfgang and Carey, Vincent J and Gentleman, Robert and Anders, Simon and Carlson, Marc and Carvalho, Benilton S and Bravo, Hector Corrada and Davis, Sean and Gatto, Laurent and Girke, Thomas and others
    given:
  journal: Nature methods
  volume: 12
  number: 2
  pages: 115-121
  issued:
    year: 2015

- id: ref2
  title: GC-content normalization for RNA-Seq data
  author: 
  - family: Risso, Davide and Schwartz, Katja and Sherlock, Gavin and Dudoit, Sandrine
    given:
  journal: BMC bioinformatics
  volume: 12
  number: 1
  pages: 480
  issued:
    year: 2011 
    
- id: ref3
  title: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments
  author: 
  - family: Bullard, James H and Purdom, Elizabeth and Hansen, Kasper D and Dudoit, Sandrine
    given:
  journal: BMC bioinformatics
  volume: 11
  number: 1
  pages: 94
  issued:
    year: 2010 

- id: ref4
  title: Inferring regulatory element landscapes and transcription factor networks from cancer methylomes
  author: 
  - family: Yao, L., Shen, H., Laird, P. W., Farnham, P. J., & Berman, B. P. 
    given:
  journal: Genome biology
  volume: 16
  number: 1
  pages: 105
  issued:
    year: 2015 

- id: ref5
  title: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments
  author: 
  - family:  James H Bullard, Elizabeth Purdom, Kasper D Hansen and Sandrine Dudoit	
    given:
  journal: BMC Bioinformatics
  volume: 11
  number: 1
  pages: 94
  issued:
    year: 2010 

- id: ref6
  title: GC-content normalization for RNA-Seq data
  author: 
  - family: Risso, D., Schwartz, K., Sherlock, G., & Dudoit, S. 
    given:
  journal: BMC Bioinformatics
  volume: 12
  number: 1
  pages: 480
  issued:
    year: 2011 

- id: ref7
  title: Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma
  author: 
  - family: Noushmehr, H., Weisenberger, D.J., Diefes, K., Phillips, H.S., Pujara, K., Berman, B.P., Pan, F., Pelloski, C.E., Sulman, E.P., Bhat, K.P. et al.
    given:
  journal: Cancer cell
  volume: 17
  number: 5
  pages: 510-522
  issued:
    year: 2010

- id: ref8
  title: Molecular Profiling Reveals Biologically Discrete Subsets and Pathways of Progression in Diffuse Glioma
  author: 
  - family: Ceccarelli, Michele and Barthel, Floris P and Malta, Tathiane M and Sabedot, Thais S and Salama, Sofie R and Murray, Bradley A and Morozova, Olena and Newton, Yulia and Radenbaugh, Amie and Pagnotta, Stefano M and others
    given:
  journal: Cell
  URL: "http://doi.org/10.1016/j.cell.2015.12.028"
  DOI: "10.1016/j.cell.2015.12.028"
  volume: 164
  number: 3
  pages: 550-563
  issued:
    year: 2016


- id: ref9
  title: Comprehensive molecular profiling of lung adenocarcinoma
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Nature
  URL: "http://doi.org/10.1038/nature13385"
  DOI: "10.1038/nature13385"
  volume: 511
  number: 7511
  pages: 543-550
  issued:
    year: 2014


- id: ref10
  title: Comprehensive molecular characterization of gastric adenocarcinoma
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Nature
  URL: "http://doi.org/10.1038/nature13480"
  DOI: "10.1038/nature13480"
  issued:
    year: 2014

- id: ref11
  title: Comprehensive molecular portraits of human breast tumours
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Nature
  URL: "http://doi.org/10.1038/nature11412"
  DOI: "10.1038/nature11412"
  volume: 490
  number: 7418
  pages: 61-70
  issued:
    year: 2012
  
- id: ref12
  title: Comprehensive molecular characterization of human colon and rectal cancer
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Nature
  URL: "http://doi.org/10.1038/nature11252"
  DOI: "10.1038/nature11252"
  volume: 487
  number: 7407
  pages: 330-337
  issued:
    year: 2012    

- id: ref13
  title: Genomic classification of cutaneous melanoma
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Cell
  URL: "http://doi.org/10.1016/j.cell.2015.05.044"
  DOI: "10.1016/j.cell.2015.05.044"
  volume: 161
  number: 7
  pages: 1681-1696
  issued:
    year: 2015    

- id: ref14
  title: Comprehensive genomic characterization of head and neck squamous cell carcinomas
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Nature
  URL: "http://doi.org/10.1038/nature14129"
  DOI: "10.1038/nature14129"
  volume: 517
  number: 7536
  pages: 576-582
  issued:
    year: 2015    

- id: ref15
  title: The somatic genomic landscape of chromophobe renal cell carcinoma
  author: 
  - family: Davis, Caleb F and Ricketts, Christopher J and Wang, Min and Yang, Lixing and Cherniack, Andrew D and Shen, Hui and Buhay, Christian and Kang, Hyojin and Kim, Sang Cheol and Fahey, Catherine C and others
    given:
  journal: Cancer Cell
  URL: "http://doi.org/10.1016/j.ccr.2014.07.014"
  DOI: "10.1016/j.ccr.2014.07.014"
  volume: 26
  number: 3
  pages: 319-330
  issued:
    year: 2014    


- id: ref16
  title: Comprehensive genomic characterization of squamous cell lung cancers
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Nature
  URL: "http://doi.org/10.1038/nature11404"
  DOI: "10.1038/nature11404"
  volume: 489
  number: 7417
  pages: 519-525
  issued:
    year: 2012   

- id: ref17
  title: Integrated genomic characterization of endometrial carcinoma
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Nature
  URL: "http://doi.org/10.1038/nature12113"
  DOI: "10.1038/nature12113"
  volume: 497
  number: 7447
  pages: 67-73
  issued:
    year: 2013   

- id: ref18
  title: Integrated genomic characterization of papillary thyroid carcinoma
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Cell
  URL: "http://doi.org/10.1016/j.cell.2014.09.050"
  DOI: "10.1016/j.cell.2014.09.050"
  volume: 159
  number: 3
  pages: 676-690
  issued:
    year: 2014   

- id: ref19
  title: The molecular taxonomy of primary prostate cancer
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Cell
  URL: "http://doi.org/10.1016/j.cell.2015.10.025"
  DOI: "10.1016/j.cell.2015.10.025"
  volume: 163
  number: 4
  pages: 1011-1025
  issued:
    year: 2015   
    

- id: ref20
  title: Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma
  author: 
  - family: Linehan, W Marston and Spellman, Paul T and Ricketts, Christopher J and Creighton, Chad J and Fei, Suzanne S and Davis, Caleb and Wheeler, David A and Murray, Bradley A and Schmidt, Laura and Vocke, Cathy D and others
    given:
  journal: NEW ENGLAND JOURNAL OF MEDICINE
  URL: "http://doi.org/10.1056/NEJMoa1505917"
  DOI: "10.1056/NEJMoa1505917"
  volume: 374
  number: 2
  pages: 135-145
  issued:
    year: 2016    
    
- id: ref21
  title: Comprehensive molecular characterization of clear cell renal cell carcinoma
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Nature
  URL: "http://doi.org/10.1038/nature12222"
  DOI: "10.1038/nature12222"
  volume: 499
  number: 7456
  pages: 43-49
  issued:
    year: 2013        
          
- id: ref22
  title: Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma
  author: 
  - family: Cancer Genome Atlas Research Network and others
    given:
  journal: Cancer Cell
  URL: "http://dx.doi.org/10.1016/j.ccell.2016.04.002"
  DOI: "10.1016/j.ccell.2016.04.002"
  volume: 29
  pages: 43-49
  issued:
    year: 2016 

- id: ref23
  title: Complex heatmaps reveal patterns and correlations in multidimensional genomic data
  author: 
  - family: Gu, Zuguang and Eils, Roland and Schlesner, Matthias
  given:
  journal: Bioinformatics
  URL: "http://dx.doi.org/10.1016/j.ccell.2016.04.002"
  DOI: "10.1016/j.ccell.2016.04.002"
  pages: "btw313"
  issued:
   year: 2016 

- id: ref24
  title: "TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages"
  author: 
  - family:  Silva, TC and Colaprico, A and Olsen, C and D'Angelo, F and Bontempi, G and Ceccarelli, M and Noushmehr, H
  given:
  journal: F1000Research
  URL: "http://dx.doi.org/10.12688/f1000research.8923.1"
  DOI: "10.12688/f1000research.8923.1"
  volume: 5
  number: 1542
  issued:
   year: 2016 

- id: ref25
  title: "TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data"
  author: 
  - family:  Colaprico, Antonio and Silva, Tiago C. and Olsen, Catharina and Garofano, Luciano and Cava, Claudia and Garolini, Davide and Sabedot, Thais S. and Malta, Tathiane M. and Pagnotta, Stefano M. and Castiglioni, Isabella and Ceccarelli, Michele and Bontempi, Gianluca and Noushmehr, Houtan
  given:
  journal: Nucleic Acids Research
  URL: "http://dx.doi.org/10.1093/nar/gkv1507"
  DOI: "10.1093/nar/gkv1507"
  volume: 44
  number: 8
  pages: e71
  issued:
   year: 2016 

vignette: >
  %\VignetteIndexEntry{Working with TCGAbiolinks package}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---
  
```{r setup, include=FALSE}
knitr::opts_chunk$set(dpi = 300)
knitr::opts_chunk$set(cache=FALSE)
```

```{r, echo = FALSE,hide=TRUE, message=FALSE,warning=FALSE}
devtools::load_all(".")
```

## Updates
Recently the TCGA data has been moved from the DCC server to 
The National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Portal 
In this version of the package, we rewrote all the functions that were  acessing 
the old TCGA server to GDC.

The GDC, which receives, processes, harmonizes, and distributes clinical, biospecimen, and genomic data from multiple cancer research programs, has data from the  following programs:

*  The Cancer Genome Atlas (TCGA)
*  Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
*  the Cancer Genome Characterization Initiative (CGCI)

The big change is that the GDC data is harmonized against GRCh38. However, not all 
data has been harmonized yet. The old TCGA data can be acessed through GDC legacy Archive,
in which the majority of data can be found.

More information about the project can be found in [GCD FAQS](https://gdc.nci.nih.gov/about-gdc/gdc-faqs)

The functions `TCGAquery`, `TCGAdownload`, `TCGAPrepare`, `TCGAquery_maf`, 
`TCGAquery_clinical`, were replaced by 
`GDCquery`, `GDCdownload`, `GDCprepare`, `GDCquery_maf`, `GDCquery_clinical`.

And it can acess both the GDC and GDC Legacy Archive.

Note: Not all the examples in this vignette were updated.



## Introduction 

Motivation: The Cancer Genome Atlas (TCGA) provides us with an enormous collection of data sets,
not only spanning a large number of cancers but also a large number of experimental platforms. 
Even though the data can be accessed and downloaded from the database, the possibility to analyse 
these downloaded data directly in one single R package has not yet been available. 

TCGAbiolinks consists of three parts or levels. Firstly, we provide different options to query and 
download from TCGA relevant data from all currently platforms and their subsequent pre-processing
for commonly used bio-informatics (tools) packages in Bioconductor or CRAN. Secondly, the package 
allows to integrate different data types and it can be used for different types of analyses dealing 
with all platforms such as diff.expression, network inference or survival analysis, etc, and then 
it allows to visualize the obtained results. Thirdly we added a social level where a researcher 
can found a similar intereset in a bioinformatic community, and allows both to find a validation 
of results in literature in pubmed and also to retrieve questions and answers from site such as 
support.bioconductor.org, biostars.org, stackoverflow,etc.

This document describes how to search, download and analyze TCGA data using the
`TCGAbiolinks` package.


## Installation

To install use the code below.

```{r, eval = FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite("TCGAbiolinks")
```

For a Graphical User Interface, please see [TCGAbiolinksGUI](https://github.com/BioinformaticsFMRP/TCGAbiolinksGUI).
The GUI in under review and will soon be available in Bioconductor repository.

## Citation

Please cite TCGAbiolinks package: 

* "TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data." Nucleic acids research (2015): [gkv1507](http://dx.doi.org/doi:10.1093/nar/gkv1507). [@ref25]

Related publications to this package:

* "TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages". F1000Research [10.12688/f1000research.8923.1](http://dx.doi.org/doi:10.12688/f1000research.8923.1) [@ref24]

Also, if you have used ELMER analysis please cite:

*  Yao, L., Shen, H., Laird, P. W., Farnham, P. J., & Berman, B. P. "Inferring regulatory element landscapes and transcription factor networks from cancer methylomes." Genome Biol 16 (2015): 105.
* Yao, Lijing, Benjamin P. Berman, and Peggy J. Farnham. "Demystifying the secret mission of enhancers: linking distal regulatory elements to target genes." Critical reviews in biochemistry and molecular biology 50.6 (2015): 550-573.

# `GDCquery`: Searching TCGA open-access data  

## `GDCquery`: Searching GDC data for download 

You can easily search GDC data using the `GDCquery` function.

Using a summary of filters as used in the TCGA portal, the function works
with the following arguments:

* **project** A list of valid project (see table below)
* **data.category** A valid project (see list with getProjectSummary(project))
* **data.type** A data type to filter the files to download
* **sample.type** A sample type to filter the files to download (See table below)
* **workflow.type** GDC workflow type
* **barcode** A list of barcodes to filter the files to download
* **legacy** Search in the legacy repository? Default: FALSE
* **platform** Experimental data platform (HumanMethylation450, AgilentG4502A_07 etc). Used only for legacy repository
* **file.type** A string to filter files, based on its names. Used only for legacy repository

The next subsections will detail each of the search arguments.
Below, we show some search examples:
```{r, eval = FALSE}
#---------------------------------------------------------------
#  For available entries and combinations please se table below
#---------------------------------------------------------------

# Gene expression aligned against Hg38
query <- GDCquery(project = "TARGET-AML",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")

# All DNA methylation data for TCGA-GBM and TCGA-GBM
query.met <- GDCquery(project = c("TCGA-GBM","TCGA-LGG"),
                      legacy = TRUE,
                      data.category = "DNA methylation",
                      platform = c("Illumina Human Methylation 450", "Illumina Human Methylation 27"))

# Using sample type to get only Primary solid Tumor samples and Solid Tissue Normal
query.mirna <- GDCquery(project = "TCGA-ACC", 
                        data.category = "Transcriptome Profiling", 
                        data.type = "miRNA Expression Quantification",
                        sample.type = c("Primary solid Tumor","Solid Tissue Normal"))

# Example Using legacy to accessing hg19 and filtering by barcode
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation", 
                  platform = "Illumina Human Methylation 27", 
                  legacy = TRUE,
                  barcode = c("TCGA-02-0047-01A-01D-0186-05","TCGA-06-2559-01A-01D-0788-05"))

# Gene expression aligned against hg19.
query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results",
                  experimental.strategy = "RNA-Seq",
                  barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                  legacy = TRUE)

# Searching idat file for DNA methylation
query <- GDCquery(project = "TCGA-OV",
                  data.category = "Raw microarray data",
                  data.type = "Raw intensities", 
                  experimental.strategy = "Methylation array", 
                  legacy = TRUE,
                  file.type = ".idat",
                  platform = "Illumina Human Methylation 450")

```

The list of projects is below:
```{r, eval = TRUE, echo = FALSE}
knitr::kable(getGDCprojects(), digits = 2,
             caption = "List of projects",row.names = FALSE)
```

The list of sample.type is below:
```{r, eval = TRUE, echo = FALSE}
knitr::kable(getBarcodeDefinition(), digits = 2,
             caption = "List sample types",row.names = FALSE)
```


The other fields (data.category, data.type, workflow.type, platform, file.type) can be found below. 
Please, not that these tables are still incomplete.

## Harmonized data

| data.category               | data.type                         |  workflow.type   | platform |
|-----------------------------|-----------------------------------|-----------------|----------|
| Transcriptome Profiling     | Gene Expression Quantification    | HTSeq - Counts  |          |
|                             |                                   | HTSeq - FPKM-UQ |          |
|                             |                                   | HTSeq - FPKM    |          |
|                             | Isoform Expression Quantification | -               |          |
|                             | miRNA Expression Quantification   | -               |          |
| Copy Number Variation       | Copy Number Segment               |                 |          |
|                             | Masked Copy Number Segment        |                 |          |
| Simple Nucleotide Variation |                                   |                 |          |
| Raw Sequencing Data         |                                   |                 |          |
| Biospecimen                 |                                   |                 |          |
| Clinical                    |                                   |                 |          |
| DNA Methylation             |                                   |                 | Illumina Human Methylation 450| 
|                             |                                   |                 | Illumina Human Methylation 27 |    

## Legacy data
| data.category               | data.type                         | platform                            | file.type          |
|-----------------------------|-----------------------------------|-------------------------------------|--------------------|
| Copy number variation       | -                                 | Affymetrix SNP Array 6.0            | nocnv_hg18.seg     |
|                             | -                                 | Affymetrix SNP Array 6.0            | hg18.seg           |
|                             | -                                 | Affymetrix SNP Array 6.0            | nocnv_hg19.seg     |
|                             | -                                 | Affymetrix SNP Array 6.0            | hg19.seg           |
|                             | -                                 | Illumina HiSeq                      | -                  |
| Simple nucleotide variation |                                   |                                     |                    |
| Raw sequencing data         |                                   |                                     |                    |
| Biospecimen                 |                                   |                                     |                    |
| Clinical                    |                                   |                                     |                    |
| Protein expression          |                                   | MDA RPPA Core                       | -                  |
| Gene expression             | Gene expression quantification    | Illumina HiSeq                      | normalized_results |
|                             |                                   | Illumina HiSeq                      | results            |
|                             |                                   | HT_HG-U133A                         | -                  |
|                             |                                   | AgilentG4502A_07_2                  | -                  |
|                             |                                   | AgilentG4502A_07_1                  | -                  |
|                             |                                   | HuEx-1_0-st-v2                      | FIRMA.txt          |
|                             |                                   |                                     | gene.txt           |
|                             | Isoform expression quantification |                                     |                    |
|                             | miRNA gene quantification         |                                     | hg19.mirna         |
|                             |                                   |                                     | hg19.mirbase20     |
|                             |                                   |                                     | mirna              |
|                             | Exon junction quantification      |                                     |                    |
|                             | Exon quantification               |                                     |                    |
|                             | miRNA isoform quantification      |                                     | hg19.isoform       |
|                             |                                   |                                     | isoform            |
| DNA methylation             |                                   | Illumina Human Methylation 450      | -                  |
|                             |                                   | Illumina Human Methylation 27       | -                  |
|                             |                                   | Illumina DNA Methylation OMA003 CPI | -                  |
|                             |                                   | Illumina DNA Methylation OMA002 CPI | -                  |
|                             |                                   | Illumina Hi Seq                     |                    |
| Raw microarray data         |  Raw intensities                  | Illumina Human Methylation 450      |  idat              |
|                             |                                   | Illumina Human Methylation 27       |  idat              |
| Other                       |                                   |                                     |                    |

## Working with mutation files.
### `GDCquery_maf`: Mutation Annotation Format (MAF) aligned against hg38

In order to download the Mutation Annotation Format (MAF) aligned against hg38, we provide the user
with the `GDCquery_maf` function. Briefly, it will get the open acess maf files from
[https://gdc-docs.nci.nih.gov/Data/Release_Notes/Data_Release_Notes/](https://gdc-docs.nci.nih.gov/Data/Release_Notes/Data_Release_Notes/).
Four separate variant calling pipelines are implemented for GDC data harmonization which are described [here](https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/)

The arguments that will be used to filter the data are:

* tumor: A TCGA project without the TCGA- prefix.
* save.csv: Save the maf as csv file
* pipelines: Four separate variant calling pipelines are implemented for GDC data harmonization. Options: muse, varscan2, somaticsniper, mutect. 

```{r, eval = FALSE}
acc.muse.maf <- GDCquery_Maf("ACC", pipelines = "muse")
acc.varscan2.maf <- GDCquery_Maf("ACC", pipelines = "varscan2")
acc.somaticsniper.maf <- GDCquery_Maf("ACC", pipelines = "somaticsniper")
acc.mutect.maf <- GDCquery_Maf("ACC", pipelines = "mutect")
```

This mutation data can be visualized through a oncoprint from the  [ComplexHeatmap](http://bioconductor.org/packages/ComplexHeatmap/) package [@ref23].

```{r, eval=FALSE, message=FALSE, warning=FALSE}
mut <- GDCquery_Maf("ACC", pipelines = "mutect2")
clin <- GDCquery_clinic("TCGA-ACC","clinical")
clin <- clin[,c("bcr_patient_barcode","disease","gender","tumor_stage","race","vital_status")]
TCGAvisualize_oncoprint(mut = mut, genes = mut$Hugo_Symbol[1:20],
                        filename = "oncoprint.pdf",
                        annotation = clin,
                        color=c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown"),
                        rows.font.size=10,
                        width = 10,
                        heatmap.legend.side = "right",
                        dist.col = 0,
                        label.font.size = 10)
```

```{r, fig.width=6, fig.height=4, echo=FALSE, fig.align="center"}
library(png)
library(grid)
img <- readPNG("oncoprint.png")
grid.raster(img)
```

### Mutation Annotation Format (MAF) aligned against hg19

In order to download the Mutation Annotation Format (MAF) aligned against hg19,
the user will need to use `GDCquery` with `legacy = TRUE`, `GDCdownload` and `GDCprepare`.

```{r, message=FALSE, warning=FALSE}
query.maf.hg19 <- GDCquery(project = "TCGA-COAD", 
                           data.category = "Simple nucleotide variation", 
                           data.type = "Simple somatic mutation",
                           access = "open", 
                           legacy = TRUE)
# Check maf availables
knitr::kable(getResults(query.maf.hg19)[,c("created_datetime","file_name")]) 
```
```{r, eval = FALSE}
query.maf.hg19 <- GDCquery(project = "TCGA-COAD", 
                           data.category = "Simple nucleotide variation", 
                           data.type = "Simple somatic mutation",
                           access = "open", 
                           file.type = "gsc_COAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf",
                           legacy = TRUE)
GDCdownload(query.maf.hg19)
coad.mutect.maf <- GDCprepare(query.maf.hg19)
```

## `GDCquery_clinic` and `GDCprepare_clinic`: Working with clinical data.

In GDC database the clinical data can be retrieved in two sources:

* indexed clinical (from json)
* xml files

there are two main differences:

* XML has more information: radiation, drugs information, follow-ups, biospecimen, etc. So the indexed one is only a subset of the XML files
* The indexed data contains the updated data with the follow up informaiton. For example: if the patient is alive in the first time clinical data was collect and the in the next follow-up he is dead, the indexed data will show dead. The XML will have two fields, one for the first time saying he is alive (in the clinical part) and the follow-up saying he is dead. You can see this case here: https://gist.github.com/tiagochst/cfa384297ff0950a6489e3f4b296ecd7

```{r, fig.height=6, message=FALSE, warning=FALSE, include=FALSE}
clin.query <- GDCquery(project = "TCGA-BLCA", data.category = "Clinical", barcode = "TCGA-FD-A5C0")
 json  <- tryCatch(GDCdownload(clin.query), 
                   error = function(e) GDCdownload(clin.query, method = "client"))
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient")
clinical.patient.followup <- GDCprepare_clinic(clin.query, clinical.info = "follow_up")
clinical.index <- GDCquery_clinic("TCGA-BLCA")
```

```{r, eval=FALSE, fig.height=6, message=FALSE, warning=FALSE, include=TRUE}
clin.query <- GDCquery(project = "TCGA-BLCA", data.category = "Clinical", barcode = "TCGA-FD-A5C0")
 json  <- tryCatch(GDCdownload(clin.query), 
                   error = function(e) GDCdownload(clin.query, method = "client"))
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient")
clinical.patient.followup <- GDCprepare_clinic(clin.query, clinical.info = "follow_up")
clinical.index <- GDCquery_clinic("TCGA-BLCA")
```

```{r, echo=TRUE, message=FALSE, warning=FALSE, fig.height=6}
# Example of the second difference:
clinical.patient[,c("vital_status","days_to_death","days_to_last_followup")]
clinical.patient.followup[,c("vital_status","days_to_death","days_to_last_followup")]
# indexed data is equivalent to follow ups information
clinical.index[clinical.index$submitter_id=="TCGA-FD-A5C0",
               c("vital_status","days_to_death","days_to_last_follow_up")]
```


You can retrieve clinical data using the `GDCquery_clinic` function. 
This will get only the indexed GDC clinical data. This is the same as clicking on the
download clinical buttun in the data portal.
This means this function is not parsing the XML files, see `GDCprepare_clinic` function.

The arguments of this function are: 

* project: project_id ("TCGA-OV","TCGA-BRCA","TARGET-WT", etc)
* type: ("clinical","biospecimen")
* save.csv: save the clinical/biospeciment into a csv file. Pattern: project_type.csv


Examples of use:
```{r, eval = FALSE}
clin <- GDCquery_clinic("TCGA-ACC", type = "clinical", save.csv = TRUE)
clin <- GDCquery_clinic("TCGA-ACC", type = "biospecimen", save.csv = TRUE)
```

To parse the TCGA clinical XML files, please use `GDCprepare_clinic` function. It receives as
argument the query object and which clinical information we should retrieve.
The possible values for the argument `clinical.info` are:

* For clinical XML files:
    * admin
    * drug
    * follow_up
    * patient
    * new_tumor_event
    * radiation
    * stage_event
* For biospecimen XML files:
    * admin
    * aliquot
    * analyte 
    * bio_patient
    * portion
    * protocol
    * sample
    * slide
    
```{r, eval = FALSE}
query <- GDCquery(project = "TCGA-COAD", 
                  data.category = "Clinical", 
                  barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")
clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")
clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")

query <- GDCquery(project = "TCGA-COAD", 
                  data.category = "Biospecimen", 
                  barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
GDCdownload(query)
clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")
clinical.sample <- GDCprepare_clinic(query, clinical.info = "sample")
clinical.slide <- GDCprepare_clinic(query, clinical.info = "slide")
clinical.portion <- GDCprepare_clinic(query, clinical.info = "portion")
```

To get all the information for TGCA samples you can use the script below:
```{r, eval = FALSE}
# This code will get all clinical indexed data from TCGA
library(TCGAbiolinks)
library(data.table)
clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>% 
            regexPipes::grep("TCGA",value=T) %>% 
            sort %>% 
            plyr::alply(1,GDCquery_clinic, .progress = "text") %>% 
            rbindlist
readr::write_csv(clinical,path = paste0("all_clin_indexed.csv"))

# This code will get all clinical XML data from TCGA
getclinical <- function(proj){
    message(proj)
    while(1){
        result = tryCatch({
            query <- GDCquery(project = proj, data.category = "Clinical")
            GDCdownload(query)
            clinical <- GDCprepare_clinic(query, clinical.info = "patient")
            for(i in c("admin","radiation","follow_up","drug","new_tumor_event")){
                message(i)
                aux <- GDCprepare_clinic(query, clinical.info = i)
                if(is.null(aux)) next
                # add suffix manually if it already exists
                replicated <- which(grep("bcr_patient_barcode",colnames(aux), value = T,invert = T) %in% colnames(clinical))
                colnames(aux)[replicated] <- paste0(colnames(aux)[replicated],".",i)
                if(!is.null(aux)) clinical <- merge(clinical,aux,by = "bcr_patient_barcode", all = TRUE)
            }
            readr::write_csv(clinical,path = paste0(proj,"_clinical_from_XML.csv")) # Save the clinical data into a csv file
            return(clinical)
        }, error = function(e) {
            message(paste0("Error clinical: ", proj))
        })
    }
}
clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>% 
    regexPipes::grep("TCGA",value=T) %>% sort %>% 
    plyr::alply(1,getclinical, .progress = "text") %>% 
    rbindlist(fill = TRUE) %>% setDF %>% subset(!duplicated(clinical))
readr::write_csv(clinical,path = "all_clin_XML.csv")
# result: https://drive.google.com/open?id=0B0-8N2fjttG-WWxSVE5MSGpva1U
# Obs: this table has multiple lines for each patient, as the patient might have several followups, drug treatments,
# new tumor events etc...
```


Also, some functions to work with clinical data are provided.

For example the function `TCGAquery_SampleTypes` will filter barcodes based on a
type the argument typesample. 
Some of the typesamples possibilities are: 
TP (PRIMARY SOLID TUMOR),
TR (RECURRENT SOLID TUMOR),
NT (Solid Tissue Normal) etc. 
Please, see ?TCGAquery_SampleTypes for all the possible values.

The function `TCGAquery_MatchedCoupledSampleTypes` will filter the samples that 
have all the typesample provided as argument. For example, if TP and TR are set 
as typesample, the function will return the barcodes of a patient if it has both types.
So, if it has a TP, but not a  TR, no barcode will be returned. If it has a TP and a TR
both barcodes are returned.

An example of the function is below:

```{r, eval = TRUE}
bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",  
         "TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
         "TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
         "TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
         "TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
         "TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
         "TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
         "TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
         "TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
         "TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
         "TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
         "TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
         "TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
         "TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
         "TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
         "TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
         "TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")

S <- TCGAquery_SampleTypes(bar,"TP")
S2 <- TCGAquery_SampleTypes(bar,"NB")

# Retrieve multiple tissue types  NOT FROM THE SAME PATIENTS
SS <- TCGAquery_SampleTypes(bar,c("TP","NB"))

# Retrieve multiple tissue types  FROM THE SAME PATIENTS
SSS <- TCGAquery_MatchedCoupledSampleTypes(bar,c("NT","TP"))
```


## `TCGAquery_subtype`: Working with molecular subtypes data.

The Cancer Genome Atlas (TCGA) Research Network has reported integrated genome-wide 
studies of various diseases. We have added some of the subtypes defined by these
report in our package. The ACC[@ref22], BRCA [@ref11], COAD [@ref12], GBM [@ref8], HNSC [@ref14], KICH [@ref15], KIRC[@ref21], KIRP [@ref20], LGG [@ref8], LUAD [@ref9], 
LUSC[@ref16], PRAD[@ref19], READ [@ref12],  SKCM [@ref13], STAD [@ref10], THCA [@ref18], UCEC [@ref17] tumors have data added. 
These subtypes will be automatically added in the summarizedExperiment
object through TCGAprepare. But you can also use the `TCGAquery_subtype` function
to retrive this information.

```{r, eval = FALSE}
# Check with subtypes from TCGAprepare and update examples
GBM_path_subtypes <- TCGAquery_subtype(tumor = "gbm")

LGG_path_subtypes <- TCGAquery_subtype(tumor = "lgg")
```

A subset of the lgg subytpe is shown below:

```{r, eval = TRUE, echo = FALSE}
knitr::kable(lgg.gbm.subtype[1:10,c(1,2,3,4)], digits = 2, 
             caption = "Table with LGG molecular subtypes from TCGAquery_subtype",
             row.names = TRUE)
```


# `GDCdownload`: Downloading GDC data  
You can easily download data using the `GDCdownload` function.
It uses GDC transfer tool to download gdc data doing a system call.
For this reason some times the update will stop to show, which does not means that 
the download process has stopped. Once the process has finished it will give a signal to R.

The data from query will be save in a folder: project/source/data.category (where source is harmonized or legacy)

The arguments are:

*  **query**	A query for `GDCquery` function
*  **token.file** A token file to download protect data (not test yet)
*  **method** Use API (POST method) or gdc client tool (API is faster, but the data might get corrupted in the download, and it might need to be executed again). Options: "api", "client"
*  **directory** Directory/Folder where the data was downloaded. Default: GDCdata
*  **chunks.per.download**  This will make the API method only download n (chunks.per.download) files at a time. This may reduce the download problems when the data size is too large.

### `GDCdownload`: Example of use
```{r, eval = FALSE}
query <- GDCquery(project = "TCGA-ACC", data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment",
                  barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01"))
GDCdownload(query)
data <- GDCprepare(query)

#-------------------------------------- 
# Gene expression
#-------------------------------------- 
# Aligned against Hg38
# mRNA pipeline: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/
query.exp.hg38 <- GDCquery(project = "TCGA-GBM", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ",
                  barcode =  c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
expdat <- GDCprepare(query = query.exp.hg38,
                     save = TRUE, 
                     save.filename = "exp.rda")

# Aligned against Hg19
query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results",
                  experimental.strategy = "RNA-Seq",
                  barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                  legacy = TRUE)
GDCdownload(query.exp.hg19)
data <- GDCprepare(query.exp.hg19)


#-------------------------------------- 
# DNA methylation data
#-------------------------------------- 
# DNA methylation aligned to hg38
query_met.hg38 <- GDCquery(project= "TCGA-LGG", 
                           data.category = "DNA Methylation", 
                           platform = "Illumina Human Methylation 450", 
                           barcode = c("TCGA-HT-8111-01A-11D-2399-05","TCGA-HT-A5R5-01A-11D-A28N-05"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38)

# DNA methylation aligned to hg19
query_meth.hg19 <- GDCquery(project= "TCGA-LGG", 
                            data.category = "DNA methylation", 
                            platform = "Illumina Human Methylation 450", 
                            barcode = c("TCGA-HT-8111-01A-11D-2399-05","TCGA-HT-A5R5-01A-11D-A28N-05"), 
                            legacy = TRUE)
GDCdownload(query_meth.hg19)
data.hg19 <- GDCprepare(query_meth.hg19)

# A function to download only 20 samples
legacyPipeline <- function(project, data.category, platform, n = 20){
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE)
    cases <- getResults(query, rows = 1:n, cols="cases") # Get two samples from the search
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE,
                      barcode = cases)
    GDCdownload(query,chunks.per.download = 5)
    data <- GDCprepare(query)
    return(data)
}
# DNA methylation
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 27")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 450")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA003 CPI")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA002 CPI")

#-------------------------------------------------------
# Example to  idat files from TCGA projects
#-------------------------------------------------------
projects <- TCGAbiolinks:::getGDCprojects()$project_id
projects <- projects[grepl('^TCGA',projects,perl=T)]
match.file.cases.all <- NULL
for(proj in projects){
    print(proj)
    query <- GDCquery(project = proj,
                      data.category = "Raw microarray data",
                      data.type = "Raw intensities", 
                      experimental.strategy = "Methylation array", 
                      legacy = TRUE,
                      file.type = ".idat",
                      platform = "Illumina Human Methylation 450")
    match.file.cases <- getResults(query,cols=c("cases","file_name"))
    match.file.cases$project <- proj
    match.file.cases.all <- rbind(match.file.cases.all,match.file.cases)
    tryCatch(GDCdownload(query, method = "api",chunks.per.download = 20),
             error = function(e) GDCdownload(query, method = "client"))
}
# This will create a map between idat file name, cases (barcode) and project
readr::write_tsv(match.file.cases.all, path =  "idat_filename_case.txt")
# code to move all files to local folder
for(file in dir(".",pattern = ".idat", recursive = T)){
    TCGAbiolinks::move(file,basename(file))
}
```

# `GDCprepare`: Preparing the data

This function is still under development, it is not working for all cases. See the tables below with the status.
Examples of query, download, prepare can be found in this [gist](https://gist.github.com/tiagochst/a701bad3fa3800ade7063760755e0aad).

## Harmonized data

| Data.category               | Data.type                         | Workflow Type   | Status                                                                      |
|-----------------------------|-----------------------------------|-----------------|-----------------------------------------------------------------------------|
| Transcriptome Profiling     | Gene Expression Quantification    | HTSeq - Counts  | Data frame or SE (losing 5% of information when mapping to genomic regions) |
|                             |                                   | HTSeq - FPKM-UQ | Returning only a (losing 5% of information when mapping to genomic regions) |
|                             |                                   | HTSeq - FPKM    | Returning only a (losing 5% of information when mapping to genomic regions) |
|                             | Isoform Expression Quantification | Not needed      |                                                                             |
|                             | miRNA Expression Quantification   | Not needed      | Returning only a dataframe for  the moment                                  |
| Copy number variation       | Copy Number Segment               |                 | Returning only a dataframe for  the moment                                  |
|                             | Masked Copy Number Segment        |                 | Returning only a dataframe for  the moment                                  |
| Simple Nucleotide Variation |                                   |                 |                                                                             |
| Raw Sequencing Data         |                                   |                 |                                                                             |
| Biospecimen                 |                                   |                 |                                                                             |
| Clinical                    |                                   |                 |                                                                             |

## Legacy data
| Data.category               | Data.type                         | Platform                            | file.type          | Status          |
|-----------------------------|-----------------------------------|-------------------------------------|--------------------|-----------------|
| Transcriptome Profiling     |                                   |                                     |                    |                 |
| Copy number variation       | -                                 | Affymetrix SNP Array 6.0            | nocnv_hg18.seg     | Working         |
|                             | -                                 | Affymetrix SNP Array 6.0            | hg18.seg           | Working         |
|                             | -                                 | Affymetrix SNP Array 6.0            | nocnv_hg19.seg     | Working         |
|                             | -                                 | Affymetrix SNP Array 6.0            | hg19.seg           | Working         |
|                             | -                                 | Illumina HiSeq                      | Several            | Working         |
| Simple Nucleotide Variation | Simple somatic mutation           |                                     |                    |                 |
| Raw Sequencing Data         |                                   |                                     |                    |                 |
| Biospecimen                 |                                   |                                     |                    |                 |
| Clinical                    |                                   |                                     |                    |                 |
| Protein expression          |                                   | MDA RPPA Core                       | -                  | Working         |
| Gene expression             | Gene expression quantification    | Illumina HiSeq                      | normalized_results | Working         |
|                             |                                   | Illumina HiSeq                      | results            | Working         |
|                             |                                   | HT_HG-U133A                         | -                  | Working         |
|                             |                                   | AgilentG4502A_07_2                  | -                  | Data frame only |
|                             |                                   | AgilentG4502A_07_1                  | -                  | Data frame only |
|                             |                                   | HuEx-1_0-st-v2                      | FIRMA.txt          | Not Preparing   |
|                             |                                   |                                     | gene.txt           | Not Preparing   |
|                             | Isoform expression quantification |                                     |                    |                 |
|                             | miRNA gene quantification         |                                     |                    |                 |
|                             | Exon junction quantification      |                                     |                    |                 |
|                             | Exon quantification               |                                     |                    |                 |
|                             | miRNA isoform quantification      |                                     |                    |                 |
|                             |                                   |                                     |                    |                 |
| DNA methylation             |                                   | Illumina Human Methylation 450      | Not used           | Working         |
|                             |                                   | Illumina Human Methylation 27       | Not used           | Working         |
|                             |                                   | Illumina DNA Methylation OMA003 CPI | Not used           | Working         |
|                             |                                   | Illumina DNA Methylation OMA002 CPI | Not used           | Working         |
|                             |                                   | Illumina Hi Seq                     |                    | Not  working    |
| Raw Microarray Data         |                                   |                                     |                    |                 |
| Structural Rearrangement    |                                   |                                     |                    |                 |
| Other                       |                                   |                                     |                    |                 |

You can easily read the downloaded data using the `GDCprepare` function.
This function will prepare the data into a 
[SummarizedExperiment](http://www.nature.com/nmeth/journal/v12/n2/abs/nmeth.3252.html) 
[@ref1] object for downstream analysis. 
For the moment this function works only with data level 3.


The arguments are:

* **query**	Data frame as the one returned from TCGAquery
* **save**	Save a rda object with the prepared object? Default: FALSE
* **save.filename** Name of the rda object that will be saved if `save` is `TRUE`
* **summarizedExperiment** Should the output be a SummarizedExperiment object? Default: `TRUE`
* **directory** Directory/Folder where the data was downloaded. Default: GDCdata
* **remove.files.prepared** Remove the files read? Default: FALSE. This argument will be considered only if save argument is set to true
* **add.gistic2.mut** 	If a list of genes (gene symbol) is given columns with gistic2 results from GDAC firehose (hg19) and a column indicating if there is or not mutation in that gene (hg38) (TRUE or FALSE - use the maf file for more information) will be added to the sample matrix in the summarized Experiment object

```{r, eval = FALSE}
library(TCGAbiolinks)
# Downloading and prepare 
query <- GDCquery(project = "TARGET-AML", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ")
GDCdownload(query)
data <- GDCprepare(query)

# Downloading and prepare using legacy
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "Protein expression",
                  legacy = TRUE, 
                  barcode = c("TCGA-OX-A56R-01A-21-A44T-20","TCGA-08-0357-01A-21-1898-20"))
GDCdownload(query)
data <- GDCprepare(query, save = TRUE, 
                   save.filename = "gbmProteinExpression.rda",
                   remove.files.prepared = TRUE)

# Downloading and prepare using legacy
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation", 
                  platform = "Illumina Human Methylation 27",legacy = TRUE,
                  barcode = c("TCGA-02-0047-01A-01D-0186-05","TCGA-06-2559-01A-01D-0788-05"))
GDCdownload(query)
data <- GDCprepare(query, add.gistic2.mut = c("PTEN","FOXJ1"))

# To view gistic and mutation information please access the samples information matrix in the summarized Experiment object
library(SummarizedExperiment)
samples.information <- colData(data)

```

# `TCGAanalyze`: Analyze data from TCGA. 
You can easily analyze data using following functions:

## `TCGAanalyze_Preprocessing`: Preprocessing of Gene Expression data (IlluminaHiSeq_RNASeqV2)

You can easily search TCGA samples, download and prepare a matrix of gene expression.
```{r, eval = FALSE}
# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
                 "TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
                 "TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
                 "TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
                 "TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode 
query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = "Illumina HiSeq",
                  file.type = "results",
                  barcode = listSamples, 
                  legacy = TRUE)

# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCARnaseqSE <- GDCprepare(query)

BRCAMatrix <- assay(BRCARnaseqSE,"raw_count") # or BRCAMatrix <- assay(BRCARnaseqSE,"raw_count")

# For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run
BRCARnaseq_CorOutliers <- TCGAanalyze_Preprocessing(BRCARnaseqSE)
```

The result is shown below:

```{r, eval = TRUE, echo = FALSE,size = 8}
library(TCGAbiolinks)
dataGE <- dataBRCA[sample(rownames(dataBRCA),10),sample(colnames(dataBRCA),7)]

knitr::kable(dataGE[1:10,2:3], digits = 2, 
             caption = "Example of a matrix of gene expression (10 genes in rows and 2 samples in columns)",
             row.names = TRUE)
```

The result from TCGAanalyze_Preprocessing is shown below:
```{r, fig.width=6, fig.height=4, echo=FALSE, fig.align="center"}
library(png)
library(grid)
img <- readPNG("PreprocessingOutput.png")
grid.raster(img)
```

## `TCGAanalyze_DEA` `&` `TCGAanalyze_LevelTab`: Differential expression analysis (DEA)

Perform DEA (Differential expression analysis) to identify differentially expressed genes (DEGs) using the `TCGAanalyze_DEA` function. 

`TCGAanalyze_DEA` performs DEA using following functions from R \Biocpkg{edgeR}:

1. edgeR::DGEList converts the count matrix into an edgeR object.
2. edgeR::estimateCommonDisp each gene gets assigned the same dispersion estimate.
3. edgeR::exactTest performs pair-wise tests for differential expression between two groups.
4. edgeR::topTags takes the output from exactTest(), adjusts the raw p-values using the False Discovery Rate (FDR) correction, and returns the top differentially expressed genes.

This function receives as arguments:

* **mat1** The matrix of the first group (in the example, group 1 is the normal samples), 
* **mat2** The matrix of the  second group (in the example, group 2 is tumor samples)
* **Cond1type** Label for group 1
* **Cond1type** Label for group 2

Next, we filter the output of dataDEGs by abs(LogFC) >=1, and uses the
`TCGAanalyze_LevelTab` function to create a table with DEGs (differentially expressed genes), log Fold Change (FC), false discovery rate (FDR), the gene expression level for samples in Cond1type, and Cond2type, and Delta value (the difference of gene expression between the two conditions multiplied logFC).

```{r, eval = FALSE}
# Downstream analysis using gene expression data  
# TCGA samples from IlluminaHiSeq_RNASeqV2 with type rsem.genes.results
# save(dataBRCA, geneInfo , file = "dataGeneExpression.rda")
library(TCGAbiolinks)

# normalization of genes
dataNorm <- TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo =  geneInfo)

# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
                                   typesample = c("NT"))

# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), 
                                   typesample = c("TP"))

# Diff.expr.analysis (DEA)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],
                            mat2 = dataFilt[,samplesTP],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")

# DEGs table with expression values in normal and tumor samples
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal",
                                          dataFilt[,samplesTP],dataFilt[,samplesNT])

```
The result is shown below:

```{r, eval = TRUE, echo = FALSE}
library(TCGAbiolinks)
dataDEGsFiltLevel$FDR <- format(dataDEGsFiltLevel$FDR, scientific = TRUE)
knitr::kable(dataDEGsFiltLevel[1:10,], digits = 2,
             caption = "Table of DEGs after DEA", row.names = FALSE)
```

## `TCGAanalyze_EAcomplete & TCGAvisualize_EAbarplot`: Enrichment Analysis

Researchers, in order to better understand the underlying biological processes, 
often want to retrieve a functional profile of a set of genes that might 
have an important role. This can be done by performing an enrichment analysis. 

We will perform an enrichment analysis on gene sets using the `TCGAanalyze_EAcomplete` 
function. 
Given a set of genes that are up-regulated under certain conditions, 
an enrichment analysis will identify classes of genes or proteins that 
are over-represented using annotations for that gene set.


To view the results you can use the `TCGAvisualize_EAbarplot` function as shown below.

```{r, eval = FALSE}
library(TCGAbiolinks)
# Enrichment Analysis EA
# Gene Ontology (GO) and Pathway enrichment by DEGs list
Genelist <- rownames(dataDEGsFiltLevel)

system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist))

# Enrichment Analysis EA (TCGAVisualize)
# Gene Ontology (GO) and Pathway enrichment barPlot

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP), 
                        GOBPTab = ansEA$ResBP,
                        GOCCTab = ansEA$ResCC,
                        GOMFTab = ansEA$ResMF,
                        PathTab = ansEA$ResPat,
                        nRGTab = Genelist, 
                        nBar = 10)

```
The result is shown below:
```{r, fig.width=6, fig.height=4, echo=FALSE, fig.align="center"}
library(png)
library(grid)
img <- readPNG("EAplot.png")
grid.raster(img)
```


The figure shows canonical pathways significantly overrepresented (enriched) by the DEGs 
(differentially expressed genes).
The most statistically significant canonical pathways identified 
in DEGs list are listed according to their p value corrected FDR (-Log) (colored bars) 
and the ratio of list genes found in each pathway over the total number of 
genes in that pathway (Ratio, red line).

## `TCGAanalyze_survival`: Survival Analysis: Cox Regression and dnet package
When analyzing survival, different problems come up than the ones discussed so far. 
One question is how do we deal with subjects dropping out of a
study. For example, assuming that we test a new cancer drug. While some subjects
die, others may believe that the new drug is not effective, and decide to drop out
of the study before the study is finished. A similar problem would be faced when
we investigate how long a machine lasts before it breaks down.

Using the clinical data, it is possible to create a survival plot with the 
function `TCGAanalyze_survival` as follows:

```{r, eval = FALSE}
clin.gbm <- GDCquery_clinic("TCGA-GBM", "clinical")
TCGAanalyze_survival(clin.gbm,
                     "gender",
                     main = "TCGA Set\n GBM",height = 10, width=10)
```

The arguments of `TCGAanalyze_survival` are:

*  **clinical_patient** TCGA Clinical patient with the information days_to_death
*  **clusterCol** Column with groups to plot. This is a mandatory field, 
the caption will be based in this column
*  **legend** Legend title of the figure
*  **xlim** xlim x axis limits e.g. xlim = c(0, 1000). Present narrower X axis, but not affect survival estimates. 
*  **main**	 main title of the plot
*  **ylab**	y-axis text of the plot
*  **xlab** x-axis text of the plot
*  **filename**	 The name of the pdf file
*  **color** Define the colors of the lines.
*  **pvalue** Show pvalue in the plot. 
*  **risk.table** 	Show or not the risk table
*  **conf.int** 	Show confidence intervals for point estimaes of survival curves.

The result is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("case2_surv.png")
grid.raster(img)
```


```{r, eval = FALSE}
library(TCGAbiolinks)
# Survival Analysis SA

clinical_patient_Cancer <- GDCquery_clinic("TCGA-BRCA","clinical")
dataBRCAcomplete <- log2(BRCA_rnaseqv2)

tokenStop<- 1

tabSurvKMcomplete <- NULL

for( i in 1: round(nrow(dataBRCAcomplete)/100)){
    message( paste( i, "of ", round(nrow(dataBRCAcomplete)/100)))
    tokenStart <- tokenStop
    tokenStop <-100*i
    tabSurvKM<-TCGAanalyze_SurvivalKM(clinical_patient_Cancer,
                                      dataBRCAcomplete,
                                      Genelist = rownames(dataBRCAcomplete)[tokenStart:tokenStop],
                                      Survresult = F,
                                      ThreshTop=0.67,
                                      ThreshDown=0.33)
    
    tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}

tabSurvKMcomplete <- tabSurvKMcomplete[tabSurvKMcomplete$pvalue < 0.01,]
tabSurvKMcomplete <- tabSurvKMcomplete[order(tabSurvKMcomplete$pvalue, decreasing=F),]

tabSurvKMcompleteDEGs <- tabSurvKMcomplete[
    rownames(tabSurvKMcomplete) %in% dataDEGsFiltLevel$mRNA,
    ]
```
The result is shown below:
```{r, fig.width=6, fig.height=4, echo=FALSE, fig.align="center"}
tabSurvKMcompleteDEGs$pvalue <- format(tabSurvKMcompleteDEGs$pvalue, scientific = TRUE)
knitr::kable(tabSurvKMcompleteDEGs[1:5,1:4], 
             digits = 2,
             caption = "Table KM-survival genes after SA",
             row.names = TRUE)
knitr::kable(tabSurvKMcompleteDEGs[1:5,5:7], 
             digits = 2,
             row.names = TRUE)
```

## `TCGAanalyze_DMR`: Differentially methylated regions Analysis

We will search for differentially methylated CpG sites using the `TCGAanalyze_DMR`
function. In order to find these regions we use the beta-values 
(methylation values ranging from 0.0 to 1.0) to compare two groups. 

First, it calculates the difference between the mean DNA methylation of each group
for each probe. 

Second, it test for differential expression between two groups using the wilcoxon test 
adjusting by the Benjamini-Hochberg method. The default arguments was set to
require a minimum absolute beta-values difference of 0.2 and am adjusted p-value
of < 0.01. 

After these tests, we save a volcano plot (x-axis:diff mean methylation,
y-axis: statistical significance) that will help the user identify the differentially
methylated CpG sites, then the results are saved in a csv file (DMR_results.groupCol.group1.group2.csv) and finally the object is returned with the calculus in the rowRanges.

The arguments of TCGAanalyze_DMR are:

* **data** SummarizedExperiment obtained from the GDCprepare
* **groupCol**  Columns with the groups inside the SummarizedExperiment
object. (This will be obtained by the function colData(data)
* **group1 ** In case our object has more than 2 groups, you should set
the name of the group
* **group2 ** In case our object has more than 2 groups, you should set
the name of the group
* **plot.filename ** pdf filename. Default: volcano.pdf
* **legend ** Legend title
* **color **vector of colors to be used in graph
* **title **main title. If not specified it will be
"Volcano plot (group1 vs group2)
* **ylab **y axis text
* **xlab **x axis text
* **xlim **x limits to cut image
* **ylim **y limits to cut image
* **label **vector of labels to be used in the figure.
Example: c("Not Significant", "Hypermethylated in group1", 
"Hypomethylated in group1"))
* **p.cut** p values threshold. *Default: 0.01*
* **diffmean.cut** diffmean threshold. *Default: 0.2*
* **adj.method** Adjusted method for the p-value calculation
* **paired** Wilcoxon paired parameter. *Default: FALSE*
* **overwrite** Overwrite the pvalues and diffmean values if already in the object
for both groups? *Default: FALSE*
* **save **save the object with the results?
* **cores **use multiple cores for non-parametric test
* **filename ** Name of the rda file to save the object 

```{r, eval = FALSE}
data <- TCGAanalyze_DMR(data, groupCol = "methylation_subtype",
                        group1 = "CIMP.H",
                        group2="CIMP.L",
                        p.cut = 10^-5,
                        diffmean.cut = 0.25,
                        legend = "State",
                        plot.filename = "coad_CIMPHvsCIMPL_metvolcano.png")
```

The output will be a plot such as the figure below. The green dots are the 
probes that are hypomethylated in group 2 compared to group 1, 
while the red dots are the hypermethylated probes in group 2 compared to group 1


```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("figure5met.png")
grid.raster(img)
```


Also, the `TCGAanalyze_DMR` 
function will save the plot as pdf and return the same SummarizedExperiment
that was given as input with the values of p-value,
p-value adjusted, diffmean and the group it belongs in the graph 
(non significant, hypomethylated, hypermethylated) in the rowRanges. 
The collumns will be (where group1 and group2 are the names of the groups): 

* diffmean.group1.group2 (mean.group2 - mean.group1) 
* diffmean.group2.group1 (mean.group1 - mean.group2)
* p.value.group1.group2
* p.value.adj.group1.group2
* status.group1.group2 (Status of probes in group2 in relation to group1) 
* status.group2.group1 (Status of probes in group1 in relation to group2) 

This values can be view/acessed using the `rowRanges` 
acessesor (`rowRanges(data)`).

**Observation:** Calling the same function again, with the same arguments 
will only plot the results,  as it was already calculated. 
If you want to have them recalculated, please set
`overwrite` to `TRUE` or remove the calculated collumns.

# `TCGAvisualize`: Visualize results from analysis functions with TCGA's data. 
You can easily visualize results from some following functions:

## `TCGAvisualize_Heatmap`: Create heatmaps with cluster bars

In order to have a better view of clusters, we normally use heatmaps.
`TCGAvisualize_Heatmap` will plot a heatmap and add to each sample bars representing
different features. This function is a wrapper to the package 
[ComplexHeatmap](http://bioconductor.org/packages/ComplexHeatmap/)  package,

The arguments of this function are:


* **data**	The object with the heatmap data (expression, methylation)
* **col.metadata** Metadata for the columns (patients). It should have the column bcr_patient_barcode or patient or ID with the patients barcodes.
* **row.metadata** Metadata for the rows genes (expression) or probes (methylation)
* **col.colors ** A list of names colors
* **row.colors ** A list of named colors
* **show_column_names **	Show column names names? Dafault: FALSE
* **show_row_names ** Show row names? Dafault: FALSE
* **cluster_rows ** Cluster rows ? Dafault: FALSE
* **cluster_columns** Cluster columns ? Dafault: FALSE
* **sortCol** Name of the column to be used to sort the columns
* **title** Title of the plot
* **type** Select the colors of the heatmap values. Possible values are "expression" (default), "methylation"
* **scale**	Use z-score to make the heamat? If we want to show differences between genes, it is good to make Z-score by samples (force each sample to have zero mean and standard deviation=1). If we want to show differences between samples, it is good to make Z-score by genes (force each gene to have zero mean and standard deviation=1). Possibilities: "row", "col. Default "none"


For more information please take a look on case study #2.

The result is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("case2_Heatmap.png")
grid.raster(img)
```

## `TCGAvisualize_Volcano`: Create volcano plot 

Creates a volcano plot for DNA methylation or expression

The arguments of this function are:

* **x** x-axis data
* **y** y-axis data
* **y.cut** p-values threshold. Default: 0.01
* **x.cut**  x-axis threshold. Default: 0.0
* **filename** Filename. Default: volcano.pdf, volcano.svg, volcano.png
* **legend** Legend title
* **color** vector of colors to be used in graph
* **title** main title. If not specified it will be Volcano plot (group1 vs group2)
* **ylab** y axis text
* **xlab** x axis text
* **xlim** x limits to cut image
* **ylim** y limits to cut image
* **height** Figure height
* **width** Figure width
* **names** Names to be ploted if significant. Should be the same size of x and y
* **names.fill** Names should be filled in a color box?  Default: TRUE
* **names.size** Size of the names text
* **dpi** Figure dpi
* **label** vector of labels to be used in the figure. Example: c("Not Significant","Hypermethylated in group1", "Hypomethylated in group1"))
* **highlight** List of genes/probes to be highlighted. It should be in the names argument.
* **highlight.color** Color of the points highlighted
* **show.names** What names will be showd? Possibilities: "both", "significant", "highlighted"

For more information please take a look on case study #3.

## `TCGAvisualize_oncoprint`

The visualize the MAF files we created a function called `TCGAvisualize_oncoprint` that
uses the [ComplexHeatmap](http://bioconductor.org/packages/ComplexHeatmap/) to create an oncoprint.

* **mut** Mutation file (see TCGAquery_maf from TCGAbiolinks)
* **genes** Gene list
* **filename** name of the pdf
* **color** named vector for the plot
* **height** pdf height
* **rm.empty.columns** If there is no alteration in that sample, whether remove it on the oncoprint
* **show.row.barplot**  Show barplot annotation on rows?
* **show.column.names** Show column names? Default: FALSE
* **rows.font.size** Size of the fonts
* **labels.font.size** Size of the fonts
* **annotation** Matrix or data frame with the annotation. Should have a column bcr_patient_barcode with the same ID of the mutation object
* **annotation.position** Position of the annotation "bottom" or "top"
* **label.title** Title of the label

```{r, eval=FALSE, message=FALSE, warning=FALSE}
mut <- GDCquery_Maf(tumor = "ACC", pipelines = "muse")
clin <- GDCquery_clinic("TCGA-ACC","clinical")
clin <- clin[,c("bcr_patient_barcode","disease","gender","tumor_stage","race","vital_status")]
TCGAvisualize_oncoprint(mut = mut, genes = mut$Hugo_Symbol[1:20],
                        filename = "oncoprint.pdf",
                        annotation = clin,
                        color=c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown"),
                        rows.font.size=10,
                        heatmap.legend.side = "right",
                        dist.col = 0, 
                        width = 5,
                        label.font.size = 10)
```

```{r, fig.width=6, fig.height=4, echo=FALSE, fig.align="center"}
library(png)
library(grid)
img <- readPNG("oncoprint.png")
grid.raster(img)
```


The columns are the samples and the rows are genes. The upper histograms shows the number
of mutation of each samples, the right histogram shows the number of patients that has that mutation (which 
is also represented by the percentage numbers in the left side of the plot).

## `TCGAvisualize_PCA`: Principal Component Analysis plot for differentially expressed genes

In order to better understand our genes, we can perform a PCA to reduce the
number of dimensions of our gene set. The function `TCGAvisualize_PCA` will plot
the PCA for different groups.

The arguments of this function are:

* **dataFilt** The expression matrix after normalization and quantile filter
* **dataDEGsFiltLevel** The TCGAanalyze_LevelTab output
* **ntopgenes** number of DEGs genes to plot in PCA
* ** group1 a string containing the barcode list of the samples in control group
* ** group2 a string containing the barcode list of the samples in disease group


```{r, eval = FALSE}
# normalization of genes
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)

# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

# selection of normal samples "NT" 
group1 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
# selection of normal samples "TP" 
group2 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))

# Principal Component Analysis plot for ntop selected DEGs
 pca <- TCGAvisualize_PCA(dataFilt,dataDEGsFiltLevel, ntopgenes = 200, group1, group2)
```

The result is shown below:
```{r, fig.width=6, fig.height=4, echo=FALSE, fig.align="center"}
library(png)
library(grid)
img <- readPNG("PCAtop200DEGs.png")
grid.raster(img)
```

## `TCGAvisualize_SurvivalCoxNET` Survival Analysis: Cox Regression and dnet package

TCGAvisualize_SurvivalCoxNET can help an user to identify a group of 
survival genes that are significant from univariate Kaplan Meier 
Analysis and also for Cox Regression.

It shows in the end a network build with community of genes with similar range of pvalues from
Cox regression (same color) and that interaction among those genes is already validated in
literatures using the STRING database (version 9.1).

```{r, eval = FALSE}
library(TCGAbiolinks)
# Survival Analysis SA

clinical_patient_Cancer <- TCGAquery_clinic("brca","clinical_patient")
dataBRCAcomplete <- log2(BRCA_rnaseqv2)

tokenStop<- 1

tabSurvKMcomplete <- NULL

for( i in 1: round(nrow(dataBRCAcomplete)/100)){
    message( paste( i, "of ", round(nrow(dataBRCAcomplete)/100)))
    tokenStart <- tokenStop
    tokenStop <-100*i
    tabSurvKM <- TCGAanalyze_SurvivalKM(clinical_patient_Cancer,
                                        dataBRCAcomplete,
                                        Genelist = rownames(dataBRCAcomplete)[tokenStart:tokenStop],
                                        Survresult = F,ThreshTop=0.67,ThreshDown=0.33)
    tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}

tabSurvKMcomplete <- tabSurvKMcomplete[tabSurvKMcomplete$pvalue < 0.01,]
tabSurvKMcomplete <- tabSurvKMcomplete[!duplicated(tabSurvKMcomplete$mRNA),]
rownames(tabSurvKMcomplete) <-tabSurvKMcomplete$mRNA
tabSurvKMcomplete <- tabSurvKMcomplete[,-1]
tabSurvKMcomplete <- tabSurvKMcomplete[order(tabSurvKMcomplete$pvalue, decreasing=F),]

tabSurvKMcompleteDEGs <- tabSurvKMcomplete[rownames(tabSurvKMcomplete) %in% dataDEGsFiltLevel$mRNA,]

tflist <- EAGenes[EAGenes$Family == "transcription regulator","Gene"]
tabSurvKMcomplete_onlyTF <- tabSurvKMcomplete[rownames(tabSurvKMcomplete) %in% tflist,]

TabCoxNet <- TCGAvisualize_SurvivalCoxNET(clinical_patient_Cancer,dataBRCAcomplete,
                                          Genelist = rownames(tabSurvKMcompleteDEGs),
                                          scoreConfidence = 700,
                                          titlePlot = "TCGAvisualize_SurvivalCoxNET Example")
```

In particular the survival analysis with kaplan meier and cox regression allow user to reduce the feature / number of genes significant for survival. And using 'dnet' pipeline with 'TCGAvisualize_SurvivalCoxNET' function the user can further filter those genes according some already validated interaction according STRING database.
This is important because the user can have an idea about the biology inside the 
survival discrimination and further investigate in a sub-group of genes that are working in as synergistic effect influencing the risk of survival.
In the following picture the user can see some community of genes with same color and survival pvalues.

The result is shown below:
```{r, fig.width=6, fig.height=4, echo=FALSE, fig.align="center"}
library(png)
library(grid)
img <- readPNG("SurvivalCoxNETOutput.png")
grid.raster(img)
```

## `TCGAvisualize_meanMethylation`: Mean DNA Methylation Analysis
Using the data and calculating the mean DNA methylation per group, it is possible 
to create a mean DNA methylation boxplot with the function 
`TCGAvisualize_meanMethylation` as follows:

```{r,include=FALSE,echo=FALSE, fig.height=5, message=FALSE, warning=FALSE}
data <- tryCatch({
    query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation",
                  platform = "Illumina Human Methylation 27",
                  legacy = TRUE, 
                  barcode = c("TCGA-02-0058-01A-01D-0186-05", "TCGA-12-1597-01B-01D-0915-05",
                              "TCGA-12-0829-01A-01D-0392-05", "TCGA-06-0155-01B-01D-0521-05",
                              "TCGA-02-0099-01A-01D-0199-05", "TCGA-19-4068-01A-01D-1228-05",
                              "TCGA-19-1788-01A-01D-0595-05", "TCGA-16-0848-01A-01D-0392-05"))
            GDCdownload(query, method = "api", chunks.per.download = 2)
            GDCdownload(query, method = "api")
            data <- GDCprepare(query)
            data
    }, error = function(e) {
         nrows <- 200; ncols <- 21
         counts <- matrix(runif(nrows * ncols, 0, 1), nrows)
         rowRanges <- GenomicRanges::GRanges(rep(c("chr1", "chr2"), c(50, 150)),
                            IRanges::IRanges(floor(runif(200, 1e5, 1e6)), width=100),
                            strand=sample(c("+", "-"), 200, TRUE),
                            feature_id=sprintf("ID%03d", 1:200))
        colData <- S4Vectors::DataFrame(shortLetterCode=rep(c("NT", "TP","TR"), 7),
                         row.names=LETTERS[1:21],
                         subtype_Pan.Glioma.DNA.Methylation.Cluster=rep(c("group1","group2","group3"),c(7,7,7)),
                         vital_status=rep(c("DEAD","ALIVE","DEAD"),7))
        data <- SummarizedExperiment::SummarizedExperiment(
                    assays=S4Vectors::SimpleList(counts=counts),
                    rowRanges=rowRanges,
                    colData=colData)
        data

        }
    )
```

```{r, eval=FALSE, echo=TRUE, fig.height=5, message=FALSE, warning=FALSE}
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation",
                  platform = "Illumina Human Methylation 27",
                  legacy = TRUE, 
                  barcode = c("TCGA-02-0058-01A-01D-0186-05", "TCGA-12-1597-01B-01D-0915-05",
                              "TCGA-12-0829-01A-01D-0392-05", "TCGA-06-0155-01B-01D-0521-05",
                              "TCGA-02-0099-01A-01D-0199-05", "TCGA-19-4068-01A-01D-1228-05",
                              "TCGA-19-1788-01A-01D-0595-05", "TCGA-16-0848-01A-01D-0392-05"))
GDCdownload(query, method = "api")
data <- GDCprepare(query)
```

```{r, echo=TRUE, fig.height=4, fig.width=3, out.width = 3, out.heigh=5, message=FALSE, warning=FALSE}
# "shortLetterCode" is a column in the SummarizedExperiment::colData(data) matrix
TCGAvisualize_meanMethylation(data, groupCol = "shortLetterCode",filename = NULL)
```
The arguments of `TCGAvisualize_meanMethylation` are:

* **data** SummarizedExperiment object obtained from `GDCprepare`
* **groupCol** Columns in colData(data) that defines the groups. If no
columns defined a columns called "Patients" will be used
* **subgroupCol** Columns in colData(data) that defines the subgroups.
* **shapes** Shape vector of the subgroups. It must have the size of the levels
of the subgroups. Example: shapes = c(21,23) if for two levels
* **filename** The name of the pdf that will be saved
* **subgroup.legend** Name of the subgroup legend. **DEFAULT: subgroupCol**
* **group.legend** Name of the group legend. **DEFAULT: groupCol**
* **color** vector of colors to be used in graph
* **title** main title in the plot
* **ylab** y axis text in the plot
* **print.pvalue** Print p-value for two groups in the plot
* **xlab** x axis text in the plot
* **labels** Labels of the groups
* **y.limits** Change lower/upper y-axis limit 

Other examples using the parameters:
```{r, echo=TRUE, fig.height=4, fig.width=7, out.width = 7, out.heigh=5, message=FALSE, warning=FALSE}

# setting y limits: lower 0, upper 1
TCGAvisualize_meanMethylation(data,groupCol = "shortLetterCode",filename = NULL, y.limits = c(0,1))
# setting y limits: lower 0
TCGAvisualize_meanMethylation(data,groupCol = "shortLetterCode",filename = NULL, y.limits = 0)
# Changing shapes of jitters to show subgroups
TCGAvisualize_meanMethylation(data,
                              groupCol  = "subtype_Pan.Glioma.DNA.Methylation.Cluster", 
                              subgroupCol="vital_status", filename = NULL)
# Sorting bars by descending mean methylation
TCGAvisualize_meanMethylation(data,
                              groupCol  = "subtype_Pan.Glioma.DNA.Methylation.Cluster",sort="mean.desc",filename=NULL)
# Sorting bars by asc mean methylation
TCGAvisualize_meanMethylation(data,groupCol  = "subtype_Pan.Glioma.DNA.Methylation.Cluster",
                              sort="mean.asc",filename=NULL)
TCGAvisualize_meanMethylation(data,groupCol  = "vital_status",sort="mean.asc",filename=NULL)
```


## `TCGAvisualize_starburst`: Integration of gene expression and DNA methylation data 

The starburst plot is proposed to combine information from two volcano plots, 
and is applied for a study of DNA methylation and gene expression. It first introduced in 2010 [@ref7]. In order to
reproduce this plot, we will use the `TCGAvisualize_starburst` function.

The function creates Starburst plot for comparison of DNA methylation
and gene expression. The log10 (FDR-corrected P value) for DNA methylation is plotted in the x axis, and for gene expression in the y axis, for each gene. 
The black dashed line shows the FDR-adjusted P value of 0.01.

The arguments of this function are:

* **met** SummarizedExperiment with methylation data obtained from the `GDCprepare` or Data frame from `DMR_results` file. Expected colData columns: diffmean, p.value.adj and p.value Execute volcanoPlot function in order to obtain these values for the object.
* **exp** Matrix with expression data obtained from the `TCGAanalyze_DEA` function. 
Expected colData columns: logFC, FDR
* **names**	Add the names of the significant genes? Default: FALSE
* **names.fill** Names should be filled in a color box? Default: TRUE
* **circle** Circle pair gene/probe that respects diffmean.cut and logFC.cut 
* **filename** pdf filename
* **legend**  legend title
* **color**  vector of colors to be used in graph
* **label**  vector of labels to be used in graph
* **title**  main title
* **ylab**  y axis text
* **xlab**  x axis text
* **xlim**  x limits to cut image
* **ylim**  y limits to cut image
* **p.cut**  p value cut-off
* **group1**  The name of the group 1  Obs: Column p.value.adj.group1.group2 should exist
* **group2**  The name of the group 2. Obs: Column p.value.adj.group1.group2 should exist
* **exp.p.cut** expression p value cut-off
* **met.p.cut**	methylation p value cut-off
* **diffmean.cut**	If set, the probes with diffmean higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted.
* **logFC.cut** If set, the probes with expression fold change higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted.
* **height**	Figure height
* **width**	Figure width
* **dpi** Figure dpi


```{r, eval = FALSE}
starburst <- TCGAvisualize_starburst(coad.SummarizeExperiment, 
                                     different.experssion.analysis.data,
                                     group1 = "CIMP.H",
                                     group2 = "CIMP.L",
                                     met.p.cut = 10^-5, 
                                     exp.p.cut=10^-5,
                                     names = TRUE)
```

As a result, the function will a plot the figure below and return a matrix with 
the Gene_symbol and it status in relation to expression (up regulated/down regulated) and to methylation (Hyper/Hypo methylated). The case study 3, shows the complete pipeline
for creating this figure.


```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("figure5star.png")
grid.raster(img)
```




# GDC Downstream Pipeline

## HTSeq data: Downstream analysis BRCA
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}

CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","HTSeq_Counts",".rda")

query <- GDCquery(project = CancerProject,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")

samplesDown <- getResults(query,cols=c("cases"))

dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")

dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]

queryDown <- GDCquery(project = CancerProject, 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts", 
                      barcode = c(dataSmTP_short, dataSmNT_short))
                    
GDCdownload(query = queryDown,
            directory = DataDirectory)

dataPrep <- GDCprepare(query = queryDown, 
                       save = TRUE, 
                       directory =  DataDirectory,
                       save.filename = FileNameData)

dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep, 
                                      cor.cut = 0.6,
                                      datatype = "HTSeq - Counts")                      

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "gcContent") 

boxplot(dataPrep, outline = FALSE)

boxplot(dataNorm, outline = FALSE)

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmTP_short],
                            mat2 = dataFilt[,dataSmNT_short],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")  

```


## miRNA expression data: Downstream analysis BRCA
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
require(TCGAbiolinks)

CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","miRNA_gene_quantification",".rda")

query.miR <- GDCquery(project = CancerProject, 
                  data.category = "Gene expression",
                  data.type = "miRNA gene quantification",
                  file.type = "hg19.mirna",
                  legacy = TRUE)

samplesDown.miR <- getResults(query.miR,cols=c("cases"))

dataSmTP.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
                                  typesample = "TP")

dataSmNT.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
                                  typesample = "NT")
dataSmTP_short.miR <- dataSmTP.miR[1:10]
dataSmNT_short.miR <- dataSmNT.miR[1:10]

queryDown.miR <- GDCquery(project = CancerProject, 
                      data.category = "Gene expression",
                      data.type = "miRNA gene quantification",
                      file.type = "hg19.mirna",
                      legacy = TRUE,
                      barcode = c(dataSmTP_short.miR, dataSmNT_short.miR))

GDCdownload(query = queryDown.miR,
            directory = DataDirectory)

dataAssy.miR <- GDCprepare(query = queryDown.miR, 
                           save = TRUE, 
                           save.filename = FileNameData, 
                           summarizedExperiment = TRUE,
                           directory =DataDirectory )
rownames(dataAssy.miR) <- dataAssy.miR$miRNA_ID

# using read_count's data 
read_countData <-  colnames(dataAssy.miR)[grep("count", colnames(dataAssy.miR))]
dataAssy.miR <- dataAssy.miR[,read_countData]
colnames(dataAssy.miR) <- gsub("read_count_","", colnames(dataAssy.miR))

dataFilt <- TCGAanalyze_Filtering(tabDF = dataAssy.miR,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT_short.miR],
                            mat2 = dataFilt[,dataSmTP_short.miR],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")  

```

# TCGA Downstream Analysis: Case Studies

### Introduction 

This vignette shows a complete workflow of the TCGAbiolinks package. 
The code is divided in 4 case study:

* 1. Expression pipeline (BRCA)
* 2. Expression pipeline (GBM)
* 3. Integration of DNA methylation and RNA expression pipeline (COAD)
* 4. Elmer pipeline (KIRC)


## Case study n. 1: Pan Cancer downstream analysis BRCA
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
library(SummarizedExperiment)
library(TCGAbiolinks)
query.exp <- GDCquery(project = "TCGA-BRCA", 
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      experimental.strategy = "RNA-Seq",
                      sample.type = c("Primary solid Tumor","Solid Tissue Normal"))
GDCdownload(query.exp)
brca.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "brcaExp.rda")

# get subtype information
dataSubt <- TCGAquery_subtype(tumor = "BRCA")

# get clinical data
dataClin <- GDCquery_clinic(project = "TCGA-BRCA","clinical") 

# Which samples are primary solid tumor
dataSmTP <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"TP") 
# which samples are solid tissue normal
dataSmNT <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"NT")
```

Using `TCGAnalyze_DEA`, we identified 3,390 differentially expression genes (DEG) (log 
fold change >=1 and FDR < 1%) between 114 normal and 1097 BRCA samples. In 
order to understand the underlying biological process from DEGs we performed an 
enrichment analysis using `TCGAnalyze_EA_complete` function.  

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}

dataPrep <- TCGAanalyze_Preprocessing(object = brca.exp, cor.cut = 0.6)                      

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfo,
                                      method = "gcContent")                

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT],
                            mat2 = dataFilt[,dataSmTP],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")  


```

TCGAbiolinks outputs bar chart with the number of genes for the main categories of 
three ontologies (GO:biological process, GO:cellular component, and GO:molecular 
function, respectively). 

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",
                                RegulonList = rownames(dataDEGs))  

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                        GOBPTab = ansEA$ResBP,
                        GOCCTab = ansEA$ResCC,
                        GOMFTab = ansEA$ResMF,
                        PathTab = ansEA$ResPat,
                        nRGTab = rownames(dataDEGs),
                        nBar = 20)
```


The figure resulted from the code above is shown below.
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("case1_EA.png")
grid.raster(img)
```


The Kaplan-Meier analysis was used to compute survival univariate curves, and  
log-Ratio test was computed to assess the statistical significance by using 
TCGAanalyze_SurvivalKM function; starting with 3,390 DEGs genes we found 555 
significantly genes with p.value <0.05. 

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}

group1 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
group2 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))

dataSurv <- TCGAanalyze_SurvivalKM(clinical_patient = dataClin,
                                   dataGE = dataFilt,
                                   Genelist = rownames(dataDEGs),
                                   Survresult = FALSE,
                                   ThreshTop = 0.67,
                                   ThreshDown = 0.33,
                                   p.cut = 0.05, group1, group2)
```

Cox-regression analysis was used to compute survival multivariate curves, and cox 
p-value was computed to assess the statistical significance by using 
TCGAnalyze_SurvivalCoxNET function. Survival multivariate analysis found 160 
significantly genes according to the cox p-value FDR 5.00e-02. From DEGs that we 
found to correlate  significantly with survival by both univariate and multivariate 
analyses we analyzed the following network.

The interactome network graph was generated using STRING.,org.Hs.string version 
10 (Human functional protein association network). The network graph was resized 
by dnet package considering only multivariate survival genes, with strong interaction 
(threshold = 700) we obtained a subgraphsub graph of 24 nodes and 31 edges.

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}

require(dnet)  # to change
org.Hs.string <- dRDataLoader(RData = "org.Hs.string")

TabCoxNet <- TCGAvisualize_SurvivalCoxNET(dataClin,
                                          dataFilt, 
                                          Genelist = rownames(dataSurv),
                                          scoreConfidence = 700,
                                          org.Hs.string = org.Hs.string,
                                          titlePlot = "Case Study n.1 dnet")
```

The figure resulted from the code above is shown below.
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("case1_dnet.png")
grid.raster(img)
```

## Case study n. 2: Pan Cancer downstream analysis LGG

We focused on the analysis of LGG samples. In particular, we used TCGAbiolinks
to download 293 samples with molecular subtypes. Link the complete [complete code](https://gist.github.com/tiagochst/277651ebed998fd3d1952d3fbc376ef2).
.

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
library(TCGAbiolinks)
library(SummarizedExperiment)

query.exp <- GDCquery(project = "TCGA-LGG", 
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      experimental.strategy = "RNA-Seq",
                      sample.type = "Primary solid Tumor")
GDCdownload(query.exp)
lgg.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "lggExp.rda")

# get subtype information
dataSubt <- TCGAquery_subtype(tumor = "LGG")

# get indexed clinical data
dataClin <- GDCquery_clinic(project = "TCGA-LGG", "Clinical")

```

First, we searched for possible outliers using the `TCGAanalyze_Preprocessing` 
function, which performs an Array Array Intensity correlation AAIC.
We used all samples in expression data which contain molecular subtypes, filtering 
out samples without molecular information, and using only IDHmut-codel (n=85),
IDHmut-non-codel (n=141) and IDHwt (n=56), NA (11), to define a square 
symetric matrix of pearson correlation among all samples (n=293). 
According to this matrix we found no samples with low correlation (cor.cut = 0.6) 
that can be identified as possible outliers, so we continued our analysis 
with 70 samples. 

Second, using the `TCGAanalyze_Normalization` function we normalized mRNA 
transcripts and miRNA, using EDASeq package. This function does use 
Within-lane normalization procedures to adjust for GC-content effect 
(or other gene-level effects) on read counts: loess robust local regression, 
global-scaling, and full-quantile normalization [@ref6] and 
between-lane normalization procedures to adjust for distributional differences 
between lanes (e.g., sequencing depth): global-scaling and full-quantile 
normalization [@ref5].

Third, using the `TCGAanalyze_Filtering` function we applied 3 filters removing 
features / mRNAs with low signal across samples obtaining 4578, 4284, 
1187 mRNAs respectively. 

Then we applied two Hierarchical cluster analysis on 1187 mRNAs after the three 
filters described above, the first cluster using as method ward.D2, and the 
second with ConsensusClusterPlus.

After the two clustering analysis, with cut.tree =4 we obtained n=4 espression clusters (EC).


```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}

# expression data with molecular subtypes
lgg.exp <- subset(lgg.exp, select = colData(lgg.exp)$patient %in% dataSubt$patient)

dataPrep <- TCGAanalyze_Preprocessing(object = lgg.exp,cor.cut = 0.6)
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfo,
                                      method = "gcContent")

datFilt1 <- TCGAanalyze_Filtering(tabDF = dataNorm,method = "varFilter")
datFilt2 <- TCGAanalyze_Filtering(tabDF = datFilt1,method = "filter1")
datFilt <- TCGAanalyze_Filtering(tabDF = datFilt2,method = "filter2")

data_Hc1 <- TCGAanalyze_Clustering(tabDF = datFilt,
                                   method = "hclust",
                                   methodHC = "ward.D2")
data_Hc2 <- TCGAanalyze_Clustering(tabDF = datFilt,
                                   method = "consensus",
                                   methodHC = "ward.D2") 
```

For the next steps, we will merge the clinical data with with the
the cluster information and we will add the subtype information. 

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}

#------  Add cluster information
cluster <- data.frame("groupsHC" = data_Hc2[[4]]$consensusClass)
cluster$groupsHC <- paste0("EC",cluster$groupsHC)
cluster$patient <-  substr(colData(lgg.exp)$patient,1,12)

# Add information about gropus from consensus Cluster in clinical data
dataClin <- merge(dataClin,cluster, by.x="bcr_patient_barcode", by.y="patient")

# Merge subtype and clinical data
clin_subt <- merge(dataClin,dataSubt, by.x="bcr_patient_barcode", by.y="patient")
clin_subt_all <- merge(dataClin,dataSubt, 
                       by.x="bcr_patient_barcode", by.y="patient", all.x = TRUE)

```

The next steps will be to visualize the data. First, we created the survival plot.
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
#----------- VISUALIZE --------------------
# plotting survival for groups EC1, EC2, EC3, EC4
TCGAanalyze_survival(data = clin_subt_all,
                     clusterCol = "groupsHC",
                     main = "TCGA kaplan meier survival plot from consensus cluster",
                     legend = "RNA Group",
                     color = c("black","red","blue","green3"),
                     filename = "case2_surv.png")
```


The result is showed below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("case2_surv.png")
grid.raster(img)
```


We will also, create a heatmap of the expression.
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
TCGAvisualize_Heatmap(t(datFilt),
                      col.metadata =  clin_subt[,c("bcr_patient_barcode",
                                                   "groupsHC",
                                                   "Histology",
                                                   "IDH.codel.subtype")],
                      col.colors =  list(
                          groupsHC = c("EC1"="black",
                                       "EC2"="red",
                                       "EC3"="blue",
                                       "EC4"="green3"),
                          Histology=c("astrocytoma"="navy",
                                      "oligoastrocytoma"="green3",
                                      "oligodendroglioma"="red"),
                          IDH.codel.subtype = c("IDHmut-codel"="tomato",
                                                "IDHmut-non-codel"="navy",
                                                "IDHwt"="gold","NA"="white")),
                      sortCol = "groupsHC",
                      type = "expression", # sets default color
                      scale = "row", # use z-scores for better visualization
                      title = "Heatmap from concensus cluster", 
                      filename = "case2_Heatmap.pdf",
                      cluster_rows = TRUE)
```

The result is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("case2_Heatmap.png")
grid.raster(img)
```

Finally, we will take a look in the mutation genes. We will first download the maf file
with `TCGAquery_maf`. In this example we will investigate the gene "ATRX".
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
LGGmut <- GDCquery_Maf(tumor = "LGG", pipelines = "muse")
# Selecting gene
mRNAsel <- "ATRX"
LGGselected <- LGGmut[LGGmut$Hugo_Symbol == mRNAsel,]

dataMut <- LGGselected[!duplicated(LGGselected$Tumor_Sample_Barcode),]
dataMut$Tumor_Sample_Barcode <- substr(dataMut$Tumor_Sample_Barcode,1,12)

# Adding the Expression Cluster classification found before
dataMut <- merge(dataMut, cluster, by.y="patient", by.x="Tumor_Sample_Barcode")
dataMut <- dataMut[dataMut$Variant_Classification!=0,]
```

## Case study n. 3: Integration of methylation and expression for ACC

In recent years, it has been described the relationship between 
DNA methylation and gene expression and the study of this relationship 
is often difficult to accomplish.

This case study will show the steps to investigate the relationship
between the two types of data.

First, we downloaded ACC DNA methylation data for HumanMethylation450k platforms, and ACC RNA expression data for Illumina HiSeq platform. 

TCGAbiolinks adds by default the subtypes classification already published by researchers.

We will use this classification to do our examples. 
So, selected the groups CIMP-low and CIMP-high to do RNA expression and DNA methylation comparison.


```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
library(TCGAbiolinks)
library(SummarizedExperiment)
dir.create("case3")
setwd("case3")
#-----------------------------------
# STEP 1: Search, download, prepare |
#-----------------------------------
# 1.1 - DNA methylation
# ----------------------------------
query.met <- GDCquery(project = "TCGA-ACC", 
                      legacy = TRUE,
                      data.category = "DNA methylation",
                      platform = "Illumina Human Methylation 450")
GDCdownload(query.met)

acc.met <- GDCprepare(query = query.met,
                      save = TRUE, 
                      save.filename = "accDNAmet.rda",
                      summarizedExperiment = TRUE)

#-----------------------------------
# 1.2 - RNA expression
# ----------------------------------
query.exp <- GDCquery(project = "TCGA-ACC", 
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results")
GDCdownload(query.exp)
acc.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "accExp.rda")
```

We will first investigate the mean methylation of the methylation groups and highlight 
the hypermutated subgroup. 

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
#--------------------------------------------
# STEP 2: Analysis
#--------------------------------------------
# 2.1 - Mean methylation of samples
# -------------------------------------------
TCGAvisualize_meanMethylation(acc.met,
                              groupCol = "subtype_MethyLevel",
                              subgroupCol = "subtype_Histology",
                              group.legend  = "Groups",
                              subgroup.legend = "Histology",
                              filename = "acc_mean.png")
```

The figure resulted from the code above is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("acc_mean.png")
grid.raster(img)
```

For DNA methylation, we do a DMR (different methylated region) analysis, which will give the difference of DNA methylation for the probes of the groups and their significance value.
The output can be seen in a volcano plot. 
Note: Depending on the number of samples this function can be very slow
due to the wilcoxon test, taking from hours to days.

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
# na.omit
acc.met <- subset(acc.met,subset = (rowSums(is.na(assay(acc.met))) == 0))

# Volcano plot
acc.met <- TCGAanalyze_DMR(acc.met, groupCol = "subtype_MethyLevel",
                           group1 = "CIMP-high",
                           group2="CIMP-low",
                           p.cut = 10^-5,
                           diffmean.cut = 0.25,
                           legend = "State",
                           plot.filename = "CIMP-highvsCIMP-low_metvolcano.png")
```

The figure resulted from the code above is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("CIMP-highvsCIMP-low_metvolcano.png")
grid.raster(img)
```


For the expression analysis, we do a DEA (differential expression analysis) which will give the fold change 
of gene expression and their significance value.
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
#-------------------------------------------------
# 2.3 - DEA - Expression analysis - volcano plot
# ------------------------------------------------
acc.exp.aux <- subset(acc.exp, 
                      select = colData(acc.exp)$subtype_MethyLevel %in% c("CIMP-high","CIMP-low"))

idx <- colData(acc.exp.aux)$subtype_MethyLevel %in% c("CIMP-high")
idx2 <- colData(acc.exp.aux)$subtype_MethyLevel %in% c("CIMP-low")

dataPrep <- TCGAanalyze_Preprocessing(object = acc.exp.aux, cor.cut = 0.6)

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfo,
                                      method = "gcContent")

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  qnt.cut = 0.25,
                                  method='quantile')

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,idx],
                            mat2 = dataFilt[,idx2],
                            Cond1type = "CIMP-high",
                            Cond2type = "CIMP-low",
                            method = "glmLRT")

TCGAVisualize_volcano(dataDEGs$logFC,dataDEGs$FDR,
                      filename = "Case3_volcanoexp.png",
                      x.cut = 3,
                      y.cut = 10^-5,
                      names = rownames(dataDEGs),
                      color = c("black","red","darkgreen"),
                      names.size = 2,
                      xlab = " Gene expression fold change (Log2)",
                      legend = "State",
                      title = "Volcano plot (CIMP-high vs CIMP-low)",
                      width = 10)
```

The figure resulted from the code above is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("figure5exp.png")
grid.raster(img)
```

Finally, using both previous analysis we do a starburst plot to select the genes 
that are Candidate Biologically Significant.

Observation: over the time, the number of samples has increased and the clinical data updated.
We used only the samples that had a classification in the examples.
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}

#------------------------------------------
# 2.4 - Starburst plot
# -----------------------------------------
# If true the argument names of the geenes in circle 
# (biologically significant genes, has a change in gene
# expression and DNA methylation and respects all the thresholds)
# will be shown
# these genes are returned by the function see starburst object after the function is executed
starburst <- TCGAvisualize_starburst(acc.met, dataDEGs,"CIMP-high","CIMP-low",
                                     filename = "starburst.png",
                                     met.p.cut = 10^-5,
                                     exp.p.cut = 10^-5,
                                     diffmean.cut = 0.25,
                                     logFC.cut = 3,
                                     names = FALSE, 
                                     height=10,
                                     width=15,
                                     dpi=300)
```

The figure resulted from the code above is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("figure5star.png")
grid.raster(img)
```

## Case study n. 4: Elmer pipeline - KIRC

One of the biggest problems related to anlysis of data is the preparation phase,
which often consists of successive steps in order to prepare it to a format acceptable by certain algorithms and software.

With the objective of assisting users in this arduous step, TCGAbiolinks will 
offer to prepare the data in order to obtain the correct format for different packages.

An example of package to perform DNA methylation and RNA expression analysis is ELMER [@ref1]. 
ELMER, which is designed to combine DNA methylation and gene expression data from human tissues to infer multi-level cis-regulatory networks. ELMER uses DNA methylation to 
identify enhancers, and correlates enhancer state with expression of nearby genes 
to identify one or more transcriptional targets. Transcription factor (TF) binding 
site analysis of enhancers is coupled with expression analysis of all TFs to 
infer upstream regulators. This package can be easily applied to TCGA public 
available cancer data sets and custom DNA methylation and gene expression data sets.

ELMER analyses have 5 main steps: 

1. Identify distal enhancer probes on HM450K.
2. Identify distal enhancer probes with significantly different DNA methyaltion level
in control group and experiment group.
3. Identify putative target genes for differentially methylated distal enhancer probes.
4. Identify enriched motifs for the distal enhancer probes which are significantly 
differentially methylated and linked to putative target gene.
5. Identify regulatory TFs whose expression associate with DNA methylation at motifs.


We will present this the study KIRC by TCGAbiolinks and 
ELMER integration. For more information, please consult the ELMER package.

For the DNA methylation data we will search the platform HumanMethylation450
and level 3 data with TCGAquery. After, we will download the data with
TCGAdownload and the data will be prepared into a data.frame object
with TCGAprepare. TCGAprepare_elmer will be used to prepare the data frame
to ELMER.
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
library(TCGAbiolinks)
library(SummarizedExperiment)
library(ELMER)
library(parallel)
dir.create("case4")
setwd("case4")
#-----------------------------------
# STEP 1: Search, download, prepare |
#-----------------------------------
# 1.1 - DNA methylation
# ----------------------------------
query.met <- GDCquery(project = "TCGA-KIRC", 
                      legacy = TRUE,
                      data.category = "DNA methylation",
                      platform = "Illumina Human Methylation 450")
GDCdownload(query.met)
kirc.met <- GDCprepare(query = query.met,
                       save = TRUE, 
                       save.filename = "kircDNAmet.rda",
                       summarizedExperiment = TRUE)


kirc.met <- TCGAprepare_elmer(kirc.met, 
                              platform = "HumanMethylation450",
                              save = TRUE,
                              met.na.cut = 0.2)

```
For the gene expression data we will search the platform IlluminaHiSeq_RNASeqV2
and level 3 data with TCGAquery. Next, we will download the normalized_results 
data with TCGAdownload and prepare it into a data.frame object
with TCGAprepare. TCGAprepare_elmer will be used to prepare the data frame
to ELMER.

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
# Step 1.2 download expression data
#-----------------------------------
# 1.2 - RNA expression
# ----------------------------------
query.exp <- GDCquery(project = "TCGA-KIRC", 
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "normalized_results")
GDCdownload(query.exp)
kirc.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "kirkExp.rda")

kirc.exp <- TCGAprepare_elmer(kirc.exp,
                              save = TRUE,
                              platform = "IlluminaHiSeq_RNASeqV2") 
```

The ELMER input is a mee object that contains the methylation matrix,
the expression matrix, the probe information, the gene information and 
a table with a summary of the data.

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
#-----------------------------------
# STEP 2: ELMER integration         |
#-----------------------------------
# Step 2.1 prepare mee object       |
# -----------------------------------
library(ELMER)
library(parallel)

geneAnnot <- txs()
geneAnnot$GENEID <- paste0("ID",geneAnnot$GENEID)
geneInfo <- promoters(geneAnnot,upstream = 0, downstream = 0)
probe <- get.feature.probe()
mee <- fetch.mee(meth = kirc.met, exp = kirc.exp, TCGA = TRUE,
                 probeInfo = probe, geneInfo = geneInfo)
```

The code below excute steps 2-5 showed above: 
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
direction <- c("hyper","hypo")

for (j in direction){
    print(j)
    dir.out <- paste0("kirc/",j)
    dir.create(dir.out, recursive = TRUE)
    #--------------------------------------
    # STEP 3: Analysis                     |
    #--------------------------------------
    # Step 3.1: Get diff methylated probes |
    #--------------------------------------
    Sig.probes <- get.diff.meth(mee, cores=detectCores(),
                                dir.out =dir.out,
                                diff.dir=j,
                                pvalue = 0.01)
    
    #-------------------------------------------------------------
    # Step 3.2: Identify significant probe-gene pairs            |
    #-------------------------------------------------------------
    # Collect nearby 20 genes for Sig.probes
    nearGenes <- GetNearGenes(TRange=getProbeInfo(mee, probe=Sig.probes$probe),
                              cores=detectCores(),
                              geneAnnot=getGeneInfo(mee))
    
    pair <- get.pair(mee=mee,
                     probes=Sig.probes$probe,
                     nearGenes=nearGenes,
                     permu.dir=paste0(dir.out,"/permu"),
                     dir.out=dir.out,
                     cores=detectCores(),
                     label= j,
                     permu.size=10000,
                     Pe = 0.001)
    
    Sig.probes.paired <- fetch.pair(pair=pair,
                                    probeInfo = getProbeInfo(mee),
                                    geneInfo = getGeneInfo(mee))
    Sig.probes.paired <-read.csv(paste0(dir.out,"/getPair.",j,".pairs.significant.csv"),
                                 stringsAsFactors=FALSE)[,1]
    
    
    #-------------------------------------------------------------
    # Step 3.3: Motif enrichment analysis on the selected probes |
    #-------------------------------------------------------------
    if(length(Sig.probes.paired) > 0 ){
        #-------------------------------------------------------------
        # Step 3.3: Motif enrichment analysis on the selected probes |
        #-------------------------------------------------------------
        enriched.motif <- get.enriched.motif(probes=Sig.probes.paired,
                                             dir.out=dir.out, label=j,
                                             background.probes = probe$name)
        if(length(enriched.motif) > 0){
            #-------------------------------------------------------------
            # Step 3.4: Identifying regulatory TFs                        |
            #-------------------------------------------------------------
            print("get.TFs")
            
            TF <- get.TFs(mee = mee,
                          enriched.motif = enriched.motif,
                          dir.out = dir.out,
                          cores = detectCores(), label = j)
            save(TF, enriched.motif, Sig.probes.paired,
                 pair, nearGenes, Sig.probes,
                 file=paste0(dir.out,"/ELMER_results_",j,".rda"))
        }
    }
}
```

From this analysis it is possible to verify the relationship between nearby 20 
gene expression vs DNA methylation at this probe. The result of this is 
show by the ELMER scatter plot.

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
scatter.plot(mee,byProbe=list(probe=c("cg00328720"),geneNum=20), category="TN", lm_line = TRUE)
```

The result is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("case4_elmer.png")
grid.raster(img)
```

Each scatter plot showing the average DNA methylation level of sites with the UA6 motif in all KIRC samples plotted against the expression of the transcription factor ZNF677 and PEG3 respectively.
```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
scatter.plot(mee,byTF = list(TF = c("ZNF677","PEG3"),
                          probe = enriched.motif[["UA6"]]), category = "TN",
                          save = TRUE, lm_line = TRUE, dir.out = "kirc")
```

The result is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("elmer1.png")
grid.raster(img)
```

The schematic plot shows probe colored in blue and the location of nearby 20 genes. 
The genes significantly linked to the probe were shown in red.

```{r,eval=FALSE,echo=TRUE,message=FALSE,warning=FALSE}
pair <- fetch.pair(pair="./kirc/getPair.hypo.pairs.significant.csv",
                   probeInfo = mee@probeInfo, geneInfo = mee@geneInfo)
schematic.plot(pair=pair, byProbe="cg15862394",save=FALSE)
```
The result is shown below:
```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("elmer2.png")
grid.raster(img)
```

The plot shows the odds ratio (x axis) for the selected motifs with OR above 1.3 and lower boundary of OR above 1.3. 
The range shows the 95% confidence interval for each Odds Ratio.

```{r, fig.width=6, fig.height=4, echo = FALSE, fig.align="center",hide=TRUE, message=FALSE,warning=FALSE}
library(png)
library(grid)
img <- readPNG("elmer3.png")
grid.raster(img)
```

******

### Session Information
******
```{r sessionInfo}
sessionInfo()
```

# References