---
title: "RTCGAToolbox"
author: "Mehmet Kemal Samur"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: true

references:
- id: ref1
  title: Comprehensive genomic characterization defines human glioblastoma genes and core pathways
  author:
  - family: Cancer Genome Atlas Research Network
    given:
  journal: Nature
  volume: 455
  number: 7216
  pages: 1061-1068
  issued:
    year: 2008

- id: ref2
  title: GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers
  author:
  - family: Mermel, C. H. and Schumacher, S. E. and Hill, B. and Meyerson, M. L. and Beroukhim, R. and Getz, G
    given:
  journal: Genome Biol
  volume: 12
  number: 4
  pages: R41
  issued:
    year: 2011

- id: ref3
  title: RTCGAToolbox\:\ A New Tool for Exporting TCGA Firehose Data
  author:
  - family: Samur MK.
    given:
  journal: Plos ONE
  volume: 9
  number: 9
  pages: e106397
  issued:
    year: 2014

vignette: >
  %\VignetteIndexEntry{RTCGAToolbox Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction

Managing data from large scale projects such as The Cancer Genome Atlas
(TCGA)[@ref1] for further analysis is an important and time consuming step for
research projects. Several efforts, such as Firehose project, make TCGA
pre-processed data publicly available via web services and data portals but it
requires managing, downloading and preparing the data for following steps. We
developed an open source and extensible R based data client for Firehose Level
3 and Level 4 data and demonstrated its use with sample case studies.
RTCGAToolbox could improve data management for researchers who are interested
with TCGA data. In addition, it can be integrated with other analysis
pipelines for further data analysis.

RTCGAToolbox is open-source and licensed under the GNU General Public License
Version 2.0. All documentation and source code for RTCGAToolbox is freely
available. Please site the paper at [@ref3].

Currently, following functions are provided to access datasets and process
datasets.

* Control functions:
    + getFirehoseRunningDates: This function can be called to access valid
stddata run dates. To access data, users have to provide valid dates.
    + getFirehoseAnalyzeDates: This function can be called to access valid
analyze run dates. To access data, users have to provide valid dates. This
function only affects the GISTIC2 [@ref2] processed copy estimate matrices.
    + getFirehoseDatasets: This function can be called to access valid dataset
aliases.
* Data client function:
    + getFirehoseData: This is the core function of the package. Users can
access Firehose processed data via this function. Once it is called, several
steps are realized by the library to access data. Finally this function
returns an S4 object that keeps all the downloaded data.

# Installation

To install RTCGAToolbox, you can use Bioconductor. Source code is also
available on GitHub. First time users use the following code snippet to
install the package

```{r eval=FALSE}
if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("RTCGAToolbox")
```

# Data Client

Before getting the data from Firehose pipelines, users have to check valid
dataset aliases, stddata run dates and analyze run dates. To provide valid
information RTCGAToolbox comes with three control functions. Users can list
datasets with "getFirehoseDatasets" function. In addition, users have to
provide stddata run date or/and analyze run date for client function. Valid
dates are accessible via "getFirehoseRunningDates" and
"getFirehoseAnalyzeDates" functions. Below code chunk shows how to list
datasets and dates.

```{r}
library(RTCGAToolbox)
# Valid aliases
getFirehoseDatasets()
```

```{r}
# Valid stddata runs
getFirehoseRunningDates(last = 3)
```

```{r}
# Valid analysis running dates (will return 3 recent date)
getFirehoseAnalyzeDates(last=3)
```
When the dates and datasets are determined users can call data client function
("getFirehoseData") to access data. Current version can download multiple data
types except ISOFORM and exon level data due to their huge data size. Below
code chunk will download READ dataset with clinical and mutation data.

```{r, message=FALSE}
# READ mutation data and clinical data
brcaData <- getFirehoseData(dataset="READ", runDate="20160128",
    forceDownload=TRUE, clinical=TRUE, Mutation=TRUE)
```

Printing the object will show the user what datasets are in the `FirehoseData`
object:

```{r}
brcaData
```

Users have to set several parameters to get data they need. Below
"getFirehoseData" options has been explained:

* dataset: Users should set cohort code for the dataset they would like to
download. List can be accessiable via `getFirehoseDatasets()` like as explained
above.
* runDate: Firehose project provides different data point for cohorts. Users
can list dates by using function above,`getFirehoseRunningDates()`.
* gistic2Date: Just like cohorts Firehose project runs their analysis
pipelines to process copy number data with GISTIC2 [@ref2]. Users who want to
get GISTIC2 processed copy number data should set this date. List can be
accessible via "getFirehoseAnalyzeDates()"

Following logic keys are provided for different data types. By default client
only download  clinical data.

* RNAseqGene
* clinical
* RNASeqGene
* RNASeq2Gene
* RNASeq2GeneNorm
* miRNASeqGene
* CNASNP
* CNVSNP
* CNASeq
* CNACGH
* Methylation
* Mutation
* mRNAArray
* miRNAArray
* RPPAArray

Users can also set following parameters to set client behavior.

* forceDownload: By default RTCGAToolbox checks your working directory before
download data. If you have data in the working directory from previous run it
loads data by using these exports. If you would like to suppress  this and re
download data you can force RTCGAToolbox.
* fileSizeLimit: If you would like to set a limit for downloaded file size you
can use this parameter. Huge data files require longer download time and
memory to load. By default his parameter set as 500MB.
* getUUIDs: Firehose provides TCGA barcodes for every sample. In some cases
users may want to use UUIDs for samples. If this parameter set, then after
processing data RTCGAToolbox gets UUIDs for each barcode.

## Example Dataset

We've provided an abbreviated dataset from the 'ACC' (Adrenocortical carcinoma)
that contains only the top 6 rows for each dataset and a full clinical dataset.
This dataset can be invoked by doing:

```{r}
data(accmini)
accmini
```

* `accmini` data is a FirehoseData object that stores RNAseq, copy number,
mutation, clinical data from the Adrenocortical Carcinoma (ACC) study.

## Conversion to Bioconductor classes

The `biocExtract` function allows the user to take any downloaded dataset and
convert it into a standard Bioconductor object. These can either be a
`SummarizedExperiment`, `RangedSummarizedExperiment`, or `RaggedExperiment`
based on features of the data. The user must provide the desired data type
as input to the function along with the actual `FirehoseData` data object.
This allows for easy adaptability to other software in the Bioconductor
ecosystem.

```{r}
biocExtract(accmini, "RNASeq2Gene")

biocExtract(accmini, "CNASNP")
```

# Raw Data

You can obtain the downloaded data in tabular or list format from the
`FirehoseData` object by using 'getData()' function.

```{r}
head(getData(accmini, "clinical"))

getData(accmini, "RNASeq2GeneNorm")

getData(accmini, "GISTIC", "AllByGene")
```

## Session Info

```{r}
sessionInfo()
```

# References