---
title: "terraTCGAdata Introduction"
author: "Marcel Ramos"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Obtain Terra TCGA data as MultiAssayExperiment}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
isAnVILWS <- function() {
    AnVILGCP::gcloud_exists() &&
    identical(AnVILBase::avplatform_namespace(), "AnVILGCP")
}
```

# terraTCGAData

## Installation

```{r,eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("terraTCGAdata")
```

# Overview
 
The `terraTCGAdata` R package aims to import TCGA datasets, as
[MultiAssayExperiment](http://bioconductor.org/packages/MultiAssayExperiment/),
available on the Terra platform. The package provides a set of
functions that allow the discovery of relevant datasets. It provides
one main function and two helper functions:
 
1.  `terraTCGAdata` allows the creation of the `MultiAssayExperiment` object
    from the different indicated resources.
 
2.  The `getClinicalTable` and `getAssayTable` functions allow for the
    discovery of datasets within the Terra data model. The column names
    from these tables can be provided as inputs to the `terraTCGAdata`
    function.

## Data

Some public Terra workspaces come pre-packaged with TCGA data (i.e., cloud data
resources are linked within the data model). Particularly the workspaces that
are labelled `OpenAccess_V1-0`. Datasets harmonized to the hg38 genome, such as
those from the Genomic Data Commons data repository, use a different data model
/ workflow and are not compatible with the functions in this package. For those
that are, we make use of the Terra data model and represent the data as
`MultiAssayExperiment`.

For more information on `MultiAssayExperiment`, please see the vignette in
that package.

# Requirements

## Loading packages

```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(AnVIL)
library(terraTCGAdata)
```

## gcloud sdk installation

A valid GCloud SDK installation is required to use the package. To get set up,
see the Bioconductor tutorials for running RStudio on Terra. Use the
`gcloud_exists()` function from the `r BiocStyle::Biocpkg("AnVIL")` package to
identify whether it is installed in your system.

```{r}
gcloud_exists()
```

You can also use the `gcloud_project` to set a project name by specifying
the project argument:

```{r, eval=isAnVILWS()}
gcloud_project()
```

# Default Data Workspace

To get a table of available TCGA workspaces, use the `selectTCGAworkspace()`
function: 

```{r, eval=isAnVILWS()}
selectTCGAworkspace()
```

You can also set the package-wide option with the `terraTCGAworkspace` function
and check the setting with `getOption('terraTCGAdata.workspace')` or by running
`terraTCGAworkspace` function.

```{r,eval=isAnVILWS()}
terraTCGAworkspace("TCGA_COAD_OpenAccess_V1-0_DATA")
getOption("terraTCGAdata.workspace")
```

# Clinical data resources

In order to determine what datasets to download, use the `getClinicalTable`
function to list all of the columns that correspond to clinical data
from the different collection centers.

```{r, eval=isAnVILWS()}
ct <- getClinicalTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
ct
names(ct)
```

# Clinical data download

After picking the column in the `getClinicalTable` output, use the column
name as input to the `getClinical` function to obtain the data:

```{r, eval=isAnVILWS()}
column_name <- "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin"
clin <- getClinical(
    columnName = column_name,
    participants = TRUE,
    workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
clin[, 1:6]
dim(clin)
```

# Assay data resources

We use the same approach for assay data. We first produce a list of assays
from the `getAssayTable` and then we select one along with any sample
codes of interest.

```{r, eval=isAnVILWS()}
at <- getAssayTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
at
names(at)
```

# Summary of sample types in the data

You can get a summary table of all the samples in the adata by using the
`sampleTypesTable`:

```{r, eval=isAnVILWS()}
sampleTypesTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
```

# Intermediate function for obtaining only the data

Note that if you have the package-wide option set, the workspace argument
is not needed in the function call.

```{r, eval=isAnVILWS()}
prot <- getAssayData(
    assayName = "protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
    sampleCode = c("01", "10"),
    workspace = "TCGA_COAD_OpenAccess_V1-0_DATA",
    sampleIdx = 1:4
)
head(prot)
```

# MultiAssayExperiment

Finally, once you have collected all the relevant column names, 
these can be inputs to the main `terraTCGAdata` function:

```{r, eval=isAnVILWS()}
mae <- terraTCGAdata(
    clinicalName = "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin",
    assays =
        c("protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
        "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data"),
    sampleCode = NULL,
    split = FALSE,
    sampleIdx = 1:4,
    workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
mae
```

We expect that most `OpenAccess_V1-0` cancer datasets follow this data model.
If you encounter any errors, please provide a minimally reproducible example
at https://github.com/waldronlab/terraTCGAdata.

# Session Info

```{r}
sessionInfo()
```