---
title: "AnVILWorkflow: Run non-R workflows implemented in Terra using R"
author: "Sehyun Oh"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Quickstart: RNAseq analysis using salmon}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: yes
    toc_depth: 4
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  comment = "#>", collapse = TRUE, fig.align = 'center',
  eval = AnVIL::gcloud_exists()
)
```

# Overview

The [AnVIL project][] is an analysis, visualization, and informatics
cloud-based space for data access, sharing and computing across large
genomic-related data sets.

For R users with the limited computing resources, we introduce the
AnVILWorkflow package. This package allows users to run workflows
implemented in [Terra][] without installing software, writing any 
workflow, or managing cloud resources. Terra is a cloud-based genomics 
platform and its computing resources rely on Google Cloud Platform (GCP). 

Use of this package requires AnVIL and Google cloud computing billing
accounts. Consult [AnVIL training guides][] for details on
establishing these accounts.

[AnVIL project]: https://anvilproject.org/
[AnVIL training guides]: https://anvilproject.org/training/guides
[Terra]: https://app.terra.bio/#


## Install and load package
```{r eval = FALSE}
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("AnVILWorkflow")
```

```{r results="hide", message=FALSE, warning=FALSE}
library(AnVIL)
library(AnVILWorkflow)
```

## Google Cloud SDK

If you use AnVILWorkflow within Terra's RStudio, you don't need extra
authentication and gcloud SDK. If you use this package locally, it 
requires gcloud SDK and the billing account used in Terra.
You can [install][] the gcloud sdk.

[isntall]: https://cloud.google.com/sdk/docs/install

Check whether your system has the installation with `AnVIL::gcloud_exists()`.
It should return `TRUE` to use AnVILWorkflow package.

```{r}
gcloud_exists()
```

If it returns `FALSE`, install the gcloud SDK following this script:

```{bash eval=FALSE, echo=FALSE}
## shell
$ curl -sSL https://sdk.cloud.google.com | bash
```

```{r eval=FALSE}
devtools::install_github("rstudio/cloudml")
cloudml::gcloud_install()
```

```{bash eval=FALSE}
## shell
$ gcloud auth login
```

```{r eval=FALSE, echo=FALSE}
## You can change the project using this script
gcloud config set project PROJECT_ID
```

## Create Terra account

You need [Terra account
setup](https://support.terra.bio/hc/en-us/articles/360034677651-Account-setup-and-exploring-Terra).
Once you have your own Terra account, you need two pieces of information
to use AnVILWorkflow package:

1)  The email address linked to your Terra account\
2)  Your billing project name

You can setup your working environment using `setCloudEnv()` function like
below. **Provide the input values with YOUR account information!**

```{r eval=FALSE}
accountEmail <- "YOUR_EMAIL@gmail.com"
billingProjectName <- "YOUR_BILLING_ACCOUNT"

setCloudEnv(accountEmail = accountEmail, 
            billingProjectName = billingProjectName)
```

```{r echo=FALSE}
## In case the environment is set already.
accountEmail <- gcloud_account()
billingProjectName <- gcloud_project()

setCloudEnv(accountEmail = accountEmail, 
            billingProjectName = billingProjectName,
            message = FALSE)
```

The remainder of this vignette assumes that an Terra account has been
established and successfully linked to a Google cloud computing
billing account.


## Major steps

Here is the table of major functions for three workflow steps - prepare,
run, and check result.

| Steps   | Functions           | Description                               |
|---------|---------------------|-------------------------------------------|
| Prepare | `cloneWorkspace`    | Copy the template workspace               |
|         | `updateInput`       | Take user's inputs                        |
| Run     | `runWorkflow`       | Launch the workflow in Terra              |
|         | `stopWorkflow`      | Abort the submission                      |
|         | `monitorWorkflow`   | Monitor the status of your workflow run   |
| Result  | `getOutput`         | List or download your workflow outputs    |


## Example in this vignette: bulk RNAseq analysis

You can find all the available workspaces you have access to using
`AnVIL::avworkspaces()` function. Workspaces manually curated by
this package are separately checked using `availableAnalysis()` function. 
The values under `analysis` column can be used for the analysis 
argument, simplifying the cloning process. For this vignette, 
we use `"salmon"`.

```
> availableAnalysis()
   analysis       workspaceNamespace                            workspaceName         configuration_namespace              configuration_name
1 bioBakery waldronlab-terra-rstudio mtx_workflow_biobakery_version3_template mtx_workflow_biobakery_version3 mtx_workflow_biobakery_version3
2    salmon  bioconductor-rpci-anvil             Bioconductor-Workflow-DESeq2         bioconductor-rpci-anvil                 AnVILBulkRNASeq
                                                                                             description
1                                                                    Microbiome analysis using bioBakery
2 Trascript quantification from RNAseq using Salmon | Differential gene expression analysis using DESeq2
```

```{r}
analysis <- "salmon"
```

# Setup

## Clone workspace
### Curated by this package

We will refer the existing workspaces, that you have access to and want
to use for your analysis, as 'template' workspaces. The first step of
using this package is cloning the template workspace using `cloneWorkspace`
function. Note that you need to provide a **unique** name for the cloned
workspace through `workspaceName` argument. Once you successfully clone 
the workspace, the function will return the name of the cloned workspace.
For example, the successfully execution of the below script will 
return `{YOUR_BILLING_ACCOUNT}/salmon_test`.

```{r echo=FALSE, eval=FALSE, error=TRUE}
# If you attempt to clone the template workspace using the existing
# workspaceName, you will get the error message.
cloneWorkspace(workspaceName = "salmon_test", analysis = analysis)
```

```{r}
salmonWorkspaceName <- basename(tempfile("salmon_")) # unique workspace name
salmonWorkspaceName
cloneWorkspace(workspaceName = salmonWorkspaceName, analysis = analysis)
```


### Any workspace you have access to
If you want to clone any other workspace that you have access to but 
is not curated by this pacakge, you can directly enter the name of 
the target workspace as a `templateName`. For example, to clone the
[Tumor_Only_CNV][] workspace:

[Tumor_Only_CNV]: https://anvil.terra.bio/#workspaces/waldronlab-terra/Tumor_Only_CNV

```{r eval=FALSE}
cnvWorkspaceName <- basename(tempfile("cnv_")) # unique workspace name
cnvWorkspaceName
cloneWorkspace(workspaceName = cnvWorkspaceName,
               templateName = "Tumor_Only_CNV")
```


```{r echo=FALSE}
## workspace used in this vignette
workspaceName <- "salmon_test"
```

## Prepare input

### Current input

You can review the current inputs using `currentInput` function. Below
shows all the required and optional inputs for the workflow. 

```{r}
current_input <- currentInput(salmonWorkspaceName)
current_input
```

<br>

### Update input
You can modify/update inputs of your workflow using `updateInput` function. To
minimize the formatting issues, we recommend to make any change in the current 
input table returned from the `currentInput` function. Under the default 
(`dry=TRUE`), the updated input table will be returned without actually
updating Terra/AnVIL. Set `dry=FALSE`, to make a change in Terra/AnVIL.

```{r}
new_input <- current_input
new_input[4,4] <- "athal_index"
new_input

updateInput(salmonWorkspaceName, inputs = new_input)
```


# Run workflow

You can launch the workflow using `runWorkflow()` function. You need to 
specify the `inputName` of your workflow. If you don't provide it, this
function will return the list of input names you can use for your workflow.

Example error outputs:
```{r eval=FALSE}
runWorkflow(slamonWorkspaceName)
# You should provide the inputName from the followings:
# [1] "AnVILBulkRNASeq_set"
#> Error in runWorkflow(workspaceName):
```

```{r}
runWorkflow(salmonWorkspaceName, inputName = "AnVILBulkRNASeq_set")
```

## Monitor progress

The last three columns (`status`, `succeeded`, and `failed`) show the 
submission and the result status. 

```{r}
submissions <- monitorWorkflow(workspaceName = salmonWorkspaceName)
submissions
```

## Abort submission

You can abort the most recently submitted job using the `stopWorkflow` 
function. You can abort any workflow that is not the most recently
submitted by providing a specific `submissionId`.

```{r}
stopWorkflow(salmonWorkspaceName)
```

# Result

The workspace `Bioconductor-Workflow-DESeq2` is the template workspace 
you cloned at the beginning using the `analysis = "salmon"` argument 
in `cloneWorkspace()` function. This template workspace has already a 
history of the previous submissions, so we will check the output examples 
in this workspace.

```{r}
submissions <- monitorWorkflow(workspaceName = "Bioconductor-Workflow-DESeq2")
submissions
```

You can check all the output files from the most recently succeeded
submission using `getOutput` function. If you specify the `submissionId` 
argument, you can get the output files of that specific submission.

```{r no-run-examples, eval=FALSE}
## Output from the successfully-done submission
successful_submissions <- submissions$submissionId[submissions$succeeded == 1]
out <- getOutput(workspaceName = "Bioconductor-Workflow-DESeq2",
                 submissionId = successful_submissions[1])
```

```{r echo=FALSE}
## Save the previous submission results from the above chunk
## write.table(out, "vignettes/salmon_test_outputs.csv")
out <- read.table("salmon_test_outputs.csv", header = TRUE) %>% 
    tibble::tibble()
```

```{r}
head(out)
```


```{r cleanup, echo=FALSE, message=FALSE, error=TRUE, warning=FALSE}
## Delete test workspaces
resp <- AnVIL::Terra()$deleteWorkspace(workspaceNamespace = billingProjectName,
                                       workspaceName = salmonWorkspaceName)
rm(resp)
```

# Session Info

```{r eval=TRUE}
sessionInfo()
```