---
title: "RcwlPipelines: Bioinformatics tools and pipelines based on Rcwl"
author: "Qiang Hu, Qian Liu"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{RcwlPipelines}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

`RcwlPipelines` is a _Bioconductor_ package that manages a collection
of commonly used bioinformatics tools and pipeline based on
`Rcwl`. These pre-built and pre-tested tools and pipelines are highly
modularized with easy customization to meet different bioinformatics
data analysis needs.  

`Rcwl` and `RcwlPipelines` together forms a _Bioconductor_ toolchain
for use and development of reproducible bioinformatics pipelines in
Common Workflow Language (CWL). The project also aims to develop a
community-driven platform for open source, open development, and open
review of best-practice CWL bioinformatics pipelines.


# Installation
1. Install the package from _Bioconductor_.

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("RcwlPipelines")
```
The development version is also available to download from GitHub. 
```{r getDevel, eval=FALSE}
BiocManager::install("rworkflow/RcwlPipelines")
```

2. Load the package into the R session.
```{r Load, message=FALSE}
library(RcwlPipelines)
```
		
# Project resources 

The project website https://rcwl.org/ serves as a central hub for all
related resources. It provides guidance for new users and tutorials
for both users and developers. Specific resources are listed below. 

## The _R_ recipes and cwl scripts

The _R_ scripts to build the CWL tools and pipelines are now residing
in a dedicated [GitHub
repository](https://github.com/rworkflow/RcwlRecipes), which is
intended to be a community effort to collect and contribute
Bioinformatics tools and pipelines using `Rcwl` and CWL.

## Tutorial book

The [tutorial book](https://rcwl.org/RcwlBook/) provides detailed
instructions for developing `Rcwl` tools/pipelines, and also includes
examples of some commonly-used tools and pipelines that covers a wide
range of Bioinformatics data analysis needs.


# `RcwlPipelines` core functions

Here we show the usage of 3 core functions: `cwlUpdate`, `cwlSearch`
and `cwlLoad` for updating, searching, and loading the needed tools
or pipelines in _R_.

## `cwlUpdate` 

The `cwlUpdate` function syncs the current `Rcwl` recipes and returns
a `cwlHub` object which contains the most updated `Rcwl` recipes. The
`mcols()` function returns all related information about each
available tool or pipeline. 

The recipes will be locally cached, so users don't need to call
`cwlUpdate` every time unless they want to use a tool/pipeline that is
newly added to `RcwlPipelines`. Here we are using the recipes from
_Bioconductor_ devel version.
	
```{r}
## For vignette use only. users don't need to do this step.
Sys.setenv(cachePath = tempdir()) 
```

```{r, message=FALSE}
atls <- cwlUpdate(branch = "dev") ## sync the tools/pipelines.
atls
table(mcols(atls)$Type)
```
Currently, we have integrated `r sum(Type(atls)=="tool")` command
line tools and `r sum(Type(atls)=="pipeline")` pipelines.

## `cwlSearch`

We can use (multiple) keywords to search for specific tools/pipelines
of interest, which internally search the `mcols` of "rname", "rpath",
"fpath", "Command" and "Containers". Here we show how to search the
alignment tool `bwa mem`.

```{r}
t1 <- cwlSearch(c("bwa", "mem"))
t1
mcols(t1)
```

## `cwlLoad`

The last core function `cwlLoad` loads the `Rcwl` tool/pipeline into
the _R_ working environment. The code below loads the tool with a
user-defined name `bwa` to do the read alignment. 

```{r}
bwa <- cwlLoad(title(t1)[1])  ## "tl_bwa"
bwa <- cwlLoad(mcols(t1)$fpath[1]) ## equivalent to the above. 
bwa
```

Now the _R_ tool of `bwa` is ready to use. 

# Customize a tool or pipeline

To fit users' specific needsï¼Œthe existing tool or pipline can be
easily customized. Here we use the `rnaseq_Sf` pipeline to demonstrate
how to access and change the arguments of a specific tool inside a
pipeline. This pipeline covers RNA-seq reads quality summary by
`fastQC`, alignment by `STAR`, quantification by `featureCounts` and
quality control by `RSeQC`.

```{r, warning=FALSE}
rnaseq_Sf <- cwlLoad("pl_rnaseq_Sf")
plotCWL(rnaseq_Sf)
```

There are many default arguments defined for the tool of `STAR` inside
the pipeline. Users might want to change some of them. For example, we
can change the value for `--outFilterMismatchNmax` argument from 2 to
5 for longer reads.

```{r}
arguments(rnaseq_Sf, "STAR")[5:6]
arguments(rnaseq_Sf, "STAR")[[6]] <- 5
arguments(rnaseq_Sf, "STAR")[5:6]
```

We can also change the docker image for a specific tool (e.g., to a
specific version). First, we search for all available docker images
for `STAR` in biocontainers repository. The Source server could be
[quay](https://quay.io/) or [dockerhub](https://hub.docker.com).

```{r}
searchContainer("STAR", repo = "biocontainers", source = "quay")
```

Then, we can change the `STAR` version into 2.7.8a (tag name: 2.7.8a--0). 

```{r}
requirements(rnaseq_Sf, "STAR")[[1]]
requirements(rnaseq_Sf, "STAR")[[1]] <- requireDocker(
    docker = "quay.io/biocontainers/star:2.7.8a--0")
requirements(rnaseq_Sf, "STAR")[[1]]
```

# Run a tool or pipeline

Once the tool or pipeline is ready, we only need to assign values for
each of the input parameters, and then submit using one of the
functions: `runCWL`, `runCWLBatch` and `cwlShiny`. More detailed Usage
and examples can be refer to the `Rcwl`
[vignette](https://bioconductor.org/packages/devel/bioc/vignettes/Rcwl/inst/doc/Rcwl.html).

To successfully run the tool or pipeline, users either need to have
all required command line tools pre-installed locally, or using the
docker/singularity runtime by specifying `docker = TRUE` or `docker =
"singularity"` argument inside `runCWL` or `runCWLBatch`
function. Since the _Bioconductor_ building machine doesn't have all the
tools installed, nor does it support the docker runtime, here we use some
pseudo-code to demonstrate the tool/pipeline execution.

```{r, eval=FALSE}
inputs(rnaseq_Sf)
rnaseq_Sf$in_seqfiles <- list("sample_R1.fq.gz",
                              "sample_R2.fq.gz")
rnaseq_Sf$in_prefix <- "sample"
rnaseq_Sf$in_genomeDir <- "genome_STAR_index_Dir"
rnaseq_Sf$in_GTFfile <- "GENCODE_version.gtf"

runCWL(rnaseq_Sf, outdir = "output/sample", docker = TRUE)
```

Users can also submit parallel jobs to HPC for multiple samples using
`runCWLBatch` function. Different cluster job managers, such as
"multicore", "sge" and "slurm", are supported using the
`BiocParallel::BatchtoolsParam`.

```{r, eval=FALSE}
library(BioParallel)
bpparam <- BatchtoolsParam(workers = 2, cluster = "sge",
                           template = batchtoolsTemplate("sge"))

inputList <- list(in_seqfiles = list(sample1 = list("sample1_R1.fq.gz",
                                                    "sample1_R2.fq.gz"),
                                     sample2 = list("sample2_R1.fq.gz",
                                                    "sample2_R2.fq.gz")),
                  in_prefix = list(sample1 = "sample1",
                                   sample2 = "sample2"))

paramList <- list(in_genomeDir = "genome_STAR_index_Dir",
                  in_GTFfile = "GENCODE_version.gtf",
                  in_runThreadN = 16)

runCWLBatch(rnaseq_Sf, outdir = "output",
            inputList, paramList,
            BPPARAM = bpparam)
```

# SessionInfo
```{r}
sessionInfo()
```