---
title: "seqArchRplus facilitates downstream analysis of clusters of promoter sequence architectures"
author: 
- name: Sarvesh Nikumbh
  affiliation: MRC London Institute of Medical Sciences and Imperial College London
date: "`r Sys.Date()`"
package: seqArchRplus
output:
  BiocStyle::html_document:
    toc: true
  BiocStyle::pdf_document: default
bibliography: seqArchRplus.bib
vignette: >
  %\VignetteIndexEntry{seqArchRplus facilitates downstream analysis of clusters of promoter sequence architectures}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---



```{r setup, include=FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```


# Introduction

Analysis of the promoterome of any organism entails analyzing data from an 
experiment like CAGE (Cap analysis of gene expression) [@Kodzius:2006] that 
provides information on genome-wide transcription start sites at single 
nucleotide resolution.
These promoters can then be further studied to identify different 
classes [@Carninci:2006] among them based on different attributes, for instance, 
their shape (broad or focused/sharp promoters), gene function (tissue 
specific vs. housekeeping) etc.
These different promoter classes harbour a variety of promoter architectures 
orchestrated by different proteins together with sequence elements at 
near-fixed positions in the promoter that determine the position of the 
transcription start site.
The different promoter architectures are known to be used differentially by 
genes either in different tissues or at different developmental timepoints.
For instance, [@haberle2014shifting] have shown a dynamic interchange of 
promoter architecture within the same genomic region between maternal and 
zygotic stages of zebrafish development.
Identifying and studying these different promoter architectures further is thus 
vital.


While the R/Bioconductor package `r Biocpkg("seqArchR")` enables de novo 
identification of clusters of promoter sequence architectures [@seqArchR], 
this package, `r Biocpkg("seqArchRplus")`, enables performing various steps 
in their downstream bioinformatic analyses and produce publication-ready plots 
(building on various other Bioconductor packages).
The many steps in the downstream analyses of promoter sequence architectures
enabled by `r Biocpkg("seqArchRplus")` are:

- Curate the final set of clusters from `r Biocpkg("seqArchR")`. 
See the 
`r Biocpkg("seqArchR")` [vignette](http://bioconductor.org/packages/release/bioc/vignettes/seqArchR/inst/doc/seqArchR.html)
, or the [bioRxiv preprint](https://doi.org/10.1101/2023.03.02.530868) to 
understand in detail why this may be required 

- Order the sequence architectures by the interquantile widths (IQWs) of the 
tag clusters (_aka_ promoter shape).
See 
[`CAGEr` vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/CAGEr/inst/doc/CAGEexp.html) 
for more information on tag clusters and their IQWs

- Visualize distributions of IQW, TPM (Tags per million) and conservation 
scores (or other when available) per cluster

- Visualize the percentages of annotations for genome-wide CAGE-derived 
transcription tag clusters for each architecture-based clusters

- Ease of comparison across samples/stages: Visualize the above plots as 
(publication ready) combined panels

- Many per cluster visualizations including:
    - sequence logos of cluster architectures
    - strand-separated sequence logos of architectures
    - distributions of promoters on different chromosomes and strands
    - GO terms enriched for each cluster/architecture

- Produce BED track files of `r Biocpkg("seqArchR")` clusters for visualization 
in a genome browser or IGV

Examples of most of these capabilities are provided in this vignette.


We hope that using `r Biocpkg("seqArchRplus")` (together with 
`r Biocpkg("seqArchR")`) will be useful in swiftly, but comprehensively, 
analyzing promoters identified using CAGE for any organism, and enable 
insights and hypotheses generation with ease. 
So far, we have tested `r Biocpkg("seqArchR")` and `r Biocpkg("seqArchRplus")` 
on promoters in drosophila from [@schor2017promoter] and modENCODE 
[@chen2014comparative], zebrafish [@nepal2013dynamic], mice (Fantom 
[@Consortium:2014hz]), humans (ENCODE), and also in plants (barley and maize).

We already have a plan for a feature in the future version of 
`r Biocpkg("seqArchRplus")`: generate HTML reports that help you navigate this 
wealth of information with relative ease.



# Installation

The latest stable version of seqArchjRplus can be installed from Bioconductor 
as shown below.
```{r seqArchRplus-install, echo=TRUE, eval=FALSE}
## When available on Bioconductor
if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("seqArchRplus")
```

In case of any errors or help required, please consider looking up:
[https://github.com/snikumbh/seqArchRplus](https://github.com/snikumbh/seqArchRplus)
and file a [new issue](https://github.com/snikumbh/seqArchRplus/issues/new).

# seqArchRplus for downstream analysis of promoter sequences

```{r setup-two, echo=TRUE,include=TRUE,results="hide",message=FALSE,warning=FALSE}
# Load seqArchRplus
library(seqArchRplus)
library(seqArchR)
library(Biostrings)
library(TxDb.Dmelanogaster.UCSC.dm6.ensGene)
library(org.Dm.eg.db)
library(ChIPseeker)

# Set seed for reproducibility
set.seed(1234)
```

## Setting up

To begin using `r Biocpkg("seqArchRplus")`, you require a set of promoters 
sequences (as a DNAStringSet object) and their clustering information (as a 
simple list storing the sequence IDs belonging to each cluster). 
With this, you can already visualize the cluster-wise sequence logos, 
distribution of chromosome and strand locations, GO term-enrichments, motif 
occurrence heatmaps.
When you have additional information such as the IQWs (shape information), 
`r Biocpkg("seqArchRplus")` can use these to order the clusters in the 
visualizations.
This visualization can be supplemented with the distribution of 
per-cluster TPM values and/or conservation scores etc. when provided.

If your workflow involves CAGEr (for pre-processing raw CAGE data) and 
`r Biocpkg("seqArchR")` (for clusters), they can be seamlessly used to utilize 
the full scope of `r Biocpkg("seqArchRplus")`.
In this case, it will be helpful if you have

- the CAGEr object or information on the tag clusters from 
[https://bioconductor.org/packages/release/bioc/html/CAGEr.html](CAGEr), 
specifically, the width of the tag clusters, total TPM value of the cluster and 
that of the dominant CTSS in the cluster

- the `r Biocpkg("seqArchR")` result object.


### Input data

In this vignette, we use the example data provided with the package. 
The raw CAGE data for different developmental timepoints in Drosophila 
melanogaster (Schor et al. 2017) were pre-processed with CAGEr to obtain 
promoter sequences. 
These were then processed with `r Biocpkg("seqArchR")` to cluster the promoter 
DNA sequences.
Only a subset of the complete data is provided with the package and used here 
to enable demonstration of package utility.
In particular, the data included is for the timepoint 10-12h after egg laying.

Specifically, the following files are provided: 

1. promoter sequences stored as a gzipped FASTA file: two kinds, with 45 and 
200 bp flanks around the TSS (filenames: `example_promoters45.fa.gz` and 
`example_promoters200.fa.gz`)
2. clustering information (filename: `example_clust_info.rds`)
3. GRanges object obtained from CAGEr which holds information on the 
tag clusters (promoters) with additional information such as the position of 
dominant TSS, TPM value for the dominant TSS, position of quantile boundaries 
(filename: `example_tc_gr.rds`)
4. BED file with same information as the GRanges object (filename: 
`example_info_df.rds`)
5. `r Biocpkg("seqArchR")` result object (filename: `seqArchR_result.rds`)

```{r prepare-fetch-data, eval=TRUE}
## 1. Raw DNA sequences
raw_seqs <- Biostrings::readDNAStringSet(
    filepath = system.file("extdata",
        "example_promoters45.fa.gz",
        package = "seqArchRplus",
        mustWork = TRUE
    )
)

## 2. Clustering information (arbitrary order/unordered)
unord_clusts <- readRDS(system.file("extdata", "example_clust_info.rds",
                            package = "seqArchRplus", mustWork = TRUE
                        ))

## 3. GRanges object
tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds",
                    package = "seqArchRplus", mustWork = TRUE
                ))

## 4. BED file
bed_info_fname <- system.file("extdata", "example_info_df.bed.gz",
                                package = "seqArchRplus", mustWork = TRUE
                            )
info_df <- read.delim(file = bed_info_fname, sep = "\t", header = TRUE)

## 5. seqArchR result
seqArchR_result <- readRDS(system.file("extdata", "seqArchR_result.rds",
      package = "seqArchRplus", mustWork = TRUE))

## **NOTE** Only for the example seqArchR result object provided with this
## package:
## 
## Any seqArchR result object has the raw DNA sequences as a part of it.
## See details here: https://snikumbh.github.io/seqArchR/reference/seqArchR.html
## 
## But, in order to reduce the size of the example seqArchR result object 
## provided with this (seqArchRplus) package, the `rawSeqs` element of the 
## result is missing. This is reassigned below for the purposes of this 
## vignette.
## 
## The seqArchR result was obtained by processing the DNA sequences in 
## `example_promoters45.fa.gz`. Thus we reassign them to 
## `seqArchR_result$rawSeqs` here
##  
if(!("rawSeqs" %in% names(seqArchR_result)))
    seqArchR_result$rawSeqs <- raw_seqs

```


If you are not using the result from `r Biocpkg("seqArchR")` or are already 
using a curated set of clusters, you can skip the following subsection and 
jump to section \@ref(generating-plots) for a demonstration of generating all 
plots.

### Curating raw clusters from `seqArchR` result {#curate-seqArchR-result}

Raw clusters from `r Biocpkg("seqArchR")` result (say, the final iteration) 
often require curation. 

The `r Biocpkg("seqArchR")` result does have a final `clustSol` (the clustering 
solution) where the clusters from the final iteration are collated. 
`r Biocpkg("seqArchR")` uses hierarchical clustering for this purpose. 
However, hierarchical clustering, with a chosen agglomeration and distance 
method, does not necessarily output the best suitable collated set of clusters. 
Therefore, some small amount of manual curation (re-assignments) may be 
required to reach the ideal collated set of clusters as the final solution. 
This is achieved with the help of the `seqArchRplus::curate_clusters` function.

The basic idea of the curate_clusters function is available in 
`help(seqArchRplus::curate_clusters)`. 
A more elaborate description is as follows.
It takes as input an agglomeration method and also a distance method (see more 
arguments in \code{help(curate_clusters)}). 
On the first call to the function, a plot with the associated dendrogram 
resulting from the hierarchical clustering is shown together with the per 
cluster sequence logos. 
This should help in identifying if the chosen agglomeration and distance 
methods worked well (i.e., are clusters with similar sequence logos also 
together in the dendrogram?).
If this is satisfactory (i.e., chosen agglomeration and distance method have 
done well already and only few curations will be required), count the number of 
overall clusters that you can visually see. 
Generally, if (estimated) 3-4 (even 5-6) clusters out of 30-40 require 
curation, I would consider it as a satisfactory outcome from the hierarchical 
clustering output.
Now, in the second call to `curate_clusters`, specify the number of clusters 
based on your count.
View the resulting clustering in the new plot produced as output.
This time the dendrogram shows this clustering with colors and grey-scaled 
boxes drawn around the clusters.
Now you can exactly note which clusters need curation. 

Imagine a hypothetical scenario where a total of 16 raw clusters are obtained 
from the final `r Biocpkg("seqArchR")` iteration.
Collating them into 4 clusters using hierarchical clustering results in:

- collated cluster 1: 1, 7, 9
- collated cluster 2: 2, 3, 4, 8, 10
- collated cluster 3: 11, 12, 13
- collated cluster 4: 5, 6, 14, 15, 16

But, due to very minor differences in the sequence logos, you may want to move 
raw cluster 9 from collated cluster 1 to collated cluster 3 (that of raw 
clusters 11-13).
To do this, set the `need_change` argument of `curate_clusters` to `c(9)` and 
`change_to` argument to `c(11)`. 
Any one of the destination cluster members can be specified in `change_to`. 
Similarly, any other such curations can be added. 
Specify all such curations together in one go like this: 

- `need_change <- list(c(9), c(4, 7), c(16))` and 
- `change_to <- list(11, 14, 0)`

Here, (a) raw cluster 9 is moved to the collated cluster containing raw cluster
11; (b) 4 and 7 are moved to that of 14; and (c) 16 is moved to a totally new 
cluster of itself.
The result is the re-assigned clusters.

Below, this procedure of curation is re-iterated with the help of actual 
function calls and resulting figures using a reduced example data available 
with the package.


```{r curation-call-1, echo=TRUE, eval=TRUE, warning=FALSE, fig.cap="Figure showing a combined panel of dendrogram + sequence logos at the first step of curation", fig.wide=TRUE, fig.height=3.5}

sn <- "RAL28_10_to_12h"
use_aggl <- "complete"
use_dist <- "euclid"

## get seqArchR clusters custom curated
seqArchR_curated_clusts <- curate_clusters(sname = sn,
                                use_aggl = use_aggl, use_dist = use_dist,
                                seqArchR_result = seqArchR_result, iter = 5,
                                pos_lab = NULL, regularize = TRUE, topn = 50,
                                use_cutk = 2, final = FALSE, dir_path = NULL
                            )
seqArchR_curated_clusts$curation_plot
```

The motive of Figure \@ref(fig:curation-call-1) is to look at the clusters 
ordered by the hierarchical clustering step. 
This way, counting the tentative number of clusters (K) is easier.

In Figure \@ref(fig:curation-call-1), we can see that the first and the second 
clusters can be collated together. 
Therefore, to demonstrate curation, we will set K=5 clusters and perform this 
minor curation as shown below. 

```{r curation-set-k, echo=TRUE, eval=TRUE, warning=FALSE, fig.cap="Figure showing a combined panel of dendrogram + sequence logos at the second step of curation", fig.wide=TRUE, fig.height=3.5}
## Let us set K=5 for this example, and combine clusters 1 and 2 into one.
set_cutk <- 5

## Form the lists `need_change` and `change_to` for re-assignments
need_change <- list(c(2))
change_to <- list(c(1))

seqArchR_curated_clusts <- curate_clusters(sname = sn,
                                use_aggl = use_aggl, use_dist = use_dist,
                                seqArchR_result = seqArchR_result, iter = 5,
                                regularize = TRUE, topn = 50,
                                use_cutk = set_cutk, #***
                                need_change = need_change,
                                change_to = change_to,
                                final = FALSE, #***
                                dir_path = NULL
                            )
seqArchR_curated_clusts$curation_plot
```

Notice that each dendrogram branch (and leaf node) has been assigned a different 
color because we set K=5.
We are now ready to make the final call to `curate_clusters` and obtain the 
curated clusters' list.

```{r curation-final-call, echo=TRUE, eval=TRUE, warning=FALSE, fig.cap="Figure showing three panels  combined: dendrogram + original sequence logos + collated cluster sequence logos at the final step of curation", fig.wide=TRUE, fig.height=4, fig.width=15}
## Satisfied with the re-assignments? now set final = TRUE
seqArchR_curated_clusts <- curate_clusters(sname = sn,
                                use_aggl = use_aggl, use_dist = use_dist,
                                seqArchR_result = seqArchR_result, iter = 5,
                                pos_lab = NULL, regularize = FALSE, topn = 50,
                                use_cutk = set_cutk, #***
                                need_change = need_change,
                                change_to = change_to,
                                final = TRUE, #***
                                dir_path = NULL
                            )
seqArchR_curated_clusts$curation_plot
```
Notice that the re-assigned clusters (1 and 2) now have different colors than 
the corresponding dendrogram branches.


```{r show-clusts}
str(seqArchR_curated_clusts$clusters_list)

```

This fetches us clusters in _arbitrary_ order, i.e., not related to the IQWs 
of the clusters.
[See below for the function call `order_clusters_iqw()` that orders these 
clusters by their median/mean interquantile widths (section \@ref(iqw-ord)).]
This curated set of clusters can now be used for further downstream analyses.

## Generating individual plots via individual functions {#generating-plots}

The following subsections demonstrate how the `r Biocpkg("seqArchRplus")` 
functions can be used to generate the various visualizations.

<!-- One can use the different individual functions from seqArchRplus to generate all  -->
<!-- the plots as demonstrated in the following subsections. -->


### Visualize architectures for each promoter cluster as sequence logos

One can visualize the architecture for each cluster of promoters as a sequence 
logo using the function `per_cluster_seqlogos`.
When the argument `one_plot` is set to TRUE, the function returns a single plot 
where sequence logos for all clusters are arranged one below the other.
When set to FALSE, a list of sequence logo plots is returned instead. 
In the later case, each plot has a title with information on the number of 
sequences in the cluster.

```{r seqlogos-one-plot, eval=TRUE, attr.source='.numberLines', warning=FALSE}
##
seqlogos_oneplot_pl <- per_cluster_seqlogos(
                            sname = "RAL28_10_to_12h",
                            seqs = raw_seqs,
                            clusts = unord_clusts,
                            pos_lab = -45:45, bits_yax = "auto",
                            strand_sep = FALSE, one_plot = TRUE,
                            txt_size = 12, dir_path = NULL
                        )
seqlogos_oneplot_pl
```

The sequence logos can also be obtained as a list instead of already combined 
into one plot by simply setting the argument `one_plot = FALSE`.
This gives you liberty to use them (either individually or otherwise) 
together with any other plots as is suitable.
```{r seqlogos-list, eval=FALSE, attr.source='.numberLines', warning=FALSE}
## Obtain the sequence logos as a list for combining later by setting the 
## 'one_plot' argument to FALSE
seqlogos_list_pl <- per_cluster_seqlogos(
                        sname = "RAL28_10_to_12h",
                        seqs = raw_seqs,
                        clusts = unord_clusts,
                        pos_lab = -45:45, bits_yax = "auto",
                        strand_sep = FALSE, one_plot = FALSE,
                        dir_path = NULL,
                        txt_size = 12
                    )
```

Obtaining strand-separated sequence logos is demonstrated in section 
\@ref(strand-seqlogos).

### Visualize the distribution of IQW and TPM values for each cluster {#iqw-ord}

The function `iqw_tpm_plots` orders the clusters by their IQW values 
(when quantitative information on promoter shape is available, for instance, the 
interquantile widths of the CAGEr tag clusters) and visualizes the distribution 
of the IQW and TPM values for each cluster as boxplots.
The input clusters are ordered from sharp (top) to broad (bottom).

```{r iqw-tpm-plot, eval=TRUE, attr.source='.numberLines'}
##
iqw_tpm_pl <- iqw_tpm_plots(sname = "RAL28_10_to_12h",
                            dir_path = NULL,
                            info_df = info_df,
                            iqw = TRUE, tpm = TRUE, cons = FALSE,
                            clusts = unord_clusts,
                            txt_size = 15
                            )
iqw_tpm_pl
```

### Fetch clusters ordered by promoter shape

The cluster order visualized in the IQW_TPM plots above can be fetched using 
the function `order_cluster_iqw`.

```{r order-clusters, eval=TRUE, attr.source='.numberLines'}
## get clusters ordered by median IQW values
seqArchR_clusts_ord <- order_clusters_iqw(sname = "RAL28_10_to_12h", 
                                                        clusts = unord_clusts,
                                                        info_df = info_df, 
                                                        order_by_median = TRUE
                                    )
str(unord_clusts)
str(seqArchR_clusts_ord)
```



### Visualize sequence clusters as images colored by the nucleotides

The function `seqs_acgt_image` enables visualizing a set of sequences as an 
image colored by the nucleotides.
Here, one can choose to order sequences by the clusters (see argument 
`seqs_ord`).

```{r acgt-image, eval=TRUE, attr.source='.numberLines'}
##
seqs_acgt_image(sname = "RAL28_10_to_12h",
                        seqs = raw_seqs,
                        seqs_ord = unlist(seqArchR_clusts_ord),
                        pos_lab = -45:45, dir_path = NULL
                    )
```


### Visualize motif occurrences as heatmaps

Dinucleotides are often deemed important within promoter sequences. 
For instance, the common dinucleotides at TSSs are PyPu (a pyrimidine-purine 
pair). 
Similarly, W-boxes (WW) that occur at ~30 bp upstream of the TSS are an 
important TSS-determining feature in Zebrafish maternal stage promoters.
Occurrences of various dinucleotides can be visualized as heatmaps using the 
function `plot_motif_heatmaps` (based on `r Biocpkg("heatmaps")`)) or 
`plot_motif_heatmaps2` (based on `r Biocpkg("seqPattern")`.
We demonstrate only `plot_motif_heatmaps` here.

```{r motif-heatmaps, eval=TRUE, attr.source='.numberLines'}
## Get larger flank FASTA sequences (larger than those used for seqArchR)
use_seqs <- Biostrings::readDNAStringSet(filepath = system.file("extdata",
                                            "example_promoters200.fa.gz",
                                            package = "seqArchRplus",
                                            mustWork = TRUE
                                        )
                                        )

plot_motif_heatmaps(sname = "RAL28_10_to_12h", seqs = use_seqs,
                                    flanks = c(200),
                                    clusts = seqArchR_clusts_ord,
                                    motifs = c("WW", "SS"),
                                    dir_path = NULL,
                                    fheight = 80, fwidth = 160
                                )

```

### Per cluster annotations' plot

In each cluster, the promoters (more specifically, the transcription start 
sites) can be annotated by their genomic feature, i.e., whether the start site 
is located within a promoter region of a gene, the 5' UTR, 3' UTR, exon, intron 
or is intergenic.
`r Biocpkg("seqArchRplus")` enables visualizing these annotation proportions 
per cluster as stacked bar plots via the function `per_cluster_annotations`.
When the argument `one_plot` is set to TRUE, this function returns all stacked 
bars combined within a single plot (see example below).
Otherwise, a list of individual/per-cluster plots is returned.

```{r annotation-one-plot, eval=TRUE, attr.source='.numberLines'}
##
annotations_oneplot_pl <-
    per_cluster_annotations(
        sname = "RAL28_10_to_12h",
        clusts = seqArchR_clusts_ord,
        tc_gr = tc_gr,
        cager_obj = NULL,
        qLow = 0.1, qUp = 0.9,
        txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene,
        tss_region = c(-500, 100),
        orgdb_obj = NULL, dir_path = NULL,
        one_plot = TRUE,
        txt_size = 12
    )
annotations_oneplot_pl
## Obtain the per cluster annotations as a list for combining later by setting 
## the 'one_plot' argument to FALSE

# annotations_list_pl <- per_cluster_annotations(
#                             sname = "RAL28_10_to_12h",
#                             clusts = unord_clusts,
#                             tc_gr = tc_gr,
#                             cager_obj = NULL,
#                             qLow = 0.1, qUp = 0.9,
#                             txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene,
#                             tss_region = c(-500, 100),
#                             orgdb_obj = NULL, dir_path = NULL,
#                             one_plot = FALSE,
#                             txt_size = 12
#                         )
```

### Per cluster strand-wise sequence logos {#strand-seqlogos}

Similar to the sequence logos of all promoters in a cluster (section XY), one 
can separate promoters in a cluster that are on the positive strand vs. the 
negative strand, and visualize them separately as shown below.
Note that this is the same function `per_cluster_seqlogos` with the argument 
`strand_sep` set to TRUE.

```{r stranded-seqlogos, eval=TRUE, attr.source='.numberLines', fig.height=10, warning=FALSE}
## Obtain strand-separated sequence logos corresponding to each cluster
stranded_seqlogos_pl <- per_cluster_seqlogos(
                            sname = "RAL28_10_to_12h",
                            seqs = raw_seqs,
                            clusts = seqArchR_clusts_ord,
                            pos_lab = -45:45, bits_yax = "auto",
                            info_df = info_df,
                            strand_sep = TRUE, #**
                            one_plot = FALSE, #**
                            dir_path = NULL,
                            txt_size = 12
                        )
one_plot <- cowplot::plot_grid(plotlist = stranded_seqlogos_pl, ncol = 1)
one_plot
```

### Per cluster chromosome- and strand-wise distribution of promoters

The `r Biocpkg("seqArchRplus")` function `per_cluster_strand_dist` can enable 
visualizing as barplots how promoters in each cluster are distributed on 
different chromosomes and strands.
```{r stranded-chrom-prop, eval=TRUE, attr.source='.numberLines', fig.height=7}

pair_colrs <- RColorBrewer::brewer.pal(n = 5, name = "Set3")
## Obtain strand-separated sequence logos corresponding to each cluster
stranded_dist_pl <- per_cluster_strand_dist(
                            sname = "RAL28_10_to_12h",
                            clusts = seqArchR_clusts_ord,
                            info_df = info_df,
                            dir_path = NULL,
                            txt_size = 12
                        )
one_plot <- cowplot::plot_grid(plotlist = stranded_dist_pl, ncol = 1)
one_plot
```

### Per cluster GO term enrichments

GO terms enriched per cluster can be obtained using the 
`r Biocpkg("seqArchRplus")` `per_cluster_go_term_enrichments()`.
This function returns a list of plots showing the GO terms enriched.

```{r go-enrichment, eval=TRUE, attr.source='.numberLines', fig.width=15, fig.height=15}

go_enrich_pl <- per_cluster_go_term_enrichments(
                            sname = "RAL28_10_to_12h",
                            clusts = seqArchR_clusts_ord,
                            tc_gr = tc_gr, 
                            txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene,
                            dir_path = NULL,
                            one_file = FALSE, #***
                            tss_region = c(-500,100),
                            orgdb_obj = "org.Dm.eg.db"
                        )

one_plot <- cowplot::plot_grid(plotlist = go_enrich_pl, ncol = 2)
one_plot
```


### Large panels combining multiple plots

```{r large-panels, eval=TRUE, attr.source='.numberLines', fig.wide=TRUE, fig.height=8, fig.width=20, fig.cap="Multiple plots combined as panels to generate one large figure."}


form_combined_panel(iqw_tpm_pl = iqw_tpm_pl, seqlogos_pl = seqlogos_oneplot_pl, 
                    annot_pl = annotations_oneplot_pl)

```

In a multi-sample study where the samples designate either different 
timepoints or cell types, all such (large) figures can be generated for all 
samples to produce a list of such plots. 


### HTML reports with large combined panel plots

```{r html-report, eval=TRUE, attr.source='.numberLines', fig.wide=TRUE, fig.height=8, fig.width=20, fig.cap="Multiple plots combined as panels to geenrate one large figure.", eval=FALSE}


generate_html_report(snames = c("RAL28_10_to_12h", "RAL28_10_to_12h"), 
                        file_type = "PDF", dir_path = tempdir())

```

In a multi-sample study where the samples designate either different 
timepoints or conditions, all such (large) figures can be generated for all 
samples to produce a list of such plots which can then be visualised in an 
HTML file that enables scrolling between the combined panel plots from one 
sample to another.


### Clusters as track BED files for visualization in the genome browser

We recommend viewing the promoters in the genome browser or IGV to 
observe additional details such as the position of the dominant TSS etc.
The relevant `r Biocpkg("seqArchRplus")` function that writes to disk this 
information as browser track BED files is `write_seqArchR_cluster_track_bed`.

```{r track-bed, eval=TRUE, attr.source='.numberLines'}
##
write_seqArchR_cluster_track_bed(
                    sname = "RAL28_10_to_12h",
                    clusts = seqArchR_clusts_ord,
                    info_df = info_df,
                    use_q_bound = FALSE,
                    one_zip_all = TRUE,
                    org_name = "Dmelanogaster",
                    dir_path = tempdir(),
                    include_in_report = FALSE,
                    strand_sep = FALSE
                )
```

### Some additional notes

Depending upon the use case, one may use any ordering for the input clusters 
for all functions.
Looking at their annotations, if some clusters happen to occur predominantly 
in 5' UTR or 3' UTR or exons or introns etc., such 
clusters can be sequestered from further downstream analyses if suitable for 
the end goal.

## Generating all plots at once

You can let `r Biocpkg("seqArchRplus")` generate all plots using default 
settings like so.

```{r generate-all-plots, eval=FALSE}

generate_all_plots(
    sname = "RAL28_10_to_12h",
    bed_info_fname = bed_info_fname,
    seqArchR_clusts = unord_clusts,
    raw_seqs = raw_seqs,
    tc_gr = tc_gr,
    use_q_bound = FALSE,
    order_by_iqw = TRUE,
    use_median_iqw = TRUE,
    iqw = TRUE, tpm = TRUE, cons = FALSE,
    pos_lab = -45:45,
    txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene,
    org_name = "Dmelanogaster",
    qLow = 0.1, qUp = 0.9,
    tss_region = c(-500, 100),
    raw_seqs_mh = use_seqs,
    motifs = c("WW", "SS", "TATAA", "CG"),
    motif_heatmaps_flanks = c(50, 100, 200),
    dir_path = tempdir(),
    txt_size = 25
)
```

# Conclusion

Promoter sequences can be identified using CAGE or variations of CAGE 
experiment.
Clusters of promoter sequences (identified by `r Biocpkg("seqArchR")` or 
otherwise) can be further analyzed with `r Biocpkg("seqArchRplus")`.
This vignette demonstrated many of the downstream analyses steps for studying 
these promoter sequence clusters/architectures.

# Session Info
```{r session_info, include=TRUE, echo=TRUE, results='markup'}
sessionInfo()
```

# References