---
title: "Genome-wide identification and classification of transcription factors in plant genomes"
author: 
  - name: Fabricio Almeida-Silva
    affiliation: |
      VIB-UGent Center for Plant Systems Biology, Ghent University, 
      Ghent, Belgium
  - name: Yves Van de Peer
    affiliation: |
      VIB-UGent Center for Plant Systems Biology, Ghent University, 
      Ghent, Belgium
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_depth: 2
    number_sections: yes
date: "`r Sys.Date()`"
bibliography: vignette_bibliography.bib
vignette: >
  %\VignetteIndexEntry{Genome-wide identification and classification of transcription factors in plant genomes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```


# Introduction

Transcription factors (TFs) are proteins that bind to cis-regulatory elements 
in promoter regions of genes and regulate their expression. 
Identifying them in a genome is useful for a variety of reasons, such as 
exploring their evolutionary history across clades and inferring 
gene regulatory networks. `r BiocStyle::Githubpkg("almeidasilvaf/planttfhunter")` 
allows users to identify plant TFs from whole-genome protein sequences and 
classify them into families and subfamilies (when applicable) using the 
classification scheme implemented in [PlantTFDB](http://planttfdb.gao-lab.org/).
As `r BiocStyle::Githubpkg("almeidasilvaf/planttfhunter")` interoperates with 
core Bioconductor packages (i.e., `AAStringSet` objects as input, 
`SummarizedExperiment` objects as output), it can be easily incorporated in 
pipelines for TF identification and classification in large-scale genomic
data sets.


# Installation

You can install `r BiocStyle::Githubpkg("almeidasilvaf/planttfhunter")` 
with the following code:

```{r "install", eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("planttfhunter")
```

Loading package after installation:

```{r load_package, message = FALSE}
library(planttfhunter)
```

# Data description

In this vignette, we will use protein sequences of TFs from the algae 
species *Galdieria sulphuraria* as an example, as its proteome is
very small. The proteome file was 
downloaded from the PLAZA Diatoms database [@osuna2020seminavis], and 
it was filtered to keep only TFs for demonstration purposes. 
The object `gsu` stores the protein sequences in an `AAStringSet` object.

```{r data}
data(gsu)
gsu
```


# Algorithm description

TF identification and classification is based on the presence of signature 
protein domains, which are identified using profile hidden Markov models (HMMs). 
The family classification scheme is the same as the one used by 
PlantTFDB [@jin2016planttfdb], and it is summarized below: [^1]

```{r scheme, echo = FALSE}
data(classification_scheme)
knitr::kable(classification_scheme)
```


[^1]: **Tip:** You can access this classification scheme in your R session by 
loading the data frame `data(classification_scheme)`.

# Identifying and classifying TFs

To identify TFs from protein sequence data, you will use the 
function `annotate_pfam()`. This function takes as input an `AAStringSet` 
object [^2] and returns a data frame of protein domains associated with 
each sequence. The HMMER program [@finn2011hmmer] is used to scan protein 
sequences for the presence of DNA-binding protein domains, as well as
auxiliary and forbidden domains. Pre-built HMM profiles can be found in the
*extdata/* directory of this package.

[^2]: **Tip:** If you have protein sequences in a FASTA file, you can read 
them into an `AAStringSet` object with the function `readAAStringSet()` from 
the `r BiocStyle::Biocpkg("Biostrings")` package.

This is how you can run `annotate_pfam()` [^3]:

[^3]: **Note:** in the code chunk below, the if statement is not required.
We just added it to make sure that the function `annotate_pfam()` is only
executed if HMMER is installed, to avoid problems when building this
vignette in machines that do not have HMMER installed. 

```{r identifying_tfs}
data(gsu_annotation)

# Annotate TF-related domains using a local installation of HMMER
if(hmmer_is_installed()) {
  gsu_annotation <- annotate_pfam(gsu)
} 

# Take a look at the first few lines of the output
head(gsu_annotation)
```

Now that we have our TF-related domains, we can classify TFs in families 
with the function `classify_tfs()`.

```{r classifying_tfs}
# Classify TFs into families
gsu_families <- classify_tfs(gsu_annotation)

# Take a look at the output
head(gsu_families)

# Count number of TFs per family
table(gsu_families$Family)
```

# Counting TFs per family in multiple species at once

If you want to get TF counts per family for multiple species, you can use
the function `get_tf_counts()`. This function takes a list of `AAStringSet`
objects containing proteomes as input [^4], and it returns a `SummarizedExperiment`
object containing TF counts per family in each species, as well as species
metadata (optional). If you are not familiar with the `SummarizedExperiment`
class, you should consider checking the vignettes of the 
`r BiocStyle::Biocpkg("SummarizedExperiment")` Bioconductor package.

[^4]: **Tip:** If you have whole-genome protein sequences for multiple
species as FASTA files in a given directory, you can read them all as a list
of `AAStringSet` objects with the function `fasta2AAStringSetlist()` from
the Bioconductor package `r BiocStyle::Biocpkg("syntenet")`.

To demonstrate how `get_tf_counts()` works, we will simulate a 
list of 4 `AAStringSet` objects by sampling 50 random genes from the example
data set `gsu` 4 times. 

```{r simulate_data_proteomes}
set.seed(123) # for reproducibility

# Simulate 4 different species by sampling 100 random genes from `gsu`
proteomes <- list(
    Gsu1 = gsu[sample(names(gsu), 50, replace = FALSE)],
    Gsu2 = gsu[sample(names(gsu), 50, replace = FALSE)],
    Gsu3 = gsu[sample(names(gsu), 50, replace = FALSE)],
    Gsu4 = gsu[sample(names(gsu), 50, replace = FALSE)]
)
proteomes
```

Great, we have a list of 4 `AAStringSet` objects. Now, let's also create a 
simulated species metadata data frame for each "species" (simulated).

```{r simulate_data_species_metadata}
# Create simulated species metadata
species_metadata <- data.frame(
    row.names = names(proteomes),
    Division = "Rhodophyta",
    Origin = c("US", "Belgium", "China", "Brazil")
)

species_metadata
```

You can add as many columns as you want to the species metadata data frame, 
but make sure that **species names are in row names**, and 
that `names(proteomes)` match `rownames(species)`, 
otherwise `get_tf_counts()` will return an error. 

Now that we have a list of `AAStringSet` objects and species metadata, we
can execute `get_tf_counts()`. This function uses `annotate_pfam()` under
the hood, so you also need to have HMMER installed and in your PATH to run it.
Here is how you can run it:

```{r get_tf_counts}
data(tf_counts)

# Get TF counts per family in each species as a SummarizedExperiment object
if(hmmer_is_installed()) {
    tf_counts <- get_tf_counts(proteomes, species_metadata)
}

# Take a look at the SummarizedExperiment object
tf_counts

# Look at the matrix of counts: assay() function from SummarizedExperiment
SummarizedExperiment::assay(tf_counts)

# Look at the species metadata: colData() function from SummarizedExperiment
SummarizedExperiment::colData(tf_counts)
```

Cool, huh? In real-world analyses, once you have TF counts per family in 
multiple species obtained with `get_tf_counts()`, 
you can try to find associations between TF counts and
eco-evolutionary aspects or traits of each species (e.g., higher frequencies of
a stress-related TF family in a species that inhabits a stressful environment).

# Session information {.unnumbered}

This document was created under the following conditions:

```{r session_info}
sessioninfo::session_info()
```

# References {.unnumbered}