---
title: "Identification and classification of duplicated genes"
author: 
  - name: Fabricio Almeida-Silva
    affiliation: |
      VIB-UGent Center for Plant Systems Biology, Ghent University, 
      Ghent, Belgium
  - name: Yves Van de Peer
    affiliation: |
      VIB-UGent Center for Plant Systems Biology, Ghent University, 
      Ghent, Belgium
output: 
  BiocStyle::html_document:
    toc: true
    number_sections: yes
bibliography: bibliography.bib
vignette: >
  %\VignetteIndexEntry{Identification and classification of duplicated genes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
)
```

# Introduction

Gene and genome duplications are a source of raw genetic material for evolution
[@ohno2013evolution]. However, whole-genome duplications (WGD) and small-scale
duplications (SSD) contribute to genome evolution in different manners. To
help you explore the different contributions of WGD and SSD to evolution, we
developed `r BiocStyle::Githubpkg("doubletrouble")`, a package that can be
used to identify and classify duplicated genes from whole-genome 
protein sequences, calculate substitution rates per substitution site (i.e.,
$K_a$ and $K_s$) for gene pairs, find peaks in $K_s$ distributions, and classify
gene pairs by age groups.

# Installation

You can install `r BiocStyle::Githubpkg("doubletrouble")` from Bioconductor
with the following code:

```{r installation, eval = FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("doubletrouble")

## Check that you have a valid Bioconductor installation
BiocManager::valid()
```

Then, you can load the package:

```{r load_package}
library(doubletrouble)
```

# Input data

To identify and classify duplicated gene pairs, users need two types of input 
data:

1. Whole-genome protein sequences (a.k.a. "proteome"), with only one protein 
sequence per gene (i.e., translated sequence of the primary transcript). 
These are typically stored in *.fasta* files.

2. Gene annotation, with genomic coordinates of all features (i.e., genes, exons, 
etc). These are typically stored in *.gff3/.gff/.gtf* files.

3. (Optional) Coding sequences (CDS), with only one DNA sequence sequence per 
gene. These are only required for users who want to calculate 
substitution rates (i.e., $K_a$, $K_s$, and their ratio $K_a/K_s$), 
and they are typically stored in *.fasta* files.

In the Bioconductor ecosystem, sequences and ranges are stored in 
standardized S4 classes named 
`XStringSet` (`AAStringSet` for proteins, `DNAStringSet` for DNA) and `GRanges`, 
respectively. This ensures seamless interoperability across packages, which
is important for both users and package developers. 
Thus, `r BiocStyle::Biocpkg("doubletrouble")` expects 
proteomes in lists of `AAStringSet` objects, and annotations in lists of
`GRanges` objects. Below you can find a summary of input data types, their
typical file formats, and Bioconductor class.

| Input data | File format | Bioconductor class | Requirement |
|:-----------|:------------|:-------------------|:------------|
| Proteome   | FASTA       | `AAStringSet`      | Mandatory   |
| Annotation | GFF3/GTF    | `GRanges`          | Mandatory   |
| CDS        | FASTA       | `DNAStringSet`     | Optional    |

Names of list elements represent species identifiers
(e.g., name, abbreviations, taxonomy IDs, or anything you like), and **must**
be consistent across different lists, so correspondence can be made. 
For instance, suppose you have an object named `seqs` with a list of 
`AAStringSet` objects (proteomes for each species) 
named *Athaliana*, *Alyrata*, and *Bnapus*. You also have an object 
named `annotation` with a list of `GRanges` objects (gene annotation for
each species). In this example, list names in `annotation` must also be
*Athaliana*, *Alyrata*, and *Bnapus* (not necessarily in that order), so that
`r BiocStyle::Biocpkg("doubletrouble")` knows that element *Athaliana* 
in `seqs` corresponds to element *Athaliana* in `annotation`. You can check
that with:

```{r eval=FALSE}
# Checking if names of lists match
setequal(names(seqs), names(annotation)) # should return TRUE
```

**IMPORTANT:** If you have protein sequences as FASTA files in a directory,
you can read them into a list of `AAStringSet` objects with the function
`fasta2AAStringSetlist()` from the Bioconductor package
`r BiocStyle::Biocpkg("syntenet")`. Likewise, you can get a `GRangesList`
object from GFF/GTF files with the function `gff2GRangesList()`, also
from `r BiocStyle::Biocpkg("syntenet")`.


# Getting to know the example data sets

In this vignette, we will use data (proteome, gene annotations, and CDS) from 
the yeast species *Saccharomyces cerevisiae* and *Candida glabrata*, 
since their genomes are relatively small (and, hence, great for 
demonstration purposes). Our goal here is to identify and classify duplicated 
genes in the *S. cerevisiae* genome. The *C. glabrata* genome will be used as an outgroup to identify transposed duplicates later in this vignette.


Data were obtained from Ensembl Fungi release 54 [@yates2022ensembl].
While you can download these data manually from the Ensembl Fungi webpage (or
through another repository such as NCBI RefSeq), here we will demonstrate how
you can get data from Ensembl using the `r BiocStyle::Biocpkg("biomartr")` 
package.

```{r eval=FALSE}
species <- c("Saccharomyces cerevisiae", "Candida glabrata")

# Download data from Ensembl with {biomartr}
## Whole-genome protein sequences (.fa)
fasta_dir <- file.path(tempdir(), "proteomes")
fasta_files <- biomartr::getProteomeSet(
    db = "ensembl", organisms = species, path = fasta_dir
)

## Gene annotation (.gff3)
gff_dir <- file.path(tempdir(), "annotation")
gff_files <- biomartr::getGFFSet(
    db = "ensembl", organisms = species, path = gff_dir
)

## CDS (.fa)
cds_dir <- file.path(tempdir(), "CDS")
cds_files <- biomartr::getCDSSet(
    db = "ensembl", organisms = species, path = cds_dir
)

# Import data to the R session
## Read .fa files with proteomes as a list of AAStringSet + clean names
seq <- syntenet::fasta2AAStringSetlist(fasta_dir)
names(seq) <- gsub("\\..*", "", names(seq))

## Read .gff3 files as a list of GRanges
annot <- syntenet::gff2GRangesList(gff_dir)
names(annot) <- gsub("\\..*", "", names(annot))

## Read .fa files with CDS as a list of DNAStringSet objects
cds <- lapply(cds_files, Biostrings::readDNAStringSet)
names(cds) <- gsub("\\..*", "", basename(cds_files))

# Process data
## Keep ranges for protein-coding genes only
yeast_annot <- lapply(annot, function(x) {
    gene_ranges <- x[x$biotype == "protein_coding" & x$type == "gene"]
    gene_ranges <- IRanges::subsetByOverlaps(x, gene_ranges)
    return(gene_ranges)
})

## Keep only longest sequence for each protein-coding gene (no isoforms)
yeast_seq <- lapply(seq, function(x) {
    # Keep only protein-coding genes
    x <- x[grep("protein_coding", names(x))]
    
    # Leave only gene IDs in sequence names
    names(x) <- gsub(".*gene:| .*", "", names(x))
    
    # If isoforms are present (same gene ID multiple times), keep the longest
    x <- x[order(Biostrings::width(x), decreasing = TRUE)]
    x <- x[!duplicated(names(x))]
    
    return(x)
})
```

Note that processing might differ depending on the data source. For instance, 
Ensembl adds gene 'biotypes' (e.g., protein-coding, pseudogene, etc) in sequence
names and in a field named *biotype* in annotation files. Other databases
might add these information elsewhere.

To avoid problems building this vignette (due to no/slow/unstable internet
connection), the code chunk above is not executed. Instead, we ran such code
and saved data in the following objects:

- **yeast_seq:** A list of `AAStringSet` objects with elements
named *Scerevisiae* and *Cglabrata*.

- **yeast_annot:** A `GRangesList` object with elements 
named *Scerevisiae* and *Cglabrata*.

Let's take a look at them.

```{r example_data}
# Load example data
data(yeast_seq)
data(yeast_annot)

yeast_seq
yeast_annot
```

# Data preparation

First of all, we need to process the list of protein sequences and gene ranges
to detect synteny with `r BiocStyle::Biocpkg("syntenet")`. We will do that
using the function `process_input()` from 
the `r BiocStyle::Biocpkg("syntenet")` package.

```{r process_input}
library(syntenet)

# Process input data
pdata <- process_input(yeast_seq, yeast_annot)

# Inspect the output
names(pdata)
pdata$seq
pdata$annotation
```

The processed data are represented as a list with the elements `seq` and
`annotation`, each containing the protein sequences and gene ranges for
each species, respectively.

Finally, we need to perform pairwise sequence similarity searches to
identify the whole set of paralogous gene pairs. We can do this 
using the function `run_diamond()` from the `r BiocStyle::Biocpkg("syntenet")`
package [^1], setting `compare = "intraspecies"` to perform only intraspecies
comparisons.

[^1]: **Note:** you need to have DIAMOND installed in your machine to run
this function. If you don't have it, you can use 
the `r BiocStyle::Biocpkg("Herper")` package to install DIAMOND in a Conda
environment and run DIAMOND from this virtual environment.

```{r run_diamond_intraspecies}
data(diamond_intra)

# Run DIAMOND in sensitive mode for S. cerevisiae only
if(diamond_is_installed()) {
    diamond_intra <- run_diamond(
        seq = pdata$seq["Scerevisiae"],
        compare = "intraspecies", 
        outdir = file.path(tempdir(), "diamond_intra"),
        ... = "--sensitive"
    )
}

# Inspect output
names(diamond_intra)
head(diamond_intra$Scerevisiae_Scerevisiae)
```

And voilà! Now that we have the DIAMOND output and the processed annotation,
you can classify the duplicated genes.

# Classifying duplicated gene pairs and genes

To classify duplicated gene pairs based on their modes of duplication,
you will use the function `classify_gene_pairs()`. This function offers
four different classification schemes, depending on how much detail you
want. The classification schemes, along with the duplication modes
they identify and their required input, are summarized in the table below:


| Scheme   | Duplication modes           | Required input                                     |
|:---------|:----------------------------|:---------------------------------------------------|
| binary   | SD, SSD                     | `blast_list`, `annotation`                         |
| standard | SD, TD, PD, DD              | `blast_list`, `annotation`                         |
| extended | SD, TD, PD, TRD, DD         | `blast_list`, `annotation`, `blast_inter`          |
| full     | SD, TD, PD, rTRD, dTRD, DD  | `blast_list`, `annotation`, `blast_inter`, `intron_counts` |

**Legend:** SD, segmental duplication. SSD, small-scale duplication.
TD, tandem duplication. PD, proximal duplication. 
TRD, transposon-derived duplication. rTRD, retrotransposon-derived duplication.
dTRD, DNA transposon-derived duplication. DD, dispersed duplication.


As shown in the table, the minimal input objects are:

- **blast_list**: A list of data frames with DIAMOND (or BLASTp, etc.) tabular 
output for intraspecies comparisons as returned 
by `syntenet::run_diamond(..., compare = 'intraspecies')`.
- **annotation**: The processed annotation list (a `GRangesList` object) 
as returned by `syntenet::process_input()`.


However, if you also want to identify transposon-derived duplicates (TRD)
and further classify them as retrotransposon-derived duplicates (rTRD) or 
DNA transposon-derived duplicates (dTRD), you will need the following objects:

- **blast_list**: A list of data frames with DIAMOND (or BLASTp, etc.) tabular 
output for interspecies comparisons (target species vs an outgroup) as returned 
by `syntenet::run_diamond(..., compare = <comparison_data_frame>)`.
- **intron_counts**: A list of data frames with the number of introns per gene
for each species, as returned by `get_intron_counts()`.


Below, we demonstrate each classification scheme with examples.

## The *binary* scheme (SD vs SSD)

The binary scheme classifies duplicates as derived from either 
segmental duplications (SD) or small-scale duplications (SSD).
To identify segmental duplicates, the function `classify_gene_pairs()` 
performs intragenome synteny detection scans 
with `r BiocStyle::Biocpkg("syntenet")` and classifies any detected anchor
pairs as segmental duplicates. The remaining pairs are classified as
originating from small-scale duplications.

This scheme can be used by specifying `scheme = "binary"` in the
function `classify_gene_pairs()`. 

```{r binary_classification}
# Binary scheme
c_binary <- classify_gene_pairs(
    annotation = pdata$annotation,
    blast_list = diamond_intra,
    scheme = "binary"
)

# Inspecting the output
names(c_binary)
head(c_binary$Scerevisiae)

# How many pairs are there for each duplication mode?
table(c_binary$Scerevisiae$type)
```

The function returns a list of data frames, each containing the duplicated
gene pairs and their modes of duplication for each species (here, because
we have only one species, this is a list of length 1).

## The *standard* scheme (SSD &rarr; TD, PD, DD)

Gene pairs derived from small-scale duplications can be further classified
as originating from tandem duplications (TD, genes are adjacent to each other),
proximal duplications (PD, genes are separated by only a few genes), or
dispersed duplications (DD, duplicates that do not fit in any of the previous
categories).

This is the default classification scheme in `classify_gene_pairs()`,
and it can be specified by setting `scheme = "standard"`.

```{r expanded_classification}
# Standard scheme
c_standard <- classify_gene_pairs(
    annotation = pdata$annotation,
    blast_list = diamond_intra,
    scheme = "standard"
)

# Inspecting the output
names(c_standard)
head(c_standard$Scerevisiae)

# How many pairs are there for each duplication mode?
table(c_standard$Scerevisiae$type)
```

## The *extended* scheme (SSD &rarr; TD, PD, TRD, DD)

To find transposon-derived duplicates (TRD), the 
function `classify_gene_pairs()` detects syntenic regions between a target
species and an outgroup species. Genes in the target species that are in 
syntenic regions with the outgroup are treated as *ancestral loci*. Then,
if only one gene of the duplicate pair is an ancestral locus, this 
duplicate pair is classified as originating from transposon-derived
duplications. 


Since finding transposon-derived duplicates requires detecting syntenic regions
between a target species and an outgroup species, you will first need to 
perform similarity searches with DIAMOND [@buchfink2021sensitive], 
BLAST [@altschul1997gapped], or similar programs. This can be done with
`syntenet::run_diamond(seq, compare = compare_df)`. For the parameter `compare`,
you will pass a 2-column data frame specifying the comparisons to be made. [^3]
Importantly, for a more accurate detection of interspecies synteny, you need to
perform bidirectional similarity searches for each comparison. For instance,
if you want to use `speciesA` as target species and `speciesB` as outgroup,
you need to perform similarity searches in both directions: 
`speciesA_speciesB` and `speciesB_speciesA`.

[^3]: **Pro tip:** If you want to identify and classify duplicated genes for
multiple species in batch, you must include the outgroup for each of them
in the comparisons data frame.

Here, we will identify duplicated gene pairs for *Saccharomyces cerevisiae*
using *Candida glabrata* as an outgroup. To create a data frame with the 
bidirectional comparisons to be made, we will use the helper function `make_bidirectional()` from the `r BiocStyle::Biocpkg("syntenet")` package.

```{r make_bidirectional_comparisons}
# Create a data frame of species and outgroups for `syntenet::run_diamond()`
spp_outgroup <- data.frame(
    species = "Scerevisiae",
    outgroup = "Cglabrata"
)
spp_outgroup

# Expand the data frame to make bidirectional comparisons
comparisons <- syntenet::make_bidirectional(spp_outgroup)
comparisons
```

Now that we have a data frame with our desired comparisons, we can pass it
to `syntenet::run_diamond()`.

```{r blast_interspecies}
data(diamond_inter) # load pre-computed output in case DIAMOND is not installed

# Run DIAMOND for the comparisons we specified
if(diamond_is_installed()) {
    diamond_inter <- run_diamond(
        seq = pdata$seq,
        compare = comparisons,
        outdir = file.path(tempdir(), "diamond_inter"),
        ... = "--sensitive"
    )
}

names(diamond_inter)
head(diamond_inter$Scerevisiae_Cglabrata)
```

As you can see, for each species-outgroup pair, `diamond_inter` contains two
data frames: one named `Scerevisiae_Cglabrata`, and one named 
`Cglabrata_Scerevisiae`. Before actually classifying gene pairs, we will
need to collapse these data frames so that we have 
**only one data frame per species-outgroup pair**. This can be easily done
with the function `collapse_bidirectional_hits()` from 
`r BiocStyle::Biocpkg("syntenet")`. As input, this function takes a 
list of interspecies DIAMOND data frames, and a 2-column data frame indicating
the target species and the outgroup species (columns 1 and 2, respectively;
double-check the order of the columns!).

```{r collapse_hits}
# For each species-outgroup pair, collapse bidirectional hits in one data frame
diamond_inter <- syntenet::collapse_bidirectional_hits(
    diamond_inter, compare = spp_outgroup
)
names(diamond_inter)
```

Then, we can pass this interspecies DIAMOND output as an argument to 
the parameter `blast_inter` of `classify_gene_pairs()`.

```{r full_classification}
# Extended scheme
c_extended <- classify_gene_pairs(
    annotation = pdata$annotation,
    blast_list = diamond_intra,
    scheme = "extended",
    blast_inter = diamond_inter
)

# Inspecting the output
names(c_extended)
head(c_extended$Scerevisiae)

# How many pairs are there for each duplication mode?
table(c_extended$Scerevisiae$type)
```

In the example above, we used only one outgroup species (*C. glabrata*). 
However, since results might change depending on the chosen outgroup, 
you can also use multiple outgroups in the comparisons data frame, and then
run interspecies DIAMOND searches as above. For instance, suppose you want
to use *speciesB*, *speciesC*, and *speciesD* as outgroups to *speciesA*.
In this case, your data frame of comparisons (to be passed to the `compare`
argument of `syntenet::run_diamond()`) would look like the following:

```{r}
# Example: multiple outgroups for the same species
spp_outgroup_many <- data.frame(
    species = rep("speciesA", 3),
    outgroup = c("speciesB", "speciesC", "speciesD")
)

comparisons_many <- syntenet::make_bidirectional(spp_outgroup_many)
comparisons_many
```

When multiple outgroups are present, `classify_gene_pairs()` will check if
gene pairs are classified as transposed (i.e., only one gene is an ancestral 
locus) in each of the outgroup species, and then calculate the percentage of 
outgroup species in which each pair is considered 'transposed'. For instance, 
you could have gene pair 1 as transposed based on 30\% of the outgroup species, 
gene pair 2 as transposed based on 100\% of the outgroup species, 
gene pair 3 based on 0\% of the outgroup species, and so on. By default, 
pairs are considered 'transposed' if they are classified as such 
in >70% of the outgroups, but you can choose a different minimum percentage 
cut-off using parameter `outgroup_coverage`.

## The *full* scheme (SSD &rarr; TD, PD, rTRD, dTRD, DD)

Finally, the full scheme consists in classifying transposon-derived
duplicates (TRD) further as originating from retrotransposons (rTRD) or
DNA transposons (dTRD). To do that, the function `classify_gene_pairs()`
uses the number of introns per gene to find TRD pairs for which
one gene has at least 1 intron, and the other gene has no introns; if that
is the case, the pair is classified as originating from the activity
of retrotransposons (rTRD, i.e., the transposed gene without introns is
a processed transcript that was retrotransposed back to the genome). All the 
other TRD pairs are classified as DNA transposon-derived duplicates (dTRD).


To classify duplicates using this scheme, you will first need to create a list
of data frames with the number of introns per gene for each species. This
can be done with the function `get_intron_counts()`, which takes a `TxDb` 
object as input. `TxDb` objects store transcript annotations, and they 
can be created with a family of functions
named `makeTxDbFrom*` from the `r BiocStyle::Biocpkg("txdbmaker")`
package (see `?get_intron_counts()` for a summary of all functions).


Here, we will create a list of `TxDb` objects from a list of `GRanges` objects
using the function `makeTxDbFromGRanges()` 
from `r BiocStyle::Biocpkg("txdbmaker")`. Importantly, to create
a `TxDb` from a `GRanges`, the `GRanges` object must contain genomic coordinates
for all features, including transcripts, exons, etc. Because of that, we
will use annotation from the example data set `yeast_annot`,
which was not processed with `syntenet::process_input()`.

```{r message=FALSE}
library(txdbmaker)

# Create a list of `TxDb` objects from a list of `GRanges` objects
txdb_list <- lapply(yeast_annot, txdbmaker::makeTxDbFromGRanges)
txdb_list
```

Once we have the `TxDb` objects, we can get intron counts per gene with
`get_intron_counts()`.

```{r}
# Get a list of data frames with intron counts per gene for each species
intron_counts <- lapply(txdb_list, get_intron_counts)

# Inspecting the list
names(intron_counts)
head(intron_counts$Scerevisiae)
```

Finally, we can use this list to classify duplicates using the full scheme
as follows:

```{r}
# Full scheme
c_full <- classify_gene_pairs(
    annotation = pdata$annotation,
    blast_list = diamond_intra,
    scheme = "full",
    blast_inter = diamond_inter,
    intron_counts = intron_counts
)

# Inspecting the output
names(c_full)
head(c_full$Scerevisiae)

# How many pairs are there for each duplication mode?
table(c_full$Scerevisiae$type)
```



# Classifying genes into unique modes of duplication

If you look carefully at the output of `classify_gene_pairs()`, you will notice
that some genes appear in more than one duplicate pair, and these pairs can
have different duplication modes assigned. There's nothing wrong with it.
Consider, for example, a gene that was originated by a segmental duplication
some 60 million years ago, and then it underwent a tandem duplication
5 million years ago. In the output of `classify_gene_pairs()`, you'd see
this gene in two pairs, one with **SD** in the `type` column, and one
with **TD**.

If you want to assign each gene to a unique mode of duplication, you can
use the function `classify_genes()`. This function assigns duplication modes
hierarchically using factor levels in column `type` as the priority order.
The priority orders for each classification scheme are:

1. **Binary:** SD > SSD.
2. **Standard:** SD > TD > PD > DD.
3. **Extended:** SD > TD > PD > TRD > DD. 
4. **Full:** SD > TD > PD > rTRD > dTRD > DD.

The input for `classify_genes()` is the list of gene pairs returned by
`classify_gene_pairs()`.

```{r classify_genes}
# Classify genes into unique modes of duplication
c_genes <- classify_genes(c_full)

# Inspecting the output
names(c_genes)
head(c_genes$Scerevisiae)

# Number of genes per mode
table(c_genes$Scerevisiae$type)
```

# Calculating substitution rates for duplicated gene pairs

You can use the function `pairs2kaks()` to calculate rates of nonsynonymous 
substitutions per nonsynonymous site ($K_a$), synonymouys substitutions per
synonymous site ($K_s$), and their ratios ($K_a/K_s$). These rates are calculated
using the Bioconductor package `r BiocStyle::Biocpkg("MSA2dist")`, which
implements all codon models in KaKs_Calculator 2.0 [@wang2010kaks_calculator].


For the purpose of demonstration, we will only calculate $K_a$, $K_s$, 
and $K_a/K_s$ for 5 TD-derived gene pairs. The CDS for TD-derived 
genes were obtained from Ensembl Fungi [@yates2022ensembl], and 
they are stored in `cds_scerevisiae`.

```{r kaks_calculation}
data(cds_scerevisiae)
head(cds_scerevisiae)

# Store DNAStringSet object in a list
cds_list <- list(Scerevisiae = cds_scerevisiae)

# Keep only top five TD-derived gene pairs for demonstration purposes
td_pairs <- c_full$Scerevisiae[c_full$Scerevisiae$type == "TD", ]
gene_pairs <- list(Scerevisiae = td_pairs[seq(1, 5, by = 1), ])

# Calculate Ka, Ks, and Ka/Ks
kaks <- pairs2kaks(gene_pairs, cds_list)

# Inspect the output
head(kaks)
```

Importantly, `pairs2kaks()` expects all genes in the gene pairs to be present
in the CDS, with matching names. Species abbreviations in gene pairs (added
by `r BiocStyle::Biocpkg("syntenet")`) are automatically removed, so you should
not add them to the sequence names of your CDS.

# Identifying and visualizing $K_s$ peaks

Peaks in $K_s$ distributions typically indicate whole-genome duplication (WGD) 
events, and they can be identified by fitting Gaussian mixture models (GMMs) to 
$K_s$ distributions. In `r BiocStyle::Githubpkg("doubletrouble")`, this can be 
performed with the function `find_ks_peaks()`.


However, because of saturation at higher $K_s$ values, only **recent WGD**
events can be reliably identified from $K_s$ 
distributions [@vanneste2013inference]. Recent WGD events are commonly found 
in plant species, such as maize, soybean, apple, etc.
Although the genomes of yeast species have signatures of WGD,
these events are ancient, so it is very hard to find evidence for them
using $K_s$ distributions. [^4] 

[^4]: **Tip:** You might be asking yourself: "How does one identify ancient
WGD, then?". A common approach is to look for syntenic blocks (i.e.,
regions with conserved gene content and order) within genomes. This is what
`classify_gene_pairs()` does under the hood to find SD-derived gene pairs.

To demonstrate how you can find peaks in $K_s$ distributions
with `find_ks_peaks()`, we will use a data frame containing $K_s$ values for
duplicate pairs in the soybean (*Glycine max*) genome, which has undergone 
2 WGDs events ~13 and ~58 million years ago [@schmutz2010genome]. 
Then, we will visualize $K_s$ distributions with peaks using the function
`plot_ks_peaks()`.

First of all, let's look at the data and have a quick look at the distribution
with the function `plot_ks_distro()` (more details on this function in the
data visualization section).

```{r ks_eda}
# Load data and inspect it
data(gmax_ks)
head(gmax_ks)

# Plot distribution
plot_ks_distro(gmax_ks)
```

By visual inspection, we can see 2 or 3 peaks. Based on our prior knowledge,
we know that 2 WGD events have occurred in the ancestral of the *Glycine* genus
and in the ancestral of all Fabaceae, which seem to correspond to the
peaks we see at $K_s$ values around 0.1 and 0.5, respectively. There could be
a third, flattened peak at around 1.6, which would represent the WGD shared
by all eudicots. Let's test which number of peaks has more support: 2 or 3.

```{r find_ks_peaks}
# Find 2 and 3 peaks and test which one has more support
peaks <- find_ks_peaks(gmax_ks$Ks, npeaks = c(2, 3), verbose = TRUE)
names(peaks)
str(peaks)

# Visualize Ks distribution
plot_ks_peaks(peaks)
```

As we can see, the presence of 3 peaks is more supported (lowest BIC). The
function returns a list with the mean, variance and amplitude 
of mixture components (i.e., peaks), as well as the $K_s$ distribution itself.

Now, suppose you just want to get the first 2 peaks. You can do that by
explictly saying to `find_ks_peaks()` how many peaks there are. 

```{r find_peaks_explicit}
# Find 2 peaks ignoring Ks values > 1
peaks <- find_ks_peaks(gmax_ks$Ks, npeaks = 2, max_ks = 1)
plot_ks_peaks(peaks)
```

**Important consideration on GMMs and $K_s$ distributions:**
Peaks identified with GMMs should not be blindly regarded as "the truth".
Using GMMs to find peaks in $K_s$ distributions can lead to problems such as
overfitting and overclustering [@tiley2018assessing]. Some general 
recommendations are:

1. Use your prior knowledge. If you know how many peaks there are (e.g.,
based on literature evidence), just tell the number to `find_ks_peaks()`.
Likewise, if you are not sure about how many peaks there are, but you know
the maximum number of peaks is N, don't test for the presence of >N peaks.
GMMs can incorrectly identify more peaks than the actual number.

2. Test the significance of each peak with SiZer (Significant ZERo crossings
of derivatives) maps [@chaudhuri1999sizer].
This can be done with the function `SiZer()` from 
the R package `r BiocStyle::CRANpkg("feature")`.

As an example of a SiZer map, let's use `feature::SiZer()` to assess
the significance of the 2 peaks we found previously.

```{r sizer}
# Get numeric vector of Ks values <= 1
ks <- gmax_ks$Ks[gmax_ks$Ks <= 1]

# Get SiZer map
feature::SiZer(ks)
```

The blue regions in the SiZer map indicate significantly increasing regions
of the curve, which support the 2 peaks we found. 

# Classifying genes by age groups

Finally, you can use the peaks you obtained before to classify gene pairs
by age group. Age groups are defined based on the $K_s$ peak to which pairs belong.
This is useful if you want to analyze duplicate pairs 
from a specific WGD event, for instance. You can do this with
the function `split_pairs_by_peak()`. This function returns a list containing
the classified pairs in a data frame, and a ggplot object with the 
age boundaries highlighted in the histogram of $K_s$ values.

```{r split_by_peak}
# Gene pairs without age-based classification
head(gmax_ks)

# Classify gene pairs by age group
pairs_age_group <- split_pairs_by_peak(gmax_ks[, c(1,2,3)], peaks)

# Inspecting the output
names(pairs_age_group)

# Take a look at the classified gene pairs
head(pairs_age_group$pairs)

# Visualize Ks distro with age boundaries
pairs_age_group$plot
```

Age groups can also be used to identify SD gene pairs that likely originated
from whole-genome duplications. The rationale here is that segmental duplicates
with $K_s$ values near $K_s$ peaks (indicating WGD events) were likely
created by such WGDs. In a similar logic, SD pairs with $K_s$ values that
are too distant from $K_s$ peaks (e.g., >2 standard deviations away from
the mean) were likely created by duplications of large genomic segments, but 
not duplications of the entire genome. 

As an example, to find gene pairs in the soybean genome that likely originated 
from the WGD event shared by all legumes (at ~58 million years ago), 
you'd need to extract SD pairs in age group 2 using the following code:

```{r}
# Get all pairs in age group 2
pairs_ag2 <- pairs_age_group$pairs[pairs_age_group$pairs$peak == 2, c(1,2)]

# Get all SD pairs
sd_pairs <- gmax_ks[gmax_ks$type == "SD", c(1,2)]

# Merge tables
pairs_wgd_legumes <- merge(pairs_ag2, sd_pairs)

head(pairs_wgd_legumes)
```

# Data visualization

Last but not least, `r BiocStyle::Biocpkg("doubletrouble")` provides users
with graphical functions to produce publication-ready plots from the output
of `classify_gene_pairs()`, `classify_genes()`, and `pairs2kaks()`.
Let's take a look at them one by one.

## Visualizing the frequency of duplicates per mode

To visualize the frequency of duplicated gene pairs or genes by duplication
type (as returned by `classify_gene_pairs()` and `classify_genes()`, 
respectively), you will first need to create a data frame of counts with
`duplicates2counts()`. To demonstrate how this works, we will use an
example data set with duplicate pairs for 3 fungi species (and substitution
rates, which will be ignored by `duplicates2counts()`).

```{r}
# Load data set with pre-computed duplicates for 3 fungi species
data(fungi_kaks)
names(fungi_kaks)
head(fungi_kaks$saccharomyces_cerevisiae)

# Get a data frame of counts per mode in all species
counts_table <- duplicates2counts(fungi_kaks |> classify_genes())

counts_table
```

Now, let's visualize the frequency of duplicate gene pairs by duplication
type with the function `plot_duplicate_freqs()`. You can visualize frequencies
in three different ways, as demonstrated below.

```{r}
# A) Facets
p1 <- plot_duplicate_freqs(counts_table)

# B) Stacked barplot, absolute frequencies
p2 <- plot_duplicate_freqs(counts_table, plot_type = "stack")

# C) Stacked barplot, relative frequencies
p3 <- plot_duplicate_freqs(counts_table, plot_type = "stack_percent")

# Combine plots, one per row
patchwork::wrap_plots(p1, p2, p3, nrow = 3) + 
    patchwork::plot_annotation(tag_levels = "A")
```

If you want to visually the frequency of duplicated **genes** (not gene pairs),
you'd first need to classify genes into unique modes of duplication
with `classify_genes()`, and then repeat the code above. For example:

```{r fig.height = 3, fig.width = 8}
# Frequency of duplicated genes by mode
classify_genes(fungi_kaks) |>   # classify genes into unique duplication types
    duplicates2counts() |>      # get a data frame of counts (long format)
    plot_duplicate_freqs()      # plot frequencies
```

## Visualizing $K_s$ distributions

As briefly demonstrated before, to plot a $K_s$ distribution for the
whole paranome, you will use the function `plot_ks_distro()`.

```{r fig.height=3, fig.width=9}
ks_df <- fungi_kaks$saccharomyces_cerevisiae

# A) Histogram, whole paranome
p1 <- plot_ks_distro(ks_df, plot_type = "histogram")

# B) Density, whole paranome
p2 <- plot_ks_distro(ks_df, plot_type = "density") 

# C) Histogram with density lines, whole paranome
p3 <- plot_ks_distro(ks_df, plot_type = "density_histogram")

# Combine plots side by side
patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
    patchwork::plot_annotation(tag_levels = "A")
```

However, visualizing the distribution for the whole paranome can mask patterns
that only happen for duplicates originating from particular duplication types.
For instance, when looking for evidence of WGD events,
visualizing the $K_s$ distribution for SD-derived pairs only can reveal
whether syntenic genes cluster together, suggesting the presence of WGD history.
To visualize the distribution by duplication type, use `bytype = TRUE` in
`plot_ks_distro()`.

```{r fig.width = 8, fig.height=4}
# A) Duplicates by type, histogram
p1 <- plot_ks_distro(ks_df, bytype = TRUE, plot_type = "histogram")

# B) Duplicates by type, violin
p2 <- plot_ks_distro(ks_df, bytype = TRUE, plot_type = "violin")

# Combine plots side by side
patchwork::wrap_plots(p1, p2) +
    patchwork::plot_annotation(tag_levels = "A")
```

## Visualizing substitution rates by species

The function `plot_rates_by_species()` can be used to show distributions of
substitution rates ($K_s$, $K_a$, or their ratio $K_a/K_s$) by species.
You can choose which rate you want to visualize, and whether or not to
group gene pairs by duplication mode, as demonstrated below.

```{r fig.width = 6, fig.height = 4}
# A) Ks for each species
p1 <- plot_rates_by_species(fungi_kaks)

# B) Ka/Ks by duplication type for each species
p2 <- plot_rates_by_species(fungi_kaks, rate_column = "Ka_Ks", bytype = TRUE)

# Combine plots - one per row
patchwork::wrap_plots(p1, p2, nrow = 2) +
    patchwork::plot_annotation(tag_levels = "A")
```

# Session information {.unnumbered}

This document was created under the following conditions:

```{r session_info}
sessioninfo::session_info()
```

# References {.unnumbered}