---
title: "TEKRABber"
author:
- name: Yao-Chung Chen
  affiliation: Institute of Bioinformatics, Freie Universität Berlin, Berlin, Germany
  email: yao-chung.chen@fu-berlin.de
- name: Katja Nowick
  affiliation: Human Biology, Institute of Biology, Freie Universität Berlin, Berlin, Germany
  email: katja.nowick@fu-berlin.de
package: TEKRABber
output:
    BiocStyle::html_document: default
vignette: >
    %\VignetteIndexEntry{TEKRABber} 
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown} 
editor_options: 
  markdown: 
    wrap: 72
---

# Introduction

`r BiocStyle::Biocpkg("TEKRABber")` is used to estimate the correlations
between genes and transposable elements (TEs) from RNA-seq data
comparing between: **(1)** Two Species **(2)** Control vs. Experiment.
In the following sections, we will use built-in data to demonstrate how
to implement `r BiocStyle::Biocpkg("TEKRABber")` on you own analysis.

# Installation

To use `r BiocStyle::Biocpkg("TEKRABber")` from your R environment, you
need to install it using `r BiocStyle::Biocpkg("BiocManager")`:

```{r installation, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("TEKRABber")
```

```{r load package, message=FALSE}
library(TEKRABber)
```

# Examples

## Comparing between two species, human and chimpanzee as an example

Gene and TE expression data are generated from randomly picked brain
regions FASTQ files from 10 humans and 10 chimpanzees (Khrameeva E et
al., Genome Research, 2020). The values for the first column of gene and
TE count table must be **Ensembl gene ID** and **TE name**:

```{r load built-in data (two species)}
# load built-in data
data(speciesCounts)
hmGene <- speciesCounts$hmGene
hmTE <- speciesCounts$hmTE
chimpGene <- speciesCounts$chimpGene
chimpTE <- speciesCounts$chimpTE
# the first column must be Ensembl gene ID for gene, and TE name for TE
head(hmGene)
```

### Query ortholog information and estimate scaling factor

In the first step, we use `orthologScale()` to get orthology information
and calculate the scaling factor between two species for normalizing
orthologous genes. The species name needs to be the abbreviation of
scientific species name used in Ensembl. (Note: (1)This step queries
information using `r BiocStyle::Biocpkg("biomaRt")` and it might need
some time or try different mirrors due to the connections to Ensembl
(2)It might take some time to calculate scaling factor based on your
data size). For normalizing TEs, you need to provide a RepeatMasker
track annotation table including four columns, (1) the name of TE (2)
the class of TE (3) the average gene length of TE from your reference
species (4) the average gene length from the species you want to
compare. A way to download RepeatMasker annotations is to query from
[UCSC Genome Table
Browser](http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1582443383_JGQc9hw61oGahqe20hrSWdfkjDWx&clade=mammal&org=Human&db=hg38&hgta_group=rep&hgta_track=hg38&hgta_table=0&hgta_regionType=genome&position=chr2%3A25%2C160%2C915-25%2C168%2C903&hgta_outputType=primaryTable&hgta_outFileName=hg38rmsk.csv)
and select the RepeatMasker track. In new version v1.8.0 and above,
TEKRABber provides `prepareRMSK()` to obtain RepeatMasker track from
UCSC and merge the table for you. However, there still remain a chance
that the species you are interested in cannot be obtain from this
method. You can use `GenomeInfoDb::registered_UCSC_genomes()` for
checking the track exists for your species.

```{r search species name, eval=FALSE}
# You can use the code below to search for species name
ensembl <- biomaRt::useEnsembl(biomart = "genes")
biomaRt::listDatasets(ensembl)
```

```{r orthology and normalizeation, message=FALSE}
# In order to save time, we provide the data for this tutorial.
# you can also uncomment the code below and run it for yourself.
data(fetchDataHmChimp)
fetchData <- fetchDataHmChimp

# Query the data and calculate scaling factor using orthologScale():
#' data(speciesCounts)
#' data(hg38_panTro6_rmsk) 
#' hmGene <- speciesCounts$hmGene
#' chimpGene <- speciesCounts$chimpGene
#' hmTE <- speciesCounts$hmTE
#' chimpTE <- speciesCounts$chimpTE
#' 
#' ## For demonstration, here we only select 1000 rows to save time
#' set.seed(1234)
#' hmGeneSample <- hmGene[sample(nrow(hmGene), 1000), ]
#' chimpGeneSample <- chimpGene[sample(nrow(chimpGene), 1000), ]
#' 
#' ## hg38_panTro6_rmsk = prepareRMSK("hg38", "panTro6")
#' fetchData <- orthologScale(
#'     speciesRef = "hsapiens",
#'     speciesCompare = "ptroglodytes",
#'     geneCountRef = hmGeneSample,
#'     geneCountCompare = chimpGeneSample,
#'     teCountRef = hmTE,
#'     teCountCompare = chimpTE,
#'     rmsk = hg38_panTro6_rmsk
#' )
```

### Create inputs for differentially expressed analysis and correlation estimation

We use `DECorrInputs()` to return input files for downstream analysis.

```{r create input files, warning=FALSE}
inputBundle <- DECorrInputs(fetchData)
```

### Differentially expressed analysis (DE analysis)

In this step, we need to generate a metadata contain species name (i.e.,
human and chimpanzee). The row names need to be same as the DE input
table and the column name must be **species** (see the example below).
Then we use `DEgeneTE()` to perform DE analysis. When you are comparing
samples between two species, the parameter **expDesign** should be
**TRUE** (as default).

```{r DE analysis (two species), message=FALSE, results='hide', warning=FALSE}
meta <- data.frame(
    species = c(rep("human", ncol(hmGene) - 1), 
    rep("chimpanzee", ncol(chimpGene) - 1))
)

meta$species <- factor(meta$species, levels = c("human", "chimpanzee"))
rownames(meta) <- colnames(inputBundle$geneInputDESeq2)
hmchimpDE <- DEgeneTE(
    geneTable = inputBundle$geneInputDESeq2,
    teTable = inputBundle$teInputDESeq2,
    metadata = meta,
    expDesign = TRUE
)
```

### Correlation analysis

Here we use `corrOrthologTE()` to perform correlation estimation
comparing each ortholog and TE. This is the most time-consuming step if
you have large data. For a quick demonstration, we use a relatively
small data. You can specify the correlation method and adjusted
*p-value* method. The default methods are Pearson's correlation and FDR.
**Note:** For more efficient and specific analysis, you can subset your
data in this step to focus on only the orthologs and TEs that you are
interested in.

```{r correlation (two species), warning=FALSE, eval=FALSE}
# we select the 200 rows of genes for demo
hmCorrResult <- corrOrthologTE(
    geneInput = hmchimpDE$geneCorrInputRef[c(1:200),],
    teInput = hmchimpDE$teCorrInputRef,
    numCore = 1,
    corrMethod = "pearson",
    padjMethod = "fdr"
)

chimpCorrResult <- corrOrthologTE(
    geneInput = hmchimpDE$geneCorrInputCompare[c(1:200), ],
    teInput = hmchimpDE$teCorrInputCompare,
    numCore = 1,
    corrMethod = "pearson",
    padjMethod = "fdr"
)
```

### Explore your result using `appTEKRABber()`:

`r BiocStyle::Biocpkg("TEKRABber")` provides an app function called
`appTEKRABber()` for you to quickly view your result and select data
that you are interested in. You will need to install
[gridlayout](https://github.com/rstudio/gridlayout) to run
`appTEKRABber()` function. **Note:** you might need to installed
additional packages to run this function.

```{r app visualize (two species), warning=FALSE, eval=FALSE}
remotes::install_github('rstudio/gridlayout')

library(plotly)
library(bslib)
library(shiny)
library(gridlayout)

appTEKRABber(
    corrRef = hmCorrResult,
    corrCompare = chimpCorrResult,
    DEobject = hmchimpDE
)
```

![](app.jpg)

The first time you opeining the app, you will see the distribution of
Gene and TE alongside pvalue axis and coefficient axis in your reference
group and comparision group. You can next select the Gene Name and
Transposable Elements which will plot a scatterplot indicating their
correlations, and also a expression plot showing the differentially
expression analysis. This help you to have a first glance at the pair of
Gene:TE which you are interested in.

## Comparing control and treatment samples within the same species

If you want to compare selected genes and TEs **(1)** from different
tissue in same species or **(2)** control and drug treatment in same
tissue in same species, please generate all the input files following
the input format. Here we show an example data of prepared input files
including expression counts from 10 control and 10 treatment samples.
The format of input data: row names should be gene name or id, and
column name is your sample id (please see details below).

```{r load built-in data (same species)}
# load built-in data
data(ctInputDE)
geneInputDE <- ctInputDE$gene
teInputDE <- ctInputDE$te

# you need to follow the input format as below
head(geneInputDE)
```

### DE analysis

For DE analysis in the same species, you also use `DEgeneTE()` function,
however, you need to set the parameter **expDesign** to **FALSE**. You
also need to provide a metadata which this time the column name must be
**experiment**. See demonstration below:

```{r DE analysis (same species), warning=FALSE, results='hide', message=FALSE}
metaExp <- data.frame(experiment = c(rep("control", 5), rep("treatment", 5)))
rownames(metaExp) <- colnames(geneInputDE)
metaExp$experiment <- factor(
    metaExp$experiment, 
    levels = c("control", "treatment")
)

resultDE <- DEgeneTE(
    geneTable = geneInputDE,
    teTable = teInputDE,
    metadata = metaExp,
    expDesign = FALSE
)
```

### Correlation analysis

Here we demonstrate using the first 200 rows of genes and all the TEs to
calculate their correlations.

```{r correlation (same species), warning=FALSE, eval=FALSE}
controlCorr <- corrOrthologTE(
    geneInput = resultDE$geneCorrInputRef[c(1:200),],
    teInput = resultDE$teCorrInputRef,
    numCore = 1,
    corrMethod = "pearson",
    padjMethod = "fdr"
)

treatmentCorr <- corrOrthologTE(
    geneInput = resultDE$geneCorrInputCompare[c(1:200),],
    teInput = resultDE$teCorrInputCompare,
    numCore = 1,
    corrMethod = "pearson",
    padjMethod = "fdr"
)

head(treatmentCorr)
```

### Explore your result using `appTEKRABber()`:

```{r app visualize (same species), warning=FALSE, eval=FALSE}
remotes::install_github('rstudio/gridlayout')
appTEKRABber(
    corrRef = controlCorr,
    corrCompare = treatmentCorr,
    DEobject = resultDE
)
```

```{r}
sessionInfo()
```