---
title: "Predicting TAD/loop boundaries using pre-trained models provided by preciseTADhub"
author:
- name: Spiro Stilianoudakis
  affiliation:
  - &1 Department of Biostatistics, Virginia Commonwealth University, Richmond, VA
- name: Mikhail Dozmorov
  affiliation:
  - *1
date: '`r format(Sys.Date(), "%B %e, %Y")`'
package: preciseTADhub
output:
    BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{preciseTADhub}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
    chunk_output_type: console
---

```{r set-options, echo=FALSE, cache=FALSE}
knitr::opts_chunk$set(warnings = FALSE, message = FALSE)
```

# Introduction

`preciseTADhub` is an ExperimentData R package that supplements the `preciseTAD` software R package. `preciseTADhub` offers users access to pre-trained random forest classification models used to predict TAD/loop boundary regions. The model building process introduced by `preciseTAD` (https://doi.org/10.1101/2020.09.03.282186) can be computationally intensive. To avoid this burden, we have provided users with 84 (2 cell lines $\times$ 2 ground truth boundaries $\times$ 21 autosomal chromosomes) .RDS files containing pre-trained models that can be leveraged to predict TAD and/or chromatin loop boundaries at base-level resolution using functionality provided by [preciseTAD](https://bioconductor.org/packages/devel/bioc/vignettes/preciseTAD/inst/doc/preciseTAD.html). 

Each of the 84 files are stored as lists containing two objects: 1) a train object from \code{caret} with RF model information, and 2) a data.frame of variable importance for each genomic annotation included in the model. The file names are structured as follows: 

$i$_$j$_$k$_$l$.rds

where $i$ denotes the chromosome that was used as a holdout {CHR1, CHR2, ..., CHR21, CHR22} (i.e. for testing; meaning all other chromosomes were used for training), $j$ denotes the cell line {GM12878, K562}, $k$ denotes the resolution (size of genomic bins) {5kb, 10kb}, and $l$ denotes the TAD/loop caller used to define ground truth {Arrowhead, Peakachu}.

For example the file named "CHR1_GM12878_5kb_Arrowhead.rds" is a list whose first item is a RF model that was built on data for chromosomes 2-22 (omitting CHR9; see https://doi.org/10.1101/2020.09.03.282186), binned using 5 kb bins, ground truth TAD boundaries were identified using the Arrowhead TAD caller at 5 kb on GM12878. All models included the same number of predictors including CTCF, RAD21, SMC3, and ZNF143. The second item in the list is a data.frame with variable importances for CTCF, RAD21, SMC3, and ZNF143.

The pre-trained models set up users to apply them to predict their own boundaries on chromosomes that were heldout, per the framework in the preciseTAD paper (https://doi.org/10.1101/2020.09.03.282186).

# Getting Started

The following is an example of how to predict TAD boundaries at base-level resolution for CHR22 on GM12878, using a pre-trained model stored in `preciseTADhub`.

## Installation

```{r}
#if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
#BiocManager::install(c("ExperimentHub"), version = "3.12")

library(ExperimentHub)
library(preciseTAD)
library(preciseTADhub)
```

## Reading in a file stored on ExperimentHub

Table 1 shows the file names and the corresponding ExperimentHub (EH) IDs. Since we want to make TAD boundary predictions on CHR22 for GM12878, we opt to read in the "CHR22_GM12878_5kb_Arrowhead.rds" file. This corresponds to the EH3895 EHID.

| FileName                        | EHID          |
|---------------------------------|---------------|
| CHR1_GM12878_5kb_Arrowhead.rds  |     EH3815    |
| CHR1_GM12878_10kb_Peakachu.rds  |     EH3816    |
| CHR1_K562_5kb_Arrowhead.rds     |     EH3817    |
| CHR1_K562_10kb_Peakachu.rds     |     EH3818    |
| CHR2_GM12878_5kb_Arrowhead.rds  |     EH3819    |
| CHR2_GM12878_10kb_Peakachu.rds  | EH3820        |
| CHR2_K562_5kb_Arrowhead.rds     | EH3821        |
| CHR2_K562_10kb_Peakachu.rds     | EH3822        |
| CHR3_GM12878_5kb_Arrowhead.rds  | EH3823        |
| CHR3_GM12878_10kb_Peakachu.rds  | EH3824        |
| CHR3_K562_5kb_Arrowhead.rds     | EH3825        |
| CHR3_K562_10kb_Peakachu.rds     | EH3826        |
| CHR4_GM12878_5kb_Arrowhead.rds  | EH3827        |
| CHR4_GM12878_10kb_Peakachu.rds  | EH3828        |
| CHR4_K562_5kb_Arrowhead.rds     | EH3829        |
| CHR4_K562_10kb_Peakachu.rds     | EH3830        |
| CHR5_GM12878_5kb_Arrowhead.rds  | EH3831        |
| CHR5_GM12878_10kb_Peakachu.rds  | EH3832        |
| CHR5_K562_5kb_Arrowhead.rds     | EH3833        |
| CHR5_K562_10kb_Peakachu.rds     | EH3834        |
| CHR6_GM12878_5kb_Arrowhead.rds  | EH3835        |
| CHR6_GM12878_10kb_Peakachu.rds  | EH3836        |
| CHR6_K562_5kb_Arrowhead.rds     | EH3837        |
| CHR6_K562_10kb_Peakachu.rds     | EH3838        |
| CHR7_GM12878_5kb_Arrowhead.rds  | EH3839        |
| CHR7_GM12878_10kb_Peakachu.rds  | EH3840        |
| CHR7_K562_5kb_Arrowhead.rds     | EH3841        |
| CHR7_K562_10kb_Peakachu.rds     | EH3842        |
| CHR8_GM12878_5kb_Arrowhead.rds  | EH3843        |
| CHR8_GM12878_10kb_Peakachu.rds  | EH3844        |
| CHR8_K562_5kb_Arrowhead.rds     | EH3845        |
| CHR8_K562_10kb_Peakachu.rds     | EH3846        |
| CHR10_GM12878_5kb_Arrowhead.rds | EH3847        |
| CHR10_GM12878_10kb_Peakachu.rds | EH3848        |
| CHR10_K562_5kb_Arrowhead.rds    | EH3849        |
| CHR10_K562_10kb_Peakachu.rds    | EH3850        |
| CHR11_GM12878_5kb_Arrowhead.rds | EH3851        |
| CHR11_GM12878_10kb_Peakachu.rds | EH3852        |
| CHR11_K562_5kb_Arrowhead.rds    | EH3853        |
| CHR11_K562_10kb_Peakachu.rds    | EH3854        |
| CHR12_GM12878_5kb_Arrowhead.rds | EH3855        |
| CHR12_GM12878_10kb_Peakachu.rds | EH3856        |
| CHR12_K562_5kb_Arrowhead.rds    | EH3857        |
| CHR12_K562_10kb_Peakachu.rds    | EH3858        |
| CHR13_GM12878_5kb_Arrowhead.rds | EH3859        |
| CHR13_GM12878_10kb_Peakachu.rds | EH3860        |
| CHR13_K562_5kb_Arrowhead.rds    | EH3861        |
| CHR13_K562_10kb_Peakachu.rds    | EH3862        |
| CHR14_GM12878_5kb_Arrowhead.rds | EH3863        |
| CHR14_GM12878_10kb_Peakachu.rds | EH3864        |
| CHR14_K562_5kb_Arrowhead.rds    | EH3865        |
| CHR14_K562_10kb_Peakachu.rds    | EH3866        |
| CHR15_GM12878_5kb_Arrowhead.rds | EH3867        |
| CHR15_GM12878_10kb_Peakachu.rds | EH3868        |
| CHR15_K562_5kb_Arrowhead.rds    | EH3869        |
| CHR15_K562_10kb_Peakachu.rds    | EH3870        |
| CHR16_GM12878_5kb_Arrowhead.rds | EH3871        |
| CHR16_GM12878_10kb_Peakachu.rds | EH3872        |
| CHR16_K562_5kb_Arrowhead.rds    | EH3873        |
| CHR16_K562_10kb_Peakachu.rds    | EH3874        |
| CHR17_GM12878_5kb_Arrowhead.rds | EH3875        |
| CHR17_GM12878_10kb_Peakachu.rds | EH3876        |
| CHR17_K562_5kb_Arrowhead.rds    | EH3877        |
| CHR17_K562_10kb_Peakachu.rds    | EH3878        |
| CHR18_GM12878_5kb_Arrowhead.rds | EH3879        |
| CHR18_GM12878_10kb_Peakachu.rds | EH3880        |
| CHR18_K562_5kb_Arrowhead.rds    | EH3881        |
| CHR18_K562_10kb_Peakachu.rds    | EH3882        |
| CHR19_GM12878_5kb_Arrowhead.rds | EH3883        |
| CHR19_GM12878_10kb_Peakachu.rds | EH3884        |
| CHR19_K562_5kb_Arrowhead.rds    | EH3885        |
| CHR19_K562_10kb_Peakachu.rds    | EH3886        |
| CHR20_GM12878_5kb_Arrowhead.rds | EH3887        |
| CHR20_GM12878_10kb_Peakachu.rds | EH3888        |
| CHR20_K562_5kb_Arrowhead.rds    | EH3889        |
| CHR20_K562_10kb_Peakachu.rds    | EH3890        |
| CHR21_GM12878_5kb_Arrowhead.rds | EH3891        |
| CHR21_GM12878_10kb_Peakachu.rds | EH3892        |
| CHR21_K562_5kb_Arrowhead.rds    | EH3893        |
| CHR21_K562_10kb_Peakachu.rds    | EH3894        |
| CHR22_GM12878_5kb_Arrowhead.rds | EH3895        |
| CHR22_GM12878_10kb_Peakachu.rds | EH3896        |
| CHR22_K562_5kb_Arrowhead.rds    | EH3897        |
| CHR22_K562_10kb_Peakachu.rds    | EH3898        |
Table: File names and corresponding ExperimentHub (EH) IDs for all 84 .RDS files stored in `preciseTADhub`.

Suppose we want to read in the model that was built using CHR1-CHR21, on GM12878, using Arrowhead defined TAD boundaries at 5kb resolution. We can do this with the following wrapper function. Note: you must initialize `ExperimentHub` first.

```{r}
#Initialize ExperimentHub
hub <- ExperimentHub()
query(hub, "preciseTADhub")
myfiles <- query(hub, "preciseTADhub")

CHR22_GM12878_5kb_Arrowhead <- readEH(chr = "CHR22", cl = "GM12878", gt = "Arrowhead", source = myfiles)
```

# Use the pre-trained model to predict TAD boundaries on the holdout chromosome (CHR22) for GM12878 between coordinates 18000000-19000000 at base-level resolution

```{r}
data("tfbsList")

# Restrict the data matrix to include only SMC3, RAD21, CTCF, and ZNF143
tfbsList_filt <- tfbsList[names(tfbsList) %in% c("Gm12878-Ctcf-Broad", 
                                            "Gm12878-Rad21-Haib",
                                            "Gm12878-Smc3-Sydh",
                                            "Gm12878-Znf143-Sydh")]
names(tfbsList_filt) <- c("Ctcf", "Rad21", "Smc3", "Znf143")


# Run preciseTAD
set.seed(123)
pt <- preciseTAD(genomicElements.GR = tfbsList_filt,
                featureType         = "distance",
                CHR                 = "CHR22",
                chromCoords         = list(18000000, 19000000),
                tadModel            = CHR22_GM12878_5kb_Arrowhead,
                threshold           = 1.0,
                verbose             = FALSE,
                parallel            = NULL,
                DBSCAN_params       = list(30000, 3))
                # flank               = 5000)
                # genome              = "hg19")

pt
```