---
title: "YY1 ChIA-PET motif analysis (step-by-step)"
author: "Jennifer Hammelman, Konstantin Krismer"
date: "2021-07-12"
output: rmarkdown::html_vignette
bibliography: bibliography.bib
vignette: >
  %\VignetteIndexEntry{YY1 ChIA-PET motif analysis (step-by-step)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```r
set.seed(17)
library(spatzie)
```

This vignette describes how to use spatzie to identify pairs of transcription factors whose sequence motifs (that describe their binding sites) are co-enriched in enhancers and promoters that interact with each other. ChIA-PET [@pmid19247990], HiChIP [@pmid27643841] or Hi-C[@pmid19815776] are molecular biology assays commonly used to investigate long-range genomic interactions and the data they generate, once properly processed (BEDPE format), serves as input to spatzie co-enrichment analyses. In this case we used interactions data in BEDPE format based on a ChIA-PET assay targeting the transcription factor YY1.

*Interactions data* in BEDPE format is a tab-separated file, where each line describes one interaction between two *anchors*, i.e., two regions of the genome that are potentially far away from each other.

In order to find motif pairs co-enriched in enhancers and promoters that interact with each other, we first need to annotate all interaction anchors and discard interactions that are not between enhancers and promoters.


# Genome assembly specific configuration

The interactions data was aligned to the mouse genome assembly *mm9* (i.e., the coordinates in the BEDPE file are *mm9* genome coordinates). The annotation package `BSgenome.Mmusculus.UCSC.mm9` contains the genomic sequence, which we need in order to detect transcription factor motifs in the interaction anchor regions. `TxDb.Mmusculus.UCSC.mm9.knownGene` contains promoter annotations.


```r
genome_id <- "BSgenome.Mmusculus.UCSC.mm9"
if (!(genome_id %in% rownames(utils::installed.packages()))) {
  BiocManager::install(genome_id, update = FALSE, ask = FALSE)
}
genome <- BSgenome::getBSgenome(genome_id)


txdb_id <- "TxDb.Mmusculus.UCSC.mm9.knownGene"
if (!(txdb_id %in% rownames(utils::installed.packages()))) {
  BiocManager::install(txdb_id, update = FALSE, ask = FALSE)
}
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene::TxDb.Mmusculus.UCSC.mm9.knownGene

ensembl_data_set <- "mmusculus_gene_ensembl"
gene_symbol <- "mgi_symbol"
```

# Load data

For this vignette we load toy BEDPE example data from a ChIA-PET experiment in murine embryonic stem cells.


```r
yy1_interactions_file <- system.file("extdata/yy1_interactions.bedpe.gz",
                                     package = "spatzie")
yy1_interactions <- GenomicInteractions::makeGenomicInteractionsFromFile(
  yy1_interactions_file,
  type = "bedpe",
  experiment_name = "yy1",
  description = "mESC yy1 chr1")
length(yy1_interactions)
```

```
## [1] 40000
```

# Load mouse promoter information

Unless the interactions data was preprocessed in such a way to include only enhancers in one anchor and only promoters in the other (or any other meaningful grouping), it is imperative to filter out non-enhancer-promoter interactions. To do this, we annotate interaction anchors as either close to a promoter or promoter-distal (i.e., putative enhancer).

Here, promoter regions are loaded, including the surrounding 2500 base pairs.


```r
promoter_ranges <- GenomicFeatures::promoters(txdb,
                                              upstream = 2500,
                                              downstream = 2500,
                                              columns = c("tx_name", "gene_id"))
# trims out-of-bound ranges located on non-circular sequences
promoter_ranges <- GenomicRanges::trim(promoter_ranges)
# remove duplicate promoters from transcript isoforms
promoter_ranges <- BiocGenerics::unique(promoter_ranges)

promoters_df <- as.data.frame(promoter_ranges)
promoters_df$gene_id <- as.character(promoters_df$gene_id)
```

# Annotate interactions with promoters

We use `annotateInteractions` from the `GenomicInteractions` package to annotate the interactions with the promoters obtained in the previous step, thus splitting the interactions into three groups: (1) interactions between promoter regions and regions far away from promoters (i.e., putative enhancers), (2) interactions between two promoter regions, and (3) interactions between two promoter-distal regions.


```r
annotation_features <- list(promoter = promoter_ranges)
GenomicInteractions::annotateInteractions(yy1_interactions, annotation_features)
GenomicInteractions::plotInteractionAnnotations(yy1_interactions)
```

![](figure/is_annotate_interactions-1.png)

# Limit and sort interactions to enhancer:promoter

As we have seen above, the interactions data contains not only interactions between enhancers and promoters. Here all interactions that are not between enhancers and promoters are discarded and enhancer-promoter interactions are sorted in a way that the promoters are always in anchor 1 and the enhancers are always in anchor 2.


```r
distal_promoter_idx <- GenomicInteractions::isInteractionType(
  yy1_interactions, "distal", "promoter")
yy1_pd <- yy1_interactions[distal_promoter_idx]
anchor1 <- GenomicInteractions::anchorOne(yy1_pd)
anchor2 <- GenomicInteractions::anchorTwo(yy1_pd)

promoter_left <- S4Vectors::elementMetadata(anchor1)[, "node.class"] == "promoter"
promoter_right <- S4Vectors::elementMetadata(anchor2)[, "node.class"] == "promoter"
promoter_ranges <- c(anchor1[promoter_left],
                     anchor2[promoter_right])
enhancer_ranges <- c(anchor2[promoter_left],
                     anchor1[promoter_right])
yy1_pd <- GenomicInteractions::GenomicInteractions(promoter_ranges,
                                                   enhancer_ranges)
```

# Scan motifs

For the purpose of this vignette, we use a toy motif database. The HOCOMOCO motif database [@Kulakovskiy2018] is commonly used, but any motif file compatible with `TFBSTools::readJASPARMatrix()` can be used.

`spatzie::scan_motifs()` takes the filtered interactions data and scans the anchors of each interaction with the motifs of the provided motif database, usually a set of transcription factor sequence motifs.

`spatzie::filter_motifs()` operates on the object returned by `spatzie::scan_motifs()` and removes motifs that are present in - in this case - fewer than 40% of the interactions.


```r
motifs_file <- system.file("extdata/motifs_subset.txt.gz",
                           package = "spatzie")
motifs <- TFBSTools::readJASPARMatrix(motifs_file, matrixClass = "PFM")
yy1_pd_interaction <- scan_motifs(yy1_pd, motifs, genome)
yy1_pd_interaction <- filter_motifs(yy1_pd_interaction, 0.4)
```

# Compute motif co-significance

`spatzie::anchor_pair_enrich()` operates on the object returned by `spatzie::filter_motifs()` and calculates the co-enrichment of all motif pairs. It supports three methods to determine co-enrichment, count-based correlation, score-based correlation, and match-based assocation. For details, see the help page (`?spatzie::anchor_pair_enrich`) or the spatzie paper (`citation("spatzie")`).

## Correlation between motif counts


```r
yy1_pd_count_corr <- anchor_pair_enrich(yy1_pd_interaction,
                                        method = "count")
pheatmap::pheatmap(-log2(yy1_pd_count_corr$pair_motif_enrich), fontsize = 6)
```

<img src="figure/is_compute_cosignificance_count-1.png" title="plot of chunk is_compute_cosignificance_count_save" alt="plot of chunk is_compute_cosignificance_count_save" width="95%" style="display: block; margin: auto;" />

## Correlation between max motif scores


```r
yy1_pd_score_corr <- anchor_pair_enrich(yy1_pd_interaction,
                                        method = "score")
pheatmap::pheatmap(-log2(yy1_pd_score_corr$pair_motif_enrich), fontsize = 6)
```

<img src="figure/is_compute_cosignificance_score-1.png" title="plot of chunk is_compute_cosignificance_score_save" alt="plot of chunk is_compute_cosignificance_score_save" width="95%" style="display: block; margin: auto;" />

YY1 binds enhancer and promoter sites providing scaffolding that forms enhancer-promoter interactions in mouse stem cells [@Weintraub2017]. As expected, spatzie identified a statistically significant co-occurrence of YY1 motifs indicating this dependency.

When interpreting spatzie results, keep in mind that motif databases such as HOCOMOCO often include groups of transcription factors with highly similar DNA-binding motifs (in this example YY1 and ZF.5), and the putative co-enrichment of one pair of transcription factor binding sites might be explained by another pair with highly similar motifs.

Please note that the motifs and the interactions data used in this vignette are dummy data used for demonstration purposes only.

# Additional information

Most of the functionality of the spatzie package is also offered through
the website at https://spatzie.mit.edu.

For a more detailed discussion on spatzie, please have a look at
the paper:

**spatzie: An R package for identifying significant transcription factor motif co-enrichment from enhancer-promoter interactions**  
Jennifer Hammelman, Konstantin Krismer, and David K. Gifford  
Nucleic Acids Research, 2022, gkac036; DOI: https://doi.org/10.1093/nar/gkac036  

# Session info


```r
sessionInfo()
```

```
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] BSgenome.Mmusculus.UCSC.mm9_1.4.0 BSgenome_1.60.0                   rtracklayer_1.52.0               
##  [4] Biostrings_2.60.1                 XVector_0.32.0                    GenomicRanges_1.44.0             
##  [7] GenomeInfoDb_1.28.1               IRanges_2.26.0                    S4Vectors_0.30.0                 
## [10] BiocGenerics_0.38.0               spatzie_0.99.6                    usethis_2.0.1                    
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-2                        rjson_0.2.20                            ellipsis_0.3.2                         
##   [4] biovizBase_1.40.0                       htmlTable_2.2.1                         base64enc_0.1-3                        
##   [7] fs_1.5.0                                dichromat_2.0-0                         rstudioapi_0.13                        
##  [10] farver_2.1.0                            bit64_4.0.5                             AnnotationDbi_1.54.1                   
##  [13] fansi_0.5.0                             xml2_1.3.2                              splines_4.1.0                          
##  [16] R.methodsS3_1.8.1                       cachem_1.0.5                            knitr_1.33                             
##  [19] Formula_1.2-4                           Rsamtools_2.8.0                         seqLogo_1.58.0                         
##  [22] annotate_1.70.0                         cluster_2.1.2                           GO.db_3.13.0                           
##  [25] dbplyr_2.1.1                            png_0.1-7                               R.oo_1.24.0                            
##  [28] pheatmap_1.0.12                         readr_1.4.0                             compiler_4.1.0                         
##  [31] httr_1.4.2                              backports_1.2.1                         lazyeval_0.2.2                         
##  [34] assertthat_0.2.1                        Matrix_1.3-4                            fastmap_1.1.0                          
##  [37] htmltools_0.5.1.1                       prettyunits_1.1.1                       tools_4.1.0                            
##  [40] igraph_1.2.6                            gtable_0.3.0                            glue_1.4.2                             
##  [43] TFMPvalue_0.0.8                         GenomeInfoDbData_1.2.6                  reshape2_1.4.4                         
##  [46] dplyr_1.0.7                             rappdirs_0.3.3                          Rcpp_1.0.7                             
##  [49] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2 Biobase_2.52.0                          vctrs_0.3.8                            
##  [52] xfun_0.24                               CNEr_1.28.0                             stringr_1.4.0                          
##  [55] lifecycle_1.0.0                         ensembldb_2.16.2                        restfulr_0.0.13                        
##  [58] poweRlaw_0.70.6                         gtools_3.9.2                            InteractionSet_1.20.0                  
##  [61] XML_3.99-0.6                            zlibbioc_1.38.0                         scales_1.1.1                           
##  [64] VariantAnnotation_1.38.0                ProtGenerics_1.24.0                     GenomicInteractions_1.26.0             
##  [67] hms_1.1.0                               MatrixGenerics_1.4.0                    SummarizedExperiment_1.22.0            
##  [70] AnnotationFilter_1.16.0                 RColorBrewer_1.1-2                      yaml_2.2.1                             
##  [73] curl_4.3.2                              memoise_2.0.0                           gridExtra_2.3                          
##  [76] ggplot2_3.3.5                           biomaRt_2.48.2                          rpart_4.1-15                           
##  [79] latticeExtra_0.6-29                     stringi_1.6.2                           RSQLite_2.2.7                          
##  [82] highr_0.9                               BiocIO_1.2.0                            checkmate_2.0.0                        
##  [85] GenomicFeatures_1.44.0                  caTools_1.18.2                          filelock_1.0.2                         
##  [88] BiocParallel_1.26.1                     rlang_0.4.11                            pkgconfig_2.0.3                        
##  [91] matrixStats_0.59.0                      bitops_1.0-7                            evaluate_0.14                          
##  [94] pracma_2.3.3                            lattice_0.20-44                         purrr_0.3.4                            
##  [97] labeling_0.4.2                          htmlwidgets_1.5.3                       GenomicAlignments_1.28.0               
## [100] bit_4.0.4                               tidyselect_1.1.1                        plyr_1.8.6                             
## [103] magrittr_2.0.1                          R6_2.5.0                                generics_0.1.0                         
## [106] Hmisc_4.5-0                             DelayedArray_0.18.0                     DBI_1.1.1                              
## [109] pillar_1.6.1                            foreign_0.8-81                          survival_3.2-11                        
## [112] KEGGREST_1.32.0                         RCurl_1.98-1.3                          nnet_7.3-16                            
## [115] tibble_3.1.2                            crayon_1.4.1                            utf8_1.2.1                             
## [118] BiocFileCache_2.0.0                     rmarkdown_2.9                           jpeg_0.1-8.1                           
## [121] progress_1.2.2                          TFBSTools_1.30.0                        grid_4.1.0                             
## [124] data.table_1.14.0                       blob_1.2.1                              digest_0.6.27                          
## [127] xtable_1.8-4                            R.utils_2.10.1                          munsell_0.5.0                          
## [130] DirichletMultinomial_1.34.0             motifmatchr_1.14.0                      Gviz_1.36.2
```

# References