---
title: "Introduction to HiCExperiment"
author: "Jacques Serizay"
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
        toc: true
        toc_depth: 2
vignette: >
    %\VignetteIndexEntry{Introduction to HiCExperiment}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r opts, eval = TRUE, echo=FALSE, results="hide", warning=FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>", 
    crop = NULL
)
suppressPackageStartupMessages({
    library(dplyr)
    library(GenomicRanges)
    library(HiContactsData)
    library(HiCExperiment)
})
```

# Introduction 

Hi-C experimental approach allows one to query contact frequency 
for all possible pairs of genomic loci simultaneously, in a genome-wide manner. 
The output of this next-generation sequencing-supported technique is a file 
describing every pair (a.k.a contact, or interaction) between two genomic loci. 
This so-called "pairs" file can be binned and transformed into a numerical 
matrix. In such matrix, each cell contains the raw or normalized 
interaction **frequency** between a pair of genomic loci (which location 
can be retrieved using the corresponding column and row indices). 

[HiC-Pro](https://github.com/nservant/HiC-Pro), 
[distiller](https://github.com/open2c/distiller-nf) and 
[Juicer](https://github.com/aidenlab/juicer/) 
are the three main pipelines used to align, filter and process 
paired-end fastq reads into pairs files and contact matrices. Each pipeline 
defined their own file formats to store these two types of 
files. 

- Pairs files are (gzipped) human-readable, text files that are 
a variant of the BEDPE format; however the column order varies depending on the 
pipeline being used. 

- Contact matrix file formats greatly vary depending on the pipeline: 

  - `HiC-Pro` generates two human-readable files: 
  a `regions` file describing each genomic interval, and a `matrix` file quantifying 
  interaction frequency between pairs of loci from the `regions` file, using a 
  standard triplet sparse matrix format. 
  - `Juicer` generates a `.hic` file, a highly compressed binary file storing
  sparse contact matrices from multiple resolutions into a single file. 
  - `distiller` uses the `.(m)cool` format, a sparse, compressed, binary 
  genomic matrix data model built on HDF5.

Each file format can contain roughly the same information, albeit with a largely 
improved compression for `.hic` and `.(m)cool` files, which can also contain 
multi-resolution matrices compared to the HiC-Pro derived files. The 
[4DN consortium](https://data.4dnucleome.org/help/about/about-dcic), 
deciphering the role nuclear organization plays in gene 
expression and cellular function, officially supports both the `.hic` and 
`.(m)cool` formats. Furthermore, the `.(m)cool` format has recently gained 
a lot of traction with the release of a series of `python` packages 
(`cooler`, `cooltools`, `pairtools`, `coolpuppy`) by the [Open2C organization](https://open2c.github.io/)
facilitating the investigation of Hi-C data stored in `.(m)cool` files in a 
`python` environment.

The R `HiCExperiment` package aims at unlocking HiC investigation within the 
rich, genomic-oriented Bioconductor environment. It provides a set of classes
and import functions to parse HiC files (both contact matrices and pairs) in R, 
allowing random access and efficient genome-based subsetting of contact matrices. 
It leverages pre-existing base Bioconductor classes, notably `GInteractions` and `ContactMatrix`
classes ([Lun, Perry & Ing-Simmons, F1000 Research 2016](https://f1000research.com/articles/5-950/v2)). 

# Installation

`HiCExperiment` package can be installed from Bioconductor using the following
command: 

```{r eval = FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("HiCExperiment")
```

All R dependencies will be installed automatically.

# The `HiCExperiment` class

```{r load_lib}
library(HiCExperiment)
showClass("HiCExperiment")
hic <- contacts_yeast()
hic
```

```{r graph, eval = TRUE, echo=FALSE, out.width='100%'}
knitr::include_graphics(
   "https://raw.githubusercontent.com/js2264/HiCExperiment/devel/man/figures/HiCExperiment_data-structure.png"
)
```

# Basics: importing `.(m)cool`, `.hic` or HiC-Pro-generated files as `HiCExperiment` objects

## Import methods

The implemented `import()` methods allow one to import Hi-C matrix files in R as 
`HiCExperiment` objects. 

```{r import, eval = FALSE}
## Change <path/to/contact_matrix>.cool accordingly
hic <- import(
    "<path/to/contact_matrix>.cool", 
    focus = "chr:start-end", 
    resolution = ...
)
```

To give real-life examples, we use the `HiContactsData` package to get access 
to a range of toy datasets available from the `ExperimentHub`. 

```{r evaled_import}
library(HiContactsData)
cool_file <- HiContactsData('yeast_wt', format = 'cool')
import(cool_file, format = 'cool')
```

## Supporting file classes 

There are currently three main standards to store Hi-C matrices in files:

- `.(m)cool` files
- `.hic` files
- `.matrix` and `.bed` files: generated by HiC-Pro.

Three supporting classes were specifically created to ensure that each of these 
file structures would be properly parsed into `HiCExperiment` objects: 

- `CoolFile` 
- `HicFile`
- `HicproFile`

For each object, an optional `pairsFile` can be associated and linked to the 
contact matrix file when imported as a `HiCExperiment` object.

```{r many_imports}
## --- CoolFile
pairs_file <- HiContactsData('yeast_wt', format = 'pairs.gz')
coolf <- CoolFile(cool_file, pairsFile = pairs_file)
coolf
import(coolf)
import(pairsFile(coolf), format = 'pairs')

## --- HicFile
hic_file <- HiContactsData('yeast_wt', format = 'hic')
hicf <- HicFile(hic_file, pairsFile = pairs_file)
hicf
import(hicf)

## --- HicproFile
hicpro_matrix_file <- HiContactsData('yeast_wt', format = 'hicpro_matrix')
hicpro_regions_file <- HiContactsData('yeast_wt', format = 'hicpro_bed')
hicprof <- HicproFile(hicpro_matrix_file, bed = hicpro_regions_file)
hicprof
import(hicprof)
```

# Import arguments

## Querying subsets of Hi-C matrix files

The `focus` argument is used to specifically import contacts within a genomic 
locus of interest. 

```{r focus}
availableChromosomes(cool_file)
hic <- import(cool_file, format = 'cool',  focus = 'I:20001-80000')
hic
focus(hic)
```

_Note:_  
Querying subsets of HiC-Pro formatted matrices is currently not 
supported. HiC-Pro formatted matrices will systematically be fully imported in 
memory when imported. 

One can also extract a count matrix from a Hi-C matrix file that is *not* 
centered at the diagonal. To do this, specify a couple of coordinates in the 
`focus` argument using a character string formatted as `"...|..."`: 

```{r asym}
hic <- import(cool_file, format = 'cool', focus = 'II:1-500000|II:100001-300000')
focus(hic)
```

## Multi-resolution Hi-C matrix files

`import()` works with `.mcool` and multi-resolution `.hic` files as well: 
in this case, the user can specify the `resolution` at which count values are recovered. 

```{r mcool}
mcool_file <- HiContactsData('yeast_wt', format = 'mcool')
availableResolutions(mcool_file)
availableChromosomes(mcool_file)
hic <- import(mcool_file, format = 'cool', focus = 'II:1-800000', resolution = 2000)
hic
```

# HiCExperiment accessors 

## Slots

Slots for a `HiCExperiment` object can be accessed using the following `getters`: 

```{r slots}
fileName(hic)
focus(hic)
resolutions(hic)
resolution(hic)
interactions(hic)
scores(hic)
tail(scores(hic, 1))
tail(scores(hic, 'balanced'))
topologicalFeatures(hic)
pairsFile(hic)
metadata(hic)
```

Several extra functions are available as well: 

```{r extra}
seqinfo(hic) ## To recover the `Seqinfo` object from the `.(m)cool` file
bins(hic) ## To bin the genome at the current resolution
regions(hic) ## To extract unique regions of the contact matrix
anchors(hic) ## To extract "first" and "second" anchors for each interaction
```

## Slot setters

### Scores 

Add any `scores` metric using a numerical vector. 

```{r scores}
scores(hic, 'random') <- runif(length(hic))
scores(hic)
tail(scores(hic, 'random'))
```

### Features 

Add `topologicalFeatures` using `GRanges` or `Pairs`. 

```{r features}
topologicalFeatures(hic, 'viewpoints') <- GRanges("II:300001-320000")
topologicalFeatures(hic)
topologicalFeatures(hic, 'viewpoints')
```

## Coercing `HiCExperiment`

Using the `as()` function, `HiCExperiment` can be coerced in `GInteractions`, 
`ContactMatrix` and `matrix` seamlessly.

```{r as}
as(hic, "GInteractions")
as(hic, "ContactMatrix")
as(hic, "matrix")[1:10, 1:10]
as(hic, "data.frame")[1:10, ]
```

# Importing pairs files

Pairs files typically contain chimeric pairs (filtered after mapping), 
corresponding to loci that have been religated together after restriction 
enzyme digestion. 
Such files have a variety of standards. 
 
- The `.pairs` file format, supported by the 4DN consortium: <ID> <chr1> <pos1> <chr2> <pos2> <str1> <str2> [<frag1> <frag2>]
- The pairs format generated by Juicer: [<readname>] <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> [<score>] [<mapq1> <mapq2>] [<mapq1> <cigar1> <sequence1> <mapq2> <cigar2> <sequence2> <readname1> <readname2>]
- The `.(all)validPairs` file format, defined in the HiC-Pro pipeline: <ID> <chr1> <pos1> <str1> <chr2> <pos2> <str2> <isize> <frag1> <frag2> <mapq1> <mapq2> [<allele-specific info>]

Pairs in any of these different formats are automatically detected and imported 
in R with the `import` function: 

```{r pairs}
import(pairs_file, format = 'pairs')
```

# Further documentation

Please check `?HiCExperiment` in R for a full description of available 
slots, getters and setters, and comprehensive examples of interaction with a 
HiCExperiment object. 

# Session info

```{r session}
sessionInfo()
```