---
title: "svaNUMT Quick Overview"
author: "Ruining Dong"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: yes
    toc_float:
      collapsed: yes
      smooth_scroll: yes
vignette: >
  %\VignetteIndexEntry{svaNUMT Package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(#echo = TRUE,
  collapse = TRUE,
  comment = "#>")
```

## Introduction
This vignette outlines a workflow of detecting nuclear-mitochondrial DNA fusions 
from Variant Call Format (VCF) using the `svaNUMT` package. 


## Using GRanges for structural variants: a breakend-centric data structure

This package uses a breakend-centric event notation adopted from the 
`StructuralVariantAnnotation` package. In short, breakends are stored in a 
GRanges object with strand used to indicate breakpoint orientation. where 
breakpoints are represented using a `partner` field containing the name of the 
breakend at the other side of the breakend. This notation was chosen as it 
simplifies the annotations of RTs which are detected at breakend-level.

## Workflow
### Loading data from VCF

VCF data is parsed into a `VCF` object using the `readVCF` function from the
Bioconductor package `VariantAnnotation`. Simple filters could be applied to a 
`VCF` object to remove unwanted calls. The `VCF` object is then converted to a 
`GRanges` object with breakend-centric notations using 
`StructuralVariantAnnotation`. More information about `VCF` objects and 
breakend-centric GRanges object can be found by consulting the vignettes in the 
corresponding packages with `browseVignettes("VariantAnnotation")` and 
`browseVignettes("StructuralVariantAnnotation")`.

```{r input, include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(StructuralVariantAnnotation)
library(VariantAnnotation)
library(svaNUMT)

vcf <- readVcf(system.file("extdata", "chr1_numt_pe_HS25.sv.vcf", package = "svaNUMT"))
gr <- breakpointRanges(vcf)
```

Note that `StructuralVariantAnnotation` requires the `GRanges` object to be 
composed entirely of valid breakpoints. Please consult the vignette of the 
`StructuralVariantAnnotation` package for ensuring breakpoint consistency.


### Identifying Nuclear-mitochondrial Genome Fusion Events
Function `svaNUMT` searches for NUMT events by identifying breakends 
supporting the fusion of nuclear chromosome and mitochondrial genome. 
`svaNUMT` returns identified breakends supporting candidate NUMTs in 2 lists 
of list of GRanges, grouped by chromosome and insertion sites. 
```{r, include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(readr)
numtS <- read_table(system.file("extdata", "numtS.txt", package = "svaNUMT"), 
    col_names = FALSE)
colnames(numtS) <- c("bin", "seqnames", "start", "end", "name", "score", "strand")
numtS <- `seqlevelsStyle<-`(GRanges(numtS), "NCBI")

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
genomeMT <- genome$chrMT
```

```{r}
NUMT <- numtDetect(gr, numtS, genomeMT, max_ins_dist = 20)
```


The breakends supporting the insertion sites and the MT sequence are arranged by
the order of events. Below is an example of a detected NUMT event, where 
MT sequence `MT:15737-15836` followed by polyadenylation is inserted between 
`chr1:1688363-1688364`.
```{r}
GRangesList(NU=NUMT$MT$NU$`1`[[1]], MT=NUMT$MT$MT$`1`[[1]])
```
Below is an example to subset the detected NUMTs by a genomic region 
given `seqnames`, `start`, and `end`. For region `chr1:1000000-3000000`,
there are 3 NUMTs detected.
 
```{r}
seqnames = 1
start = 1000000
end = 3000000
i <- sapply(NUMT$MT$NU[[seqnames]], function(x) 
  sum(countOverlaps(x, GRanges(seqnames = seqnames, IRanges(start, end))))>0)
list(NU=NUMT$MT$NU[[seqnames]][i], MT=NUMT$MT$MT[[seqnames]][i])
```


## Visualising breakpoint pairs via circos plots

One way of visualising paired breakpoints is by circos plots. Here we use the 
package [`circlize`](https://doi.org/10.1093/bioinformatics/btu393) to 
demonstrate breakpoint visualisation. The `bedpe2circos` function takes 
BEDPE-formatted dataframes (see `breakpointgr2bedpe()`) and plotting 
parameters for the `circos.initializeWithIdeogram()` and `circos.genomicLink()` 
functions from `circlize`.

To generate a simple circos plot of one candidate NUMT event:
```{r, include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(circlize)
numt_chr_prefix <- c(NUMT$MT$NU$`1`[[2]], NUMT$MT$MT$`1`[[2]])
GenomeInfoDb::seqlevelsStyle(numt_chr_prefix) <- "UCSC"
pairs <- breakpointgr2pairs(numt_chr_prefix)
pairs
```
To see supporting breakpoints clearly, we generate the circos plot according to 
the loci of event.
```{r}
circos.initializeWithIdeogram(
    data.frame(V1=c("chr1", "chrM"),
               V2=c(1791073,1),
               V3=c(1791093,16571),
               V4=c("p15.4",NA),
               V5=c("gpos50",NA)),  sector.width = c(0.2, 0.8))
#circos.initializeWithIdeogram()
circos.genomicLink(as.data.frame(S4Vectors::first(pairs)), 
                   as.data.frame(S4Vectors::second(pairs)))
circos.clear()
```


## SessionInfo
```{r}
sessionInfo()
```