---
title: |-
  DNAfusion
   \vspace{0.5in}
author:
- name: Christoffer Trier Maansson
  affiliation: 
    Department of Clinical Biochemistry, Aarhus University Hospital, Denmark
  email: ctm@clin.au.dk
- name: Emma Roger Andersen
  affiliation:
    Department of Clinical Biochemistry, Aarhus University Hospital, Denmark
date: "`r format(Sys.time(), '%d %b %Y')`"
output:
  BiocStyle::html_document:
    toc: yes
    toc_depth: 3
    number_sections: yes
    highlight: haddock
subtitle: |-
  https://github.com/CTrierMaansson/DNAfusion
   \vspace{0.3in}
abstract: "Circulating tumor DNA (ctDNA) containing somatic mutations 
  can be found in blood plasma. \nThis includes DNA fusions, 
  such as the EML4-ALK, which can be an oncogenic driver in non-small cell lung
  cancer. This is an introduction to the **DNAfusion** package for R,
  which can be used to evaluate whether EML4-ALK is present in blood plasma. \n"
vignette: |
  %\VignetteIndexEntry{Introduction to DNAfusion}
  %\VignetteEncoding{UTF-8} 
  %\VignetteEngine{knitr::rmarkdown}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    tidy.opts = list(width.cutoff=100),
    tidy = FALSE,
    message = FALSE,
    collapse = TRUE,
    comment = "#>"
)
```
\newpage
# Introduction
<img src="logo.png" width="200" align="right">

This package was created in order to increase the sensitivity of EML4-ALK
detection from commercially available NGS products such as the AVENIO (Roche)
pipeline. 

Paired-end sequencing of cfDNA generated BAM files can be used as input to
discover EML4-ALK variants. 
This package was developed using position deduplicated BAM files generated with
the AVENIO Oncology Analysis Software. 
These files are made using the AVENIO ctDNA surveillance kit and 
Illumina Nextseq 500 sequencing. 
This is a targeted hybridization NGS approach and includes ALK-specific but not
EML4-specific probes.

The package includes eight functions.

The output of the first function, `EML4_ALK_detection()`, is used to determine
whether EML4-ALK is detected and serves as input for the next four exploratory
functions characterizing the EML4-ALK variant. The last function
`EML4_ALK_analysis()` combines the output of the exploratory functions.
The `introns_ALK_EML4()` function identifies the introns of EML4 and ALK 
containing the breakpoint. This is used in the `find_variant()` function which
identifies the EML4-ALK variant. 

To serve as examples, this package includes BAM files representing the EML4-ALK
positive cell line H3122 and the EML4-ALK negative cell line, HCC827.  

# Installation

Use **Bioconductor** to install the most recent version of
**DNAfusion**


```{r pull_DNAfusion, message=FALSE, results = 'hide', echo = FALSE}
library(DNAfusion)

```

```{r setup_bioconductor, message=FALSE, results = 'hide', eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DNAfusion")
library(DNAfusion)

```

# Package data

BAM files from the cell lines, H3122 and HCC827, are included in the package and
can be used as examples to explore the functions. 

```{r examples}
H3122_bam <- system.file("extdata", 
                            "H3122_EML4.bam",
                            package = "DNAfusion")
HCC827_bam <-  system.file("extdata", 
                            "HCC827_EML4.bam", 
                            package = "DNAfusion")

```
\newpage
# Functions

## `EML4_ALK_detection()`
This function looks for EML4-ALK mate pair reads in the BAM file.

**Input:**
$$\\[0.1in]$$

**`file`**

    The name of the file which the data are to be read from.
$$\\[0.1in]$$

**`genome`**

    character representing the reference genome. 
    Can either be "hg38" or "hg19". 
    Default = "hg38".
$$\\[0.1in]$$ 

**`mates`**

    integer, the minimum number EML4-ALK mate pairs needed to be
    detected in order to call a variant. Default = 2.
$$\\[0.1in]$$ 

**Output:**

A `GAlignments` object with soft-clipped reads representing
EML4-ALK is returned. If no EML4-ALK is detected the the `GAlignments`
is empty.

**Examples:**
\scriptsize
```{r EML4_ALK_detection2}
H3122_result <- EML4_ALK_detection(file = H3122_bam, 
                        genome = "hg38", 
                        mates = 2) 
head(H3122_result)

```
\normalsize
```{r EML4_ALK_detection3}
HCC827_result <- EML4_ALK_detection(file = HCC827_bam, 
                    genome = "hg38", 
                    mates = 2)
HCC827_result

```

## `EML4_sequence()`
This function identifies the basepairs leading up to the EML4 breakpoint.

**Input:**
$$\\[0.1in]$$

**`reads`**

    GAlignments object returned by EML4_ALK_detection().
$$\\[0.1in]$$  

**`basepairs`**

    integer, number of basepairs identified from the EML4-ALK fusion.
    Default = 20.
$$\\[0.1in]$$ 

**`genome`**

    character representing the reference genome. 
    Can either be "hg38" or "hg19". 
    Default = "hg38".
$$\\[0.1in]$$ 

**Output:**

If EML4-ALK is detected, returns a `table` of identified
EML4 basepairs with the number of corresponding reads for each sequence.
If no EML4-ALK is detected "No EML4-ALK was detected" is returned.

**Examples:**

```{r EML4_sequence}
EML4_sequence(H3122_result, genome = "hg38", basepairs = 20)
EML4_sequence(HCC827_result, genome = "hg38", basepairs = 20)

```

## `ALK_sequence()`
This function identifies the basepairs following the ALK breakpoint.

**Input:**
$$\\[0.1in]$$

**`reads`**

    GAlignments object returned by EML4_ALK_detection().
$$\\[0.1in]$$  

**`basepairs`**

    integer, number of basepairs identified from the EML4-ALK fusion. 
    Default = 20.
$$\\[0.1in]$$ 

**`genome`**

    character representing the reference genome. 
    Can either be "hg38" or "hg19". 
    Default = "hg38".
$$\\[0.1in]$$ 

**Output:**

If EML4-ALK is detected, returns a `table` of identified
ALK basepairs with the number of corresponding reads for each sequence.
If no spanning reads in ALK is detected an empty `GAlignments` object 
is returned. If no EML4-ALK is detected "No EML4-ALK was detected" is returned.

**Examples:**

```{r ALK_sequence}
ALK_sequence(H3122_result, genome = "hg38", basepairs = 20)
ALK_sequence(HCC827_result, genome = "hg38", basepairs = 20)

```

## `break_position()`
This function identifies the genomic position in EML4 or ALK,
where the breakpoint has happened.

**Input:**
$$\\[0.1in]$$

**`reads`**

    GAlignments object returned by EML4_ALK_detection().
$$\\[0.1in]$$
**`genome`**

    character representing the reference genome. 
    Can either be "hg38" or "hg19". 
    Default = "hg38".
$$\\[0.1in]$$ 


**`gene`**

    Character string representing the gene. Can be either "ALK" or "EML4".
$$\\[0.1in]$$

**Output:**

If EML4-ALK is detected, it returns a `table` of genomic positions
with the number of corresponding reads for each sequence.
If no spanning reads in EML4 or ALK is detected an empty `GAlignments` object is
returned. If no EML4-ALK is detected "No EML4-ALK was detected" is returned.

**Examples:**

```{r break_position}
break_position(H3122_result, genome = "hg38", gene = "EML4")
break_position(HCC827_result, genome = "hg38", gene = "EML4")

```

## `break_position_depth()`
This function identifies the read depth at the basepair
before the breakpoint in EML4 or ALK.

**Input:**
$$\\[0.1in]$$

**`file`**

    The name of the file which the data are to be read from.
$$\\[0.1in]$$

**`reads`**

    GAlignments returned by EML4_ALK_detection().
$$\\[0.1in]$$
**`genome`**

    character representing the reference genome. 
    Can either be "hg38" or "hg19". 
    Default = "hg38".
$$\\[0.1in]$$ 


**`gene`**

    Character string representing the gene. Can be either "ALK" or "EML4".
$$\\[0.1in]$$

**Output:**

If EML4-ALK is detected a single `integer` corresponding
to the read depth at the breakpoint is returned.
If no spanning reads in EML4 or ALK is detected an empty GAlignments object is
returned. If no EML4-ALK is detected "No EML4-ALK was detected" is returned.

**Examples:**

```{r break_position_depth}
break_position_depth(H3122_bam, H3122_result, genome = "hg38", gene = "EML4")
break_position_depth(HCC827_bam, HCC827_result, genome = "hg38", gene = "EML4")

```

## `EML4_ALK_analysis()`
This functions collects the results from the other functions of the package.

**Input:**
$$\\[0.1in]$$

**`file`**

    The name of the file which the data are to be read from.
$$\\[0.1in]$$

**`genome`**

    character representing the reference genome. 
    Can be either "hg38" or "hg19".
    Default = "hg38".
$$\\[0.1in]$$ 

**`mates`**

    integer, the minimum number EML4-ALK mate pairs needed to be detected in
    order to call a variant. Default = 2.
$$\\[0.1in]$$

**`basepairs`**

    integer, number of basepairs identified from the EML4-ALK fusion. 
    Default = 20.
$$\\[0.1in]$$ 

**Output:** 

A `list` object with clipped_reads corresponding to `EML4_ALK_detection()`,
last_EML4 corresponding to `EML4_sequence()`,
first_ALK corresponding to `ALK_sequence()`,
breakpoint_ALK corresponding to `break_position()`, gene = "ALK",
breakpoint_EML4 corresponding to `break_position()`, gene = "EML4",
read_depth_ALK corresponding to `break_position_depth()`.gene = "ALK",
and read_depth_EML4 corresponding to `break_position_depth()` gene = "EML4".
If no EML4-ALK is detected an empty `GAlignments` is returned.

**Examples:**

```{r EML4_ALK_analysis_results, message=FALSE}
H3122_results <- EML4_ALK_analysis(file = H3122_bam, 
                                    genome = "hg38", 
                                    mates = 2, 
                                    basepairs = 20)
HCC827_results <- EML4_ALK_analysis(file = HCC827_bam, 
                                    genome = "hg38", 
                                    mates = 2, 
                                    basepairs = 20)

```


\scriptsize
```{r EML4_ALK_analysis_output1}
head(H3122_results$clipped_reads)

```
\normalsize
```{r EML4_ALK_analysis_output2}

H3122_results$last_EML4

H3122_results$first_ALK

H3122_results$breakpoint_ALK

H3122_results$breakpoint_EML4

H3122_results$read_depth_ALK

H3122_results$read_depth_EML4

HCC827_results

```

## `introns_ALK_EML4()`
This function identifies the introns of ALK and EML4 where 
the breakpoint has happened.

**Input:**
$$\\[0.1in]$$
**`file`**

    The name of the file which the data are to be read from.
$$\\[0.1in]$$

**`genome`**

    character representing the reference genome. 
    Can be either "hg38" or "hg19".
    Default = "hg38".
$$\\[0.1in]$$ 

**Output:** 

A`dataframe`of the ALK- and EML4-intron of the breakpoint is returned
corresponding to the transcript ENST00000389048.8 for ALK and
ENST00000318522.10 for EML4.
If the breakpoint is not located in introns of ALK or EML4,
"Breakpoint not located in intron of ALK" or
"Breakpoint not located in intron of EML4" is returned.
If no EML4-ALK is detected “No EML4-ALK was detected” is returned.


**Examples:**

```{r introns_ALK_EML4_results, message=FALSE}
introns_ALK_EML4(file=H3122_bam,genome="hg38")
introns_ALK_EML4(file=HCC827_bam,genome="hg38")
```

## `find_variants()`
This function identifies the EML4-ALK variants as defined by 
[Zhang et al. 2021](https://doi.org/10.1016/j.lungcan.2021.06.012 )

**Input:**
$$\\[0.1in]$$
**`file`**

    The name of the file which the data are to be read from.
$$\\[0.1in]$$

**`genome`**

    character representing the reference genome. 
    Can be either "hg38" or "hg19".
    Default = "hg38".
$$\\[0.1in]$$ 

**Output:** 

A `dataframe`of the EML4-ALK variant is returned.
If no variant is detected, "No ALK-EML4 was detected" is returned.
If the variant is not defined a `list` with identified introns with
breakpoints is returned.
If the breakpoint could not be identified in either of the genes a `list`
with identified introns with breakpoints is returned.

**Examples:**

```{r ifind_variants_results, message=FALSE}
find_variants(file=H3122_bam,genome="hg38")
find_variants(file=HCC827_bam,genome="hg38")
```

\newpage
# Session info
```{r session, echo = FALSE}
sessioninfo::session_info(
    pkgs = "attached",
    include_base = FALSE,
    dependencies = NA,
    to_file = FALSE
)

```