---
title: "miRNA affinity models and the KdModel class"
author: 
- name: Pierre-Luc Germain
  affiliation:
    - D-HEST Institute for Neuroscience, ETH
    - Lab of Statistical Bioinformatics, UZH
- name: Michael Soutschek
  affiliation: Lab of Systems Neuroscience, D-HEST Institute for Neuroscience, ETH
- name: Fridolin Gross
  affiliation: Lab of Systems Neuroscience, D-HEST Institute for Neuroscience, ETH
package: scanMiR
output:
  BiocStyle::html_document
abstract: |
  This vignettes introduces the KdModel and KdModelList classes used for storing
  miRNA 12-mer affinities and predicting the dissociation constant of specific
  sites.
vignette: |
  %\VignetteIndexEntry{2_Kdmodels}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
library(BiocStyle)
```


# KdModels

The `KdModel` class contains the information concerning the sequence (12-mer) 
affinity of a given miRNA, and is meant to compress and make easily manipulable 
the dissociation constants (Kd) predictions from 
[McGeary, Lin et al. (2019)](https://dx.doi.org/10.1126/science.aav1741). We 
can take a look at the example `KdModel`:

```{r}
library(scanMiR)
data(SampleKdModel)
SampleKdModel
```

In addition to the information necessary to predict the binding affinity to any 
given 12-mer sequence, the model contains, minimally, the name and sequence of 
the miRNA. Since the `KdModel` class extends the list class, any further 
information can be stored:

```{r}
SampleKdModel$myVariable <- "test"
```

An overview of the binding affinities can be obtained with the following plot:

```{r}
plotKdModel(SampleKdModel, what="seeds")
```

The plot gives the -log(Kd) values of the top 7-mers (including both canonical 
and non-canonical sites), with or without the final "A" vis-à-vis the first 
miRNA nucleotide.

To predict the dissociation constant (and binding type, if any) of a given 
12-mer sequence, you can use the `assignKdType` function:

```{r}
assignKdType("ACGTACGTACGT", SampleKdModel)
# or using multiple sequences:
assignKdType(c("CTAGCATTAAGT","ACGTACGTACGT"), SampleKdModel)
```

The log_kd column contains log(Kd) values multiplied by 1000 and stored as an 
integer (which is more economical when dealing with millions of sites). In the 
example above, `r (lkd <- assignKdType("CTAGCATTAAGT", SampleKdModel)$log_kd)` 
means `r lkd/1000`, or a dissociation constant of `r exp(lkd/1000)`. The 
smaller the values, the stronger the relative affinity.

## KdModelLists

A `KdModelList` object is simply a collection of `KdModel` objects. We can 
build one in the following way:

```{r}
# we create a copy of the KdModel, and give it a different name:
mod2 <- SampleKdModel
mod2$name <- "dummy-miRNA"
kml <- KdModelList(SampleKdModel, mod2)
kml
summary(kml)
```

Beyond operations typically performed on a list (e.g. subsetting), some 
specific slots of the respective KdModels can be accessed, for example: 

```{r}
conservation(kml)
```

# Creating a KdModel object

`KdModel` objects are meant to be created from a table assigning a log_kd 
values to 12-mer target sequences, as produced by the CNN from McGeary, Lin et 
al. (2019). For the purpose of example, we create such a dummy table:

```{r}
kd <- dummyKdData()
head(kd)
```

A `KdModel` object can then be created with:

```{r}
mod3 <- getKdModel(kd=kd, mirseq="TTAATGCTAATCGTGATAGGGGTT", name = "my-miRNA")
```

Alternatively, the `kd` argument can also be the path to the output file of the 
CNN (and if `mirseq` and `name` are in the table, they can be omitted).

# Common KdModel collections

The [scanMiRData](https://github.com/ETHZ-INS/scanMiRData) package contains 
`KdModel` collections corresponding to all human, mouse and rat mirbase miRNAs.

# Under the hood

When calling `getKdModel`, the dissociation constants are stored as an 
lightweight overfitted linear model, with base KDs coefficients (stored as 
integers in `object$mer8`) for each 1024 partially-matching 8-mers (i.e. at 
least 4 consecutive matching nucleotides) to which are added 8-mer-specific 
coefficients (stored in `object$fl`) that are multiplied with a flanking score 
generated by the flanking di-nucleotides. The flanking score is calculated 
based on the di-nucleotide effects experimentally measured by McGeary, Lin et 
al. (2019). To save space, the actual 8-mer sequences are not stored but 
generated when needed in a deterministic fashion. The 8-mers can be obtained, 
in the right order, with the `getSeed8mers` function.

<br/><br/>

# Session info {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```