---
title: Drug-Target Interactions 
author: "Authors: Kevin Horan, Yuzhu Duan, Austin Leong, Dan Evans, Jamison McCorrison, Nicholas Schork and Thomas Girke"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" 
always_allow_html: yes
output:
  BiocStyle::html_document:
    toc_float: true
    code_folding: show
vignette: |
  %\VignetteIndexEntry{Drug-Target Interactions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}

fontsize: 14pt
bibliography: bibtex.bib
---

<style>
pre code {
  white-space: pre !important;
  overflow-x: scroll !important;
  word-break: keep-all !important;
  word-wrap: initial !important;
}
</style>

<!---
- Compile from command-line
Rscript -e "rmarkdown::render('drugTargetInteractions.Rmd', c('BiocStyle::html_document', 'pdf_document')); knitr::knit('drugTargetInteractions.Rmd', tangle=TRUE)"
-->

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```

# Introduction

## Overview

The `drugTargetInteractions` package provides utilities for identifying
drug-target interactions for sets of small molecule and/or gene/protein
identifiers [@Wu2006-jm]. The required drug-target interaction information is
obained from a downloaded SQLite instance of the [ChEMBL](https://www.ebi.ac.uk/chembl/) 
database [@Gaulton2012-dv; @Bento2014-ry]. ChEMBL has been chosen for this purpose, 
because it provides one of the most comprehensive and best annotatated knowledge resources
for drug-target information available in the public domain.

## Install Package

As Bioconductor package `drugTargetInteraction` can be installed with the `BiocManager::install()` function.

```{r pkg_install_bioc, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("drugTargetInteractions")
```

Alternatively, the package can be installed from GitHub as follows. 

```{r pkg_install_github, eval=FALSE}
devtools::install_github("girke-lab/drugTargetInteractions", build_vignettes=TRUE) # Installs from github
```

## Load Package and Access Help

To use `drugTargetInteractions`, the package needs to be loaded in a user's R session.

```{r package load, eval=TRUE, warning=FALSE}
library("drugTargetInteractions") # Loads the package
```

The following commands are useful to list the available help files and open the 
the vignette of the package.

```{r documentation, eval=FALSE, warning=FALSE}
library(help="drugTargetInteractions") # Lists package info
vignette(topic="drugTargetInteractions", package="drugTargetInteractions") # Opens vignette
```

# Working Environment

## Required Files and Directories

The `drugTargetInteractions` package uses a downloaded SQLite instance of the
[ChEMBL](https://www.ebi.ac.uk/chembl/) database. The following code in this
vignette uses a slimmed down toy version of this database that is small enough
to be included in this package for demonstrating the usage of its functions.
For real drug-target analysis work, it is important that users download and
uncompress a recent version of the ChEMBL SQLite database from
[here](https://chembl.gitbook.io/chembl-interface-documentation/downloads), and then replace
the path assigned to `chembldb` below with the path to the full version of the
ChEMBL database they have downloaded to their system. Since the SQLite database
from ChEMBL can be used by this package as is, creating a copy of the ChEMBL
SQLite database on Bioconductor's
[AnnotationHub](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html)
is not necessary at this point. This way users can always use the latest or
historical versions of ChEMBL without the need of maintaining a mirror instance. 

The following `genConfig` function call creates a list containing the paths to
input and output directories used by the sample code introduced in this
vignette.  For real analyses, users want to customize these paths to match the
environment on their system. Usually, this means all paths generated by the
`system.file` function need to be changed, as those are specific to work with
the test data of a package (_e.g._ toy ChEMBL SQLite instance).

```{r dir_env, eval=TRUE}
chembldb <- system.file("extdata", "chembl_sample.db", package="drugTargetInteractions")
resultsPath <- system.file("extdata", "results", package="drugTargetInteractions")
config <- genConfig(chemblDbPath=chembldb, resultsPath=resultsPath)
```

In addition, a lookup table is downloaded on the fly from UniChem (see [web
page](https://www.ebi.ac.uk/unichem/ucquery/listSources) and corresponding [ftp
site](ftp://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/data/wholeSourceMapping/).
This lookup table is used by `drugTargetInteractions` to translate compound
identifiers across different drug databases. Currently, this includes three types of 
compound identifiers: DrugBank, PubChem, and ChEBI. 

```{r data_setup, eval=TRUE}
downloadUniChem(config=config)
cmpIdMapping(config=config)
```

# Produce Results Quickly

Users mainly interested in generating analysis results can skip the
technical details in the following sections and continue with the section
entitled [Workflow to Run Everything](#7_Workflow_to_Run_Everything).

# Retrieve UniProt IDs

The following returns for a set of query IDs (_e.g._ ENSEMBL gene IDs) the
corresponding UniProt IDs based on a stict ID matching as well as a more
relaxed sequence similarity-based approach. The latter sequence similarity
associations are obtained with the `getUniprotIDs` or the `getParalogs` functions
using UniProt's UNIREF cluster or BioMart's paralog annotations, respectively.

## UniProt's UNIREF Clusters

The `UniProt.ws` package is used to to return for a set of query IDs
(here Ensembl gene IDs) the corresponding UniProt IDs based on two independent
approaches: ID mappings (IDMs) and sequence similarity nearest neighbors
(SSNNs) using UNIREF clusters. The latter have been generated by UniProt with the MMSeqs2 
and Linclust algorithms [@Steinegger2017-uq; @Steinegger2018-ub]. Additional details 
on the UNIREF clusters are available [here](https://www.uniprot.org/help/uniref). The 
organism, query ID type and sequence similarity level can be selected under the `taxId`, 
`kt` and `seq_cluster` arguments, respectively. The `seq_cluster` argument can be
assigned one of: `UNIREF100`, `UNIREF90` or `UNIREF50`. The result is a list
with two `data.frames`. The first one is based on IDMs and the second one on
SSNNs.

```{r getUniprotIDs1, eval=FALSE, message=FALSE, warning=FALSE}
keys <- c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071")
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90") 
```

```{r load_res_list90, echo = FALSE, results = 'asis'}
res_list90 <- readRDS(system.file("extdata", "res_list90.rds", package="drugTargetInteractions"))
```
The following shows the first `data.frame` containing the ID mapping results.

```{r getUniprotIDs2, eval=TRUE, message=FALSE}
library(DT)
datatable(res_list90[[1]], options = list(scrollX=TRUE, scrollY="600px", autoWidth = TRUE))
```
The following shows how to return the dimensions of the two `data.frames` and how to
obtain the UniProt IDs as character vectors required for the downstream analysis steps.

```{r getUniprotIDs3, eval=TRUE}
sapply(res_list90, dim, simplify=FALSE)
sapply(names(res_list90), function(x) unique(na.omit(res_list90[[x]]$ID)))
```

## BioMart's Paralogs

The following obtains via `biomaRt` for a set of query genes the corresponding 
UniProt IDs as well as their paralogs. The query genes can be Gene Names or ENSEMBL
Gene IDs from _H. sapiens_. The result is similar to IDMs and SSNNs from the
`getUniprotIDs` function, but instead of UNIREF clusters, biomaRt's paralogs
are used to obtain SSNNs. 

```{r getParalogs1, eval=FALSE}
queryBy <- list(molType="gene", idType="external_gene_name", ids=c("ANKRD31", "BLOC1S1", "KANSL1"))
queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071"))
res_list <- getParalogs(queryBy)
```

```{r getParalogs1_load, eval=TRUE,echo=FALSE}
res_list <- readRDS(system.file("extdata", "paralogs1_res_list.rds", package="drugTargetInteractions"))
```

The following shows the first `data.frame` containing the ID mapping results.

```{r getParalogs2, eval=TRUE, message=FALSE}
library(DT)
datatable(res_list[[1]], options = list(scrollX = TRUE, scrollY="400px", autoWidth = TRUE))
```
The following shows how to return the dimensions of the two `data.frames` and how to
obtain the UniProt IDs as character vectors required for the downstream analysis steps.

```{r getParalogs3, eval=TRUE}
sapply(res_list, dim, simplify=FALSE)
sapply(names(res_list), function(x) unique(na.omit(res_list[[x]]$ID_up_sp)))
```

# Query Drug-Target Annotations 

The `drugTargetAnnot` function returns for a set of compound or gene/protein
IDs the corresponding known drug-target annotation data available in ChEMBL. A
related function called `getDrugTarget` is described in the subsequent
subsection. This method generates very similar results, but uses internally pre-computed
annotation summary tables which is less flexible than the usage of pure SQL statements. 

## Using `drugTargetAnnot` 

The `drugTargetAnnot` function queries the ChEMBL database with SQL statements without
depending on pre-computed annotation tables. 

### Query with Compound IDs

```{r drugTargetQuery1cmp, eval=TRUE}
queryBy <- list(molType="cmp", idType="chembl_id", ids=c("CHEMBL17", "CHEMBL19", "CHEMBL1201117", "CHEMBL25", "nomatch", "CHEMBL1742471"))
qresult1 <- drugTargetAnnot(queryBy, config=config)
```

```{r cmp_query_result1, eval=TRUE, message=FALSE}
library(DT)
datatable(qresult1, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
```

### Query with Protein IDs

```{r drugTargetQuery1protein, eval=TRUE}
queryBy <- list(molType="protein", idType="UniProt_ID", ids=c("P43166", "P00915"))
qresult2 <- drugTargetAnnot(queryBy, config=config)
```

```{r protein_query_result1, eval=TRUE, message=FALSE}
library(DT)
datatable(qresult2, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
```

### Query with Gene IDs

The following returns drug-target annotations for a set of query Ensembl gene IDs.
For this the Ensembl gene IDs are translated into UniProt IDs using both the IDM 
and SSNN approaches. 

```{r getUniprotIDs, eval=FALSE, message=FALSE}
keys <- c("ENSG00000120088", "ENSG00000135441", "ENSG00000120071")
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90") 
```
```{r getUniprotIDs4, eval=TRUE, message=FALSE}
id_list <- sapply(names(res_list90), function(x) unique(na.omit(res_list90[[x]]$ID)))
```

Next, drug-target annotations are returned for the Uniprot IDs with the IDM or
SSNN methods. The following example uses the UniProt IDs of the SSNN method.
Note, to include the upstream Ensembl gene IDs in the final result table, the
below Ensembl ID collapse step via a `tapply` is necessary since occasionally
UniProt IDs are assigned to several Ensembl gene IDs (_e.g._ recent gene duplications).

```{r drugTargetQuery2protein, eval=TRUE}
queryBy <- list(molType="protein", idType="UniProt_ID", ids=id_list[[2]])
qresultSSNN <- drugTargetAnnot( queryBy, config=config)
ensidsSSNN <- tapply(res_list90[[2]]$ENSEMBL, res_list90[[2]]$ID, paste, collapse=", ") 
qresultSSNN <- data.frame(Ensembl_IDs=ensidsSSNN[as.character(qresultSSNN$QueryIDs)], qresultSSNN)
```

```{r protein_query_result2, eval=TRUE, message=FALSE}
library(DT)
datatable(qresultSSNN, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
```

## Using `getDrugTarget` 

The `getDrugTarget` function generates similar results as `drugTargetAnnot`,
but depends on a pre-computed query table (here `drugTargetAnnot.xls`). 

### Query with Compound IDs

```{r cmp_query2, eval=TRUE}
id_mapping <- c(chembl="chembl_id", pubchem="PubChem_ID", uniprot="UniProt_ID", drugbank="DrugBank_ID")
queryBy <- list(molType="cmp", idType="pubchem", ids=c("2244", "65869", "2244"))
queryBy <- list(molType="protein", idType="uniprot", ids=c("P43166", "P00915", "P43166"))
queryBy <- list(molType="cmp", idType="drugbank", ids=c("DB00945", "DB01202"))
#qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17,39,46:53),config=config)
qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config)
```

```{r cmp_query_result2, eval=TRUE, message=FALSE}
library(DT)
datatable(qresult3, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
```

### Query with Protein IDs

```{r protein_query2, eval=TRUE}
queryBy <- list(molType="protein", idType="chembl", ids=c("CHEMBL25", "nomatch", "CHEMBL1742471"))
#qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping, columns=c(1,5,8,16,17,39,46:52),config=config) 
qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config) 
```

```{r protein_query_result3, eval=TRUE, message=FALSE}
library(DT)
datatable(qresult4, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
```

# Query Bioassay Data

The `drugTargetBioactivity` function returns for a set of compound or gene/protein IDs the
corresponding bioassay data available in ChEMBL.

## Query with Compound IDs

Example query for compounds IDs.

```{r bioassayQuery1cmp, eval=TRUE, warning=FALSE}
queryBy <- list(molType="cmp", idType="DrugBank_ID", ids=c("DB00945", "DB00316", "DB01050"))
qresultBAcmp <- drugTargetBioactivity(queryBy, config=config)
```

```{r cmpBA_query_result, eval=TRUE, message=FALSE}
library(DT)
datatable(qresultBAcmp, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
```

## Query with Protein IDs

Example query for protein IDs. Note, the Ensembl gene to UniProt ID mappings are derived from 
above and stored in the named character vector `ensidsSSNN`.

```{r bioassayQuery1protein, eval=TRUE, warning=FALSE}
queryBy <- list(molType="protein", idType="uniprot", ids=id_list[[1]])                                                                                                             
qresultBApep <- drugTargetBioactivity(queryBy, config=config)                                                                                       
qresultBApep <- data.frame(Ensembl_IDs=ensidsSSNN[as.character(qresultBApep$UniProt_ID)], qresultBApep)
```

```{r pepBA_query_result, eval=TRUE, message=FALSE}
library(DT)
datatable(qresultBApep, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
```

# Workflow to Run Everything 

This section explains how to run all of the above drug-target interaction
analysis steps with a few convenience meta functions. Users mainly interested
in generating analysis results quickly can focus on this section only.

## ID mapping 

The `getSymEnsUp` function returns for a query of gene or protein IDs a mapping
table containing: ENSEMBL Gene IDs, Gene Names/Symbols, UniProt IDs and ENSEMBL
Protein IDs. Internally, the function uses the `ensembldb` package. Its results
are returned in a list where the first slot contains the ID mapping table, while
the subsequent slots include the corresponding named character vectors: `ens_gene_id`,
`up_ens_id`, and `up_gene_id`. Currently, the following query IDs are supported
by `getSymEnsUp`: `GENE_NAME`, `ENSEMBL_GENE_ID` and `UNIPROT_ID`.

### Query with Gene Names 

```{r gene_name, eval=TRUE, message=FALSE}
gene_name <- c("CA7", "CFTR")
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=gene_name, idtype="GENE_NAME")
ens_gene_id <- idMap$ens_gene_id
ens_gene_id
```

### Query with ENSEBML Gene IDs

```{r ens_gene_id, eval=TRUE, message=FALSE}
ensembl_gene_id <- c("ENSG00000001626", "ENSG00000168748")
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=ensembl_gene_id, idtype="ENSEMBL_GENE_ID")
ens_gene_id <- idMap$ens_gene_id
```

### Query with UniProt IDs

```{r ens_uniprot_id, eval=TRUE, message=FALSE}
uniprot_id <- c("P43166", "P13569") 
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=uniprot_id, idtype="UNIPROT_ID")
ens_gene_id <- idMap$ens_gene_id
```

## Retrieve UniProt IDs 

The perfect match and nearest neighbor UniProt IDs can be obtained from UniProt's
UNIREF cluster or BioMart's paralog annotations.

### UNIREF Cluster

The corresponding IDM and SSNN UniProt IDs for the above ENSEMBL gene IDs are
obtained with the `getUniprotIDs` function. This step is slow since the queries
have to be performed with `chunksize=1` in order to reliably track the query ENSEMBL 
gene ID information in the results. 

```{r res_list_ens_gene_id_uniref, eval=FALSE, message=FALSE}
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=names(ens_gene_id), seq_cluster="UNIREF90", chunksize=1)
sapply(res_list90, dim)
```

### BioMart Parlogs

Here the corresponding perfect match (IDM) and paralog (SSNN) UniProt IDs for
the above ENSEMBL gene IDs are obtained with the `getParalogs` function. The
latter is much faster than `getUniprotIDs`, and also covers wider evolutionary
distances. Thus, it may be the preferred methods for many use cases.

```{r res_list_ens_gene_id_paralog, eval=FALSE, message=FALSE}
queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=names(ens_gene_id))
res_list <- getParalogs(queryBy)
```

```{r res_list_ens_gene_id_paralog_load , eval=TRUE,echo=FALSE}
res_list <- readRDS(system.file("extdata", "paralogs2_res_list.rds", package="drugTargetInteractions"))
```
```{r res_list_ens_gene_id_paralog_part2, eval=TRUE}
sapply(res_list, dim)
```

## Drug-Target Data

Both drug-target annotation and bioassay data are obtained with a meta
function called `runDrugTarget_Annot_Bioassay` that internally uses the main
processing functions `drugTargetAnnot` and `drugTargetBioactivity`. It
organizes the result in a list with the annotation and bioassay data
(`data.frames`) in the first and second slot, respectively. Importantly, the
results from the IDM and SSNN UniProt IDs are combined in a single table, where
duplicated rows have been removed. To track in the result table, which method
was used for obtaining UniProt IDs, an `IDM_Mapping_Type` column has been added.
Note, the gene IDs in the SSNN rows have the string `Query_` prepended to
indicate that they are not necessarily the genes encoding the SSNN UniProt
proteins listed in the corresponding rows. Instead they are the genes encoding
the query proteins used for searching for SSNNs.  

```{r runDrugTarget_Annot_Bioassay, eval=TRUE, message=FALSE}
drug_target_list <- runDrugTarget_Annot_Bioassay(res_list=res_list, up_col_id="ID_up_sp", ens_gene_id, config=config) 
sapply(drug_target_list, dim)
```

View content of annotation result slot:

```{r runDrugTarget_Annot_Bioassay_result1, eval=TRUE, message=FALSE}
datatable(drug_target_list$Annotation, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
```

View content of bioassay result slot (restricted to first 500 rows):

```{r runDrugTarget_Annot_Bioassay_result2, eval=TRUE, message=FALSE}
datatable(drug_target_list$Bioassay[1:500,], options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
```

## Drug-Target Frequency 

The following generates a summary table containing drug-target frequency information.

```{r DrugTargetSummary, eval=TRUE, message=FALSE}
df <- drug_target_list$Annotation
df[,"GeneName"] <- gsub("Query_", "", as.character(df$GeneName))
stats <- tapply(df$CHEMBL_CMP_ID, as.factor(df$GeneName), function(x) unique(x))
stats <- sapply(names(stats), function(x) stats[[x]][nchar(stats[[x]]) > 0])
stats <- sapply(names(stats), function(x) stats[[x]][!is.na(stats[[x]])])
statsDF <- data.frame(GeneNames=names(stats), Drugs=sapply(stats, paste, collapse=", "), N_Drugs=sapply(stats, length))
```

Print drug-target frequency table.

```{r DrugTargetSummary2, eval=TRUE, message=FALSE}
datatable(statsDF, options = list(scrollX = TRUE, scrollY="150px", autoWidth = TRUE))
```

## Write Results to Tabular Files

Both the annotation and bioassay data of the `drug_target_list` object can
be exported to separate tabular files as follows.

```{r export_annotation_bioassay, eval=FALSE, message=FALSE}
write.table(drug_target_list$Annotation, "DrugTargetAnnotation.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
write.table(drug_target_list$Bioassay, "DrugTargetBioassay.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
write.table(statDF, "statDF.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
```

# Session Info

```{r sessionInfo}
sessionInfo()
```

# References