---
title: "rBLAST: R Interface for the Basic Local Alignment Search Tool"
author: "Michael Hahsler and Anurag Nagar"
vignette: >
  %\VignetteIndexEntry{rBLAST: R Interface for the Basic Local Alignment Search Tool}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  rmarkdown::html_vignette
---


```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

```{r setup}
library("rBLAST")
```


```{r, include = FALSE}
# only run code if blast+ is installed
run <- has_blast()
```

```{r, echo = FALSE, eval = !run, results='asis'}
cat("**Note: BLAST was not installed when this vignette was built. Some output is not available in this vignette!**")
```

# Introduction

This package provides an R interface to use 
a local installation of 
the Basic Local Alignment Search Tool (BLAST) executable. This allows the user
to download BLAST databases for querying or to create their own database from 
sets of sequences.
The interface integrates BLAST search directly with the Bioconductor 
infrastructure by using
the `XStringSet` (e.g., `RNAStringSet`) from the package `Biostrings`.

This package complements the function in `blastSequences()` in package `annotate` 
that runs a BLAST query not locally but by connecting to the NCBI server. 

# System requirements
The BLAST+ software needs to be installed on your system. Installation
instructions are available in this package's
[INSTALL](https://github.com/mhahsler/rBLAST/blob/devel/INSTALL) file and
at \url{https://www.ncbi.nlm.nih.gov/books/NBK569861/}.

R needs to be able to find the executable. After installing the software,
try in R
```{r, eval=run}
Sys.which("blastn")
```

If the command returns "" instead of the path to the executable,
then you need to set the environment variable called PATH. In R
```{r, eval=run}
Sys.setenv(PATH = paste(Sys.getenv("PATH"),
   "path_to_your_BLAST_installation", sep=.Platform$path.sep))
```

# Examples

## Use an existing database

You can download pretrained databases from NCBI at https://ftp.ncbi.nlm.nih.gov/blast/db/. 
Here we  download the 16S rRNA database. To avoid multiple downloads of the same file, we
use BiocFileCache in function `blast_db_cache` for download. The database compressed and 
needs to be expanded.

```{r, eval=run}
## download the 16S Microbial rRNA data base from NCBI
tgz_file <- blast_db_get("16S_ribosomal_RNA.tar.gz")
untar(tgz_file, exdir = "16S_rRNA_DB")
```

The extracted database consists of a folder with a set of database files.

```{r, eval=run}
list.files("./16S_rRNA_DB/")
```

Next, we open the database for querying using the `blast()` function which 
returns a BLAST database object.
```{r, eval=run}
bl <- blast(db = "./16S_rRNA_DB/16S_ribosomal_RNA")
bl
```

To demonstrate how to query the database, we read one sequence from an example 
FASTA file that 
is shipped with the package. Queries are performed using the `predict()` function. The result is a 
`data.frame` with one row per match. We will show the first 5 matches.

```{r, eval=run}
seq <- readRNAStringSet(system.file("examples/RNA_example.fasta",
    package = "rBLAST"
))[1]
seq

cl <- predict(bl, seq)
nrow(cl)
cl[1:5, ]
```

Additional arguments for BLAST can be passed on using the `BLAST_args`
parameter for `predict()`.
The output format
can be specified using `custom_format`. In the following, we specify a custom format and 
that the sequences need to have 99% identity. 
See the BLAST Command Line Applications User Manual for details 
(https://www.ncbi.nlm.nih.gov/books/NBK279690/).

```{r, eval=run}
fmt <- paste(
    "qaccver saccver pident length mismatch gapopen qstart qend",
    "sstart send evalue bitscore qseq sseq"
)
cl <- predict(bl, seq,
    BLAST_args = "-perc_identity 99",
    custom_format = fmt
)
cl
```


```{r, include = FALSE, eval=run}
## clean up
unlink("16S_rRNA_DB", recursive = TRUE)
```

## Create a custom BLAST database

The `makeblastdb` utility can be used to create a BLAST database from a 
FASTA file with sequences.
The package provides an R interface function of the same name. As an example,
we will create a searchable database from a sequence FASTA file shipped with the package. 

```{r}
seq <- readRNAStringSet(system.file("examples/RNA_example.fasta",
    package = "rBLAST"
))
seq
```

First, we write a FASTA file that will be used by blast to create the database.
```{r}
writeXStringSet(seq, filepath = "seqs.fasta")
```

Next, we create a BLAST database from the file. We need to specify that
the sequences contains RNA and thus nucleotides.
```{r, eval=run}
makeblastdb("seqs.fasta", db_name = "db/small", dbtype = "nucl")
```

Note that it is convenient to specify a folder and a name separated by 
a `/` to organize all index files in a folder. 

We can now open the database and use it for queries.
```{r, eval=run}
db <- blast("db/small")
db
```

Check if it finds a 100 nucleotide long fragment from the first sequence 
in the training data.

```{r, eval=run}
fragment <- subseq(seq[1], start = 101, end = 200)
fragment

predict(db, fragment)
```

We see that the found sequence ID (`ssequid`) matches the query sequence ID
(`qsequid`) and that it matches the correct region in the sequence 
(see `sstart` and `send`).

To permanently remove a database, the folder can be deleted.
```{r}
unlink("seqs.fasta")
unlink("db", recursive = TRUE)
```

# SessionInfo

```{r}
sessionInfo()
```