1 About the template

This section provides general description and how to use this BLAST workflow. In the actual analysis report, this section is usually removed.

This BLAST workflow template is based on the BLAST based R package rBLAST.

  • The BLAST software can be downloaded from NCBI. Please make sure it can be run from command-line.
  • rBLAST can be installed with install.packages('rBLAST', repos = 'https://mhahsler.r-universe.dev').

This workflow does: 1. Validate the BLAST installation 2. BLAST input fasta file with a reference genome. 3. BLAST input with a certain database 4. BLAST sequence with general databases to find out the source organism(s).

All are written in R (Linewise) steps, but BLAST+ must be installed.

2 Introduction

Users want to provide here background information about the design of their BLAST project.

This report describes the analysis of a BLAST project studying …

2.1 Experimental design

Typically, users want to specify here all information relevant for the analysis of their BLAST study. This includes detailed descriptions of files, experimental design, reference genome, gene annotations, etc.

3 Workflow environment

To create a Workflow within systemPipeR, we can start by defining an empty container and checking the directory structure:

library(systemPipeR)
sal <- SPRproject()
sal

To load the entire workflow with default settings, run

sal <- importWF(sal, "SPblast.Rmd")

3.1 Load packages

cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
cat(c("'rBLAST", "readr\n"), sep = "', '")
### pre-end
appendStep(sal) <- LineWise(code = {
    library(systemPipeR)
    library(rBLAST)
}, step_name = "load_packages")

3.2 Test BLAST install

Molecules can be loaded or downloaded. This example dataset has 100 molecules.

# Here, the dataset is downloaded. If you already have the
# data locally, change URL to local path.
appendStep(sal) <- LineWise(code = {
    # If you have a modular system, use following line
    moduleload("ncbi-blast")
    # If not, comment out line above you need to install
    # BLAST and configure the PATH.
    blast_check <- tryCMD("blastn", silent = TRUE)
    if (blast_check == "error")
        stop("Check your BLAST installation path.")
}, step_name = "test_blast", dependency = "load_packages")

3.3 Load query sequence

Load query sequence from a fasta file.

In this template, an example fasta is provided, with 10 sequences from Arabidopsis, Cholera, Human, Mouse, and COVID-19, 2 for each.

appendStep(sal) <- LineWise(code = {
    query <- readDNAStringSet("data/example.fasta")
}, step_name = "load_query", dependency = "test_blast")

3.4 BLAST against reference genome

In this step, we are trying to BLAST the query sequences to a reference genome and see if this genome contains the whole or part of the sequences.

In this example, a minimized tair10 genome is used. In the real analysis, please replace it with a full genome fasta file.

appendStep(sal) <- LineWise(code = {
    reference <- "data/tair10.fasta"
    # this command prepare BLAST-able database of genome
    makeblastdb(reference, dbtype = "nucl")
}, step_name = "build_genome_db", dependency = "load_query")

Next BLAST is performed. Since there are only 2 Arabidopsis sequences in the example fasta. Only these two sequences are expected to return statistically meaningful BLAST results.

appendStep(sal) <- LineWise(code = {
    bl_tair10 <- blast(db = reference, type = "blastn")
    cl_tair10 <- predict(bl_tair10, query)
    readr::write_csv(cl_tair10, "results/blast_tair10.csv")
}, step_name = "blast_genome", dependency = "build_genome_db")

3.5 BLAST existing databases

There are plenty of databases on NCBI that one could download and run BLAST on. Once the databases are downloaded, unzip all files into one directory. We need to provide the path to the database.

In this example, we want to know if COVID-19 is a beta coronavirus. Then, we can use some COVID sequence to BLAST all other existing beta coronavirus sequences and find the similarity. This resource is downloadable from NCBI. All downloaded Betacoronavirus.XX.tar.gz files are unzipped to /srv/projects/db/ncbi/preformatted/20220131/. Please change the path according to your project. Then, we can BLAST the last two sequence against the database.

appendStep(sal) <- LineWise(code = {
    bl_covid <- blast(db = "/srv/projects/db/ncbi/preformatted/20220131/Betacoronavirus",
        type = "blastn")
    cl_covid <- predict(bl_covid, query[9:10])
    readr::write_csv(cl_covid, "results/blast_covid.csv")
}, step_name = "blast_db", dependency = "load_query")

3.6 BLAST to general databases

Sometimes we do not know the origin of a sequence, for example, a sequence comes from a contaminated sample, and we want to know the source. In such cases, we would need to BLAST the sequence to a more generic database. The most generic nucleotide BLAST database is the nt database.

This database is extremely big and requires giant RAM and CPU cores to run. Please do not run the following example unless your system admin has provided you such store space and computational power. A better way for average the user is to use the website https://blast.ncbi.nlm.nih.gov/blast/Blast.cgi/ . The engine over there is optimized and can quickly search for the species information.

appendStep(sal) <- LineWise(code = {
    bl_nt <- blast(db = "/srv/projects/db/ncbi/preformatted/20220131/nt",
        type = "blastn")
    cl_nt <- predict(bl_nt, query[5])
    readr::write_csv(cl_nt, "results/blast_nt.csv")
}, step_name = "blast_nt", dependency = "load_query", run_step = "optional")

3.7 Workflow session

appendStep(sal) <- LineWise(code = {
    sessionInfo()
}, step_name = "wf_session", dependency = "blast_db")

4 Manage the workflow

To run the workflow, use runWF function. It executes all the steps store in the workflow container. The execution will be on a single machine without submitting to a queuing system of a computer cluster.

sal <- runWF(sal, run_step = "mandatory")  # remove `run_step` to run all steps to include optional steps

5 About the workflow

5.1 Tools used

To check command-line tools used in this workflow, use listCmdTools, and use listCmdModules to check if you have a modular system.

The following code will print out tools required in your custom SPR project in the report. In case you are running the workflow for the first time and do not have a project yet, or you just want to browser this workflow, following code displays the tools required by default.

if (file.exists(file.path(".SPRproject", "SYSargsList.yml"))) {
    local({
        sal <- systemPipeR::SPRproject(resume = TRUE)
        systemPipeR::listCmdTools(sal)
        systemPipeR::listCmdModules(sal)
    })
} else {
    cat(crayon::blue$bold("Tools and modules required by this workflow are:\n"))
    cat(c("bowtie2/2.4.5", "samtools/1.14", "macs2"), sep = "\n")
}
## Tools and modules required by this workflow are:
## bowtie2/2.4.5
## samtools/1.14
## macs2

5.2 Session Info

This is the session information for rendering this report. To access the session information of workflow running, check HTML report of renderLogs.

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets 
## [6] methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.33.1
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.36       R6_2.5.1           
##  [3] codetools_0.2-20    bookdown_0.40      
##  [5] fastmap_1.2.0       xfun_0.46          
##  [7] cachem_1.1.0        knitr_1.48         
##  [9] htmltools_0.5.8.1   rmarkdown_2.27     
## [11] lifecycle_1.0.4     cli_3.6.3          
## [13] sass_0.4.9          jquerylib_0.1.4    
## [15] compiler_4.4.1      tools_4.4.1        
## [17] evaluate_0.24.0     bslib_0.8.0        
## [19] yaml_2.3.10         formatR_1.14       
## [21] BiocManager_1.30.23 crayon_1.5.3       
## [23] jsonlite_1.8.8      rlang_1.1.4