AnnotationBustR Tutorial

Samuel R. Borstein

2017-03-02

1: Introduction

This is a tutorial for using the R package AnnotationBustR. AnnotationBustR reads in sequences from GenBank and allows you to quickly extract specific parts and write them to FASTA files given a set of search terms. This is useful as it allows users to quickly extract parts of concatenated or genomic sequences based on GenBank features and write them to FASTA files, even when feature annotations for homologous loci may vary (i.e. gene synonyms like COI, COX1, COXI all being used for cytochrome oxidase subunit 1).

In this tutorial we will cover the basics of how to use AnnotationBustR to extract parts of a GenBank sequences.This is considerably faster than extracting them manually and more accurate than using alignment methods like BLAST. For example, it is possible to extract into FASTA files every subsequence from a mitochondrial genome (38 sequences, 13 CDS, 22 tRNA, 2rRNA, 1 D-loop) in 26-36 seconds, which is significantly faster than if you were to do it manually from the online GenBank features table. In this tutorial, we will discuss how to install AnnotationBustR, the basic AnnotationBustR pipeline, and how to use the functions that are included in AnnotationBustR.

2: Installation

2.1: Installation From CRAN

In order to install the stable CRAN version of the AnnotationBustR package:

install.packages("AnnotationBustR")

2.2: Installation of Development Version From GitHub

While we recommend use of the stable CRAN version of this package, we recommend using the package devtools to temporarily install the development version of the package from GitHub if for any reason you wish to use it :

#1. Install 'devtools' if you do not already have it installed:
install.packages("devtools")

#2. Load the 'devtools' package and temporarily install the development version of 
#'AnnotationBustR' from GitHub:
library(devtools)
dev_mode(on=T)
install_github("sborstein/AnnotationBustR")  # install the package from GitHub
library(AnnotationBustR)# load the package

#3. Leave developers mode after using the development version of 'AnnotationBustR' so it will not #remain on your system permanently.
dev_mode(on=F)

3: Using AnnotationBustR

To load AnnotationBustR and all of its functions/data:

library(AnnotationBustR)

It is important to note that most of the functions within AnnotationBustR connect to sequence databases and require an internet connection.

3.0: AnnotationBustR Work Flow

Before we begin a tutorial on how to use AnnotationBustR to extract sequences, lets first discuss the basic workflow of the functions in the package (Fig. 1). The orange box represents the step that occur outside of using AnnotationBustR. The only step you must do outside of AnnotationBustR is obtain a target list of accession numbers. This can be done either by downloading the accession numbers themselves from GenBank (http://www.ncbi.nlm.nih.gov/nuccore) or using R packages like ape, seqinr and rentrez to find accessions of interest in R. All boxes in blue in the graphic below represent steps that occur using AnnotationBustR. Boxes in green represent steps that are not mandatory, but may prove to be useful features of AnnotationBustR. In this tutorial, we will go through the steps in order, including the optional steps to show how to fully use the AnnotationBustR package.

Fig. 1: AnnotationBustR Workflow. Steps in orange occur outside the package while steps in blue are core parts of AnnotationBustR and steps in green represent optional steps

Fig. 1: AnnotationBustR Workflow. Steps in orange occur outside the package while steps in blue are core parts of AnnotationBustR and steps in green represent optional steps

3.1:(Optional Step) Finding the Longest Available

AnnotationBustR’s FindLongestSeq function finds the longest available sequence for each species in a given set of GenBank accession numbers. All the user needs is to obtain a list of GenBank accession numbers they would like to input. The only function argument for FindLongestSeq is Accessions, which takes a vector of accession numbers as input. We can run the function below by:

#Create a vector of GenBank nucleotide accession numbers. In order this contains accessions for 
#two humans, a chimpanzee, and a gorilla. We would expect to get only 3 accessions out of this 
#function, the longest accession found for the two humans.

my.accessions<-c("KX702233.1","KT725960.1","JN191205.1","KF914214")#make the vector
my.longest.seqs<-FindLongestSeq(my.accessions)#Run the FindLongestSeq function
my.longest.seqs#return the longest seqs found

In this case we can see that the function worked and, for the two human accessions, only returned accession KT725960.1 (16570 bp) which was longer than accession KX702233.1 (16569 bp). The table returns a three-column data frame with the species name, the corresponding accession number, and the length.

3.2: Load a Data Frame of Search Terms of Gene Synonyms to Search With:

AnnotationBustR works by searching through the annotation features table for a locus of interest using search terms for it (i.e. possible synonyms it may be listed under). These search terms are formatted to have three columns:

Below (Figure 2) is an example of where these corresponding items would be in the GenBank features table:
Fig. 2: GenBank features annotation for accession G295784.1 that contains ATP8 and ATP6. The words highlighted in yellow would fall under the column of Type. Here they are both CDS. The type of sequence is always listed farthest to the left in the features table. Colors in blue indicate terms that would be placed in the Name column, here indicating that the two CDS in this example are ATP8, labeled as ATPase8 and ATP6 respectively.

Fig. 2: GenBank features annotation for accession G295784.1 that contains ATP8 and ATP6. The words highlighted in yellow would fall under the column of Type. Here they are both CDS. The type of sequence is always listed farthest to the left in the features table. Colors in blue indicate terms that would be placed in the Name column, here indicating that the two CDS in this example are ATP8, labeled as ATPase8 and ATP6 respectively.

So, if we wanted to use AnnotationBustR to capture these and write them to a FASTA, we could set up a data frame that looks like the following.

##   Locus Type    Name
## 1  ATP8  CDS ATPase8
## 2  ATP6  CDS ATPase6

While AnnotationBustR will work with any data frame formatted as discussed above, we have included in it pre-made search terms for mitochondrial DNA (mtDNA), chloroplast DNA (cpDNA), and ribosomal DNA (rDNA). These can be loaded from AnnotationBustR using:

#Load in pre-made data frames of search terms
data(mtDNAterms)#loads the mitochondrial DNA search terms
data(cpDNAterms)#loads the chloroplast DNA search terms
data(rDNAterms)#loads the ribosomal DNA search terms

These data frames can also easily be manipulated to select only the loci of interest. For instance, if we were only interested in tRNAs from mitochondrial genomes, we could easily subset out the tRNAs from the premade mtDNAterms object by:

data(mtDNAterms)#load the data frame of mitochondrial DNA terms
tRNA.terms<-mtDNAterms[mtDNAterms$Type=="tRNA",]#subset out the tRNAs into a new data frame

3.3:(Optional Step) Merge Search Terms If Neccessary

While we have tried to cover as many synonyms for genes in our pre-made data frames, it is likely that some synonyms may not be represented due to the vast array of synonyms a single gene may be listed under in the features table. To solve this we have included the function MergeSearchTerms.

For example, let’s imagine that we found a completely new annotation for the gene cytochrome oxidase subunit 1 (COI) listed as CX1. The MergeSearchTerms function only has two arguments, ..., which takes two or more objects of class data.frame and the logical Sort.Genes, which We could easily add this to other mitochondrial gene terms by:

add.name<-data.frame("COI","CDS", "CX1")#Add imaginary gene synonym for cytochrome oxidase subunit 1, CX1
colnames(add.name)<-colnames(mtDNAterms)#make the columnames for this synonym those needed for AnnotationBustR

#Run the merge search term function without sorting based on gene name.
new.terms<-MergeSearchTerms(add.name, mtDNAterms, SortGenes=FALSE)

#Run the merge search term function with sorting based on gene name.
new.terms<-MergeSearchTerms(add.name, mtDNAterms, SortGenes=TRUE)

3.4 Extract sequences with AnnotationBust

The main function of AnnotationBustR is AnnotationBust. This function extracts the sub-sequence(s) of interest from the accessions and writes them to FASTA files in the current working directory. In addition to writing sub-sequences to FASTA files, AnnotationBust also generates an accession table for all found sub-sequences written to FASTA files which can then be written to a csv file using base R write.csv. AnnotationBustR requires at least two arguments, a vector of accessions for Accessions and a data frame of search terms formatted as discussed in 3.2 and 3.3 for Terms.

Additional arguments include the ability to specify duplicate genes you wish to recover as a vector of gene names using the Duplicates argument and specifying the number of duplicate instances to extract using a numeric vector (which must be the same length as Duplicates) using the DuplicateInstances argument.

AnnotationBustR also has arguments to translate coding sequences into the corresponding peptide sequence setting the TranslateSeqs argument to TRUE. If TranslateSeqs=TRUE, users should also specify the GenBank translation code number corresponding to their sequences using the TranslateCode argument. A list of GenBank translaton codes for taxa is available here: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

An additional argument of AnnotationBust is DuplicateSpecies which when set to DuplicateSpecies=TRUE adds the accession number to the species name for the FASTA file. This can be useful for later analyses involving FASTA files as duplicate names in FASTA headers can pose problems with some programs. It is important to note that if users select DuplicateSpecies=TRUE that while FASTA file will contain species names with their respected accession number, the corresponding accession table will have a single row per species containing all the accession numbers for each sub-sequence locus found for that species seperated by a comma.

The final argument of AnnotantionBust is Prefix. This argument can either be NULL or a character vector of length 1. If Prefix is not NULL it will add the prefix specified to all FASTA files. If left as Prefix=NULL files names will just be the name of the locus.

For the tutorial, we will use the accessions we created in examples 3.1 in the object my.accessions. This is a vector that contains four accessions for mitogenomes of two humans, one chimpanzee, and one gorilla. For this example, we will use all the arguments of AnnotationBust to extract all 38 subsequences for the four accessions (22 tRNAs, 13 CDS, 2 rRNAs, and 1 D-loop). For this we will have to specify duplicates, in this case for tRNA-Leu and tRNA-Ser, which occur twice in vertebrate mitogenomes. We will translate the CDS using TranslateSeqs argument with TranslateCode=2, the code for vertebrate mitogenomes. Because we have to accessions for the same species (Humans), we will specify DuplicateSpecies=TRUE and have out FASTA files output with the prefix “Demo” by specifying Prefix="Demo". This will create 38 FASTA files for all mitochondrial loci in the mitochondrial genome in the working director.

#run AnnotationBust function for two duplicate tRNA genes occuring twice and translate CDS
my.seqs<-AnnotationBust(my.accessions, mtDNAterms,Duplicates=c("tRNA_Leu","tRNA_Ser"), DuplicateInstances=c(2,2), TranslateSeqs=TRUE, TranslateCode=2, DuplicateSpecies=TRUE, Prefix="Demo")

#We can return the accession table. It has our two accessions for Homo sapiens in the same row 
my.seqs#retutn the accession table
write.csv(my.seqs, file="AccessionTable.csv")#Write the accession table to a csv file

3.5 Troubleshooting

As AnnotationBustR is dependent on access to online databases, it is important that you make sure you have a reliable internet connection while using the package, especially for the FindLongestSeq and AnnotationBust functions. Additionally, as these functions require access to GenBank and ACNUC, if you have internet access yet are still haivng isues, check that these sites and there servers are not down or undergoing maintainence, which they do occasionally.

The AnnotationBust function may also provide the following warnings and report the following in the accessions table:

“Acc. # Not Found” = The supplied accession number does not seem to exist, check that it is correct.

“Acc. # Not On ACNUC GenBank” = Sometime ACNUC is behind a day or two from NCBI GenBank. To double check the date for the current version of the ACNUC database you can run the following code:

ACNUC.GB.Info<-seqinr::choosebank("genbank", infobank=T)
ACNUC.GB.INFO#return info on date

“type not fully Ann.” = Indicates that the sequence is on GenBank, but is lacking necessary annotation information to fully extract all subsequences. They may be concatenated and not contain position information for each subsequence.

4: Final Comments

Further information on the functions and their usage can be found in the helpfiles help(package=AnnotationBustR). For any further issues and questions send an email with subject ‘AnnotationBustR support’ to sborstei@vols.utk.edu or post to the issues section on GitHub(https://github.com/sborstein/AnnotationBustR/issues).