\documentclass[11pt]{article} %\VignetteIndexEntry{eudysbiome User Manual} %\VignettePackage{eudysbiome} \usepackage{hyperref} \usepackage{enumerate} \textwidth=6.2in \textheight=8.7in \oddsidemargin=0.2in \evensidemargin=0.2in \headheight=0in \headsep=0in \large \begin{document} \SweaveOpts{concordance=TRUE} \title{eudysbiome User Manual} \author{Xiaoyuan Zhou, Christine Nardini \\ zhouxiaoyuan@picb.ac.cn} \maketitle \section*{Introduction} \label{sec:introduction} Large amounts of data for metagenomics, especially the earliest studies on 16S ribosomal RNA gene, are produced by high-throughput screening methods. These are processed in the form of quantitative comparisons (between two microbiomes' conditions) of reads' counts. Reads' counts are interpreted as a taxon's \texttt{abundance} in a microbial community under given conditions, such as a medical treatments or environmental changes. The comparative analysis of such microbiomes with a baseline condition permits to identify a list of microbes (classified in species, genus or higher-order taxa) that are differential in abundance among the conditions. The taxonomic classification of 16S rRNA sequences is generally dependent on two strategies to assign sequences into populations: one is the phylotype-based method that assigns sequences to taxa based on the similarity to a reference database; the other is the operational taxonomic unit (OTU)-based method which does not rely on the association to known taxa, but consists of the definition of clusters of sequences sharing high similarity among them (typically a 97\% identity threshold). These clusters are identified by an OTU\_ID, and where possible associated to a species. The phylotype-based method limits the taxonomic classification of novel sequences from previously unknown taxa and is less sensitive to the sequencing errors\cite{Schloss2011}. The OTU-based method overcomes the limitations of phylotype-based method and permits the classification of OTUs down to the species level, as typically an OTU is thought of as representing a species, and for this the OTU-based method necessarily relies on a reference data set including species. \texttt{eudysbiome} applies the OTU-based classification by mapping OTUs' unclassified \emph{representative} sequences (produced via microbiome analysis pipelines such as emph{Mothur}\cite{Schloss2009} or \emph{QIIME}\cite{Caporaso2010} during the generation of the OTUs) to a selection of classified representative sequences of OTUs, obtained from clustering the truncated \emph{SILVA}\cite{silva} Small subunit (16S/18S, SSU) ribosomal RNA (rRNA) reference dataset at 97\% similarity. This representative dataset is built within QIIME and included in our package. In this mapping, each entry in the representative dataset is associated to a taxonomic path containing the taxonomy rank designations from kingdom to species. In order to translate the GI microbiome variations into potential clinical interpretation, it is helpful to assess whether the variation leads towards a microbiome that is more or less synergistic with the host, i.e. more or less pathogenic. To achieve this goal, in addition to the standard taxonomic annotation (\texttt{classification}, from now on) described above, this package annotates species as \texttt{harmful/harmless} based on their ability to contribute to mammals' host diseases (as indicated in literature) and hence based on their pathogenic potential (\texttt{annotation} from now on). Several assumptions are needed to achieve this goal, due to the quickly growing, but still limited amount of information currently available on microbes in our GI. First, the number of unknown (uncultured, ambiguous, unclassified, etc.) species resulting from the classification process leads to the practical impossibility to assess the pathogenic potential directly on the annotated species. For this reason, the analysis is run at the genus level, annotated based on the annotated species. The rationale used throughout the annotation process relies on the assumption that larger and successful efforts have been devoted to the study of pathogens, and we have observed that classified species that are not explicitly (literature based) annotated as \texttt{harmful}, can be safely annotated as \texttt{harmless}. Genera are then further annotated as harmful, in a conservative fashion, if at least one species in the genus is pathogenic or less pathogenic otherwise. This outputs the reference table \texttt{harmGenera} offered in the package, and allows for direct annotation of classified genera. However, to achieve a more precise annotation, it is recommended to provide (or compute via our package \texttt{tableSpecies} function) the data frame genus-species, to control for the very stringent harmful annotation. In fact, if, in the data to be analyzed, species are provided in addition to the genus, the package checks whether the genera defined as harmful in the annotation table actually include the harmful species. In this case the genus is coherently annotated as harmful. However, if the harmful species are not present in the data to be analyzed, the genus provides in fact a harmless contribution, and it is annotated as such. Similarly, if the genus turns out to include none known species, we discard the one listed in the harmGenera table from the analysis while annotate the others as harmless. Finally, the package statistically measures the \texttt{eubiotic} (harmless genera increase or harmful genera decrease) or \texttt{dysbiotic} (harmless genera decrease or harmful genera increase) relevance of a given treatment or environmental change in terms of its ability to modify the harmful/harmless frequencies. \begin{itemize} \item The package requires as inputs: \begin{itemize} \item a FASTA-formatted 16S rRNA sequence file to be classified to the species level; \item the list of microbe genera, for example, a list of differential genera identified by the comparative analysis to be annotated as \texttt{harmful/harmless}; \item the microbial abundance variations, a simple difference of the differential genera abundance (defined as $\Delta$g) in the two conditions to be compared, as defined above in \hyperref[sec:introduction]{Introduction}; \item a table qualifying the genera as \texttt{harmful}/\texttt{harmless}, as defined by literature. Such a table, manually curated, is included in this package, but is by no means exhaustive: continuous advances in microbiology make this input incomplete and flexible; we encourage users to share expansions of this table. \end{itemize} \item The package outputs: \label{itm:ouput} \begin{itemize} \item Species-classification results, a \texttt{*.taxonomy} file which contains a taxonomic path for each 16S rRNA sequence and a \texttt{*.tax.summary} file which contains a taxonomic outline indicating the number of sequences that were found at each level (kingdom to species); \item optional, a two-column Genus-Species data frame extracted from the assigned taxonomic paths in \texttt{*.taxonomy} file, which includes only the differential genera (as \emph{inputs}); \item a graphical output of the genus abundance difference-$\Delta$g across the tested conditions (y-axis) and their harmful/harmless nature (negative/positive x-axis); \item a contingency table showing as frequencies the cumulated contributions to an eubiotic/dystbiotic microbiome (see Table~\ref{tab:contingency}, columns, namely \texttt{EI} and \texttt{DI}) under different conditions (comparisons between a condition and a reference, listed in rows, namely \texttt{C1} and \texttt{C2}). The eubiotic impact (EI) is quantified by the |$\Delta$g| cumulation of increasing harmless genera and decreasing harmful genera, while the dysbiotic impact (DI) is quantified by the reverse, i.e. |$\Delta$g| accumulation of decreasing harmless genera and increasing harmful genera; \item the results (probability) of testing the null hypothesis that there is no difference in the proportions of frequencies of \texttt{EI} between \texttt{C1} and \texttt{C2} using Fisher exact test (two sided) or Chi-squared test\cite{Rice2007}, computed as the probability that the proportion of frequencies in \texttt{EI} under \texttt{C1} ($\frac{a}{a+b}$) is different from that in \texttt{DI} under \texttt{C2} ($\frac{c}{c+d}$). The results of the one-sided Fisher's exact test\cite{Rice2007} assess whether \texttt{C1} is more likely to be associated to a eubiotic microbiome than \texttt{C2}, and is computed as the probability that the proportion of EI under \texttt{C1} is higher than \texttt{C2}. \end{itemize} \end{itemize} \begin{table} \centering \begin{tabular}{c c c c} \hline\hline Comparison & EI & DI & \emph{Row Total} \\ [0.5ex] \hline C1 & \emph{a} & \emph{b} & \emph{a+b} \\ C2 & \emph{c} & \emph{d} & \emph{c+d} \\ \hline \emph{Column Total} & \emph{a+c} & \emph{b+d} & \emph{a+b+c+d(=n)} \\ \hline \end{tabular} \caption{Contingency Table} \label{tab:contingency} \end{table} \section*{Methods} \section{Representative Sequences} \label{sec:repre} To achieve the Species-level classification, we recommend classifying the unknown 16S rRNA sequences to a well-curated representative dataset of 16S rRNA reference sequences, such as Greengene and SILVA representative sets (as recommended by \emph{Mothur} with very stringent 99\% similarity and \emph{QIIME} with 97\% similarity ). We here use the SILVA representative set created by clustering at 97\% sequence identity, to guarantee a fast Species-level classification and also require less computational resources when assigning sequences to a reference dataset, both requirements are crucial to allow automation of this classification step, as we offer in this package. The representative dataset ``Silva\_119\_rep\_set97.fna" is downloaded in latest version SILVA119 provided by QIIME team from (\href{https://www.arb-silva.de/no_cache/download/archive/qiime}{https://www.arb\-silva.de/no\_cache/download/archive/qiime}). A taxonomic mapping file ``Silva\_119\_rep\_set97\_taxonomy.txt", mapping each entry in the representative dataset to a taxonomy rank designation, was also downloaded and prepared into the input format to \emph{Mothur}, which is included in our package for the usage of further sequence classification. \section{Taxonomy Assignment} \label{sec:assign} The assignment requires a FASTA-formatted input of unclassified sequence, a representative sequence file and a taxonomic mapping file for the representative sequences(Section~\ref{sec:repre}). Given a set of unclassified 16S rRNA sequences, e.g. a set of OTU representative sequences, we assign the taxonomic paths to these sequences by calling the \texttt{classify.seqs} command in \emph{Mothur} (\href{http://www.mothur.org/}{http://www.mothur.org/}). Of the two alternative methods (\texttt{Wang} and \texttt{k-Nearest Neighbor (knn)}) provided in the \texttt{classify.seqs} command for the taxonomic assignment, we use \texttt{\emph{Wang}}'s, implemented by the RDP classifier. This method queries both the unclassified and reference sequences k-mer by k-mer (subsequences of length k) and assigns the unclassified sequences to the appropriate taxa based on the highest matching probability. To calculate the confidence of the assignments, bootstrapping by random replacement of 1/8 (k = 8) of the k-mers in the unclassified sequence is used. <>= library("eudysbiome") input.fasta = "Unclassified.fasta" # using the extracted fasta and taxonomy as template assignTax(fasta = input.fasta,ksize = 8, iters = 100, cutoff = 80, processors=1, dir.out = "assignTax_out") @ The parameters, \texttt{k-size} (length k), \texttt{iterates} (bootstrap iterations) and \texttt{processors} (number of central processing units) are used as defaults in \emph{Mothur}. We set a cutoff of bootstrap confidence score to 80, which means a minimum 80\% sequences were assigned to the same taxonomy, a higher value gives a more strict and accurate taxonomy assignment. A \texttt{*.taxonomy} file and a \texttt{*.tax.summary} file of the classification results are outputd into the \texttt{assginTax} directory. \newpage \section{Genus-Species Table Construction} \label{sec:speciesTable} To identify only the species under certain genera from the \texttt{*.taxonomy} file, the \texttt{tableSpecies} function constructs a two-column Genus-Species data frame, where one column refers to the provided genera while the other refers to the species included in these genera. <<>>= genera = c("Lactobacillus","Bacteroides") #species = tableSpecies(tax.file = "*.taxonomy", microbe = genera) @ \section{Microbe Annotation} \label{sec:microAnnotate} A differential genera list (input) can be annotated as \texttt{harmless} or \texttt{harmful} by the function \texttt{microAnnotate} based on our manually curated table named \texttt{harmGenera} in this package. The table lists the harmful genera based on the pathogenic or opportunistic pathogenic species included in the genera, using a stringent approach: one armful species is sufficient to define the genus harmful (this is what indeed matters to the eubiotic/dysbiotic trend). Although a genus list is acceptable and can be processed with this genera annotation table, we recommend inputting for the data to be analyzed the Genus-Species data frame, as in the \texttt{diffGenera} table below to gain a more accurate annotation. In fact, if the species abundances are known, it is possible to discard a genus in case none of the taxonomically annotated species is present in the dataset (only unknown ones), or mark as harmless a genus that would be harmful by annotation table, in case the harmful species is not present in the dataset under study. For example, genus1 will be annotated as harmful if any of the three species (1, 2 and 3) under this genus is annotated as \texttt{harmful}, otherwise, genus1 will be annotated as \texttt{harmless}. For the data lacking of Species-level classifications, we suggest to do the classification and construct such a Genus-Species table for better annotation by functions described above (Section~\ref{sec:speciesTable}). <<>>= library("eudysbiome") data(diffGenera) head(diffGenera) data(harmGenera) annotation = microAnnotate(diffGenera, annotated.micro = harmGenera) @ \section{Cartesian Plane Plot} \label{sec:Cartesian} The function \texttt{Cartesian} accepts either a data frame or a numeric matrix of $\Delta$g, whose rows represent differential genera and columns represent condition comparisons, these are the argument to produce the cartesian plane (4 quadrants (see details below and in Figure~\ref{fig:carte1} below). The $\Delta$gs are log-2 converted and redundantly represented by the height on the y-axis and the dots diameter. Because of its definition, the increase of harmless (1st cartesian quadrant) and/or the decrease of harmful (3rd cartesian quadrant) define microbiome variation that are eubiotic (beneficial) and highlighted by a blue box, and the decrease of harmless (2nd quadrant) and/or the increase of harmful (4th quadrant) as dysbiotic (non-beneficial) and highlighted by a yellow box. The unknown genera are removed from the plot. For example below, a data frame \texttt{data} is constructed from the \texttt{microDiff} dataset with $\Delta$g of ten differential genera among comparisons \texttt{A vs C}, \texttt{B vs C} and \texttt{D vs C}, where \texttt{A}, \texttt{B} and \texttt{D} are three conditions and \texttt{C} is a control. The genera are annotated as \texttt{harmless}, \texttt{harmful} or \texttt{unknown} in \texttt{micro.ano} based on the output by the \texttt{microAnnotate} function, and comparisons are defined as \texttt{A-C} (A vs C), \texttt{B-C} (B vs C), and \texttt{D-C} (D vs C) in \texttt{comp.ano} and indicated by the column names of the input data if no other \texttt{comp.anno} is specified. Eubiotic changes associated to conditions \texttt{A, B, D} compared to control \texttt{C} are plotted in the up-utmost right and bottom-utmost left quadrants (increase of harmless and decrease of harmful genera) and dysbiotic variations are plotted on the bottom-utmost right and up-utmost left quadrants (increase of harmful and decrease of harmless genera) in Figure~\ref{fig:carte1}. <>= data(microDiff) microDiff attach(microDiff) par(mar = c(5,4.1,5.1,5)) Cartesian(data ,micro.anno = micro.anno,comp.anno= comp.anno, unknown=TRUE,point.col = c("blue","purple","orange")) @ \begin{figure} \centering \includegraphics[height=0.9\textwidth, width=0.9\textwidth]{Cartesian} \caption{\label{fig:carte1} Cartesian plane of the harmful/harmless annotated genera (on the x-axis) and their abundance variations among the condition comparisons (log2 ($\Delta$g), y-axis). The eubiotic microbiome impact is highlighted by a dark blue box while the dysbiotic one is highlighted by a yellow box.} \end{figure} \newpage \section{Contingency Table Construction} This function computes the frequencies of the contingency table as the cumulated |$\Delta$g| classified by each couple formed by a condition and an impact (eubiotic/dysbiotic, see Table~\ref{tab:contingency}). This outputs the significance of the association (contingency) between conditions and impacts by \texttt{contingencyTest}. For example, the benefits of conditions \texttt{A, B, D} are measured by the increase $\Delta$g of harmless genera and the decrease $\Delta$g of harmful genera in the comparisons to \texttt{C}, while the non-beneficial impact is evaluated in reverse by the decrease $\Delta$g of harmless genera and the increase $\Delta$g of harmful genera. Absolute values of $\Delta$g are cumulated as frequencies and used into the contingency table (Table~\ref{tab:Count}). <>= microCount = contingencyCount(data ,micro.anno = micro.anno, comp.anno= comp.anno) @ \begin{table} \centering \begin{tabular}{c c c} \hline\hline Condition & Eubiotic Impact & Dysbiotic Impact \\ [0.5ex] \hline A-C & 2068 & 315 \\ \hline B-C & 2270 & 313 \\ \hline D-C & 1369 & 264 \\ \hline \end{tabular} \caption{Condition-impact contingency table of microbial frequencies} \label{tab:Count} \end{table} \section{Contingency test for count data} To elaborate the significance of the association between conditions and eubiotic/dysbiotic impacts, Chi-squared test and Fisher's exact test (one- and two- sided) are performed on the frequencies from \texttt {contingencyCount} for testing the null hypothesis that conditions are equally likely to lead to a more eubiotic microbiome when compared to the control while the alternative hypothesis is that this probability is not equal or one condition is more likely to be associated to an eubiotic microbiomes than the other (only with Fisher test, one-sided). Taking Table~\ref{tab:Count} as an example, we hypothesize that the proportion of eubiotic frequencies are different (Chi-squared and two-sided Fisher test) between condition comparisons \texttt{A-C}, \texttt{B-C} and \texttt{D-C} or even higher (one-sided Fisher test) in one comparison than the other, and we want to test whether this difference is negligible or refers to a significant association between the condition and the (GI) microbiome composition modification. Both Fisher and Chi-squared tests are performed by the \texttt{contingencyTest} function and significance values are output in tables. <>= microTest = contingencyTest(microCount,alternative ="greater") microTest["Chisq.p"] microTest["Fisher.p"] @ \newpage \begin{thebibliography}{9} \bibitem{Schloss2011} Patrick D. Schloss, et al. Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis. \emph{Applied and Environmental Microbiology} 2011; 77(10): p. 3219-3226 \bibitem{Schloss2009} Patrick D. Schloss, et al. Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. \emph{Applied and environmental microbiology} 2009; 75(23): 7537-7541. \bibitem{Caporaso2010} J Gregory Caporaso, et al. QIIME allows analysis of high-throughput community sequencing data. \emph{Nature Methods} 2010; 7(5):335-336. \bibitem{silva} Quast C.,et al. The SILVA ribosomal RNA gene database project:improved data processing and web-based tools. \emph{Nucleic Acids Research} 2013; 41(D1):D590-D596 \bibitem{Rice2007} Rice, John A., Mathematical statistics and data analysis, Belmont, CA, Thomson/Brooks/Cole, Duxbury advanced series, 3rd, 2007. \end{thebibliography} \section*{Session Information} The session information records the versions of all the packages used in the generation of the present document. \scriptsize{ <>= toLatex(sessionInfo()) @ } \end{document}