%\VignetteIndexEntry{Introduction to MutationalPatterns}
\documentclass{article}
\usepackage{float}
\usepackage[natbibapa]{apacite}
\bibliographystyle{apacite}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex(use.unsrturl=FALSE)
@

\title{Introduction to \Biocpkg{MutationalPatterns}}
\author{Francis Blokzijl}
\author{Roel Janssen}
\author{Ruben van Boxtel}
\author{Edwin Cuppen}
\affil{University Medical Center Utrecht, Utrecht, The Netherlands}
\date{\today}

\begin{document} 

\maketitle

\tableofcontents

\newpage{}

<<options, echo=FALSE>>=
options(width=96)
library(ggplot2)
@

\section{Introduction}

Mutational processes leave characteristic footprints in genomic DNA. This
package provides a comprehensive set of flexible functions that allows
researchers to easily evaluate and visualize a multitude of mutational patterns
in base substitution catalogues of e.g. tumour samples or DNA-repair deficient
cells. The package covers a wide range of patterns including: mutational
signatures, transcriptional and replicative strand bias, genomic distribution
and association with genomic features, which are collectively meaningful for
studying the activity of mutational processes. The package provides
functionalities for both extracting mutational signatures \emph{de novo} and
determining the contribution of previously identified mutational signatures on
a single sample level. MutationalPatterns integrates with common R genomic
analysis workflows and allows easy association with (publicly available)
annotation data.

Background on the biological relevance of the different mutational patterns, a
practical illustration of the package functionalities, comparison with similar
tools and software packages and an elaborate discussion, are described in the
MutationalPatterns article, of which a preprint is available at bioRxiv:
\url{https://doi.org/10.1101/071761}

\newpage{}
\section{Data}

To perform the mutational pattern analyses, you need to load one or multiple
VCF files with single-nucleotide variant calls and the corresponding reference
genome.
 
\subsection{List reference genome}

List available genomes using \Biocpkg{BSgenome}:

<<loading_reference_data>>=
library(BSgenome)
head(available.genomes())
@

Download and load your reference genome of interest:

<<loading_reference_data>>=
ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)
@

\subsection{Load example data}

We provided an example data set with this package, which consists of a subset of
somatic mutation catalogues of 9 normal human adult stem cells from 3 different
tissues \citep{Blokzijl2016}.

Load the MutationalPatterns package:

<<load_package>>=
library(MutationalPatterns)
@

Locate the VCF files of the example data:
<<locate_files>>=
vcf_files <- list.files(system.file("extdata", package="MutationalPatterns"),
                        pattern = ".vcf", full.names = TRUE)
@ 

Define corresponding sample names for the VCF files:
<<set_sample_names>>=
sample_names <- c(
    "colon1", "colon2", "colon3",
    "intestine1", "intestine2", "intestine3",
    "liver1", "liver2", "liver3")
@

Load the VCF files into a \texttt{GRangesList}:
<<read_vcfs_as_granges>>=
vcfs <- read_vcfs_as_granges(vcf_files, sample_names, ref_genome)
summary(vcfs)
@

Define relevant metadata on the samples, such as tissue type:
<<store_tissue_variable>>=
tissue <- c(rep("colon", 3), rep("intestine", 3), rep("liver", 3))
@


\section{Mutation characteristics}

\subsection{Base substitution types}

We can retrieve base substitutions from the VCF GRanges object as "REF>ALT"
using \Rfunction{mutations\_from\_vcf}:

<<mutations_from_vcf>>=
muts = mutations_from_vcf(vcfs[[1]])
head(muts, 12)
@

We can retrieve the base substitutions from the VCF GRanges object and convert
them to the 6 types of base substitution types that are distinguished by
convention: C>A, C>G, C>T, T>A, T>C, T>G. For example, when the reference
allele is G and the alternative allele is T (G>T), \Rfunction{mut\_type}
returns the G:C>T:A mutation as a C>A mutation:

<<mut_type>>=
types = mut_type(vcfs[[1]])
head(types, 12)
@

To retrieve the sequence context (one base upstream and one base downstream) of
the base substitutions in the VCF object from the reference genome, you can use
the \Rfunction{mut\_context} function:

<<mut_context>>=
context = mut_context(vcfs[[1]], ref_genome)
head(context, 12)
@

With \Rfunction{type\_context}, you can retrieve the types and contexts
for all positions in the VCF GRanges object. For the base substitutions that are
converted to the conventional base substitution types, the reverse complement of
the sequence context is returned.

<<type_context>>=
type_context = type_context(vcfs[[1]], ref_genome)
lapply(type_context, head, 12)
@

With \Rfunction{mut\_type\_occurrences}, you can count mutation type
occurrences for all VCF objects in the \texttt{GRangesList}. For
C>T mutations, a distinction is made between C>T at CpG sites and other
sites, as deamination of methylated cytosine at CpG sites is a common mutational
process. For this reason, the reference genome is needed for this functionality.

<<mut_type_occurrences>>=
type_occurrences <- mut_type_occurrences(vcfs, ref_genome)
type_occurrences
@

\subsection{Mutation spectrum}

A mutation spectrum shows the relative contribution of each mutation type in
the base substitution catalogs. The \Rfunction{plot\_spectrum} function plots
the mean relative contribution of each of the 6 base substitution types over
all samples. Error bars indicate standard deviation over all samples. The total
number of mutations is indicated.

<<plot_spectrum>>=
p1 <- plot_spectrum(type_occurrences)
@

Plot the mutation spectrum with distinction
between C>T at CpG sites and other sites:

<<plot_spectrum_2>>=
p2 <- plot_spectrum(type_occurrences, CT = TRUE)
@

Plot spectrum without legend:
<<plot_spectrum_3>>=
p3 <- plot_spectrum(type_occurrences, CT = TRUE, legend = FALSE)
@

The gridExtra package will be used throughout this vignette to combine multiple
plots:
<<combine_plot_spectrum_noeval, eval=FALSE>>=
library("gridExtra")
grid.arrange(p1, p2, p3, ncol=3, widths=c(3,3,1.75))
@
<<combine_plot_spectrum, echo=FALSE>>=
library("gridExtra")
ggsave("combine_plot_spectrum.pdf",
       grid.arrange(p1, p2, p3, ncol=3, widths=c(3,3,1.75)),
       width=10,
       height=3)
@

\begin{figure}[H]
\begin{center}
\includegraphics[width=1.0\textwidth]{combine_plot_spectrum}
\end{center}
\end{figure}

You can facet the per sample group, e.g. plot the spectrum for each tissue
separately:
<<plot_spectrum_4>>=
p4 <- plot_spectrum(type_occurrences, by = tissue, CT = TRUE, legend = TRUE)
@

Define your own 7 colors for spectrum plotting:
<<plot_spectrum_5>>=
palette <- c("pink", "orange", "blue", "lightblue", "green", "red", "purple")
p5 <- plot_spectrum(type_occurrences, CT=TRUE, legend=TRUE, colors=palette)
@

<<combine_plot_spectrum_2_noeval, echo=FALSE>>=
ggsave("combine_plot_spectrum_2.pdf", 
       grid.arrange(p4, p5, ncol=2, widths=c(4,2.3)), 
       width=10, 
       height=3)
@

<<combine_plot_spectrum_2, eval=FALSE>>=
grid.arrange(p4, p5, ncol=2, widths=c(4,2.3))
@

\begin{figure}[H]
\begin{center}
\includegraphics[width=1.0\textwidth]{combine_plot_spectrum_2}
\end{center}
\end{figure}

\subsection{96 mutational profile}

Make a 96 trinucleodide mutation count matrix:
<<mut_matrix>>=
mut_mat <- mut_matrix(vcf_list = vcfs, ref_genome = ref_genome)
head(mut_mat)
@

Plot the 96 profile of two samples:
<<plot_96_profile_2_e, eval=FALSE, width=7, height=3>>=
plot_96_profile(mut_mat[,c(1,7)])
@

\begin{figure}[H]
\begin{center}
<<plot_96_profile_2, fig=TRUE, width=7, height=3, echo=FALSE>>=
plot_96_profile(mut_mat[,c(1,7)])
@
\end{center}
\end{figure}

Plot 96 profile of two samples in a more condensed plotting format:
<<plot_96_profile_3_e, eval=FALSE, width=7, height=3>>=
plot_96_profile(mut_mat[,c(1,7)], condensed = TRUE)
@

\begin{figure}[H]
\begin{center}
<<plot_96_profile_3, fig=TRUE, width=7, height=3, echo=FALSE>>=
plot_96_profile(mut_mat[,c(1,7)], condensed = TRUE)
@
\end{center}
\end{figure}

\newpage{}
\section{Mutational signatures}

\subsection{\textit{De novo} mutational signature extraction using NMF}

Mutational signatures are thought to represent mutational processes, and are
characterized by a specific contribution of 96 base substitution types with a
certain sequence context. Mutational signatures can be extracted from your
mutation count matrix, with non-negative matrix factorization (NMF). A critical
parameter in NMF is the factorization rank, which is the number of mutational
signatures. You can determine the optimal factorization rank using the NMF
package \citep{Gaujoux2010}. As described in their paper:

``...a common
way of deciding on the rank is to try different values, compute some quality
measure of the results, and choose the best value according to this quality
criteria. The most common approach is to choose the smallest rank for which
cophenetic correlation coefficient starts decreasing. Another approach is to
choose the rank for which the plot of the residual sum of squares (RSS) between
the input matrix and its estimate shows an inflection point.''

First add a small psuedocount to your mutation count matrix:

<<psuedo_count>>=
mut_mat <- mut_mat + 0.0001
@

Use the NMF package to generate an estimate rank plot:
<<use_nmf>>=
library("NMF")
estimate <- nmf(mut_mat, rank=2:5, method="brunet", nrun=10, seed=123456)
@

And plot it:
<<estimate_rank_e, eval=FALSE, width=7, height=5>>=
plot(estimate)
@

\begin{figure}[H]
\begin{center}
<<estimate_rank, fig=TRUE, width=7, height=5, echo=FALSE>>=
plot(estimate)
@
\end{center}
\end{figure}

Extract 2 mutational signatures from the mutation count matrix with
\Rfunction{extract\_signatures} (For larger datasets it is wise to perform more
iterations by changing the nrun parameter to achieve stability and avoid local
minima):

<<extract_signatures>>=
nmf_res <- extract_signatures(mut_mat, rank = 2, nrun = 10)
@

Assign signature names:
<<add_column_names>>=
colnames(nmf_res$signatures) <- c("Signature A", "Signature B")
rownames(nmf_res$contribution) <- c("Signature A", "Signature B")
@

Plot the 96-profile of the signatures:
<<plot_96_profile_e, eval=FALSE, width=7, height=3>>=
plot_96_profile(nmf_res$signatures, condensed = TRUE)
@

\begin{figure}[H]
\begin{center}
<<plot_96_profile, fig=TRUE, width=7, height=3, echo=FALSE>>=
plot_96_profile(nmf_res$signatures, condensed = TRUE)
@
\end{center}
\end{figure}

Visualize the contribution of the signatures in a barplot:
<<plot_contribution>>=
pc1 <- plot_contribution(nmf_res$contribution, nmf_res$signature,
                         mode = "relative")
@

Visualize the contribution of the signatures in absolute number of mutations:
<<plot_contribution_2>>=
pc2 <- plot_contribution(nmf_res$contribution, nmf_res$signature,
                         mode = "absolute")
@

Combine the two plots:
<<plot_contribution_2_fig_e, fig=TRUE, width=7, height=5, eval=FALSE>>=
grid.arrange(pc1, pc2)
@

\begin{figure}[H]
\begin{center}
<<plot_contribution_2_fig, fig=TRUE, width=7, height=5, echo=FALSE>>=
grid.arrange(pc1, pc2)
@
\end{center}
\end{figure}

Flip X and Y coordinates:

<<plot_contribution_4_e, eval=FALSE>>=
plot_contribution(nmf_res$contribution, nmf_res$signature,
                         mode = "absolute", coord_flip = TRUE)
@

\begin{figure}[H]
\begin{center}
<<plot_contribution_4, fig=TRUE, width=7, height=4, echo=FALSE>>=
plot_contribution(nmf_res$contribution, nmf_res$signature,
                         mode = "absolute", coord_flip = TRUE)
@
\end{center}
\end{figure}

The relative contribution of each signature for each sample can also be plotted
as a heatmap with \Rfunction{plot\_contribution\_heatmap}, which might be easier
to interpret and compare than stacked barplots. The samples can be
hierarchically clustered based on their euclidean distance. The signatures can
be plotted in a user-specified order.

Plot signature contribution as a heatmap with sample clustering dendrogram and
a specified signature order:

<<plot_contribution_heatmap>>=
pch1 <- plot_contribution_heatmap(nmf_res$contribution,
                                  sig_order = c("Signature B", "Signature A"))
@

Plot signature contribution as a heatmap without sample clustering:
<<plot_contribution_heatmap2_e>>=
pch2 <- plot_contribution_heatmap(nmf_res$contribution, cluster_samples=FALSE)
@

Combine the plots into one figure:
<<plot_contribution_heatmap2_e2, eval=FALSE>>=
grid.arrange(pch1, pch2, ncol = 2, widths = c(2,1.6))
@

\begin{figure}[H]
\begin{center}
<<plot_contribution_heatmap2, fig=TRUE, width=6, height=3, echo=FALSE>>=
grid.arrange(pch1, pch2, ncol = 2, widths = c(2,1.6))
@
\end{center}
\end{figure}

Compare the reconstructed mutational profile with the original mutational
profile:

<<plot_compare_profiles_e, eval=FALSE>>=
plot_compare_profiles(mut_mat[,1],
                        nmf_res$reconstructed[,1],
                        profile_names = c("Original", "Reconstructed"),
                        condensed = TRUE)
@

\begin{figure}[H]
\begin{center}
<<plot_compare_profiles, fig=TRUE, width=6, height=5.5, echo=FALSE>>=
plot_compare_profiles(mut_mat[,1], 
                        nmf_res$reconstructed[,1], 
                        profile_names = c("Original", "Reconstructed"),
                        condensed = TRUE)
@
\end{center}
\end{figure}

\subsection{Find optimal contribution of known signatures}

\subsubsection{COSMIC mutational signatures}

Download mutational signatures from the COSMIC website:

<<download_cancer_signatures>>=
sp_url <- paste("https://cancer.sanger.ac.uk/cancergenome/assets/",
                "signatures_probabilities.txt", sep = "")

cancer_signatures = read.table(sp_url, sep = "\t", header = TRUE)
# Match the order of the mutation types to MutationalPatterns standard
new_order = match(row.names(mut_mat), cancer_signatures$Somatic.Mutation.Type)
# Reorder cancer signatures dataframe
cancer_signatures = cancer_signatures[as.vector(new_order),]
# Add trinucletiode changes names as row.names
row.names(cancer_signatures) = cancer_signatures$Somatic.Mutation.Type
# Keep only 96 contributions of the signatures in matrix
cancer_signatures = as.matrix(cancer_signatures[,4:33])
@ 

Plot mutational profile of the first two COSMIC signatures:
<<plot_96_profile_COSMIC_e, eval=FALSE>>=
plot_96_profile(cancer_signatures[,1:2], condensed = TRUE, ymax = 0.3)
@

\begin{figure}[H]
\begin{center}
<<plot_96_profile_COSMIC, fig=TRUE, width=7, height=3, echo=FALSE>>=
plot_96_profile(cancer_signatures[,1:2], condensed = TRUE, ymax = 0.3)
@
\end{center}
\end{figure}

Hierarchically cluster the COSMIC signatures based on their similarity with
average linkage:

<<cluster_COSMIC, fig=TRUE, width=7, height=5>>=
hclust_cosmic = cluster_signatures(cancer_signatures, method = "average")
# store signatures in new order
cosmic_order = colnames(cancer_signatures)[hclust_cosmic$order]
plot(hclust_cosmic)
@

\subsubsection{Similarity between mutational profiles and COSMIC signatures}

The similarity between each mutational profile and each COSMIC signature, can be
calculated with \Rfunction{cos\_sim\_matrix}, and visualized with
\Rfunction{plot\_cosine\_heatmap}. The cosine similarity reflects how well each
mutational profile can be explained by each signature individually. The
advantage of this heatmap representation is that it shows in a glance the
similarity in mutational profiles between samples, while at the same time
providing information on which signatures are most prominent. The samples can be
hierarchically clustered in \Rfunction{plot\_cosine\_heatmap}.

The cosine similarity between two mutational profiles/signatures can be
calculated with \Rfunction{cos\_sim}:

<<cos_sim>>=
cos_sim(mut_mat[,1], cancer_signatures[,1])
@

Calculate pairwise cosine similarity between mutational profiles and COSMIC
signatures:

<<cos_sim_cosmic_samples, fig=TRUE, width=7, height=3.5>>=
cos_sim_samples_signatures = cos_sim_matrix(mut_mat, cancer_signatures)
# Plot heatmap with specified signature order
plot_cosine_heatmap(cos_sim_samples_signatures,
                    col_order = cosmic_order,
                    cluster_rows = TRUE)
@

\subsubsection{Find optimal contribution of COSMIC signatures to reconstruct 96 mutational profiles}

In addition to \textit{de novo} extraction of signatures, the contribution of
any set of signatures to the mutational profile of a sample can be quantified.
This unique feature is specifically useful for mutational signature analyses of
small cohorts or individual samples, but also to relate own findings to known
signatures and published findings. The  \Rfunction{fit\_to\_signatures} function
finds the optimal linear combination of mutational signatures that most closely
reconstructs the mutation matrix by solving a non-negative least-squares
constraints problem.

Fit mutation matrix to the COSMIC mutational signatures:

<<fit_to_signatures>>=
fit_res <- fit_to_signatures(mut_mat, cancer_signatures)
@ 

Plot the optimal contribution of the COSMIC signatures in each sample as a
stacked barplot.

<<plot_contribution_3_noeval, eval=FALSE>>=
# Select signatures with some contribution
select <- which(rowSums(fit_res$contribution) > 10)
# Plot contribution barplot
plot_contribution(fit_res$contribution[select,],
                    cancer_signatures[,select],
                    coord_flip = FALSE,
                    mode = "absolute")
@

<<plot_contribution_3, echo=FALSE>>=
# Select signatures with some contribution
select <- which(rowSums(fit_res$contribution) > 0)
# Plot contribution
ggsave("plot_contribution_3.pdf",
       plot_contribution(fit_res$contribution[select,],
                         cancer_signatures[,select],
                         coord_flip = FALSE,
                         mode = "absolute"),
       width=9,
       height=5)
@

\begin{figure}[H]
\begin{center}
\includegraphics[width=1.0\textwidth]{plot_contribution_3}
\end{center}
\end{figure}

Plot relative contribution of the cancer signatures in each sample as a heatmap
with sample clustering:
<<plot_contribution_heatmap3_e, eval=FALSE>>=
plot_contribution_heatmap(fit_res$contribution,
                          cluster_samples = TRUE,
                          method = "complete")
@

\begin{figure}[H]
\begin{center}
<<plot_contribution_heatmap3, fig=TRUE, width=8, height=3, echo=FALSE>>=
plot_contribution_heatmap(fit_res$contribution,
                          cluster_samples = TRUE,
                          method = "complete")
@
\end{center}
\end{figure}

Compare the reconstructed mutational profile of sample 1
with its original mutational profile:
<<plot_compare_profiles_2_e, eval=FALSE>>=
plot_compare_profiles(mut_mat[,1], fit_res$reconstructed[,1],
                        profile_names = c("Original", "Reconstructed"),
                        condensed = TRUE)
@

\begin{figure}[H]
\begin{center}
<<plot_compare_profiles_2, fig=TRUE, width=6, height=5.5, echo=FALSE>>=
plot_compare_profiles(mut_mat[,1], fit_res$reconstructed[,1],
                        profile_names = c("Original", "Reconstructed"),
                        condensed = TRUE)
@
\end{center}
\end{figure}

Calculate the cosine similarity between all original and reconstructed
mutational profiles with \Rfunction{cos\_sim\_matrix}:

<<cos_sim_ori_rec>>=
# calculate all pairwise cosine similarities
cos_sim_ori_rec <- cos_sim_matrix(mut_mat, fit_res$reconstructed)
# extract cosine similarities per sample between original and reconstructed
cos_sim_ori_rec <- as.data.frame(diag(cos_sim_ori_rec))
@

We can use ggplot to make a barplot of the cosine similarities between the
original and reconstructed mutational profile of each sample. This clearly
shows how well each mutational profile can be reconstructed with the COSMIC
mutational signatures. Two identical profiles have a cosine similarity of 1.
The lower the cosine similarity between original and reconstructed, the less
well the original mutational profile can be reconstructed with the COSMIC
signatures. You could use, for example, cosine similarity of 0.95 as a cutoff.

<<cos_sim_ori_rec>>=
# Adjust data frame for plotting with gpplot
colnames(cos_sim_ori_rec) = "cos_sim"
cos_sim_ori_rec$sample = row.names(cos_sim_ori_rec)
@

<<plot_cos_sim_ori_rec_e, eval=FALSE>>=
# Load ggplot2
library(ggplot2)
# Make barplot
ggplot(cos_sim_ori_rec, aes(y=cos_sim, x=sample)) +
  geom_bar(stat="identity", fill = "skyblue4") +
  coord_cartesian(ylim=c(0.8, 1)) +
  # coord_flip(ylim=c(0.8,1)) +
  ylab("Cosine similarity\n original VS reconstructed") +
  xlab("") +
  # Reverse order of the samples such that first is up
  # xlim(rev(levels(factor(cos_sim_ori_rec$sample)))) +
  theme_bw() +
  theme(panel.grid.minor.y=element_blank(),
        panel.grid.major.y=element_blank()) +
  # Add cut.off line
  geom_hline(aes(yintercept=.95))
@

\begin{figure}[H]
\begin{center}
<<plot_cos_sim_ori_rec, fig=TRUE, width=6, height=3, echo=FALSE>>=
# Load ggplot2
library(ggplot2)
# Make barplot
ggplot(cos_sim_ori_rec, aes(y=cos_sim, x=sample)) +
  geom_bar(stat="identity", fill = "skyblue4") +
  coord_cartesian(ylim=c(0.8, 1)) +
  # coord_flip(ylim=c(0.8,1)) +
  ylab("Cosine similarity\n original VS reconstructed") +
  xlab("") +
  # Reverse order of the samples such that first is up
  # xlim(rev(levels(factor(cos_sim_ori_rec$sample)))) +
  theme_bw() +
  theme(panel.grid.minor.y=element_blank(),
        panel.grid.major.y=element_blank()) +
  # Add cut.off line
  geom_hline(aes(yintercept=.95))
@
\end{center}
\end{figure}

\section{Strand bias analyses}

\subsection{Transcriptional strand bias analysis}

For the mutations within genes it can be determined whether the mutation is
on the transcribed or non-transcribed strand, which can be used to evaluate
the involvement of transcription-coupled repair. To this end, it is determined
whether the "C" or "T" base (since by convention we regard base substitutions
as C>X or T>X) are on the same strand as the gene definition. Base substitions
on the same strand as the gene definitions are considered "untranscribed", and
on the opposite strand of gene bodies as "transcribed", since the gene
definitions report the coding or sense strand, which is untranscribed. No
strand information is reported for base substitution that overlap with more
than one gene body on different strands.

Get gene definitions for your reference genome:

<<get_genes>>=
# For example get known genes table from UCSC for hg19 using 
# BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
genes_hg19 <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
genes_hg19
@

Get transcriptional strand information for all positions in the first VCF
object with \Rfunction{mut\_strand}. This function returns ``-'' for
positions outside gene bodies, and positions that overlap with more than one
gene on different strands.

<<mut_strand>>=
strand = mut_strand(vcfs[[1]], genes_hg19)
head(strand, 10)
@ 

Make mutation count matrix with transcriptional strand information (96
trinucleotides * 2 strands = 192 features). NB: only those mutations that are
located within gene bodies are counted.

<<mut_matrix_stranded>>=
mut_mat_s <- mut_matrix_stranded(vcfs, ref_genome, genes_hg19)
mut_mat_s[1:5,1:5]
@ 

Count the number of mutations on each strand, per tissue, per mutation type:

<<strand_occurrences>>=
strand_counts <- strand_occurrences(mut_mat_s, by=tissue)
head(strand_counts)
@ 

Perform Poisson test for strand asymmetry significance testing:

<<strand_bias_test>>=
strand_bias <- strand_bias_test(strand_counts)
strand_bias
@

Plot the mutation spectrum with strand distinction:
<<plot_strand>>=
ps1 <- plot_strand(strand_counts, mode = "relative")
@ 

Plot the effect size (log2(untranscribed/transcribed) of the strand bias.
Asteriks indicate significant strand bias.
<<plot_strand_bias_3>>=
ps2 <- plot_strand_bias(strand_bias)
@

Combine the plots into one figure:
<<plot_strand_bias_e, eval=FALSE>>=
grid.arrange(ps1, ps2)
@

\begin{figure}[H]
\begin{center}
<<plot_strand_bias, fig=TRUE, width=7.5, height=5.5, echo=FALSE>>=
grid.arrange(ps1, ps2)
@
\end{center}
\end{figure}


\subsection{Replicative strand bias analysis}

The involvement of replication-associated mechanisms can be evaluated by
testing for a mutational bias between the leading and lagging strand. The
replication strand is dependent on the locations of replication origins from
which DNA replication is fired. However, replication timing is dynamic and
cell-type specific, which makes replication strand determination less
straightforward than transcriptional strand bias analysis. Replication timing
profiles can be generated with Repli-Seq experiments. Once the replication
direction is defined, a strand asymmetry analysis can be performed similarly as
the transcription strand bias analysis.

Read example bed file provided with the package with replication direction
annotation:

<<repli_file>>=
repli_file = system.file("extdata/ReplicationDirectionRegions.bed",
                           package = "MutationalPatterns")
repli_strand = read.table(repli_file, header = TRUE)
# Store in GRanges object
repli_strand_granges = GRanges(seqnames = repli_strand$Chr,
  ranges = IRanges(start = repli_strand$Start + 1,
                   end = repli_strand$Stop),
  strand_info = repli_strand$Class)
# UCSC seqlevelsstyle
seqlevelsStyle(repli_strand_granges) = "UCSC"
repli_strand_granges
@

The GRanges object should have a ``strand\_info'' metadata column, which
contains only two different annotations, e.g. ``left'' and ``right'', or
``leading'' and ``lagging''. The genomic ranges cannot overlap, to allow only
one annotation per location.

Get replicative strand information for all positions in the first VCF
object. No strand information ``-'' is returned for base substitutions in
unannotated genomic regions.

<<strand_from_vcf_rep>>=
strand_rep <- mut_strand(vcfs[[1]], repli_strand_granges, mode = "replication")
head(strand_rep, 10)
@

Make mutation count matrix with transcriptional strand information (96
trinucleotides * 2 strands = 192 features).

<<mut_matrix_stranded_rep>>=
mut_mat_s_rep <- mut_matrix_stranded(vcfs, ref_genome, repli_strand_granges,
                                     mode = "replication")
mut_mat_s_rep[1:5, 1:5]
@

The levels of the "strand\_info" metadata in the GRanges object determines the
order in which the strands are reported in the mutation matrix that is returned
by \Rfunction{mut\_matrix\_stranded}, so if you want to count right before left,
you can specify this, before you run \Rfunction{mut\_matrix\_stranded}:

<<specify_levels>>=
repli_strand_granges$strand_info <- factor(repli_strand_granges$strand_info,
                                           levels = c("right", "left"))
mut_mat_s_rep2 <- mut_matrix_stranded(vcfs, ref_genome, repli_strand_granges,
                                      mode = "replication")
mut_mat_s_rep2[1:5, 1:5]
@

Count the number of mutations on each strand, per tissue, per mutation type:

<<strand_occurrences_rep>>=
strand_counts_rep <- strand_occurrences(mut_mat_s_rep, by=tissue)
head(strand_counts)
@

Perform Poisson test for strand asymmetry significance testing:

<<strand_bias_test_rep>>=
strand_bias_rep <- strand_bias_test(strand_counts_rep)
strand_bias_rep
@

Plot the mutation spectrum with strand distinction:

<<plot_strand_rep>>=
ps1 <- plot_strand(strand_counts_rep, mode = "relative")
@

Plot the effect size (log2(untranscribed/transcribed) of the strand bias. 
Asteriks indicate significant strand bias. 

<<plot_strand_bias_rep2>>=
ps2 <- plot_strand_bias(strand_bias_rep)
@

Combine the plots into one figure:
<<plot_strand_bias_rep_e, eval=FALSE>>=
grid.arrange(ps1, ps2)
@

\begin{figure}[H]
\begin{center}
<<plot_strand_bias_rep, fig=TRUE, width=7.5, height=5.5, echo=FALSE>>=
grid.arrange(ps1, ps2)
@ 
\end{center}
\end{figure}

\subsection{Extract signatures with strand bias}

Extract 2 signatures from mutation count matrix with strand features:

<<extract_signatures>>=
nmf_res_strand <- extract_signatures(mut_mat_s, rank = 2)

# Provide signature names
colnames(nmf_res_strand$signatures) <- c("Signature A", "Signature B")
@ 

Plot signatures with 192 features:

<<plot_192>>=
a <- plot_192_profile(nmf_res_strand$signatures, condensed = TRUE)
@

Plot strand bias per mutation type for each signature with significance test:
<<plot_strand_bias>>=
b <- plot_signature_strand_bias(nmf_res_strand$signatures)
@

Combine the plots into one figure:
<<plot_192_profile_noeval, eval=FALSE>>=
grid.arrange(a, b, ncol = 2, widths = c(5, 1.8))
@

<<plot_192_profile, echo=FALSE>>=
ggsave("plot_192_profile.pdf", 
       grid.arrange(a, b, ncol = 2, widths = c(5, 1.8)),
       width = 10,
       height = 5)
@

\begin{figure}[H]
\begin{center}
\includegraphics[width=1.0\textwidth]{plot_192_profile}
\end{center}
\end{figure}


\section{Genomic distribution}

\subsection{Rainfall plot}

A rainfall plot visualizes mutation types and intermutation distance. Rainfall
plots can be used to visualize the distribution of mutations along the genome
or a subset of chromosomes. The y-axis corresponds to the distance of a
mutation with the previous mutation and is log10 transformed. Drop-downs from
the plots indicate clusters or ``hotspots'' of mutations.

Make rainfall plot of sample 1 over all autosomal chromosomes

<<plot_rainfall_noeval, eval=FALSE>>=
# Define autosomal chromosomes
chromosomes <- seqnames(get(ref_genome))[1:22]

# Make a rainfall plot
plot_rainfall(vcfs[[1]], title = names(vcfs[1]),
                chromosomes = chromosomes, cex = 1.5, ylim = 1e+09)
@

<<plot_rainfall, echo=FALSE>>=
# Define autosomal chromosomes
chromosomes <- seqnames(get(ref_genome))[1:22]

# Make a rainfall plot
ggsave("plot_rainfall.pdf",
       plot_rainfall(vcfs[[1]], title = names(vcfs[1]),
                     chromosomes = chromosomes, cex = 1.5, ylim = 1e+09),
       width=9,
       height=3)
@

\begin{figure}[H]
\begin{center}
\includegraphics[width=1.0\textwidth]{plot_rainfall}
\end{center}
\end{figure}

% Make rainfall plot of the first sample over chromosome 1:
%
% \begin{figure}[H]
% \begin{center}
% <<plot_rainfall_2, fig=TRUE, width=6, height=3>>=
% chromosomes <- seqnames(get(ref_genome))[1]
% plot_rainfall(vcfs[[1]], title = names(vcfs[1]),
%                 chromosomes = chromosomes[1], cex = 2, ylim = 1e+09)
% @
% \end{center}
% \end{figure}

\subsection{Enrichment or depletion of mutations in genomic regions}

Test for enrichment or depletion of mutations in certain genomic regions, such
as promoters, CTCF binding sites and transcription factor binding sites.  To
use your own genomic region definitions (based on e.g. ChipSeq experiments)
specify your genomic regions in a named list of GRanges objects.  Alternatively,
use publicly available genomic annotation data, like in the example below.

\subsubsection{Example: regulation annotation data from Ensembl using 
    \Biocpkg{biomaRt}}

The following example displays how to download promoter, CTCF binding sites and 
transcription factor binding sites regions for
genome build hg19 from Ensembl using \Biocpkg{biomaRt}.  For other datasets,
see the  \Biocpkg{biomaRt} documentation \citep{Durinck2005}.

To install \Biocpkg{biomaRt}, uncomment the following lines:
<<install_biomaRt, eval=FALSE>>=
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("biomaRt")
@

Load the \Biocpkg{biomaRt} package.
<<load_biomart>>=
library(biomaRt)
@ 

Download genomic regions. NB: Here we take some shortcuts by loading the results 
from our example data. The corresponding code for downloading this data can be 
found above the command we run:

<<download_using_biomaRt>>=
# regulatory <- useEnsembl(biomart="regulation",
#                          dataset="hsapiens_regulatory_feature",
#                          GRCh = 37)

## Download the regulatory CTCF binding sites and convert them to
## a GRanges object.
# CTCF <- getBM(attributes = c('chromosome_name',
#                             'chromosome_start',
#                             'chromosome_end',
#                             'feature_type_name',
#                             'cell_type_name'),
#              filters = "regulatory_feature_type_name", 
#              values = "CTCF Binding Site", 
#              mart = regulatory)
#
# CTCF_g <- reduce(GRanges(CTCF$chromosome_name,
#                 IRanges(CTCF$chromosome_start,
#                 CTCF$chromosome_end)))

CTCF_g <- readRDS(system.file("states/CTCF_g_data.rds",
                    package="MutationalPatterns"))

## Download the promoter regions and convert them to a GRanges object.

# promoter = getBM(attributes = c('chromosome_name', 'chromosome_start',
#                                 'chromosome_end', 'feature_type_name'),
#                  filters = "regulatory_feature_type_name", 
#                  values = "Promoter", 
#                  mart = regulatory)
# promoter_g = reduce(GRanges(promoter$chromosome_name,
#                     IRanges(promoter$chromosome_start,
#                             promoter$chromosome_end)))

promoter_g <- readRDS(system.file("states/promoter_g_data.rds",
                        package="MutationalPatterns"))

## Download the promoter flanking regions and convert them to a GRanges object.

# flanking = getBM(attributes = c('chromosome_name',
#                                 'chromosome_start',
#                                 'chromosome_end',
#                                 'feature_type_name'),
#                  filters = "regulatory_feature_type_name", 
#                  values = "Promoter Flanking Region", 
#                  mart = regulatory)
# flanking_g = reduce(GRanges(
#                        flanking$chromosome_name,
#                        IRanges(flanking$chromosome_start,
#                        flanking$chromosome_end)))

flanking_g <- readRDS(system.file("states/promoter_flanking_g_data.rds",
                                    package="MutationalPatterns"))
@ 

Combine all genomic regions (GRanges objects) in a named list:

<<combine_genomic_regions>>=
regions <- GRangesList(promoter_g, flanking_g, CTCF_g)

names(regions) <- c("Promoter", "Promoter flanking", "CTCF")
@ 

Use the same chromosome naming convention consistently:

<<combine_genomic_regions_2>>=
seqlevelsStyle(regions) <- "UCSC"
@ 

\subsection{Test for significant depletion or enrichment in genomic regions}

It is necessary to include a list with Granges of regions that were surveyed
in your analysis for each sample, that is: positions in the genome at which
you have enough high quality reads to call a mutation. This can
be determined using e.g. CallableLoci tool by GATK. If you would not include the
surveyed area in your analysis, you might for example see a depletion of
mutations in a certain genomic region that is solely a result from a low
coverage in that region, and therefore does not represent an actual depletion
of mutations.

We provided an example surveyed region data file with the package. For simplicity, 
here we use the same surveyed file for each sample. For a proper analysis, determine 
the surveyed area per sample and use these in your analysis.

Download the example surveyed region data:

<<download_bed_data>>=
## Get the filename with surveyed/callable regions
surveyed_file <- system.file("extdata/callableloci-sample.bed",
                             package = "MutationalPatterns")

## Import the file using rtracklayer and use the UCSC naming standard
library(rtracklayer)
surveyed <- import(surveyed_file)
seqlevelsStyle(surveyed) <- "UCSC"

## For this example we use the same surveyed file for each sample.
surveyed_list <- rep(list(surveyed), 9)
@ 

Test for enrichment or depletion of mutations in your defined genomic
regions using a binomial test.  For this test, the chance of observing a
mutation is calculated as the total number of mutations, divided by the
total number of surveyed bases.

<<genomic_distribution>>=
## Calculate the number of observed and expected number of mutations in
## each genomic regions for each sample.
distr <- genomic_distribution(vcfs, surveyed_list, regions)
@ 

<<enrichment_depletion_test>>=
## Perform the enrichment/depletion test by tissue type.
distr_test <- enrichment_depletion_test(distr, by = tissue)
head(distr_test)
@ 

<<plot_enrichment_depletion_e, eval=FALSE>>=
plot_enrichment_depletion(distr_test)
@

\begin{figure}[H]
\begin{center}
<<plot_enrichment_depletion, fig=TRUE, width=7, height=5, echo=FALSE>>=
plot_enrichment_depletion(distr_test)
@
\end{center}
\end{figure}

\bibliography{references}

\section{Session Information}
<<sessionInfo, eval=TRUE, echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@

\end{document}