%\VignetteIndexEntry{An R package for prediction of nucleosome positioning}
%\VignetteKeywords{Nucleosome}

\documentclass[a4paper]{article}

\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\usepackage{ccaption}
\usepackage{natbib}

\setlength{\textwidth}{6.2in}
\setlength{\textheight}{8.5in}
\setlength{\parskip}{1.5ex plus0.5ex minus 0.5ex}
\setlength{\oddsidemargin}{0.1cm}
\setlength{\evensidemargin}{0.1cm}
\setlength{\headheight}{0.3cm}
\setlength{\arraycolsep}{0.1cm}

\renewcommand{\baselinestretch}{1}

\begin{document}
\title{An introduction to the nuCpos package}
\author{Hiroaki Kato\thanks{hkato@med.shimane-u.ac.jp}, \  Takeshi Urano}

\maketitle

\section{About nuCpos}

\Rpackage{nuCpos}, a derivative of \Rpackage{NuPoP}, 
is an R package for predicting \textbf{\textit{nuc}}leosome 
\textbf{\textit{pos}}itions. In \Rpackage{nuCpos}, 
a duration hidden Markov model is trained with a 
\textbf{\textit{C}}hemical map of nucleosomes either from 
budding yeast \textit{Saccharomyces cerevisiae} 
(\cite{stat:Brogaard2012}), fission yeast 
\textit{Schizosaccharomyces pombe} (\cite{stat:Moyle-Heyrman2013}), 
or embryonic stem cells of house mouse \textit{Mus musculus} 
(\cite{stat:Voong2016}). \Rpackage{nuCpos} outputs the Viterbi 
(most probable) path of nucleosome-linker states, 
predicted nucleosome occupancy scores and histone binding affinity 
(HBA) scores as \Rpackage{NuPoP} does. \Rpackage{nuCpos} can also 
calculate local and whole nucleosomal HBA scores for 
a given 147-bp sequence. 

The parental package \Rpackage{NuPoP}, licensed under GPL-2, 
was developed by Ji-Ping Wang and Liqun Xi. Please refer to 
\cite{stat:XiWang2010} and \cite{stat:WangWidom2008} for 
technical details of \Rpackage{NuPoP}. Their excellent codes 
were adapted in \Rpackage{nuCpos} to demonstrate the usefulness 
of chemical maps in prediction. 
In dHMM-based predicton, 
users of \Rpackage{nuCpos} can choose whether HBA scores will be 
smoothed or not in the output. Note that when 
\Rpackage{nuCpos} was released, \Rpackage{NuPoP} only used an 
MNase-seq-based map of budding yeast nucleosomes to train 
a duration hidden Markov model. However, as \Rpackage{NuPoP} now 
provides chemical map-based prediction, users are encouraged 
to use \Rpackage{NuPoP} functions to conduct dHMM-based prediction 
in their original way. 


\section{nuCpos functions}
\Rpackage{nuCpos} has three functions: \verb@predNuCpos@, 
\verb@HBA@, and \verb@localHBA@. 
The \verb@predNuCpos@ function provides dHMM-based prediction of 
nucleosome occupancy. This function is built to demonstrate 
the usefulness of chemical maps in prediction. Users can obtain 
unsmoothed HBA scores with this function. Note: \Rpackage{NuPoP} 
now provides chemical map-based prediction. 

The functions \verb@HBA@ and \verb@localHBA@ receive a sequence of 
147-bp DNA and calculate whole nucleosomal and local HBA scores. 
These functions invoke core Fortran codes for HBA calculation 
that were adapted from the excellent dHMM code of \Rpackage{NuPoP}.

\Rpackage{nuCpos} requires the \Rpackage{Biostrings} package, 
especially when DNA sequences are given as DNAString objects to 
the functions \verb@HBA@, and \verb@localHBA@. 
These functions can also receive DNA sequences as simple character 
string objects without loading the \Rpackage{Biostrings} package. 
Note: \Rpackage{nuCpos} requires the \Rpackage{NuPoP} package to perform 
some example runs.

Load the \Rpackage{nuCpos} package as follows:

<<>>=
library(nuCpos)
@


\section{Performing predictions with predNuCpos}
The \verb@predNuCpos@ function acts like the \verb@predNuPoP@ 
function of \Rpackage{NuPoP}. 
When the $\Rfunarg{ActLikePredNuPoP}$ argument is set as TRUE, 
\verb@predNuCpos@ reads a DNA 
sequence file in FASTA format and invokes a Fortran subroutine 
to perform predictions. The prediction results will be saved 
in the working directory. \textit{TRP1ARS1x1.fasta}, the DNA 
sequence of \textit{TRP1ARS1} circular minichromosome (1,465 bp) 
(\cite{stat:Fuse2017}), in \Robject{extdata} can be used for 
an example run. Call the \verb@predNuCpos@ function as follows:

<<>>=
predNuCpos(file = system.file("extdata", "TRP1ARS1x1.fasta", 
    package = "nuCpos"), species = "sc", smoothHBA = FALSE, 
    ActLikePredNuPoP = TRUE)
@ 

The argument $\Rfunarg{file}$ is the path to the fasta file. 
The argument $\Rfunarg{species}$ can be specified as follows: 
mm = \textit{M. musculus};  sc = \textit{S. cerevisiae}; 
sp = \textit{S. pombe}. Re-scaling of the nucleosome and linker 
models for the prediction of other species' nucleosomes are 
not supported. \Rpackage{nuCpos} uses 4th order Markov chain 
models for the prediction. 

The name of the output file will be like 
\textit{TRP1ARS1x1.fasta\_Prediction4.txt}. 
As in the output file produced by the parental \Rpackage{NuPoP} 
package, it will contain five columns: 
\begin{enumerate}
    \item{\Robject{Position}: position in the input DNA sequence.}
    \item{\Robject{P-start}:  probability that a nucleosome starts at.}
    \item{\Robject{Occup}:    nucleosome occupancy score.}
    \item{\Robject{N/L}:      Viterbi path (1 and 0 for the 
        nucleosome and linker states, respectively).}
    \item{\Robject{Affinity}: histone binding affinity score.} 
\end{enumerate}

To import the output into R, the \verb@readNuPoP@ function of 
    \Rpackage{NuPoP} can be used:
<<>>=
library(NuPoP)
results.TRP1ARS1x1 <- readNuPoP("TRP1ARS1x1.fasta_Prediction4.txt", 
    startPos = 1, endPos = 1465)
results.TRP1ARS1x1[1:5,]
@ 

The arguments $\Rfunarg{startPos}$ and $\Rfunarg{endPos}$ are used 
to import a part of the prediction results. In this example, 
the prediction results for the whole tested sequence is imported. 
First and last 73-bp regions do not have HBA scores 
(\Robject{Affinity}) as they cannot be calculated. 
The HBA scores start from the 74th position:

<<>>=
results.TRP1ARS1x1[72:76,]
@

For visualization of the prediction results, the \verb@plotNuPoP@ 
function of \Rpackage{NuPoP} can be used. This function draws 
two plots in the graphical window. The top one shows predicted 
nucleosome occupancy. In the bottom one, probability of a nucleosome 
to start at the given position (blue vertical lines) and the 
Viterbi path (red lines) are shown as well as the nucleosome 
occupancy (gray). 

<<fig=TRUE>>=
plotNuPoP(results.TRP1ARS1x1)
@ 

For prediction of nucleosome positioning in short circular DNA, 
one can use a triplicated sequence for prediction and read only 
the central copy for the evaluation. By triplicating the DNA, 
inaccurate prediction near the DNA ends, which are joined to 
each other in the circular form, can be avoided.

<<>>=
predNuCpos(file = system.file("extdata", "TRP1ARS1x3.fasta", 
    package = "nuCpos"), species = "sc", smoothHBA = FALSE, 
    ActLikePredNuPoP = TRUE)
results.TRP1ARS1 <- readNuPoP("TRP1ARS1x3.fasta_Prediction4.txt", 
    startPos = 1466, endPos = 2930)
@ 

Here, \Robject{TRP1ARS1x3.fasta} in \Robject{extdata} is a 
triplicated sequence (4,395 bp) of the \textit{TRP1ARS1} 
minichromosome (1,465 bp). 
The central part (from the coordinate 1,466 to 2,930) of the 
prediction results is read by \verb@readNuPoP@. They are apparently 
different from the previous results near the terminal regions.

<<fig=TRUE>>=
par(mfrow = c(2, 1))
plot(x = 1:1465, y = results.TRP1ARS1x1[,3], type = "n", 
    ylim = c(-0.05, 1), xlab = "Position from the EcoRI site #1", 
    ylab = "probability/occupancy")
title("Not triplicated")
polygon(c(1, 1:1465, 1465), c(0, results.TRP1ARS1x1[,3], 0), col = 8)
points(x = 1:1465, y = results.TRP1ARS1x1[,4], type = "l", col = 2)
points(x = 1:1465, y = results.TRP1ARS1x1[, 2], type = "h", col = 4)

plot(x = 1:1465, y = results.TRP1ARS1[,3], type = "n", 
    ylim = c(-0.05, 1), xlab = "Position from the EcoRI site #1", 
    ylab = "probability/occupancy")
title("Triplicated: central copy")
polygon(c(1, 1:1465, 1465), c(0, results.TRP1ARS1[,3], 0), col = 8)
points(x = 1:1465, y = results.TRP1ARS1[,4], type = "l", col = 2)
points(x = 1:1465, y = results.TRP1ARS1[, 2], type = "h", col = 4)
@


By specifying the argument $\Rfunarg{smoothHBA}$ as TRUE, 
HBA scores can be smoothed in a 55-bp window as being done by 
the \verb@predNuPoP@ function of \Rpackage{NuPoP}.

<<fig=TRUE>>=
predNuCpos(file = system.file("extdata", "TRP1ARS1x3.fasta", 
    package = "nuCpos"), species = "sc", smoothHBA = TRUE, 
    ActLikePredNuPoP = TRUE)
results.TRP1ARS1.smooth <- readNuPoP("TRP1ARS1x3.fasta_Prediction4.txt", 
    startPos = 1466, endPos = 2930)

par(mfrow = c(2, 1))
plot(x = 1:1465, y = results.TRP1ARS1[,3], type = "n", 
    ylim = c(-0.05, 1), xlab = "Position from the EcoRI site #1", 
    ylab = "probability/occupancy")
title("Occupancy(grey)/probability(blue)/Viterbi(red)")
polygon(c(1, 1:1465, 1465), c(0, results.TRP1ARS1[,3], 0), col = 8)
points(x = 1:1465, y = results.TRP1ARS1[,4], type = "l", col = 2)
points(x = 1:1465, y = results.TRP1ARS1[, 2], type = "h", col = 4)

plot(x = 1:1465, y = results.TRP1ARS1[,5], type = "n", 
    xlab = "Position from the EcoRI site #1", 
    ylab = "HBA", main = "HBA(red)/smoothed HBA(blue)")
points(x = 1:1465, y = results.TRP1ARS1[,5], type = "l", col = 2)
points(x = 1:1465, y = results.TRP1ARS1.smooth[,5], type = "l", col = 4)
@

As shown as a red line in the bottom one of the above plots, 
non-smoothed HBA scores in eukaryotic sequences exhibit 
about 10-bp periodicity. The dyads of predicted nucleosomes usually 
locate at the coordinates with high HBA scores. HBA scores in 
the output of \verb@predNuCpos@ can be standardized as being done 
by the \verb@predNuPoP@ function of \Rpackage{NuPoP} by specifying 
the argument $\Rfunarg{std}$ of \verb@predNuCpos@ as TRUE. 
The default setting for $\Rfunarg{std}$ is FALSE.

When the argument $\Rfunarg{ActLikePredNuPoP}$ is set as FALSE, 
which is the default setting, 
\verb@predNuCpos@ receives a character string or DNAString object 
as $\Rfunarg{inseq}$. 
In this case, prediction results will be returned to the 
R environment, and no file will be generated in the working directory.
The input sequence ($\Rfunarg{inseq}$) must not contain characters 
other than A/C/G/T. 

The results will contain five columns: 
\begin{enumerate}
    \item{\Robject{pos}: position in the input DNA sequence.}
    \item{\Robject{pstart}:  probability that a nucleosome starts at.}
    \item{\Robject{nucoccup}:    nucleosome occupancy score.}
    \item{\Robject{viterbi}:      Viterbi path (1 and 0 for the 
        nucleosome and linker states, respectively).}
    \item{\Robject{affinity}: histone binding affinity score.} 
\end{enumerate}

<<>>=
TRP1ARS1 <- paste(scan(file = 
    system.file("extdata", "TRP1ARS1x1.fasta", package = "nuCpos"), 
    what = character(), skip = 1), sep = "", collapse = "")
results.TRP1ARS1.internal <- 
    predNuCpos(inseq = TRP1ARS1, species = "sc", smoothHBA = FALSE, 
    ActLikePredNuPoP = FALSE)
results.TRP1ARS1.internal[72:76,]
@


\section{Histone binding affinity score calculation with HBA}

HBA score can be calculated for a given 147-bp sequence with the 
\verb@HBA@ function. In the examples bellow, a character string 
object \Robject{inseq} and a DNAString object \Robject{INSEQ} with 
the same 147-bp DNA sequences are given to \verb@HBA@. Note: the 
\Rpackage{Biostrings} package is required for the latter case.

<<>>=
load(system.file("extdata", "inseq.RData", package = "nuCpos"))
HBA(inseq = inseq, species = "sc")
for(i in 1:3) cat(substr(inseq, start = (i-1)*60+1, 
    stop = (i-1)*60+60), "\n")
load(system.file("extdata", "INSEQ_DNAString.RData", 
    package = "nuCpos"))
INSEQ
HBA(inseq = INSEQ, species = "sc")
@

The argument $\Rfunarg{inseq}$ is the character string object to 
be given. Alternatively, a DNAString object can be used here. 
The length of DNA must be 147 bp. The argument $\Rfunarg{species}$ 
can be specified as follows: mm = \textit{M. musculus}; 
sc = \textit{S. cerevisiae}; sp = \textit{S. pombe}. 


\section{Local histone binding affinity score calculation 
    with localHBA}

Local HBA scores are defined as HBA scores for 13 overlapping 
subnucleosomal segments named A to M. They can be calculated for 
a given 147-bp sequence with the \verb@localHBA@ function. 
Like \verb@HBA@, this function can receive either a character 
string object or a DNAString object. The segment G corresponds to 
the central 21 bp region, in which the dyad axis passes through 
the 11th base position. This means that the local HBA score for 
the G segment implies the relationship between DNA and histone 
proteins at around superhelical locations -0.5 and +0.5. 
The neighboring F segment, which is 20 bp in length, 
is for SHLs -1.5 and -0.5. The result of example run shown below 
suggests that subsequence of \Robject{inseq} around SHL -3.5 and 
-2.5 is suitable for nucleosome formation. 


<<fig=TRUE>>=
localHBA(inseq = inseq, species = "sc")
barplot(localHBA(inseq = inseq, species = "sc"), 
    names.arg = LETTERS[1:13], xlab = "Nucleosomal subsegments", 
    ylab = "local HBA", main = "Local HBA scores for inseq")
@


\section{Acknowledgements}
We would like to thank Drs. Shimizu, Fuse and Ichikawa for 
sharing DNA sequences and \textit{in vivo} data, 
and giving fruitful comments. 
We would like to thank Dr. Ji-Ping Wang and his colleagues
for distributing NuPoP under the GPL-2 license. 
In this package, their excellent code for dHMM-based prediction 
was adapted for chemical map-based prediction to demonstrate 
the usefulness of chemical maps in prediction. 
As we noticed that canceling of HBA smoothing helps predicting 
rotational settings, predNuCpos provides this option. 
However, for those who want to predict nucleosome occupancy in 
the original way with chemical maps, we encourage users to 
use NuPoP functions as it now provides 
chemical map-based predictions. 
In our functions HBA and localHBA, their excellent code was 
also adapted to calculate the scores of given 147-bp sequences 
independently of the genomic context.

\bibliographystyle{apalike}
%\bibliography{nuCposBib.bib}

\begin{thebibliography}{}

\bibitem[Wang et~al., 2008]{stat:WangWidom2008}
    Wang JP, Fondufe-Mittendorf Y, Xi L, Tsai GF, Segal E and 
    Widom J (2008).
\newblock Preferentially quantized linker {DNA} lengths in 
    \textit{Saccharomyces cerevisiae}.
\newblock {\em PLoS Computational Biology}, 4(9):e1000175.

\bibitem[Xi et~al., 2010]{stat:XiWang2010}
    Xi L, Fondufe-Mittendorf Y, Xia L, Flatow J, Widom J and 
    Wang JP (2010).
    \newblock Predicting nucleosome positioning using a 
        duration hidden markov model.
    \newblock {\em BMC Bioinformatics}, 11:346.

\bibitem[Brogaard et~al., 2012]{stat:Brogaard2012}
    Brogaard K, Xi L, and Widom J (2012).
    \newblock A map of nucleosome positions in yeast at 
        base-pair resolution.
    \newblock {\em Nature}, 486(7404):496-501.

\bibitem[Moyle-Heyrman et~al., 2012]{stat:Moyle-Heyrman2013}
    Moyle-Heyrman G, Zaichuk T, Xi L, Zhang Q, Uhlenbeck OC, 
        Holmgren R, Widom J and Wang JP (2013).
    \newblock Chemical map of \textit{Schizosaccharomyces pombe} 
        reveals species-specific features in nucleosome positioning.
    \newblock {\em Proc. Natl. Acad. Sci. U. S. A.}, 
        110(50):20158-63.

\bibitem[Ichikawa et~al., 2014]{stat:Ichikawa2014}
    Ichikawa Y, Morohoshi K, Nishimura Y, Kurumizaka H and 
        Shimizu M (2014).
    \newblock Telomeric repeats act as nucleosome-disfavouring 
        sequences in vivo.
    \newblock {\em Nucleic Acids Res.}, 42(3):1541-1552.

\bibitem[Voong et~al., 2016]{stat:Voong2016}
    Voong LN, Xi L, Sebeson AC, Xiong B, Wang JP and Wang X (2016).
    \newblock Insights into Nucleosome Organization in 
        Mouse Embryonic Stem Cells through Chemical Mapping.
    \newblock {\em Cell}, 167(6):1555-1570.

\bibitem[Fuse et~al., 2017]{stat:Fuse2017}
    Fuse T, Katsumata K, Morohoshi K, Mukai Y, Ichikawa Y, 
        Kurumizaka H, Yanagida A, Urano T, Kato H, and Shimizu M (2017).
    \newblock Parallel mapping with site-directed hydroxyl radicals 
        and micrococcal nuclease reveals structural features of 
        positioned nucleosomes in vivo.
    \newblock {\em Plos One}, 12(10):e0186974.


\end{thebibliography}


\end{document}