%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{vignette source}
%\VignetteEncoding{UTF-8}
\documentclass{article}
\setlength\parindent{0pt}
\usepackage{amsmath,amssymb,tikz,hyperref}
\usetikzlibrary{positioning}
\hypersetup{colorlinks = true,allcolors = blue}
\title{semisup: detecting SNPs with interactive effects on a quantitative trait}
\date{\today}
\author{\textbf{A Rauschenberger}, \textbf{RX Menezes}, \textbf{MA van de Wiel}, \\ \textbf{NM van Schoor}, and \textbf{MA Jonker}}
\bibliographystyle{plain}

\begin{document}
\maketitle

<<echo=FALSE>>=
options(width=60)
knitr::knit_hooks$set(document = function(x) {sub('\\usepackage[]{color}', '\\usepackage{xcolor}', x, fixed = TRUE)})
@

This vignette explains how to use the R~package \textbf{semisup}. Use the function \hyperref[mixtura]{\mbox{mixtura}} for model fitting, and the function \hyperref[scrutor]{\mbox{scrutor}} for hypothesis testing.

%-------------------------------------------------------------------------------
\vspace{0cm} \section{Initialisation}%------------------------------------------
%-------------------------------------------------------------------------------

Start with installing \textbf{semisup} from Bioconductor\footnote{\href{https://cran.r-project.org/web/packages/devtools/README.html}{devtools} and \href{https://github.com/rauschenberger/semisup}{GitHub}: \texttt{devtools::install\_github("rauschenberger/semisup")}}:

<<eval=FALSE>>=
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("semisup")
@

Then load and attach the package:

<<>>=
library(semisup)
@

If you want to reproduce the examples, you should attach the toy database:

<<eval=FALSE>>=
attach(toydata)
@

<<echo=FALSE>>=
names <- names(toydata)
for(i in seq_along(names)){
    assign(names[i],toydata[[i]])
}
rm(names)
@

The following commands access the reference manual:

<<eval=FALSE>>=
help(semisup)
?semisup
@

%-------------------------------------------------------------------------------
\newpage \section{Scope}%-------------------------------------------------------
%-------------------------------------------------------------------------------

Data is available for $n$~samples. Let $\boldsymbol{y}={(y_1,\ldots,y_n)^T}$ represent the observations, $\boldsymbol{x}={(x_1,\ldots,x_n)^T}$ the groups, and $\boldsymbol{z}={(z_1,\ldots,z_n)^T}$ the classes. We assume all observations from the \textit{labelled} group are in class~\textcolor{blue}{A}, and those from the \textit{unlabelled} group are in class~\textcolor{blue}{A} or in class~\textcolor{red}{B}.
\begin{table}[!h]
\centering
\begin{tabular}{l|ccc|ccccc}
& $1$ & \ldots & $s$ & $s+1$ & \ldots & $n=s+u$ \\ \hline
$\boldsymbol{y}$ & $y_1$ & \ldots & $y_{s}$ & $y_{s+1}$ & \ldots & $y_{s+u}$ \\
$\boldsymbol{x}$ & $0$ & \ldots & $0$ & $1$ & \ldots & $1$ \\
$\boldsymbol{z}$ & \textcolor{blue}{A} & \ldots & \textcolor{blue}{A} & \textcolor{blue}{A}/\textcolor{red}{B} & \ldots & \textcolor{blue}{A}/\textcolor{red}{B}
\end{tabular}
\label{table}
\caption{Observations $\boldsymbol{y}$, groups $\boldsymbol{x}$, and classes $\boldsymbol{z}$. Here, the first~$s$ observations are \textit{labelled} (class~\textcolor{blue}{A}), and the last~$u$ observations are \textit{unlabelled} (class \textcolor{blue}{A} or \textcolor{red}{B}).}
\end{table}

We assume all observations come from the same probability distribution, but with different parameters for the two classes:
\begin{align*}
& Y_i | (Z_i = \textcolor{blue}{\mathrm{A}}) \sim F(\boldsymbol{\cdot},\textcolor{blue}{\boldsymbol{\theta_a}}), \\
& Y_i | (Z_i = \textcolor{red}{\mathrm{B}}) \sim {F(\boldsymbol{\cdot},\textcolor{red}{\boldsymbol\theta_b}}).
\end{align*}

The mixing proportion~$\tau$ is the probability that a random \textit{unlabelled} observations is in class~\textcolor{red}{B}. It is of interest to test whether $\tau$ is significantly larger than zero.
\begin{align*}
& \tau =~ \mathbb{P}[Z_i=\textcolor{red}{\mathrm{B}}|X_i=1], \\
& H_0:~ \tau = 0, \\
& H_1:~ \tau > 0.
\end{align*}

The function \hyperref[mixtura]{mixtura} estimates the unknown parameters ($\textcolor{blue}{\boldsymbol{\theta_a}}$, $\textcolor{red}{\boldsymbol{\theta_b}}$, $\tau$) and predicts the missing class labels in $\boldsymbol{z}=(z_1,\ldots,z_n)^T$. The function \hyperref[scrutor]{scrutor} tests homogeneity (${\tau=0}$) against heterogeneity (${\tau>0}$).

%-------------------------------------------------------------------------------
\newpage \section{Model fitting} \label{mixtura} %------------------------------
%-------------------------------------------------------------------------------

Observing two groups of observations, we assume the \textit{labelled} observations are in class~\textcolor{blue}{A}, and the \textit{unlabelled} observations are in class~\textcolor{blue}{A} or in class~\textcolor{red}{B}.

<<echo=FALSE,fig.width='\\linewidth',fig.height=3>>=
par(mar=(c(2,4,1,5)))
plot.new()
plot.window(xlim=c(1,length(y)),ylim=range(y))
box()
axis(side=2)
title(ylab="y")
mtext(text="labelled",at=25,side=1)
mtext(text="unlabelled",at=75,side=1)
abline(v=50.5,lty=2)
points(y,col=rep(c("blue","black"),each=50))
@

The function \hyperref[mixtura]{mixtura} estimates the unknown parameters and predicts the missing class labels:

<<>>=
fit <- mixtura(y,z)
@

<<echo=FALSE>>=
mean0 <- round(fit$estim1$mean0,0)
sd0 <- round(fit$estim1$sd0,0)
mean1 <- round(fit$estim1$mean1,0)
sd1 <- round(fit$estim1$sd1,0)
tau <- round(fit$estim1$p1,2)
@

Here, \Sexpr{100*tau}\% of the \textit{unlabelled} observations are assigned to class \textcolor{red}{B}, and all other observations are assigned to class \textcolor{blue}{A}:

<<results='hide'>>=
class <- round(fit$posterior)
@

% alternative hypothesis
<<echo=FALSE,fig.width='\\linewidth',fig.height=3>>=
par(mar=(c(2,4,1,5)))
plot.new()
plot.window(xlim=c(1,length(y)),ylim=range(y))
box()
axis(side=2)
title(ylab="y")
mtext(text="labelled",side=1,at=25)
mtext(text="unlabelled",side=1,at=75)
abline(v=50.5,lty=2)
points(y,col=ifelse(class,"red","blue"))

# component zero
text <- paste("N(",mean0,",",sd0,")",sep="")
mtext(text=text,side=4,at=0,line=0.5,las=1,col="blue")

# mixing proportion
text <- bquote(tau==.(tau))
mtext(text=text,side=4,at=2,line=0.5,las=2)

# component one
text <- paste("N(",mean1,",",sd1,")",sep="")
mtext(text=text,side=4,at=4,line=0.5,las=1,col="red")
@

These are the parameter estimates:
<<results='hide'>>=
fit$estim1
@

%-------------------------------------------------------------------------------
\newpage \section{Hypothesis testing} \label{scrutor}%--------------------------
%-------------------------------------------------------------------------------

Under the null hypothesis, all observations are in class~\textcolor{blue}{A}. Under the alternative hypothesis, some \textit{unlabelled} observations are in class~\textcolor{red}{B}. \newline

The function \hyperref[mixtura]{mixtura} not only fits the model under the alternative hypothesis (see above), but also under the null hypothesis:

<<results='hide'>>=
fit$estim0
@

<<echo=FALSE,fig.width='\\linewidth',fig.height=3>>=
par(mar=(c(2,4,1,5)))
plot.new()
plot.window(xlim=c(1,length(y)),ylim=range(y))
box()
axis(side=2)
title(ylab="y")
mtext(text="labelled",side=1,at=25)
mtext(text="unlabelled",side=1,at=75)
abline(v=50.5,lty=2)
points(y,col="blue")

# estimates
mean0 <- round(fit$estim0$mean0,0)
sd0 <- round(fit$estim0$sd0,0)

# component zero
text <- paste("N(",mean0,",",sd0,")",sep="")
mtext(text=text,side=4,at=0,line=0.5,las=1,col="blue")

# mixing proportion
text <- bquote(tau==0)
mtext(text=text,side=4,at=2,line=0.5,las=2)

# component one
text <- paste("N(-,-)",sep="")
mtext(text=text,side=4,at=4,line=0.5,las=1,col="red")
@

Because the null distribution of the likelihood-ratio test statistic is unknown, we compare the hypotheses by resampling. The function \hyperref[scrutor]{scrutor} uses parametric bootstrapping or permutation:

<<eval=FALSE>>=
scrutor(y,z)
@

If the \mbox{$p$-value} is less than or equal to the significance level, we reject the null hypothesis in favour of the alternative hypothesis.

%-------------------------------------------------------------------------------
\vspace{3cm} \section*{Options} %-----------------------------------------------
%-------------------------------------------------------------------------------

% \hspace{0.1cm} \textbf{Further arguments}\newline
\fbox{
\begin{minipage}{\textwidth}
The functions \hyperref[mixtura]{mixtura} and \hyperref[scrutor]{scrutor} have similar arguments. Set \texttt{dist} equal to \texttt{"norm"} or \texttt{"nbinom"} to choose between the Gaussian and the negative binomial distributions. In the latter case, optionally provide a dispersion estimate \texttt{phi} or an offset \texttt{gamma}. All other arguments are technical.
\end{minipage}}

%-------------------------------------------------------------------------------
\newpage \section{Application}%-------------------------------------------------
%-------------------------------------------------------------------------------

%-------------------------------------------------------------------------------
\subsection{Data preparation}%--------------------------------------------------
%-------------------------------------------------------------------------------

Let $n$ be the sample size, $q$ the number of quantitative traits, and $p$ the number of single nucleotide polymorphisms (\textsc{snp}s).

\begin{itemize}

\item Transform the quantitative trait to a vector of length~$n$, or transform the quantitative traits to a matrix with $n$~rows (samples) and $q$~columns (variables).

\item Transform the \textsc{snp} to a vector of length~$n$, or transform the \textsc{snp}s to a matrix with $n$~rows (samples) and $p$~columns (variables).

\item Binarise the \textsc{snp}(s), indicating the \textit{labelled} group by zero, and the \textit{unlabelled} group by a missing value.

\end{itemize}

For example, assign observations with zero minor alleles to the \textit{labelled} group, and those with one or two minor alleles to the \textit{unlabelled} group:

\vspace{0.4cm}
\begin{minipage}[left]{0.48\textwidth}
<<eval=FALSE>>=
n <- length(snp)


z <- rep(NA,times=n)
z[snp==0] <- 0
@
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}[right]{0.48\textwidth}
<<eval=FALSE>>=
n <- nrow(SNPs)
p <- ncol(SNPs)

Z <- matrix(NA,nrow=n,ncol=p)
Z[SNPs==0] <- 0 
@
\end{minipage}
\vspace{0.1cm}

%-------------------------------------------------------------------------------
\subsection{Test of association}%-----------------------------------------------
%-------------------------------------------------------------------------------

Use \hyperref[scrutor]{scrutor} to test for association between a quantitative trait (vector) and a \textsc{snp} (vector). The function returns a test statistic and a \mbox{$p$-value}.

\vspace{0.3cm}
\begin{minipage}{0.48\textwidth}
\begin{equation*}
\boldsymbol{y} = 
\begin{pmatrix}
y_{1} \\
y_{2} \\
\vdots  \\
y_{n} 
\end{pmatrix}
\qquad
\boldsymbol{z} = 
\begin{pmatrix}
z_{1} \\
z_{2} \\
\vdots  \\
z_{n},
\end{pmatrix}
\end{equation*}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}{0.48\textwidth}
\begin{center}
\begin{tikzpicture}[every node/.style={circle,draw}]
    \node (y) at (0,0) {$\boldsymbol{y}$};
    \node (z) at (2,0) {$\boldsymbol{z}$};
    \path[thick]
    (y) edge (z);
\end{tikzpicture}
\end{center}
\end{minipage}
\vspace{0.3cm}

<<eval=FALSE>>=
scrutor(y,z)
@

%-------------------------------------------------------------------------------
\newpage \subsection{Genome-wide association study}%----------------------------
%-------------------------------------------------------------------------------

Use \hyperref[scrutor]{scrutor} to test for association between a quantitative trait (vector) and several \textsc{snp}s (matrix). For each \textsc{snp}, the function returns a test statistic and a \mbox{$p$-value}.

\vspace{0.3cm}
\begin{minipage}{0.49\textwidth}
\begin{equation*}
\boldsymbol{y} = 
\begin{pmatrix}
y_{1} \\
y_{2} \\
\vdots  \\
y_{n} 
\end{pmatrix}
\end{equation*}
\begin{equation*}
\underset{n \times p}{\boldsymbol{Z}} = 
\begin{pmatrix}
z_{1,1} & z_{1,2} & \cdots & z_{1,p} \\
z_{2,1} & z_{2,2} & \cdots & z_{2,p} \\
\vdots  & \vdots  & \ddots & \vdots  \\
z_{n,1} & z_{n,2} & \cdots & z_{n,p} 
\end{pmatrix}
\end{equation*}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}{0.49\textwidth}
\vspace{0.2cm}
\begin{center}
\begin{tikzpicture}[every node/.style={circle,draw}]
    \node (y)               at (0,0) {$\boldsymbol{y}$};
    \node (z1)              at (-2,-2.3) {$\boldsymbol{z_1}$};
    \node (z2)              at (-1,-2.3) {$\boldsymbol{z_2}$};
    \node (z3)              at (0,-2.3) {$\boldsymbol{z_3}$};
    \node[draw=none] (dots) at (1,-2.3) {$\cdots$};
    \node (zp)              at (2,-2.3) {$\boldsymbol{z_p}$};
    \path[thick]
    (y) edge (z1)
    (y) edge (z2)
    (y) edge (z3)
    (y) edge (zp);
\end{tikzpicture}
\end{center}
\end{minipage}
\vspace{0.3cm}

<<eval=FALSE>>=
scrutor(y,Z)
@

%-------------------------------------------------------------------------------
\subsection{Differential expression analysis}%----------------------------------
%-------------------------------------------------------------------------------

Use \hyperref[scrutor]{scrutor} to test for association between several quantitative traits (matrix) and a \textsc{snp} (vector). For each quantitative trait, the function returns a test statistic and a \mbox{$p$-value}.

\vspace{0.3cm}
\begin{minipage}{0.49\textwidth}
\begin{equation*}
\underset{n \times q}{\boldsymbol{Y}} = 
\begin{pmatrix}
y_{1,1} & y_{1,2} & \cdots & y_{1,q} \\
y_{2,1} & y_{2,2} & \cdots & y_{2,q} \\
\vdots  & \vdots  & \ddots & \vdots  \\
y_{n,1} & y_{n,2} & \cdots & y_{n,q} 
\end{pmatrix}
\end{equation*}
\begin{equation*}
\boldsymbol{z} = 
\begin{pmatrix}
z_{1} \\
z_{2} \\
\vdots  \\
z_{n} 
\end{pmatrix}
\end{equation*}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}{0.49\textwidth}
\vspace{0.4cm}
\begin{center}
\begin{tikzpicture}[every node/.style={circle,draw}]
    \node (z)               at (0,0) {$\boldsymbol{z}$};
    \node (y1)              at (-2,2.4) {$\boldsymbol{y_1}$};
    \node (y2)              at (-1,2.4) {$\boldsymbol{y_2}$};
    \node (y3)              at (0,2.4) {$\boldsymbol{y_3}$};
    \node[draw=none] (dots) at (1,2.4) {$\cdots$};
    \node (yq)              at (2,2.4) {$\boldsymbol{y_q}$};
    \path[thick]
    (z) edge (y1)
    (z) edge (y2)
    (z) edge (y3)
    (z) edge (yq);
\end{tikzpicture}
\end{center}
\end{minipage}
\vspace{0.3cm}

<<eval=FALSE>>=
scrutor(Y,z)
@

%-------------------------------------------------------------------------------
\subsection{Expression quantitative trait loci analysis}%-----------------------
%-------------------------------------------------------------------------------

Use \hyperref[scrutor]{scrutor} to test for association between several quantitative traits (matrix) and several \textsc{snp}s (matrix). If their numbers are different, all pairwise combinations are considered. If their numbers are equal, a one-to-one correspondence is assumed. For each combination, the function returns a test statistic and a \mbox{$p$-value}.

\vspace{0.3cm}
\begin{minipage}{0.49\textwidth}
\begin{equation*}
\underset{n \times q}{\boldsymbol{Y}} = 
\begin{pmatrix}
y_{1,1} & y_{1,2} & \cdots & y_{1,q} \\
y_{2,1} & y_{2,2} & \cdots & y_{2,q} \\
\vdots  & \vdots  & \ddots & \vdots  \\
y_{n,1} & y_{n,2} & \cdots & y_{n,q} 
\end{pmatrix}
\end{equation*}
\begin{equation*}
\underset{n \times p}{\boldsymbol{Z}} = 
\begin{pmatrix}
z_{1,1} & z_{1,2} & \cdots & z_{1,p} \\
z_{2,1} & z_{2,2} & \cdots & z_{2,p} \\
\vdots  & \vdots  & \ddots & \vdots  \\
z_{n,1} & z_{n,2} & \cdots & z_{n,p} 
\end{pmatrix}
\end{equation*}
\end{minipage}
\hspace{0.02\textwidth}
\begin{minipage}{0.49\textwidth}
\begin{center}
\vspace{0.4cm}
\begin{tikzpicture}[every node/.style={circle,draw}]
    \node (y1)              at (0,0) {$\boldsymbol{y_1}$};
    \node (y2)              at (1,0) {$\boldsymbol{y_2}$};
    \node (y3)              at (2,0) {$\boldsymbol{y_3}$};
    \node[draw=none] (dots) at (3,0) {$\cdots$};
    \node (yq)              at (4,0) {$\boldsymbol{y_q}$};
    \node (z1)              at (0,-2.4) {$\boldsymbol{z_1}$};
    \node (z2)              at (1,-2.4) {$\boldsymbol{z_2}$};
    \node (z3)              at (2,-2.4) {$\boldsymbol{z_3}$};
    \node[draw=none] (dots) at (3,-2.4) {$\cdots$};
    \node (zp)              at (4,-2.4) {$\boldsymbol{z_p}$};
    \path[thick]
    (y1) edge[gray] (z2)
    (y1) edge[gray] (z3)
    (y1) edge[gray] (zp)
    (y2) edge[gray] (z1)
    (y2) edge[gray] (z3)
    (y2) edge[gray] (zp)
    (y3) edge[gray] (z1)
    (y3) edge[gray] (z2)
    (y3) edge[gray] (zp)
    (yq) edge[gray] (z1)
    (yq) edge[gray] (z2)
    (yq) edge[gray] (z3)
    (y1) edge[black] (z1)
    (y2) edge[black] (z2)
    (y3) edge[black] (z3)
    (yq) edge[black] (zp);
\end{tikzpicture}
\end{center}
\end{minipage}
\vspace{0.3cm}

<<eval=FALSE>>=
scrutor(Y,Z)
@

%-------------------------------------------------------------------------------
\section*{References}%----------------------------------------------------------
%-------------------------------------------------------------------------------

The R package \texttt{semisup} is based on Rauschenberger et al.~\cite{Rauschenberger2018}, where detailed references to previous work are given. If you use \texttt{semisup} for publications, please cite Rauschenberger et al.~\cite{Rauschenberger2018}. \newline

Consider shrinkage estimation (Robinson et al.~\cite{Robinson2008}) and scale normalisation (Robinson et al.~\cite{Robinson2010}) to improve the negative binomial mixture model (R package \href{http://bioconductor.org/packages/edgeR/}{edgeR}). Use the non-parametric mixture test (van Wieringen et al.~\cite{VanWieringen2008}) to increase robustness against outliers (R package \href{http://www.few.vu.nl/~wvanwie/software/TestForPDE/TestForPDE.html}{PDGEtest}).

\begingroup
\renewcommand{\section}[2]{}
\bibliography{references}
\endgroup

\end{document}