---
title: "SCBN Tutorial"
author: "Yan Zhou"
package: SCBN
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{SCBN Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


# Introduction
This package provides a statistical normalization method and differential 
expression analysis for RNA-seq data between different species. It consider
the different gene lengths and unmapped genes between species, and formulate 
the problem from the perspective of hypothesis testing and search for an 
optimal scaling factor that minimizes the deviation between the empirical and
nominal type I errors. The methods on this package are described in the article 
*A statistical normalization method and differential expression analysis for 
RNA-seq data between different species* by Yan Zhou, Jiadi Zhu, Tiejun Tong, 
Junhui Wang, Bingqing Lin, Jun Zhang (2018, pending publication).

The scale based normalization (SCBN) method is developed for analyzing cross 
species RNE-seq data, which is different from the data in same species. We have 
implemented the SCBN method via a set of R functions. We make a R package named 
SCBN, and give a tutorial for the package. The method consist three steps.

Step 1: Data Pre-processing;  

Step 2: Calculating scaling factor for data;

Step 3: Calculating p-values for each orthologous genes and select significants.

We employ a simulation dataset to illustrate the usage of the SCBN package.
The programs can run under the Linux system and windows 10 system. The
R versions should larger than 3.5.0.

# Preparations
To install the SCBN package into your R environment, start R and
enter:
```{r, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("SCBN")
```
Then, the SCBN package is ready to load.
```{r}
library(SCBN)
```

# Data format
In order to reproduce the presented SCBN workflow, the package includes the 
example data sets, which is generated by function *generateDataset()*. The 
next we will give an example for how to generate simulation dataset:

```{r}
data(orthgenes)
orthgenes[, 6:9] <- round(orthgenes[, 6:9])
orthgenes1 <- orthgenes[!(is.na(orthgenes[,6])|is.na(orthgenes[,7])|
                        is.na(orthgenes[,8])|is.na(orthgenes[,9])), ]
sim_data <- generateDataset(commonTags=5000, uniqueTags=c(100, 300),
                            unmapped=c(400, 200),group=c(1, 2),
                            libLimits=c(.9, 1.1)*1e6,
                            empiricalDist=orthgenes1[, 6],
                            genelength=orthgenes1[, 2], randomRate=1/100,
                            pDifferential=.05, pUp=.5, foldDifference=2)
```

There are two dataset in the data subdirectory of SCBN package, that is 
**orthgenes.RData** and **sim_data**. **orthgenes.RData** is a real dataset 
come from *The evolution of gene expression levels in mammalian organs* by 
Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, et al.
Nature. 2011;478:343-348. **sim_data** is a simulation dataset, which is 
generated by *generateDataset()* according to the previous steps.

```{r}
data(orthgenes)
head(orthgenes)
```

**orthgenes.RData** includes 9 columns, 
* the first and third columns are the ID for two species;
* the second and fourth columns are the gene length for two species;
* from sixth to ninth columns are RNA-seq reads for two species.

```{r}
data(sim_data)
head(sim_data)
```

**sim_data** includes 4 columns,
* the first and third columns are gene length for two species;
* the second and fourth columns are reads for two species.

# Calculate scaling factor for data
For different species, we need to consider not only the total read counts but 
also the different gene numbers and gene lengths. Therefore, we propose SCBN
method, by utilizing the available knowledge of housekeeping genes, it get the 
optimal scaling factor by minimizing the deviation between the empirical and the 
nominal type I error. 

```{r}
data(sim_data)
factor <- SCBN(orth_gene=sim_data, hkind=1:1000, a=0.05)
factor
```

The output for *SCBN()* is "*median_val*" and "*scbn_val*". *median_val* is a
scaling factor which is calculated by method in *The evolution of gene 
expression levels in mammalian organs* by Brawand D, Soumillon M, Necsulea A, 
Julien P, Csardi G, Harrigan P, et al. Nature. 2011;478:343-348. *scbn_val* is 
a scaling factor which in our paper. We assess the performance of
two method, and find the proposed method outperforms the other method.


# Calculate p-values for each orthologous genes and select significants
Getting the scaling factor, we calculate p-values with adjusted sage.test 
function. The step is as follows:

```{r}
data(sim_data)
orth_gene <- sim_data
hkind <- 1:1000
scale <- factor$scbn_val
x <- orth_gene[, 2]
y <- orth_gene[, 4]
lengthx <- orth_gene[, 1]
lengthy <- orth_gene[, 3]
n1 <- sum(x)
n2 <- sum(y)
p_value <- sageTestNew(x, y, lengthx, lengthy, n1, n2, scale)
head(p_value)
```

Getting p-values for each orthologous genes, we choose differentially expressed 
orthologous genes with p-values.  We assume the genes to be differentially 
expressed genes when p-values less than pre-setting cutoffs, such as $10^{-3}$ 
or $10^{-5}$.