---
title: "BioQC-kidney: The kidney expression example"
date: "`r Sys.Date()`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{BioQC-kidney: The kidney expression example}
  %\usepackage[utf8]{inputenc}
output:
  rmarkdown::html_vignette:
    self_contained: no
  md_document:
    variant: markdown_phpextra
    preserve_yaml: TRUE
---

```{r setup, include=FALSE}
options(fig_caption=TRUE)
library(knitr)
opts_chunk$set(out.extra='style="display:block; margin: auto"', fig.align="center")
```

Introduction
------------
In this vignette, we demonstrate the use of *BioQC* with a case study where mouse kidney samples were profiled for gene expression.
Results of BioQC pointed to potential tissue heterogeneity caused by pancreas contamination which was confirmed by qRT-PCR experiments.


Importing the data
------------------

Let's first load the BioQC signatures using `readGmt`: 

```{r lib, warning=FALSE, message=FALSE, results="hide"}
library(BioQC)
gmtFile <- system.file("extdata/exp.tissuemark.affy.roche.symbols.gmt",
                        package="BioQC")
gmt <- readGmt(gmtFile)
```

The example data for this vignette is shipped as part of the BioQC package. 
Alternatively, it is available in the [GitHub repository](https://github.com/accio/BioQC/inst/extdata/kidney_example.rda). 
The expression data is stored in a [ExpressionSet](https://www.bioconductor.org/packages/release/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf). 

```{r load_data, warning=FALSE, message=FALSE, results="hide"}
file <- system.file("extdata/kidney_example.rda", package="BioQC")
load(file)
print(eset)
```

The dataset contains expression of `r nrow(eset)` genes in `r ncol(eset)` samples. The expression profile was normalized with RMA normalization. The signals were  also `log2`-transformed; however, this step does not affect the result of *BioQC* since it is essentially a non-parametric statistical test.


Running BioQC
-------------
Next we run the core function of the *BioQC* package, `wmwTest`, to perform the analysis.

```{r run_bioqc}
system.time(bioqcRes <- wmwTest(eset, gmt,
                                 valType="p.greater"))
```

The function returns *one-sided* $p$-values of Wilcoxon-Mann-Whitney test. 

For better visualization we 
 * filter the p-value matrix (a sensible cutoff depends on the actual dataset and the expected signatures);
 * transform the p-values using $|\log_{10} p|$. We refer to thse transformed values as *BioQC score*. 

```{r visualize_bioqc_prepare}
bioqcResFil <- filterPmat(bioqcRes, 1E-6)
bioqcAbsLogRes <- absLog10p(bioqcResFil)
```

We inspect the BioQC scores by visualizing them as Heatmap. 
By closer examination we find that the expression of pancreas and adipose specific genes is significantly enriched in samples 23-25: 

```{r heatmap, fig.width=8, fig.height=4, dev='png', fig.cap="BioQC scores (defined as abs(log10(p))) of the samples visualized in heatmap. Red and blue indicate high and low scores respectively."}
library(RColorBrewer)
heatmap(bioqcAbsLogRes, Colv=NA, Rowv=TRUE,
        cexRow=0.85, scale="row",
        col=rev(brewer.pal(7, "RdBu")),
        labCol=1:ncol(bioqcAbsLogRes))
```

Visual inspection reveals that there might be contaminations in samples 23-25, potentially by pancreas and adipose tissue. 
```{r vis_bioqc, fig.width=8, fig.height=4, dev='png', fig.cap="BioQC scores (defined as abs(log10(p))) of the samples. K and P represent kidney and pancreas signature scores respectively."}
filRes <- bioqcAbsLogRes[c("Kidney_NGS_RNASEQATLAS_0.6_3",
                           "Pancreas_Islets_NR_0.7_3"),]
matplot(t(filRes), pch=c("K", "P"), type="b", lty=1L,
        ylab="BioQC score", xlab="Sample index")
```




Validation with quantitative RT-PCR
-----------------------------------
To confirm the hypothesis generated by *BioQC*, we performed qRT-PCR experiments to test two pancreas-specific genes' expression in the same  set of samples. Note that the two genes (amylase and elastase) are not included in the signature set provided by BioQC.

The results are shown in the figure below. It seems likely that sample 23-25 are contaminated by nearby pancreas tissues when the kidney was dissected. Potential contamination by adipose tissues remains to be tested.

```{r rt_pcr_result, fig.width=8, fig.height=4, dev='png', fig.cap="Correlation between qRT-PCR results and BioQC pancreas score"}
amylase <- eset$Amylase
elastase <- eset$Elastase
pancreasScore <- bioqcAbsLogRes["Pancreas_Islets_NR_0.7_3",]
par(mfrow=c(1,2), mar=c(3,3,1,1), mgp=c(2,1,0))
plot(amylase~pancreasScore, log="y", pch=21, bg="red",
     xlab="BioQC pancreas score", ylab="Amylase")
text(pancreasScore[23:25],amylase[23:25], 23:25, pos=1)
plot(elastase~pancreasScore, log="y", pch=21, bg="red",
     xlab="BioQC pancreas score", ylab="Elastase")
text(pancreasScore[23:25],elastase[23:25], 23:25, pos=1)

```



Impact of sample removal on differential gene expression analysis
-----------------------------------------------------------------
In this study, four mice of the *FVB/NJ* strain received nephrectomy operation and treatment of Losartan, an angiotensin II receptor antagonist drug, and four mice reveiced an sham operation and Losartan. Within the Nephrectomy+Losartan group, one sample (index 24) is possibly contaminated by pancreas. Suppose now we are interested in the differential gene expression between the conditions. We now run the analysis twice, once with and once without the contaminated sample, to study the impact of removing heterogenous samples detected by *BioQC*.

```{r load_limma, warning=FALSE, message=FALSE, results="hide"}
library(limma)
```

```{r deg, fig.width=5, fig.height=5, dev='png', fig.cap="Log2 fold change (logFC) values reported by *limma* with one contaminated sample included (y-axis) or excluded (x-axis). Genes strongly affected by the contamination are indicated by red dots."}

isNeph <- with(pData(eset), Strain=="FVB/NJ" &
                 TREATMENTNAME %in% c("Nephrectomy-Losartan", "Sham-Losartan"))
isContam <- with(pData(eset), INDIVIDUALNAME %in% c("BN7", "FNL8", "FN6"))
esetNephContam <- eset[,isNeph]
esetNephExclContam <- eset[, isNeph & !isContam]
getDEG <- function(eset) {
  group <- factor(eset$TREATMENTNAME, levels=c("Sham-Losartan","Nephrectomy-Losartan"))
  design <- model.matrix(~group)
  colnames(design) <- c("ShamLo", "NephLo")
  contrast <- makeContrasts(contrasts="NephLo", levels=design)
  exprs(eset) <- normalizeBetweenArrays(log2(exprs(eset)))
  fit <- lmFit(eset, design=design)
  fit <- contrasts.fit(fit, contrast)
  fit <- eBayes(fit)
  tt <- topTable(fit, n=nrow(eset))
  return(tt)
}

esetNephContam.topTable <- getDEG(esetNephContam)
esetNephExclContam.topTable <- getDEG(esetNephExclContam)
esetFeats <- featureNames(eset)
esetNephTbl <- data.frame(feature=esetFeats,
                          OrigGeneSymbol=esetNephContam.topTable[esetFeats,]$OrigGeneSymbol,
                          GeneSymbol=esetNephContam.topTable[esetFeats,]$GeneSymbol,
                          Contam.logFC=esetNephContam.topTable[esetFeats,]$logFC,
                          ExclContam.logFC=esetNephExclContam.topTable[esetFeats,]$logFC)
par(mfrow=c(1,1), mar=c(3,3,1,1)+0.5, mgp=c(2,1,0))
with(esetNephTbl, smoothScatter(Contam.logFC~ExclContam.logFC,
                                xlab="Excluding one contaminating sample [logFC]",
                                ylab="Including one contaminating sample [logFC]"))
abline(0,1)
isDiff <- with(esetNephTbl, abs(Contam.logFC-ExclContam.logFC)>=2)
with(esetNephTbl, points(Contam.logFC[isDiff]~ExclContam.logFC[isDiff], pch=16, col="red"))
diffTable <- esetNephTbl[isDiff,]
diffGenes <- unique(diffTable[,"GeneSymbol"])
pancreasSignature <- gmt[["Pancreas_Islets_NR_0.7_3"]]$genes
diffGenesPancreas <- diffGenes %in% pancreasSignature
diffTable$isPancreasSignature <- diffTable$GeneSymbol %in% pancreasSignature
colnames(diffTable) <- c("Probeset", "GeneSymbol", "Human ortholog",
                         "Log2FC", "Log2FC (excl. contam.)",
                         "IsPancreasSignature")
diffTable <- diffTable[order(diffTable$Log2FC, decreasing=TRUE),]
```

We found that `r nrow(diffTable)` probesets representing `r length(diffGenes)` genes are associated with much stronger expression changes if the contaminated sample is not excluded (table below). Not surprisingly almost all of these genes are highly expressed in normal human pancreas tissues, and `r sum(diffGenesPancreas)` genes belong to the pancreas signature used by *BioQC*. 

```{r kable_deg, include=FALSE}
kable(diffTable, caption="Genes that are identified as strongly changed ony if the contaminated sample is included.")
```

In summary, we observe that tissue heterogeneity can impact down-stream analysis results and negatively affect reproducibility of gene expression data if it remains overlooked. It underlines again the value of applying *BioQC* as a first-line quality control tool.




R Session Info
----------------
```{r session_info}
sessionInfo()
```