---
title: "phantasusLite tutorial"
output:
BiocStyle::html_document:
toc: true
toc_float: false
vignette: >
%\VignetteIndexEntry{phantasusLite tutorial}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Introduction
The `phantasusLite` package contains a set functions that facilitate working with public
gene expression datasets originally developed for [phantasus package](https://bioconductor.org/packages/phantasus). Unlike `phantasus`, `phantasusLite` aims to limit the amount of dependencies.
The current functionality includes:
* Providing an interface to precomputed RNA-seq gene counts from ARCHS4 and DEE2 projects stored at remote HSDS repositories.
* Inferring sample groups for cases when they are not provided in the original metadata.
* Saving and loading expression matrices from GCT format.
# Installation
It is recommended to install the release version of the package from Bioconductor using the following commands:
```{r eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phantasusLite")
```
Alternatively, the most recent version of the package can be installed from the GitHub repository:
```{r message=FALSE, eval=FALSE}
library(devtools)
install_github("ctlab/phantasusLite")
```
# Loading precomputed RNA-seq counts
```{r message=FALSE, warning=FALSE}
library(GEOquery)
library(phantasusLite)
```
Let's load dataset GSE53053 from GEO using `GEOquery` package:
```{r message=FALSE}
ess <- getGEO("GSE53053")
es <- ess[[1]]
```
RNA-seq dataset from GEO do not contain the expression matrix, thus `exprs(es)` is empty:
```{r}
head(exprs(es))
```
However, a number of precomputed gene count tables are available at HSDS server '<https://alserglab.wustl.edu/hsds/>'. It features HDF5 files with counts
from ARCHS4 and DEE2 projects:
```{r}
url <- 'https://alserglab.wustl.edu/hsds/?domain=/counts'
getHSDSFileList(url)
```
GSE53053 dataset was sequenced from *Mus musculus* and we can get an expression matrix
from the corresponding HDF5-file with DEE2 data:
```{r}
file <- "dee2/mmusculus_star_matrix_20240409.h5"
es <- loadCountsFromH5FileHSDS(es, url, file)
head(exprs(es))
```
Normally `loadCountsFromHSDS` can be used to automatically select the HDF5-file with the largest
number of quantified samples:
```{r}
es <- ess[[1]]
es <- loadCountsFromHSDS(es, url)
head(exprs(es))
```
The counts are different from the previous values as ARCHS4 counts were used -- ARCHS4 is prioritized when there are several files with the same number of samples:
```{r}
preproc(experimentData(es))$gene_counts_source
```
Further, gene symbols are also imported from ARCHS4 database and are available as feature data:
```{r}
head(fData(es))
```
# Inferring sample groups
For some of the GEO datasets, such as GSE53053, the sample annotation is not fully available.
However, frequently sample titles are structured in a way that allows to infer the groups.
For example, for GSE53053 we can see there are three groups: Ctrl, MandIL4, MandLPSandIFNg,
with up to 3 replicates:
```{r}
es$title
```
For such well-structured titles, `inferCondition` function can be used to automatically
identify the sample conditions and replicates:
```{r}
es <- inferCondition(es)
print(es$condition)
print(es$replicate)
```
# Working with GCT files
GCT text format can be used to store annotated gene expression matrices and load them in software
such as [Morpheus](https://software.broadinstitute.org/morpheus/) or [Phantasus](https://ctlab.itmo.ru/phantasus/).
For example, we can save the `ExpressionSet` object that we defined previously:
```{r}
f <- file.path(tempdir(), "GSE53053.gct")
writeGct(es, f)
```
And the load the file back:
```{r}
es2 <- readGct(f)
print(es2)
```
# Session info
```{r}
sessionInfo()
```
```