--- title: "GOpro: Determine groups of genes and find their most characteristic GO term" author: "Lidia Chrabaszcz" output: BiocStyle::html_document vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{GOpro: Determine groups of genes and find their characteristic GO term} %\VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Installation ```{r, eval = FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("GOpro", dependencies = TRUE) ``` # Loading After the package is installed, it can be loaded into R workspace typing ```{r, eval=TRUE, results='hide', warning=FALSE, message=FALSE} library(GOpro) ``` # Overview This document presents an overview of the GOpro package. This package is for determining groups of genes and finding characteristic functions for these groups. It allows for interpreting groups of genes by their most characteristic biological function. It provides one function *findGO* which is based on the set of methods. One of these methods allows for determining significantly different genes between at least two distinct groups (i.e. patients with different medical condition) - the ANOVA test with correction for multiple testing. It also provides two methods for grouping genes. One of them is so-called all pairwise comparisons utilizing Tukey's method. By this method profiles of genes are determined, i.e. in terms of the gene expression genes are grouped according to the differences in the expressions between given cohorts. Another method of grouping is hierarchical clustering. This package provides a method for finding the most characteristic gene ontology terms for anteriorly obtained groups using the one-sided Fisher's test for overrepresentation of the gene ontology term. If genes were grouped by the hierarchical clustering, then the most characteristic function is found for all possible groups (for each node in the dendrogram). # Details Genes must be named with the gene aliases and they must be arranged in the same order for each cohort. ## Determining significantly different genes based on their expressions Genes which are statistically differently expressed are selected for the further analysis by ANOVA test. The *topAOV* parameter denotes the maximum number of significantly different genes to be selected. The significance level of ANOVA test is specified by the *sig.levelAOV* parameter. This threshold is used as the significance level in the BH correction for multiple testing. In the case of equal p-values of the test (below the given threshold), all genes for which the p-value of the test is the same as for the gene numbered with the *topAOV* value are included in the result. ## Grouping genes based on their similarity There are two methods provided for grouping genes. They are specified by the *grouped* parameter. The first one using Tukey's test is called when *grouped* equals *'tukey'* and the second one can be called by using the *'clustering'* value. ### All pairwise comparisons by Tukey's test The Tukey's test is applied to group genes based on their profiles. The *sig.levelTUK* parameter denotes the significance level of Tukey's test. For each gene two-sided Tukey's test is conducted among cohorts. The mean expressions in the cohorts are arranged in ascending order and the result of the test is adapted. All genes with the same order of means and the same result of the test are grouped together. I.e. notation *colon\=bladder\