%\VignetteIndexEntry{genphen overview} \documentclass[a4paper, 11pt]{article} \usepackage[utf8]{inputenc} \let\chapter\section \usepackage{libertine} \usepackage[numbers]{natbib} \bibliographystyle{plainnat} \usepackage[english]{babel} \selectlanguage{english} \usepackage{graphicx} \usepackage{placeins} \usepackage{amsmath} \usepackage{amscd} \usepackage{ifthen} \usepackage{float} \usepackage{subfig} \usepackage{lscape} \usepackage{parskip} \usepackage{multirow} \usepackage{todonotes} \usepackage{color} \usepackage{colortbl} \definecolor{hellgrau}{rgb}{0.9,0.9,0.9} \definecolor{dunkelgrau}{rgb}{0.8,0.8,0.8} \usepackage{hyperref} %add-on renews \renewcommand\topfraction{0.85} \renewcommand\bottomfraction{0.85} \renewcommand\textfraction{0.1} \renewcommand\floatpagefraction{0.85} \setlength\floatsep{1.25\baselineskip plus 3pt minus 2pt} \setlength\textfloatsep{1.25\baselineskip plus 3pt minus 2pt} \setlength\intextsep{1.25\baselineskip plus 3pt minus 2 pt} \setlength{\parindent}{0pt} \begin{document} \SweaveOpts{concordance=TRUE} \title{Decoding genotype-phenotype associations with \texttt{genphen}} \author{Simo Kitanovski, Daniel Hoffmann,\\ Bioinformatics and Computational Biophysics,\\ University of Duisburg-Essen, Essen, Germany} \maketitle \tableofcontents This tutorial gives you some of the technical background underlying \texttt{genphen} that should enable you to understand and use this tool. \section{\texttt{genphen} quantifies genotype-phenotype associations} Genome wide association studies (GWAS) have become an important tool for understanding the association between genotypes and phenotypes. With GWAS we try to answer questions such as ``what are the genotypes in the human genome which predispose to a disease?'' or ``what are the genotypes in certain strains of mice which confer resistance against a specific virus?''. The advances in high- throughput sequencing technology (HTSeq) have provided massive genetic data and thus potentially countless applications for GWAS, with genotypes representing single nucleotide polymorphisms (SNPs) or single amino acid polymorphisms (SAAPs) identified in a group of individuals, whereas the phenotype can be any quantiative or discrete trait or charactersitic measured in each individual. The classical (frequentist) statistical methods for GWAS rely on simple and often inadequate methods to capture complex and potentially non-linear genotype- phenotype associations. By quantifying the strength of the association using P-values, the frequentist methods often run into the multiple-comparison problem, which when countered with a rigorous P-value correction results in large amounts of false-negatives. Additional disadvantages of the P-values include the facts that they are difficult to interpret and compare between studies. With \texttt{genphen} we provide a hybrid method for the joint analysis of multiple traits of different typies which reaps the benefits of two approaches: i) statistical learning approaches such as random forest (RF) and support vector machine (SVM) to capture complex associations; ii) Bayesian inference using hierarchical models for accurate quantification of the strength of association, whereby the models are robust to outliers, consistent with the data and automatically deal with the multiple-hypothesis problem. Furthermore, \texttt{genphen} provides a set of additional procedures, including a test for phylogenetic bias (used to discover biases in the data due to the population structure) and procedure for data reduction (used for the removal of non-informative genotypes and thereby simplifying the otherwiese computationally costly GWAS). Future updates will include procedures for data augmentation (to augment small/noisy datasets) and methods for gene prioritization based on network diffusion algorithms using functional network data. \newpage \section{Methods} \label{sec:methods} \subsection{Input} Two types of data are necessary to perform a genetic association study: \begin{itemize} \item genotype data (e.g. set of 1,000 SNPs found along the aligned genomes of 10 individuals), provided in one of three possible input types: \begin{itemize} \item \textbf{character vector} of length $N$ (if only a single SNP/SAAP is provided), containing the genotypes of $N$ individuals. \item \textbf{character matrix} with dimensions $N \times S$ (with $N$ = individuals, $S$ = SNPs/SAAPs) \item \textbf{AAMultipleAlignment} or \textbf{DNAMultipleAlignment} object (package \texttt{Biostrings}) - if the genotype data is a multiple sequence alignment, composed of $N$ sequences (individuals). \end{itemize} \item phenotype data which can include a combination of dichotomous and quantitative traits is allowed (experimental measurement made for each individual such as body mass index, immune response, survival, case-control, etc.) provided as: \itemize{ \item numerical vector of length $N$ if only a single phenotype is analyzed \item numerical matrix $N \times P$, if $P$ phenotypes are provided. } \end{itemize} \subsection{Association Scores} With \texttt{genphen} we quantify the association between each genotype (SNP/SAAP) and phenotype using multiple metrics of association, each of which is explained in the following paragraphs. \paragraph{Classification accuracy ($CA$)} $CA$ quantifies the degree of accuracy with which one can classify (predict) the alleles of a SNP from the phenotype. If there exists a strong association between a particular SNP and the phenotype, one should be able to train a statistical model (using RF or SVM) which accurately classifies the two alleles of that SNP solely from the phenotype data (with $CA \approx 1$). Otherwise, the model should perform poorly, with the classification accuracy of the model being approximately similar to that of simple guessing ($CA \approx 0.5$). To estimate $CA$, \texttt{genphen} uses RF and SVM in a cross-validation (CV) mode, computing a distribution of possible $CAs$ for each SNP. During each iteration of the CV procedure, a subset (e.g. 66\%) of the genotype-phenotype data is selected at random (with replacement) and used to train a classifier, followed by testing (prediction) based on the remaining data. To summarize $CA$, we compute its mean and 95\% highest density interval (95\% HDI), which is defined as the interval that covers 95\% of the $CA$ distribution, with every point inside the interval having higher credibility than any point outside it. The SNPs with $CA \approx 1$, and a narrow HDI have a strong association with the phenotype. \paragraph{Cohen's $\kappa$ statistic} There is one pitfall where the $CA$ estimate can be misleading, and this is the case when the analyzed SNP is composed of unevenly represented genetic states (alleles). For instance, the allele A of a given SNP is found in 90\% of the individuals, while the other allele T in only 10\%. Such an uneven composition of the alleles can lead to a misleading $CA$, i.e. even without learning, the algorithm can produce a high $CA \approx 0.9$ by always predicting the dominant label. The Cohen's $\kappa$ statistics can be used to estimate the degree of $CA$ above the expected accuracy ($CA_{exp}$): \begin{gather*} \kappa = (CA - CA_{exp})/(1 - CA_{exp}) \end{gather*} The $\kappa$ statistics is a quality metric, which is to be used together with $CA$. Cohen defines the following meaningful $\kappa$ intervals: [$\kappa$<0]: ``no agreement'', [0.0-0.2]: ``slight agreement'', [0.2-0.4]: ``fair agreement'' , [0.4-0.6]: ``moderate agreement'', [0.6-0.8]: ``substantial agreement'' and [0.8-1.0]: ``almost perfect agreement''. To summarize the Cohen's $\kappa$, we compute its mean and 95\% highest density interval (95\% HDI). \paragraph{Effect size} Given $N$ individuals, each having genotype values for $S$ refSNPs, we generate the genotype matrix $X^{N \times S}$. The genotype matrix can also be refereed to as a design matrix in the genetic association study, with $X_{ij}$ set to 1 if an individual has the first allele, and $X$ set to -1 for the second allele. For multi-allelic genotypes, the genotype matrix is expanded (colmn-wise) to include each bi-allelic permutation. The phenotypes $P$ can also be grouped to form the phenotype matrix $Y^{N \times P}$. We model the effect of each SNP/SAAP on each phenotype using the following Bayesian model: \begin{gather*} Y_{ik} \sim \begin{cases} \operatorname{Student-t}\left(\nu_k, \alpha_{jk} + \beta_{jk} \cdot X_{ij}, \sigma_k\right), & \text{if $k$ quant.}\\ \operatorname{Bernoulli}\left(\operatorname{logit^{-1}}(\alpha_{jk} + \beta_{jk} \cdot X_{ij})\right), & \text{if $k$ dich.} \end{cases} \end{gather*} where $i$ and $j$ and $k$ index different individuals, SNPs and phenotypes; For quantitative trait, the model assumes Student-t distributed measurement errors with phenotype-specific standard deviation ($\sigma$) and degrees of freedom ($\nu$), with central tendency defined by $\alpha_{k}+\beta_{jk} \cdot X_{ij}$, where $\alpha$ and $\beta$ are the inferred intercept and slope (effect size) coefficients of the model. For dichotomous traits, the model assumes Bernoulli distributed measurement errors. Assuming that all SNPs are independent, we can use the univariate model setting in \texttt{genphen} to place independent vague priors on all slope and intercept coefficients as: \begin{gather*} \beta_{jk} \sim \text{Student-t}(1, 0, 10) \\ \alpha_{jk} \sim \text{Student-t}(1, 0, 100) \end{gather*} On the other hand, if we assume that the estimated slopes are not completely independent and originate from a common overarching distribution, we can use the hierarchical model setting in \texttt{genphen} to model the hierarchy as: \begin{gather*} \beta_{jk}\sim\text{Student-t}(\nu_{\beta_k},\mu_{\beta_k},\sigma_{\beta_k}) \alpha_{jk}\sim\text{Student-t}(\nu_{\alpha_k},\mu_{\alpha_k},\sigma_{\alpha_k}) \end{gather*} The remaining lines of the model describes the priors of the remaining parameters defined in either model type: \begin{gather*} \mu_{\beta_k} \sim \operatorname{Student-t}\left(1, 0, 10\right) \\ \mu_{\alpha_k} \sim \operatorname{Student-t}\left(1, 0, 100\right) \\ \nu_k, \nu_{\beta_k} \sim \operatorname{Gamma}\left(1, 2\right) \\ \sigma_{k}, \sigma_{\beta_k} \sim \operatorname{Half-Cauchy}\left(0, 5\right)\\ \end{gather*} Importantly, the hierarchical version of the model performs partial-pooling, and therefore does an automatic correction for multiple-comparison. The model was implemented in Stan~\footnote{Stan Development Team. 2017. Stan Modeling Language Users Guide and Reference Manual, Version 2.17.0. http://mc-stan.org}. We summarize each association using the mean of its slope coefficient ($\beta$) and 95\% (for instance) highest density interval (HDI), which is defined as the interval that covers a 95\% of the posterior distribution, with every point inside the interval having a higher credibility than any point outside it. Thus we can define an association as significant if the null-effect, i.e. $\beta = 0$ lies outside the 95\% HDI. The complete posterior is provided to the user, enabling checks for MCMC convergence and posterior prediction using built-in routines for poterior predictive checks. \subsection{Phylogenetic Bias ($B$)} To control for potential phylogenetic biases (population structure), we devised the following procedure. First, we use the complete genotype data (all SNPs) to compute a kinship matrix ($K^{N \times N}$ - dissimilarity matrix for the $N$ individuals). Alternatively, the users can provide their own kinship matrix (e.g. kinship estimated using more accurate phylogenetic methods). For a group of individuals which belong to a group defined by an alleles of a given SNP, we next compute their mean kinship distance using the kinship matrix data. If the individuals in the group are related, the compute mean kinship distance must be significantly lower than the mean kinship distance computed from the complete kinship matrix. We define the phylogenetic bias as: \begin{gather*} B = 1 - \hat{d}_{g}/\hat{d}_{t} \end{gather*} where $\hat{d}_{g}$ is the mean kinship distance between the individuals who share the genotype $g$; $\hat{d}_{t}$ is the mean kinship distance of the complete kinship matrix. For a complete phylogenetic bias, $B = 1$ ($\hat{d}_{g} << \hat{d}_{t}$), and $B = 0$ (or slightly negative) for no bias. This estimate is computed for each SNP and genotype group within each SNP. To compute the phylogenetic bias associated with a SNP we compute: \begin{gather*} B = 1 - min(\hat{d}_{g_{1}}, \hat{d}_{g_{2}})/\hat{d}_{t} \end{gather*} where $\hat{d}_{g_{1}}$ and $\hat{d}_{g_{2}}$ represent the mean kinship distance between the individuals who share the genotype (allele) $g_{1}$ and $g_{2}$ or a given SNP; $\hat{d}_{t}$ is the mean kinship distance in the complete kinship matrix. For a complete phylogenetic bias, $B = 1$ and $B = 0$ (or slightly negative) for no bias. This estimate is computed for each SNP and each pair of genotypes. \subsection{Pareto Optimization} We use Pareto optimization (with R package \texttt{rPref}) to rank the SNPs based on their multi-factorial association. Given that $CA$ is encoded into $\kappa$, we use only $\beta$ and $\kappa$ with an objective function that prioritizes SNPs which score high with respect to both of them. The Pareto optimization procedure assigns each SNP to a non-dominated front (rank). \newpage \section{Case studies} \subsection{I: Association between SNPs and a *quantiative* phenotype} \label{sec:case1} In the first case study, we show a typical genotype-phenotype analysis, whereby the genotype is a multiple sequence alignment containing of 120 protein sequences (individuals), each composed of 8 amino acids (positions), and a quantiative phenotype measured for each individual. \begin{scriptsize} <>= require(genphen) require(ggplot2) require(knitr) require(ggrepel) require(reshape) require(ape) require(xtable) require(gridExtra) options(xtable.floating = FALSE) @ \end{scriptsize} \begin{scriptsize} <>= # genotype as matrix (120x154), we will subset part of it: data(genotype.saap) # phenotype as vector (120 measurements): data(phenotype.saap) @ \end{scriptsize} \paragraph{Genotype-phenotype data} First we show an overview of the distribution of the phenotype across the genetic states found at each of the 8 studied positions in the multiple sequence alignment. \begin{scriptsize} <>= # Format the genotype-phenotype data, such that it can then # be visualized with ggplot df <- data.frame(genotype.saap[, 82:89], phenotype = phenotype.saap, stringsAsFactors = FALSE) df <- melt(data = df, id.vars = "phenotype") colnames(df) <- c("phenotype", "site", "genotype") df$site <- gsub(pattern = "X", replacement = '', x = df$site) df$site <- factor(x = df$site, levels = unique(df$site)) @ <>= # Visualization g <- ggplot(data = df)+ facet_wrap(facets = ~site, nrow = 2, scales = "free_x")+ geom_violin(aes(x = genotype, y = phenotype))+ ylab(label = "Quantitative phenotype")+ xlab(label = "Genotypes")+ geom_point(aes(x = genotype, y = phenotype, col = genotype), size = 1, shape = 21, position = position_jitterdodge())+ scale_color_discrete(name = "genotype")+ theme_bw(base_size = 14)+ theme(legend.position = "none") g @ \begin{figure}[H] \centering <>= g @ \end{figure} \end{scriptsize} \textbf{Important remark:} We recommended that the quantiative phenotypes are roughly normally (or T) distributed. While our models are designed to be robust against outliers, we advise you to perform the data transformations of skewed phenotypes (e.g. log-transformations) before the analysis. Here the phenotype has already been log10-transformed and is normally distributed. \paragraph{Association analysis} Next, we perform the genetic association study for a single \underline{quantitative ('Q')} phenotype with \texttt{genphen} using the following settings: \begin{itemize} \item \underline{hierarchical} Bayesian model will be run with \underline{2 MCMC chains} composed of \underline{1,500 iterations} each, including \underline{500 warmup iterations}. \item \underline{Random forest} was selected for the statistical learning, which will be run in a cross-validation mode with \underline{200 iterations}, whereby in each iteration 66\% of the data (\underline{default: cv.fold = 0.66}) will be used to train the model. \item We report for each metrics its mean and \underline{95\% HDI} \item Whenever possible, \underline{2 cores} will be used. \end{itemize} \begin{scriptsize} <>= # Run genphen c.out <- genphen::runGenphen(genotype = genotype.saap[, 82:89], phenotype = phenotype.saap, phenotype.type = "Q", model.type = "hierarchical", mcmc.chains = 2, mcmc.steps = 1500, mcmc.warmup = 500, cores = 2, hdi.level = 0.95, stat.learn.method = "rf", cv.steps = 200) @ \end{scriptsize} Typical way of visualizing the \texttt{genphen} results is with the following plot, where each point represents a SAAP plotted according to x = classification accuracy ($CA$), y = effect slize ($\beta$), color = Cohen's $\kappa$. The most promising SAAPs have $CA$ and $\kappa$ close to 1, with a non-null $\beta$, i.e. $\beta$ with 95\% HDI that does not overlap with 0 (shown as a dashed line in the figure). The labels show the SAAP site in the genotype data, and its constituting genotypes. \begin{scriptsize} <>= # Get the scores data c.score <- c.out$scores # Some optional formatting for the SNPs (label = site : genotype1 -> genotype2) c.score$label <- paste(c.score$site, ":", c.score$ref, "->", c.score$alt, sep = '') # Visualization g <- ggplot(data = c.score)+ geom_errorbar(aes(x = ca.mean, ymin = beta.hdi.low, ymax = beta.hdi.high), width = 0.015, col = "darkgray")+ geom_point(aes(x = ca.mean, y = beta.mean, fill = kappa.mean), shape = 21, size = 4)+ geom_text_repel(aes(x = ca.mean, y = beta.mean, label = label), size = 5)+ theme_bw(base_size = 14)+ ylab(label = expression("Effect size ("*beta*") (with 95% HDI)"))+ scale_x_continuous(name = "CA", limits = c(0, 1.05))+ geom_hline(yintercept = 0, linetype = "dashed")+ theme(legend.position = "top")+ scale_fill_distiller(palette = "Spectral", limits = c(-0.2, 1))+ guides(fill = guide_colorbar(barwidth = 10, barheight = 1.5)) g @ \end{scriptsize} \begin{figure}[H] \centering <>= g @ \end{figure} The association scores are also shown in the following table: \begin{scriptsize} <>= # Description: # Rounds digits to 2-decimal points, and concatinates the lower and upper # limits of the HDI to have a simpler visualization getHdiPretty <- function(x, digits = 2) { x[1] <- round(x = x[1], digits = digits) x[2] <- round(x = x[2], digits = digits) return(paste("(", x[1], ", ", x[2], ")", sep = '')) } c.score$beta.hdi <- apply(X = c.score[, c("beta.hdi.low", "beta.hdi.high")], MARGIN = 1, getHdiPretty, digits = 2) c.score$ca.hdi <- apply(X = c.score[, c("ca.hdi.low", "ca.hdi.high")], MARGIN = 1, getHdiPretty, digits = 2) c.score$kappa.hdi <- apply(X = c.score[, c("kappa.hdi.low", "kappa.hdi.high")], MARGIN = 1, getHdiPretty, digits = 2) # Print table print(xtable(c.score[, c("label", "beta.mean", "beta.hdi", "ca.mean", "ca.hdi", "kappa.mean", "kappa.hdi"), ], align = rep(x = "c", times = 8, digits = 2)), include.rownames = FALSE, size = "scriptsize") @ \end{scriptsize} <>= print(xtable(c.score[, c("label", "beta.mean", "beta.hdi", "ca.mean", "ca.hdi", "kappa.mean", "kappa.hdi"), ], align = rep(x = "c", times = 8, digits = 2)), include.rownames = FALSE, size = "scriptsize") @ \paragraph{Pareto optimization} We use Pareto optimization (with R package \texttt{rPref}) to rank the SNPs based on their multi-factorial association. Given that $CA$ is encoded into $\kappa$, we use only $\beta$ and $\kappa$ with an objective function that prioritizes SNPs which score high with respect to both of them. The results from the Pareto optimization procedure are shown below: \begin{scriptsize} <>= # Visualization g <- ggplot(data = c.score)+ facet_wrap(facets = ~phenotype.id, scales = "free")+ geom_line(aes(y = abs(beta.mean), x = kappa.mean, group = rank))+ geom_point(aes(y = abs(beta.mean), x = kappa.mean, fill = rank), shape = 21, size = 4)+ geom_text_repel(aes(y = abs(beta.mean), x = kappa.mean, label = label), size = 5)+ theme_bw(base_size = 14)+ ylab(label = expression("|"*beta*"|"))+ xlab(label = expression(kappa))+ scale_fill_gradientn(colours = terrain.colors(n = 10))+ theme(legend.position = "top") g @ \end{scriptsize} \begin{figure}[H] \centering <>= g @ \end{figure} \paragraph{MCMC convergence and sampling issues} You might want to check the validity your Bayesian inference by inspecting the \texttt{genphen} output named convergence which contains information about the markov chain monte carlo (MCMC) simulation done with R package rstan including potential scale reduction factor (Rhat) and effective sampling size (ESS), as well as information concerning potential convergence issues such as divergences and tree depth exceeded warnings. For detailed information about each warning please read Stan documentation (mc-stan.org/users/documentation/). \begin{scriptsize} <>= rstan::check_hmc_diagnostics(c.out$complete.posterior) rstan::stan_rhat(c.out$complete.posterior) rstan::stan_ess(c.out$complete.posterior) rstan::stan_diag(c.out$complete.posterior) @ \end{scriptsize} \paragraph{Phylogenetic bias control} Next, we compute the phylogenetic bias of each mutation, shown in the table below: \begin{scriptsize} <>= # Compute the phylogenetic bias bias <- runPhyloBiasCheck(genotype = genotype.saap, input.kinship.matrix = NULL) # Extract kinship matrix kinship.matrix <- bias$kinship.matrix # Extract the bias associated with mutations of the sites which # were included in the association analysis mutation.bias <- bias$bias # To make site id concordant with data mutation.bias$site <- mutation.bias$site - 81 mutation.bias <- merge(x = c.score, y = mutation.bias, by = c("site", "ref", "alt")) # Show the bias table print(xtable(mutation.bias[, c("site", "ref", "alt", "bias.ref", "bias.alt")], align = rep(x = "c", times = 6, digits = 2)), include.rownames = FALSE, size = "small") @ \end{scriptsize} <>= print(xtable(mutation.bias[, c("site", "ref", "alt", "bias.ref", "bias.alt")], align = rep(x = "c", times = 6, digits = 2)), include.rownames = FALSE, size = "scriptsize") @ We use the kinship matrix to perform hierarchical clustering, visualizing the population strcuture and two examples (mutations) with genotype 1 marked with blue and genotype 2 marked with orange in either case. Individuals not covered by either genotype are marked with gray color. The shown examples differ in the degree of phylogenetic bias. \begin{scriptsize} <>= color.a <- character(length = nrow(genotype.saap)) color.a[1:length(color.a)] <- "gray" color.a[which(genotype.saap[, 82] == "h")] <- "orange" color.a[which(genotype.saap[, 82] == "q")] <- "blue" color.b <- character(length = nrow(genotype.saap)) color.b[1:length(color.b)] <- "gray" color.b[which(genotype.saap[, 84] == "a")] <- "orange" color.b[which(genotype.saap[, 84] == "d")] <- "blue" c.hclust <- hclust(as.dist(kinship.matrix), method = "average") par(mfrow = c(1, 2), mar = c(0,0,1,0) + 0.1) plot(as.phylo(c.hclust), tip.color = color.a, cex = 0.6, type = "fan", main = "B = 0.15") plot(as.phylo(c.hclust), tip.color = color.b, cex = 0.6, type = "fan", main = "B = 0.43") @ \end{scriptsize} \begin{figure}[H] \centering <>= par(mfrow = c(1, 2), mar = c(0,0,1,0) + 0.1) plot(as.phylo(c.hclust), tip.color = color.a, cex = 0.6, type = "fan", main = "B = 0.15") plot(as.phylo(c.hclust), tip.color = color.b, cex = 0.6, type = "fan", main = "B = 0.43") @ \end{figure} \subsection{II: Association between SNP and two phenotypes (quantitative and dichotomous)} \label{sec:case2} In the second case study we show you how to use \texttt{genphen} in case of two phenotypes of different types (quantitative and dichotomouse). Here, the genotype is a single simulated SNP in 40 individuals. First we show an overview of the distribution of the phenotypes in each genotype. \begin{scriptsize} <>= # simulate genotype genotype <- rep(x = c("A", "C", "T", "G"), each = 10) # simulate quantitative and dichotomous phenotypes phenotype.Q <- c(rnorm(n = 10, mean = 0, sd = 1), rnorm(n = 10, mean = 0.5, sd = 1), rnorm(n = 10, mean = -0.5, sd = 1), rnorm(n = 10, mean = 2, sd = 1)) phenotype.D <- c(rbinom(n = 10, size = 1, prob = 0.3), rbinom(n = 10, size = 1, prob = 0.5), rbinom(n = 10, size = 1, prob = 0.6), rbinom(n = 10, size = 1, prob = 0.7)) phenotype <- cbind(phenotype.Q, phenotype.D) rm(phenotype.Q, phenotype.D) out <- runGenphen(genotype = genotype, phenotype = phenotype, phenotype.type = c("Q", "D"), model.type = "hierarchical", mcmc.chains = 4, mcmc.steps = 2500, mcmc.warmup = 500, cores = 2, hdi.level = 0.95, stat.learn.method = "svm", cv.steps = 500) @ <>= # Format the genotype-phenotype data, such that it can then # be visualized with ggplot df <- data.frame(genotype = genotype, phenotype.Q = phenotype[, 1], phenotype.D = phenotype[, 2], stringsAsFactors = FALSE) @ <>= # Visualization g1 <- ggplot(data = df)+ geom_point(aes(x = genotype, y = phenotype.Q, col = genotype), size = 1, shape = 21, position = position_jitterdodge(jitter.width = 0.2, jitter.height = 0, dodge.width = 0.5))+ xlab(label = "Genotypes")+ ylab(label = "Phenotype (Q)")+ theme_bw(base_size = 14)+ theme(legend.position = "none") g2 <- ggplot(data = df)+ geom_point(aes(x = genotype, y = phenotype.D, col = genotype), size = 1, shape = 21, position = position_jitterdodge(jitter.width = 0.2, jitter.height = 0.05, dodge.width = 0.5))+ xlab(label = "Genotypes")+ scale_y_continuous(name = "Phenotype (D)", breaks = c(0, 1), labels = c(0, 1))+ theme_bw(base_size = 14)+ theme(legend.position = "none") gridExtra::grid.arrange(g1, g2, ncol = 2) @ \end{scriptsize} \begin{figure}[H] \centering <>= gridExtra::grid.arrange(g1, g2, ncol = 2) @ \end{figure} \textbf{Important remark:} The dichotomous phenotype can be provided as both numeric or character vector. The elements of these vectors are then encoded into two categories (1 and 0). If the user has a preference of how the encoding has to be done (which category is to be encoded to 1 or 0), the encoding should be done prior to the analysis. \paragraph{Association analysis} We perform a genetic association study for \underline{multiple phenotypes} of different types, including one quantiative ('Q') and one dichotomous phenotype ('D'), using the following settings: \begin{itemize} \item \underline{univariate} Bayesian model will be run with \underline{4 MCMC chains} composed of \underline{1500 iterations} each, including \underline{500 warmup iterations}. \item \underline{Support vector machines} was selected for the statistical learning, which will be run in a cross-validation mode with \underline{500 iterations}, whereby in each iteration 80\% of the data will be used to train the model. \item All estimates will be reported according to their mean and \underline{95\% HDI} \item Whenever possible, \underline{2 cores} will be used. \end{itemize} \begin{scriptsize} <>= # run genphen m.out <- genphen::runGenphen(genotype = genotype, phenotype = phenotype, phenotype.type = c("Q", "D"), model.type = "univariate", mcmc.chains = 4, mcmc.steps = 1500, mcmc.warmup = 500, cores = 2, hdi.level = 0.95, stat.learn.method = "svm", cv.steps = 500, cv.fold = 0.8) @ \end{scriptsize} Once again we visualize the \texttt{genphen} results is with a plot in which the point represents the SNP, plotted according to x = classification accuracy ($CA$), y = slope ($\beta$) with error bars representing the 95\% HDI, color = Cohen's $\kappa$. \begin{scriptsize} <>= # Get the scores data m.score <- m.out$scores # Some optional formatting for the SNPs # (label = site : genotype1 -> genotype2) m.score$label <- paste(m.score$site, ":", m.score$ref, "->", m.score$alt, sep = '') # Visualization g1 <- ggplot(data = m.score[m.score$phenotype.id == 1, ])+ geom_errorbar(aes(x = ca.mean, ymin = beta.hdi.low, ymax = beta.hdi.high), width = 0.015, col = "darkgray")+ geom_point(aes(x = ca.mean, y = beta.mean, fill = kappa.mean), shape = 21, size = 4)+ geom_text_repel(aes(x = ca.mean, y = beta.mean, label = label), size = 5)+ theme_bw(base_size = 14)+ ylab(label = expression("Effect size ("*beta*") (with 95% HDI)"))+ scale_x_continuous(name = "CA", limits = c(0, 1.05))+ geom_hline(yintercept = 0, linetype = "dashed")+ theme(legend.position = "top")+ scale_fill_distiller(palette = "Spectral", limits = c(-0.2, 1))+ guides(fill = guide_colorbar(barwidth = 10, barheight = 1.5))+ ggtitle(label = "Phenotype Q") g2 <- ggplot(data = m.score[m.score$phenotype.id == 2, ])+ geom_errorbar(aes(x = ca.mean, ymin = beta.hdi.low, ymax = beta.hdi.high), width = 0.015, col = "darkgray")+ geom_point(aes(x = ca.mean, y = beta.mean, fill = kappa.mean), shape = 21, size = 4)+ geom_text_repel(aes(x = ca.mean, y = beta.mean, label = label), size = 5)+ theme_bw(base_size = 14)+ ylab(label = expression("Effect size ("*beta*") (with 95% HDI)"))+ scale_x_continuous(name = "CA", limits = c(0, 1.05))+ geom_hline(yintercept = 0, linetype = "dashed")+ theme(legend.position = "top")+ scale_fill_distiller(palette = "Spectral", limits = c(-0.2, 1))+ guides(fill = guide_colorbar(barwidth = 10, barheight = 1.5))+ ggtitle(label = "Phenotype D") grid.arrange(g1, g2, ncol = 2) @ \end{scriptsize} \begin{figure}[H] \centering <>= grid.arrange(g1, g2, ncol = 2) @ \end{figure} \newpage \section{Extra Utilities} \label{sec:utilities} \subsection{Data Reduction} The methods implemented in \texttt{genphen} are statistically superior to the ones implemented by most classical (frequentist) tools for GWAS. The major challenge, however, is the substantially increased computational cost when analyzing the effects of hundreeds of thousands of SNPs. Inspired by the biological assumption that the major fraction of the studied SNPs are non- informative (genetic noise) with respect to the selected phenotype, various data reduction techniques can be implemented to quickly scan the SNP scpae and discard a substantial portion of the the SNPs deemed clearly non-informative. Our data reduction procedure includes the following steps: \enumerate{ \item The complete data (genotypes and a single phenotype) is used to train a random forest (RF) model, which will quantify the importance of each SNP/SAAP in explaining the phenotypeassociation between each SNP and the phenotype. \item We can then plot the distribution of variable importances, to get an insight into the structure of the importances values and potentially disect the signal from the noise. \item The main analysis can then be performed with runGenphen using a subset (based on their importance) of SNPs } Using a case study based on a simulated data of 50,000 SNPs (60 subjects), we elaborate the typical data reduction steps in more detail. The typical runtime for the provided dataset ($60 \times 50,000$) is few minutes. \begin{scriptsize} <>= # Simulate 50,000 SNPs and 60 phenotypes set.seed(seed = 551155) g1 <- replicate(n=5*10^4, expr=as.character(rbinom(n=30, size = 1,prob = 0.49))) g2 <- replicate(n=5*10^4, expr=as.character(rbinom(n=30, size = 1,prob = 0.51))) gen <- rbind(g1, g2) phen <- c(rnorm(n = 30, mean = 3, sd = 3), rnorm(n = 30, mean = 5, sd = 3)) @ \end{scriptsize} \begin{scriptsize} <>= # Run diagnostics diag <- genphen::runDiagnostics(genotype = gen, phenotype = phen, phenotype.type = "Q", rf.trees = 50000) @ \end{scriptsize} We can inspect the distribution of importances and select a set of promising SNP based on their importance score, which can then be studied in the main association study as explained previously. \begin{scriptsize} <>= # Visualization g <- ggplot(data = diag)+ geom_density(aes(importance))+ xlab("Importance")+ theme_bw(base_size = 14)+ scale_x_continuous(trans = "log10")+ annotation_logticks(base = 10, sides = "b") g @ \end{scriptsize} \begin{figure}[H] \centering <>= g @ \end{figure} %By visualizing the estimated slope coefficients of the diagnosed SNPs, we can %observe an enrichment of non-informative (statistically not-significant) SNPs %beyond the rank of 1,000. We can thus narrow down our interval of interest to %the top-ranked 2,500 SNPs, yielding massive data reduction of 95\%, while still %retaining many SNPs with small effects. \end{document}