--- title: "Use of the epistasisGA package to identify multi-SNP effects in nuclear families" author: - name: Michael Nodzenski affiliation: - Department of Biostatistics, University of North Carolina, Chapel Hill, NC - Graduate Partnerships Program, National Institutes of Health, Bethesda, MD - Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC email: michael.nodzenski@gmail.com date: "April 26, 2021" package: epistasisGA output: BiocStyle::html_document: toc_float: true fig_width: 5 bibliography: library.bib vignette: > %\VignetteIndexEntry{GADGETS Usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo=FALSE, results="hide"} knitr::opts_chunk$set(tidy = FALSE, cache = FALSE, dev = "png", message = FALSE, error = FALSE, warning = FALSE) ``` # Introduction While methods for characterizing marginal relationships between individual SNPs and disease status have been well developed in high throughput genetic association studies of complex diseases, identifying joint associations between collections of genetic variants and disease remains challenging. To date, studies have overwhelmingly focused on detecting variant-disease associations on a SNP by SNP basis. Doing so allows researchers to scan millions of SNPs for evidence of association with disease, but only SNPs with large marginal disease associations can be identified, missing collections of SNPs with large joint disease associations despite small marginal associations. For many diseases, it is hypothesized that increased penetrance may result from the joint effect of variants at multiple susceptibility loci, suggesting methods focused on identifying multi-SNP associations may offer greater insight into the genetic architecture of complex diseases. The epistasisGA package presents one such approach. In this package, we implement the GADGETS method [@GADGET2020] for identifying multi-SNP disease associations in case-parent triad or affected/unaffected sibling-pair studies. Briefly, GADGETS uses a "genetic algorithm" (GA) [@holland] to identify collections of risk-relevant SNPs. Genetic algorithms are a class of general purpose optimization algorithms particularly well suited to combinatorial optimization for high dimensional problems. While we use a GA in the context of population genetics, genetic algorithms were not originally developed for genetic studies and can be used to solve a broad variety of problems. In a typical genetic algorithm, optimal solutions are identified by mimicking the process of Darwinian natural selection. In the first iteration of the algorithm, or 'generation', a set of potential solutions, collectively known as a 'population' with each population component referred to as a 'chromosome', is sampled. Individual chromosomes are made up of finite sets of discrete elements, just as biological chromosomes are comprised of SNPs. Unlike biological chromosomes, however, GA 'chromosomes' are unordered sets. A user defined function then assigns a 'fitness score' to each chromosome, with the fitness score constructed such that better solutions have higher fitness scores. Then, the chromosomes are subjected to 'crossing over', where component elements of different chromosomes are exchanged, and 'mutation', where component elements are arbitrarily replaced, to generate a new population for the next generation. Chromosomes with higher fitness scores are preferentially selected to produce 'offspring', analogous to how organisms with higher fitness reproduce more effectively under natural selection. The scoring and mutation/crossing over process continues iteratively until stopping criteria have been achieved, hopefully resulting in one or more acceptable solutions to the optimization problem. To maintain population diversity and help avoid premature convergence, GADGETS implements what is known as an 'island model' GA [@island1996], where multiple 'island' populations are evolved in parallel with limited 'migration' of chromosomes among islands permitted at pre-specified intervals of generations. In GADGETS, the term 'chromosome' refers to a collection or subset of SNPs we wish to examine for evidence of a strong multi-SNP association with disease. To be clear, these SNPs need not be located on the same biological chromosome. The details of the fitness score function and crossover/mutation operations are given in [@GADGET2020], but in short, the intuitive aim is to assign high scores to collections of SNPs that are jointly carried by the disease affected children (cases) much more frequently than not. For affected/unaffected sibling-pairs, we operationalize this by comparing the paired sibling genotypes. For case-parent triads, we imagine a possible sibling for each case based on the alleles that were not transmitted, which we call the 'complement', and we compare the frequency with which subsets of SNPs are jointly transmitted to cases versus complements. Importantly, we do not assume a single subset of risk-relevant SNPs, so GADGETS does not directly output a single optimal subset but rather a number provisionally interesting collections of SNPs. Some final processing is therefore required to carry out overall inference and to tease out the actual risk-relevant subset(s) of SNPs. As a final note, users should be aware this method does not currently scale up to genome wide, but it does accommodate larger numbers of input SNPs than comparable methods. In our simulations, we have used up to 10,000 input SNPs, but users are encouraged to experiment with larger numbers if desired. # Basic Usage ## Load Data We begin our example usage of GADGETS by loading a simulated example of case-parent triad data. ```{r} library(epistasisGA) data(case) data(dad) data(mom) ``` These data were simulated based on a case-parent triad study of children with orofacial-clefts using the method described by @Shi2018 for triads consisting of mothers, fathers, and affected children. For each dataset, the rows correspond to families, and the columns correspond to SNPs. Because this is a small example just for illustration, each of these datasets includes only `r ncol(case)` SNPs. The SNPs in columns 51, 52, 76, and 77 are a true risk pathway (rs1731422, rs4237892, rs7985535, rs1487251), where the joint combination of variants substantially increases the penetrance compared to other genotypes. At present, GADGETS does not support the use of genotypes imputed with uncertainty (genotype 'dosages'), so the most likely genotype should be used for any imputed SNPs. GADGETS does, however, support missing genotypes. If a genotype is missing for a particular SNP in any family member, that genotype will be set to missing (-9) for all members of the family, and the family will be considered uninformative for that SNP. ## Pre-process Data The second step in the analysis pipeline is to pre-process the data. We begin pre-processing by constructing a block diagonal matrix to indicate whether a given pair of SNPs should be considered to be in linkage, where `TRUE` indicates a pair of SNPs are in linkage and `FALSE` otherwise. Below, we default to the assumption that SNPs located on the same biological chromosome are in linkage, but users need not make this assumption and are encouraged to more carefully tailor this matrix based on individual circumstances. An important note is that the matrix must be block diagonal (or uniformly set to TRUE), and the ordering of the columns in the input triad data must be consistent with the specified LD structure. For the example data, the SNPs are drawn from chromosomes 10-13, with the columns sorted by chromosome and 25 SNPs per chromosome. We therefore construct the matrix as follows: ```{r} library(Matrix) block.ld.mat <- as.matrix(bdiag(list(matrix(rep(TRUE, 25^2), nrow = 25), matrix(rep(TRUE, 25^2), nrow = 25), matrix(rep(TRUE, 25^2), nrow = 25), matrix(rep(TRUE, 25^2), nrow = 25)))) ``` Now, we can execute pre-processing: ```{r} pp.list <- preprocess.genetic.data(case, father.genetic.data = dad, mother.genetic.data = mom, block.ld.mat = block.ld.mat) ``` For affected/unaffected sibling study, the sibling genetic data should be supplied as the `complement.genetic.data` argument: ```{r, eval=F} ### FOR ILLUSTRATION ONLY, DO NOT TRY TO RUN THIS CHUNK ### pp.list <- preprocess.genetic.data(affected.sibling.genotypes, complement.genetic.data = unaffected.sibling.genotypes, block.ld.mat = block.ld.mat) ``` GADGETS can also accommodate studies with a mix of triads and unaffected/affected siblings. To input this kind of data, analysts would first need to generate the 'complement' data for triads, as follows: ```{r} complement <- mom + dad - case ``` Then the unaffected sibling genotypes should be concatenated with the complement data, and, similarly, the triad case genotypes and affected sibling genotypes should be concatenated, and these genotype matrices should be entered as the `complement.genetic.data` and `case.genetic.data` arguments for `preprocess.genetic.data`: ```{r, eval = FALSE} ### STILL JUST FOR ILLUSTRATION ### combined.case <- rbind(case, affected.sibling.genotypes) combined.complement <- rbind(complement, unaffected.sibling.genotypes) pp.list <- preprocess.genetic.data(case.genetic.data = combined.case, complement.genetic.data = combined.complement, block.ld.mat = block.ld.mat) ``` Importantly, X chromosome SNPs can be used as candidates for GADGETS, but some extra attention is required. Specifically, for affected/unaffected sibling data, both siblings in each pair should be of the same biological sex. For case-parent triad data, the 'complement' should be assumed to be of the same sex as the case. As a consequence, analysts must input triad data containing X chromosome SNPs as `case.genetic.data` and `complement.genetic.data`, as shown above, after manually creating the complement data for these triads. To do so, autosomal SNP complements should still be created as above (mother genotype + father genotype - case genotype), but the X chromosome SNP complement genotypes should be the allele transmitted by the father to the case *and* the allele *not* transmitted by the mother to the case. Note this will result in male X chromosome SNP case and complement genotypes being coded as 0 or 1, compared to 0, 1, or 2 for females. If males and females are both contained in the analysis dataset, analysts may wish to double male X chromosome SNP allele counts (male genotype 1 re-coded as 2) because half of the X copies are typically silenced in females. This, however, is not required and analysts can decide on a SNP-by-SNP basis. Alternatively, GADGETS could be run separately for males and females, in which case doubling male X chromosome allele counts would not be necessary. The `preprocess.genetic.data` function performs a few disparate tasks that users should note. First, it identifies the minor allele for each input SNP based on the observed frequency in the mothers and fathers (or both siblings in sibling studies). The coding is then flipped for any SNPs where the allele count corresponds to the major allele. The identities of these re-coded SNPs can be found in the output from `preprocess.genetic.data`. Afterwards, any SNPs with minor allele frequency below a given value, 0\% by default (no filtering), are removed. Following SNP filtering, $\chi^2$ statistics for univariable SNP-disease associations are computed for each SNP assuming a log-additive model using the method of conditioning on the set of transmitted and untransmitted genotypes, regardless of phase, described by @Cordell2002. These statistics are used in GADGETS when SNPs are sampled for mutation, with the sampling probability for a given SNP proportional to its $\sqrt{\chi^2}$ value. Alternatively, users can manually specify a vector proportional to SNP sampling probabilities using argument `snp.sampling.probs`. This may be of interest to users who wish to incorporate prior biological or expert knowledge into the algorithm to prioritize sampling a subset of particularly interesting SNPs. ## Run GADGETS We now use GADGETS to identify interesting collections of SNPs that appear to be jointly risk-related using the `run.gadgets` function. The method requires a number of tuning parameters, but the function defaults are a good starting point. More information regarding these parameters can be found in the package documentation and the paper describing the method [@GADGET2020]. Briefly, GADGETS requires the user to pre-specify the number of SNPs that may be jointly associated with disease status. This is controlled by the `chromosome.size` argument. We recommend running the algorithm for a range of sizes, typically 2-6. For this simple example, however, we will only examine sizes 3 and 4. Note that `run.gadgets` does not output results directly to the R session, it will instead write results to the directory specified in `results.dir`. Furthermore, as mentioned previously, GADGETS uses a technique known as an 'island model' to identify potentially interesting collections of SNPs. Each 'island' corresponds to a separately initialized population of `n.chromosomes` chromosomes, with each chromosome comprised of `chromosome.size` SNPs and each island randomly assigned to cluster of `island.cluster.size` islands. When GADGETS is run, each island's population evolves independently for intervals of a pre-specified number of generations (controlled by argument `migration.generations`) after which the top `n.migrations` chromosomes from each island 'migrate' to other islands in the cluster. This continues until each island in the cluster converges or the maximum number of generations (argument `generations`) is reached. Island clusters evolve completely independently from one another and, where computational resources permit, in parallel. Results for each island (argument `n.islands`) are written separately to `results.dir`. ```{r, message = FALSE} run.gadgets(pp.list, n.chromosomes = 5, chromosome.size = 3, results.dir = "size3_res", cluster.type = "interactive", registryargs = list(file.dir = "size3_reg", seed = 1300), n.islands = 8, island.cluster.size = 4, n.migrations = 2) run.gadgets(pp.list, n.chromosomes = 5, chromosome.size = 4, results.dir = "size4_res", cluster.type = "interactive", registryargs = list(file.dir = "size4_reg", seed = 1400), n.islands = 8, island.cluster.size = 4, n.migrations = 2) ``` The population size argument, `n.chromosomes`, does not have a default, so users need to carefully consider how to specify this parameter. In general, run-times are faster with smaller populations, but decreasing the population size may increase the chance of failing to identify true risk pathways. However, this behavior is also depends on parameters `n.islands` and `island.cluster.size`. Broadly, our goal is to maximize the number of distinct subsets of SNPs we examine for evidence of association with disease while maintaining acceptable run-times. Because the package allows island clusters to evolve in parallel, we generally pay less of a run-time price for large island numbers compared to large population sizes. We therefore typically choose to run many islands in small clusters with relatively small population sizes. More concretely, in simulations involving 10,000 input SNPs, we have observed good performance using 1,000 islands in clusters of 4 with populations of 200 chromosomes per island. The `cluster.type` and `registry.args` parameters are also important. For the above example, the "interactive" cluster type indicates that all islands run sequentially in the R session. However this is not how we anticipate `run.gadgets` will be used in most cases. Rather, we strongly recommend this function be used with high performance computing clusters to avoid prohibitively long run-times. An example (not run) of this type of command is given below: ```{r, eval=F} ### FOR ILLUSTRATION ONLY, DO NOT TRY TO RUN THIS CHUNK ### library(BiocParallel) fname <- batchtoolsTemplate("slurm") run.gadgets(pp.list, n.chromosomes = 20, chromosome.size = 3, results.dir = "size3_res", cluster.type = "slurm", registryargs = list(file.dir = "size3_reg", seed = 1300), cluster.template = fname, resources = list(chunks.as.arrayjobs = TRUE), n.islands = 12, island.cluster.size = 4) ``` The `cluster.template` must be properly calibrated to the user's HPC cluster. Packages `r CRANpkg("batchtools")` and `r Biocpkg("BiocParallel")` both contain good documentation on how to construct these files. It is also possible to run `run.gadgets` on a single machine with multiple cores (see `run.gadgets` documentation). Regardless of the chosen `cluster.type`, `run.gadgets` uses the functionality from package `r CRANpkg("batchtools")` to run jobs. In the case of HPC cluster use, with a properly configured `cluster.template`, users simply need to execute `run.gadgets` from an interactive R session and the jobs will be submitted to and executed on the cluster. This approach is well described in section 4.3.2 of the vignette "Introduction to BiocParallel" from package `r Biocpkg("BiocParallel")`. Depending on `cluster.type`, jobs may take minutes to hours to complete. The status of jobs can be queried using the functions in package `r CRANpkg("batchtools")`, most commonly `getStatus`. For users of HPC clusters, commands such as 'squeue' can also be used. For larger numbers of submissions, users may also construct their own automated scripts for checking that jobs have successfully completed. Perhaps the most obvious indication of a job failure is the `results.dir` will not contain `n.islands` distinct sets of results. This is particularly important when running jobs on an HPC cluster, where jobs may fail relatively cryptically. In the case of job failure due to problems with cluster schedulers (jobs fail to launch, node failure, etc.), the failed jobs can be re-run using the exact same `run.gadgets` command. The function will automatically identify the island jobs that still need to be run, and submit only those. This is also true for users running jobs on personal machines, who need to stop computations before all island results are available. Once users have confirmed `run.gadgets` has completed and run properly, the sets of results across islands should be condensed using the function `combine.islands`. Note that in addition to the results directory path, the function requires as input a data.frame indicating the RSIDs (or a placeholder name), reference, and alternate alleles for each SNP in the data passed to `preprocess.genetic.data` as well as the list output by `preprocess.genetic.data`. ```{r} data(snp.annotations) size3.combined.res <- combine.islands("size3_res", snp.annotations, pp.list, 2) size4.combined.res <- combine.islands("size4_res", snp.annotations, pp.list, 2) ``` Importantly, the function will write these results to the specified directory. The function condenses island results such that the rows are sorted in decreasing order according to fitness score, or, more plainly, with the most interesting SNPs appearing at the top of the dataset: ```{r} library(magrittr) library(knitr) library(kableExtra) kable(head(size4.combined.res)) %>% kable_styling() %>% scroll_box(width = "750px") ``` In this case, we see that the SNPs corresponding to rs1731422, rs4237892, rs7985535, and rs1487251 are the top collection, or, more simply, the group of 4 SNPs identified as most likely to exhibit a joint association with disease status. Note that these are the actual SNPs simulated to exhibit a joint effect, so we have identified the true risk pathway. Perhaps the most important elements of the output are the `n.cases.risk.geno`and `n.comps.risk.geno`columns. These report the number of cases and complements/unaffected siblings with the provisional risk genotypes, as determined by GADGETS, at each locus, among families where only one of the case and complement carries the risk genotype. For true epistatic SNP-sets, many more cases should carry the risk genotypes at each locus than the complements. A few important but subtle components of the results users should note are the `risk.allele`, `allele.copies`, `diff.vec`, `fitness.score`, and `n.islands.found` columns. For a given SNP within a chromosome, the corresponding `risk.allele` column reports the proposed risk-related nucleobase and the `allele.copies` column indicates the required number of copies for the proposed risk pathway. A '1+' indicates at least one copy of the risk allele is needed, while '2' indicates two copies are required. The `n.islands.found` column reports the number of distinct islands on which a specific chromosome was found. The `diff.vec` columns also may be descriptively interesting to some users. A positive value for a given SNP indicates the minor allele is positively associated with disease status, while a negative value implies the reference (‘wild type’) allele is positively associated with the disease. More generally, these values should be large in magnitude when a SNP is transmitted to cases much more often than complements. For true multi-SNP risk pathways, we expect pathway SNPs to be jointly transmitted to cases more often than compliments, therefore the magnitude of the `diff.vec` values should be large across the pathway. As such, high scoring chromosomes containing a subset of SNPs with small magnitude `diff.vec` values offer descriptive evidence that some of the chromosome's SNPs are not jointly transmitted to the cases and suggest a smaller chromosome size may be warranted. In the case of the true risk pathway SNPs, all `diff.vec` values are positive and reasonably large, and all `allele.copies` columns are '1+' indicating the risk pathway requires having at least one copy of the minor allele for all 4 SNPs. ## Permute Datasets Of course, in real studies we do not know the identity of true risk pathways. We would therefore like a way to determine whether the results from the observed data are consistent with what we expect under the global null hypothesis of no association between any input SNPs and disease status. We do so via permutation testing. Maintaining the case/complement (or case/unaffected sibling) nomenclature from above, in permuting the data, we randomly flip or do not flip the the disease status labels. We create these permuted datasets using the `permute.dataset` command. Here we generate 4 different permuted datasets. In real applications, users are advised to generate at least 100, more if computationally feasible. ```{r} set.seed(1400) perm.data.list <- permute.dataset(case, father.genetic.data = dad, mother.genetic.data = mom, n.permutations = 4) ``` ## Re-Run GADGETS Once we have created the permuted datasets, for each permutation, we perform exactly the same sequence of analyses shown above. Users should note this step is by far the most time consuming of the workflow and almost certainly requires access to a computing cluster. However, since this vignette provides only a very small example, we are able to run the analyses locally. We begin by pre-processing the permuted datasets: ```{r} preprocess.lists <- lapply(perm.data.list, function(permutation){ preprocess.genetic.data(permutation$case, complement.genetic.data = permutation$comp, block.ld.mat = block.ld.mat) }) ``` Then, we run GADGETS on each permuted dataset for each chromosome size and condense the results: ```{r, results = "hide"} #specify chromosome sizes chrom.sizes <- 3:4 #specify a different seed for each permutation seeds <- 4:7 #run GADGETS for each permutation and size lapply(chrom.sizes, function(chrom.size){ lapply(seq_along(preprocess.lists), function(permutation){ perm.data.list <- preprocess.lists[[permutation]] seed.val <- chrom.size*seeds[permutation] res.dir <- paste0("perm", permutation, "_size", chrom.size, "_res") reg.dir <- paste0("perm", permutation, "_size", chrom.size, "_reg") run.gadgets(perm.data.list, n.chromosomes = 5, chromosome.size = chrom.size, results.dir = res.dir, cluster.type = "interactive", registryargs = list(file.dir = reg.dir, seed = seed.val), n.islands = 8, island.cluster.size = 4, n.migrations = 2) }) }) #condense the results perm.res.list <- lapply(chrom.sizes, function(chrom.size){ lapply(seq_along(preprocess.lists), function(permutation){ perm.data.list <- preprocess.lists[[permutation]] res.dir <- paste0("perm", permutation, "_size", chrom.size, "_res") combine.islands(res.dir, snp.annotations, perm.data.list, 2) }) }) ``` ## Run Global Test Now we are ready to more formally examine whether our results are consistent with what would be expected under the null hypothesis of no association between the input variants and disease status. Our first step is to run a global test across all chromosome sizes examining whether the fitness scores from the observed data are drawn from the distribution expected under the null. Briefly, for each chromosome size separately, we sum the fitness scores of the top $k$ chromosomes for the observed data and each permuted dataset. We then rank the observed sum compared to the corresponding chromosome size specific sums for each permuted dataset. Letting $R_d$ denote the rank for chromosome size $d$, with 1 denoting the rank of the smallest sum of fitness scores, and with $N$ denoting the the number of permutes plus 1, we compute $T_d = -2ln((N - R_d + 0.5)/N)$. The test is based on summing $T_d$ over chromosome sizes, $T = \sum_d{T_d}$. We also compute this statistic for each permuted dataset and compute a permutation-based p-value using the $N$ values for $T$. Intuitively, this test assesses whether the top-scoring chromosomes are consistently higher than null data across chromosome sizes. A key feature of this approach is the test does not require adjustment for multiple comparisons despite the very large number of combinations being considered by the GA. The test is implemented in function `global.test`, with $k$ specified by argument `n.top.scores`. We generally specify `n.top.scores = 10`, but for this illustrative example, we only use 5. When GADGETS returns fewer than `n.top.scores` chromosomes for a given chromosome size for either the observed data or any permute, say $l^*$ chromosomes, the global test will be automatically computed using the top $l^*$ for that chromosome size. Note that the number of chromosomes actually used to compute the global test for each chromosome size will be stored in the `chrom.size.k` element of the `global.test` output. ```{r} # chromosome size 3 results # function requires a list of vectors of # permutation based fitness scores chrom3.perm.fs <- lapply(perm.res.list[[1]], function(x) x$fitness.score) chrom3.list <- list(observed.data = size3.combined.res$fitness.score, permutation.list = chrom3.perm.fs) # chromosome size 4 results chrom4.perm.fs <- lapply(perm.res.list[[2]], function(x) x$fitness.score) chrom4.list <- list(observed.data = size4.combined.res$fitness.score, permutation.list = chrom4.perm.fs) # list of results across chromosome sizes, with each list # element corresponding to a chromosome size final.results <- list(chrom3.list, chrom4.list) # run global test global.test.res <- global.test(final.results, n.top.scores = 5) # examine how many chromosomes were used for each chromosome size global.test.res$chrom.size.k # look at the global test stat and p-value global.test.res$obs.test.stat global.test.res$pval ``` We see the observed $T$ for our simple example is `r global.test.res$obs.test.stat` with associated permutation based p-value `r global.test.res$pval`. Note the permutation p-values for both for the `global.test` and `epistasis.test` functions are computed as one plus the number of permutation test statistics that exceed the observed test statistic divided the number of permutations plus one [@phipson], underscoring the need to run as many permutations as computationally possible. If we overlay the observed $T$ on a boxplot of the permutations, we see the observed distance is a huge outlier even when taking logs: ```{r, echo = FALSE} library(ggplot2) plot.data <- data.frame(distance_type = c(rep("original", 4), rep("log", 4)), data = rep(rep("Permuted", 4), 2), distance = c(global.test.res$perm.test.stats, log(global.test.res$perm.test.stats))) obs.plot.data <- data.frame(distance_type = c("original", "log"), data = rep("Observed", 2), distance = c(global.test.res$obs.test.stat, log(global.test.res$obs.test.stat))) ggplot(plot.data, aes(x = factor(""), y = distance, color = data)) + geom_boxplot() + geom_point() + geom_point(data = obs.plot.data, aes(x = factor(""), y = distance, color = data)) + facet_wrap(distance_type ~ ., scales = "free_y", strip.position = "left", nrow = 2, labeller = as_labeller(c(original = "T", log = "log(T)"))) + ylab(NULL) + xlab(NULL) + theme(strip.background = element_blank(), strip.placement = "outside", axis.ticks.x = element_blank(), axis.text.x = element_blank()) + guides(color=guide_legend(title="Data Type")) ``` ## Post-hoc Analyses Regardless of result, it is crucial that users are clear about what the test implies. The null hypothesis is that the observed test statistic is drawn from the null distribution, i.e., that the global patterns of transmissions to cases are not systematically different from those to the corresponding complements or unaffected siblings. In situations where we reject the null hypothesis, of course the conclusion is the observed test statistic is very different from from the null distribution. However, to best characterize results, it is incumbent on users to closely examine results beyond the global test. In particular, we would like to be able to identify the specific subsets of SNPs that are collectively related to risk. To that end, `global.test` provides additional, chromosome size specific information that users are encouraged to examine. First, users may look at the `obs.marginal.test.stats` and `marginal.pvals` objects from the `global.test` output: ```{r} global.test.res$obs.marginal.test.stats global.test.res$marginal.pvals ``` The `obs.marginal.test.stats` object reports the fitness score sums for each chromosome size for the observed data. The `marginal.pvals` object reports a permutation p-value for each distinct chromosome size. Second, users may also wish to examine the `max.obs.fitness` and `max.order.pvals` elements of the output: ```{r} global.test.res$max.obs.fitness global.test.res$max.order.pvals ``` The `max.obs.fitness` element reports the maximum fitness score in the observed data for each chromosome size, and the `max.order.pvals` element reports the corresponding permutation p-values. Given sufficient permutations to estimate the null distribution, these p-values correspond to the test of the null hypothesis that the maximum observed fitness score for a given chromosome size is not greater than what would be expected given no SNP-disease associations. Rejecting the null implies the observed maximum fitness score exceeds what would be expected by random chance. Finally, the function provides boxplots of the observed vs. permuted fitness scores for each element of the input results list: ```{r, fig.width=6} library(grid) grid.draw(global.test.res$boxplot.grob) ``` Note the numbers at the top of the plots correspond to the element of the input results list. In this example, the '1' corresponds to the chromosome size 3 results (the first element of the results list) and the '2' corresponds to the chromosome size 4 results (the second element of the results list). Above, we see the observed fitness scores tend to be higher than the null permutes, especially for chromosome size 4. Users should also be clear that rejecting the null hypothesis does not necessarily indicate that a collection of SNPs exhibit epistasis. Indeed, we may reject the null simply because we have identified a single SNP with a non-zero marginal disease association. To that end, we provide a permutation based procedure specifically examining evidence for epistasis among a collection of SNPs conditional on the marginal SNP-disease associations. To illustrate, in our example, we might be interested in whether the high score of the top 4 SNP chromosome was attributable to epistasis or at least one large marginal association. We can execute this test as follows: ```{r} top.snps <- as.vector(t(size4.combined.res[1, 1:4])) set.seed(10) epi.test.res <- epistasis.test(top.snps, pp.list) epi.test.res$pval ``` The test indicates the observed fitness score is unusual under the assumption of no epistasis among loci and given the observed marginal transmissions to the disease affected children. However, caution is warranted in interpreting this 'p-value'. The GADGETS stochastic search method identifies SNP-sets that appear to be interacting even under an independent-effects null, so the epistasis test p-value should be interpreted in an exploratory, hypothesis-generating context. We make use of these 'p-values' primarily in constructing network plots of potentially interesting SNP-sets. Users may also receive a warning when executing `epistasis.test` that says "All chromosome SNPs in linkage, returning NA for p-value". This is generated because the procedure can only run if at least two SNPs are not in linkage, as specified in matrix `block.ld.mat`, otherwise other methods (not provided in this package) are required. ## Visualize Results As a final step in the analytic pipeline, we recommend users examine network plots of the results using function `network.plot`. This may be particularly useful when trying to determine the true number of SNPs involved in multi-SNP risk pathways and to identify those SNPs. For instance, in the example above, the true risk pathway involves 4 SNPS, but we ran GADGETS for chromosome sizes of 3 and 4. For chromosome size 3, we saw that many of the top identified collections of SNPs were subsets of the true 4 SNP pathway. If we didn't know the true pathway size was 4, a network plot might help make this clearer. Briefly, we use our epistasis test 'p-values' to assign graphical scores to pairs of SNPs identified in top-scoring chromosomes, with higher scores corresponding to lower epistasis p-values (more substantial evidence for epistasis). We then aggregate those pair-scores across chromosome sizes to generate a final collection of SNP-pairs, which we display in a single network plot. We start by computing the SNP-pair scores using `compute.graphical.scores`. This function takes as required arguments a list of data.tables containing the results from GADGETS run for different chromosome sizes and the list of pre-processed data from function `preprocess.genetic.data`. For analysts who have run the permutation-based global test, we recommend restricting attention to chromosomes with fitness scores higher than what we would expect for null data. Specifically, the `global.test` function output contains a vector, `max.perm.95th.pctl`, that reports the $95^{th}$ percentile of the maximum observed fitness score across the null permutes for each chromosome size. We restrict our network plots to the chromosomes with fitness scores exceeding the corresponding null threshold for each chromosome size: ```{r} # vector of 95th percentile of null fitness scores chrom.size.thresholds <- global.test.res$max.perm.95th.pctl # chromosome size 3 threshold d3.t <- chrom.size.thresholds[1] # chromosome size 4 threshold d4.t <- chrom.size.thresholds[2] # create results list obs.res.list <- list(size3.combined.res[size3.combined.res$fitness.score >= d3.t, ], size4.combined.res[size4.combined.res$fitness.score >= d4.t, ]) ``` Alternatively, for analysts who have not run the global-test permutes, we recommend restricting attention to a subset of the top scoring chromosomes for each chromosome size. We've observed good results using the top 10, but we use the top 5 in the illustrative command below: ```{r} obs.res.list.no.permutes <- list(size3.combined.res[1:5, ], size4.combined.res[1:5, ]) ``` Once the results list has been prepared, we generate graphical scores for each SNP-pair. Since we've run the global test, we use the `obs.res.list` results below, but the steps would be exactly the same if we instead used the `obs.res.list.no.permutes` list. Note that for large numbers of top scoring chromosomes, this function may take at least 10-20 minutes to complete. ```{r} set.seed(10) graphical.scores <- compute.graphical.scores(obs.res.list, pp.list) ``` Next, we input these pair scores into function `network.plot` to actually plot the data. ```{r, fig.width = 14, fig.height = 12} network.plot(graphical.scores, pp.list, graph.area = 200, node.size = 40, vertex.label.cex = 2) ``` We see the true pathway SNPs appear prominently in this plot. Key features of the plotting approach are (1) the thickness of the SNP-to-SNP line segments is proportional to the log of the graphical score for that pair and (2) the area of each node circle is proportional to the log of the sum of the graphical scores for that SNP. Although not seen in the plot above, dashed connections indicate SNPs in high LD (pairwise $r^2 \geq 0.1$ by default) in the unaffected siblings/complements. This function uses the Fruchterman-Reingold force directed graph drawing algorithm, as implemented in the `qgraph` R package. If other layout algorithms are desired, users can specify `plot = FALSE` to `network.plot`, which will return an `igraph` object that can be used for more customized or interactive plotting. For instance, the `igraph` package function `tkplot` allows for interactive plotting of `igraph` objects. In real applications, there may be too many SNPs for a network plot to fit on a single page. In those instances, we plot a subset of the top scoring pairs. For example, we could plot the top 6 scoring pairs as follows: ```{r, fig.width = 14, fig.height = 12} network.plot(graphical.scores, pp.list, n.top.scoring.pairs = 6, graph.area = 10, node.size = 40, vertex.label.cex = 2) ``` As a more minor point, the SNP labels default to the RSIDs provided, as seen in the third and fourth columns of the `pair.scores` object, and the second column of `snp.scores`: ```{r} head(graphical.scores[["pair.scores"]]) head(graphical.scores[["snp.scores"]]) ``` However, if users desire custom labels, they can do so by changing the corresponding values. For instance, here we label the true risk SNPs with a star and leave all others blank: ```{r, fig.width = 14, fig.height = 12} risk.numbers <- c(51, 52, 76, 77) graphical.scores[[1]]$SNP1.rsid <- ifelse(graphical.scores[[1]]$SNP1 %in% risk.numbers, "*", "") graphical.scores[[1]]$SNP2.rsid <- ifelse(graphical.scores[[1]]$SNP2 %in% risk.numbers, "*", "") graphical.scores[[2]]$rsid <- ifelse(graphical.scores[[2]]$SNP %in% risk.numbers, "*", "") network.plot(graphical.scores, pp.list, graph.area = 200, node.size = 40, vertex.label.cex = 10) ``` # Cleanup and sessionInfo() {.unnumbered} ```{r, results="hide"} #remove all example directories perm.reg.dirs <- as.vector(outer(paste0("perm", 1:4), paste0("_size", chrom.sizes, "_reg"), paste0)) perm.res.dirs <- as.vector(outer(paste0("perm", 1:4), paste0("_size", chrom.sizes, "_res"), paste0)) lapply(c("size3_res", "size3_reg", "size4_res", "size4_reg", perm.reg.dirs, perm.res.dirs), unlink, recursive = TRUE) ``` ```{r} #session information sessionInfo() ``` # References {.unnumbered}