SimReg has a function sim_reg for performing Bayesian Similarity Regression. The aim is to estimate the probability of an association between ontological term sets and a binary variable, and should such an association exist, a characteristic ontological profile such that ontological similarity to the profile increases the probability of the binary variable taking the value TRUE. The procedure has been used in the context of linking an ontologically encoded phenotype (as HPO terms) to a binary genotype (indicating the presence or absence of a rare variant within given genes) [1], so in this guide, we’ll use the same theme.

sim_reg is an MCMC routine, and as such, its output is samples of parameter values from their posterior distributions. The function accepts many arguments including the response variable y (logical), the ontologically encoded predictor variable x (list), ones controlling the sampling pattern, ones controlling the tuning of the parameter proposal schemes and others specifying the prior distributions of the parameters.

It returns a list of traces for the sampled parameters. Of particular interest are the estimated mean posteriors of gamma - the model selection indicator, indicating whether a particular sample was taken given an association between x and y or not - which can be interpreted as an estimate of the probability of an association under the model assumptions - which is stored in the ‘mean_posterior_gamma’ slot of the output (i.e. result$mean_posterior_gamma - can also be calculated using mean(result$gamma)), and the posterior distribution of the characteristic ontological profile phi.

However, because most of these variables have default values, the output of algorithm can be demonstrated passing only a few. To set up a workspace were we can run a simple example, we need an ontology_index object, so we load the ontologyIndex package which contains an example - the Human Phenotype Ontology, hpo, and create an HPO profile template and a super set of terms, terms, from which we’ll generate random term sets to run the algorithm on. In our setting, we’ll interpret this HPO profile template as the phenotype of a hypothetical disease. We set template to the set HP:0005537, HP:0000729 and HP:0001873, corresponding to phenotype abnormalities ‘Decreased mean platelet volume’, ‘Autistic behavior’ and ‘Thrombocytopenia’ respectively.

suppressPackageStartupMessages(library(ontologyIndex))
suppressPackageStartupMessages(library(SimReg))
data(hpo)
set.seed(1)

template <- c("HP:0005537", "HP:0000729", "HP:0001873")
terms <- get_ancestors(hpo, c(template, sample(hpo$id, size=50)))

First, we’ll do an example where there is no association between x and y, and then one where there is an association.

In the example with no association, we’ll fix y, with 10 TRUEs and generate the x randomly, with each set of ontological terms determined by sampling 5 random terms from terms.

y <- c(rep(TRUE, 10), rep(FALSE, 90))
x <- replicate(simplify=FALSE, n=100, expr=minimal_set(hpo, sample(terms, size=5)))

Thus, our input data looks like:

y
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
head(x)
## [[1]]
## [1] "HP:0010337" "HP:0011947" "HP:0003256" "HP:0001155"
## 
## [[2]]
## [1] "HP:0004268" "HP:0011185" "HP:0100393" "HP:0008457" "HP:0002564"
## 
## [[3]]
## [1] "HP:0011304" "HP:0005420" "HP:0040133" "HP:0006563" "HP:0005625"
## 
## [[4]]
## [1] "HP:0004303" "HP:0010337" "HP:0000284" "HP:0012387"
## 
## [[5]]
## [1] "HP:0012823" "HP:0005938" "HP:0006563" "HP:0001991"
## 
## [[6]]
## [1] "HP:0040070" "HP:0012387" "HP:0030266" "HP:0000492" "HP:0009418"

Now we can call the sim_reg function to estimate the probability of an association (note: by default, the probability of an association has a prior of 0.05 and this can be set by passing a gamma_prior_prob argument), and print the mean posterior value of gamma, corresponding to our estimated probability of association.

no_assoc <- sim_reg(ontology=hpo, x=x, y=y)
no_assoc$mean_posterior_gamma
## [1] 0.04075

We note that there is a low probability of association. Now, we sample x conditional on y, so that if y[i] == TRUE, then x has 2 out of the 3 terms in template added to its profile.

x_assoc <- lapply(y, function(y_i) minimal_set(hpo, c(
    sample(terms, size=5), if (y_i) sample(template, size=2))))

If we look again at the first few values in x for which y[i] == TRUE, we notice that they contain terms from the template.

head(x_assoc)
## [[1]]
## [1] "HP:0008438" "HP:0004268" "HP:0100491" "HP:0006009" "HP:0045036"
## [6] "HP:0005537" "HP:0000729"
## 
## [[2]]
## [1] "HP:0100393" "HP:0010329" "HP:0006498" "HP:0001627" "HP:0002088"
## [6] "HP:0001873" "HP:0000729"
## 
## [[3]]
## [1] "HP:0005938" "HP:0100367" "HP:0010935" "HP:0000178" "HP:0006530"
## [6] "HP:0000729" "HP:0001873"
## 
## [[4]]
## [1] "HP:0006530" "HP:0001943" "HP:0006493" "HP:0001311" "HP:0003130"
## [6] "HP:0000729" "HP:0001873"
## 
## [[5]]
## [1] "HP:0006493" "HP:0011675" "HP:0200106" "HP:0001844" "HP:0008438"
## [6] "HP:0000729" "HP:0001873"
## 
## [[6]]
## [1] "HP:0001626" "HP:0012388" "HP:0011794" "HP:0009115" "HP:0000707"
## [6] "HP:0005537" "HP:0001873"

Now we run the procedure again with the new x and y and print the mean posterior value of gamma.

assoc <- sim_reg(ontology=hpo, x=x_assoc, y=y)
assoc$mean_posterior_gamma
## [1] 1

We note that we infer a higher probability of association. We can also visualise the estimated characteristic ontological profile, using the function phi_plot, and note that the inferred characteristic phenotype corresponds well to template.

phi_plot(hpo, assoc$phi[assoc$gamma], max_terms=10, fontsize=30)

Note that we must subset the $phi slot by $gamma, as the characteristic ontological profile phi has no effect if gamma == FALSE. A more comprehensive summary of the output can be exported to pdf using the function sim_reg_summary.

References

  1. D. Greene, NIHR BioResource, S. Richardson, E. Turro, `Phenotype similarity regression for identifying the genetic determinants of rare diseases’, American Journal of Human Genetics, 2016 (to be released)