%
% NOTE -- ONLY EDIT .Rnw!!!
% .tex file will get overwritten.
%
%\VignetteIndexEntry{ontoTools: sgdiOntology}
%\VignetteDepends{}
%\VignetteKeywords{Genomics, Ontology}
%\VignettePackage{ontoTools}
%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\bi}{\begin{itemize}}
\newcommand{\ei}{\end{itemize}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{SGDI ontology development}
\author{VJ Carey \url{stvjc@channing.harvard.edu}}
\maketitle

\section{Introduction}

This document describes tasks related to ontology
development for the SGDI (software for genomic
data integration) project.  Principal concerns 
include
\begin{itemize}
\item establishment of conventions for describing
designs of and samples from microarray experiments;
\item establishment of software tools that help
implement these conventions;
\item maximizing reuse of programatically accessible
vocabulary resources, such as those provided
by caCORE/EVS;
\item employing appropriate standards for metadata
structure design and deployment, such as RDF/OWL
models and associated XML serializations.
\end{itemize}

\section{Implementation issues}

We will distinguish three basic information structures:
\begin{itemize}
\item {\bf provenanceStruct:} a container for
information regarding the source and maintenance
of ontology-related information;
\item {\bf nomenclature:} a set of tokens representing
terms or concepts, with
specified provenance and definitions;
\item {\bf ontology:} a nomenclature with a hierarchical
structure reflecting semantic relations among the terms.
\end{itemize}

\subsection{Nomenclature example}
An example of a nomenclature structure is given here:
<<echo=FALSE,results=hide>>=
options(width=70)
require(ontoTools, quietly=TRUE)
library(KEGG.db)
#require(KEGG, quietly=TRUE)
<<keggdemo,echo=FALSE>>=
KPL <- eapply(KEGGPATHID2NAME, function(x)x)
GDI_KEGGPATH <- new("nomenclature", name="KEGGPATH",
 provenance=new("provStruct", 
 URI="ftp://ftp.genome.ad.jp/pub/kegg/pathways/map_title.tab", 
 captureDate="June 30 2004", comment="Rel 30.0"), 
 inMappings=c("LL2KEGGmap.hsa", "LL2KEGGmap.rno"),
 terms=names(KPL), definitions=as.character(unlist(KPL)))
GDI_KEGGPATH
@
The information on ``GDI maps'' pertains to usage of the
nomenclature in other information structures.  In general
it will be important to catalog not only the terms and
vocabularies in which they are used, but also the
substantive data resources in which these vocabulary
resources are deployed.  For example, in cross-organism inference,
it will be useful to be able to identify which data resources
use KEGG identifiers to characterize genomic components or
gene products.  One can then iterate over only these
resources, searching for the presence of a given
set of KEGG identifiers.
\subsection{Ontology example}
An example of an ontology is given here.  We use the
class \Rclass{taggedHierNomenclature} to emphasize
\begin{enumerate}
\item the existence of tags, which are abbreviated tokens
that are typically semantically opaque, used for abbreviated
reference to terms of interest;
\item the existence of hierarchical semantic relationships
among terms;
\item the extension of the formal nomenclature class.
\end{enumerate}
The data resource in this example is the NCI \textit{Thesaurus},
as opposed to the \textit{Metathesaurus}.  The thesaurus
is made available to support ontological inference in ways
that are not straightforward for the metathesaurus at this time.
<<lkGDINCI>>=
data(GDI_NCIThesaurus)
GDI_NCIThesaurus
@
There are helper functions that navigate the
ontology; at present a true graph is not employed.
<<lkpar>>=
mpar <- parents("Mesna", GDI_NCIThesaurus)
mpar
children( mpar, GDI_NCIThesaurus )
@
General regular expression matching can be used:
<<lkHER>>=
substring(grep("HER-2", GDI_NCIThesaurus),1,70)
@
When definitions are present, they can be obtained:
<<lkdef>>=
getDefs("Mesna", GDI_NCIThesaurus)
@
\section{Building a new ontology}
A workflow for building a new ontology is not
clearly established at present, but the basic tasks
appear to be
\begin{enumerate}
\item determine a set of concepts and associated terms;
\item examine existing ontologies for coverage of the
terms of interest;
\item if the application can proceed on the basis of the
harvesting of a single pre-existing ontology, it may suffice to
build a \Rclass{taggedHierNomenclature} instance
based on this ontology;
\item if the application requires a separate ontology
or desired concepts are not adequately covered, 
\begin{enumerate}
\item construct
a new OWL model for the ontology using Protege;
\item deserialize the OWL to an \Rclass{ontModel} using Rswub;
\item create a \Rclass{taggedHierNomenclature} instance
on the basis of lists derivable from the \Rclass{ontModel} instance.
\end{enumerate}
\end{enumerate}
We'll illustrate this process using some terms
related to breast cancer identified by Sridhar.
We'll work backwards from a data structure, \Robject{SGDIvocab},
currently in \Rpackage{ontoTools}
Sridhar looked the terms up in the NCI Metathesaurus
and determined that they were covered in some sense, but he did not
provide the exact entry matching the intended concept.
The terms are
<<echo=FALSE>>=
data(SGDIvocab)
SGDIvocab@terms
@
The exact meanings of these terms is not completely clear.
The use of the \verb+_array+ suffix has no conventional
interpretation that I am aware of; likewise the \verb+_clinical+
suffix.  We may need to invent new terms and definitions
to clarify the intended meaning of these tokens.

We proceed provisionally; the resulting tools may be modified
at any time in the future as usage patterns emerge.

\subsection{Protege-based management of terms and structure}

Figure \ref{protfig} shows the Protege ontology editor
in use to define the BCTerms class.  There are 13 instances.
Note to the right of the display that there are a variety 
of fields to be defined, including an RDFS comment field,
and an \verb+NCI_Meta_tag+ field.
Figure \ref{prot2} shows the editor focused on the formal
tags provided by NCI Metathesaurus for terms semantically
similar (by informal matching) to
the BCTerms of interest.

\begin{figure}
\setkeys{Gin}{width=1.0\textwidth}
\includegraphics{protlk}
\caption{View of protege ontology editor for SGDI vocabulary. 
Focus on BCTerms.}
\label{protfig}
\end{figure}
\begin{figure}
\setkeys{Gin}{width=1.0\textwidth}
\includegraphics{prot2}
\caption{View of protege ontology editor for SGDI vocabulary.
Focus on NCI metatags.}
\label{prot2}
\end{figure}

\subsection{Importation of OWL model}
The Rswub package is still not ready for distribution,
but I will convert OWL to R structures as needed.
The ontology model is prescribed by the Jena system
from HP.  \Rfunction{readSWModel} returns an anonymous
omegahat reference to a Java class instance.
<<loadRswub,results=hide,eval=FALSE>>=
library(Rswub)
<<useRswub,eval=FALSE>>=
omod <- readSWModel("http://www.biostat.harvard.edu/~carey/SGDI.owl", asURL=TRUE)
<<local,echo=FALSE,results=hide,eval=FALSE>>=
omod <- readSWModel("/home/stvjc/Protege_3.0_beta/SGDI.owl")
omod@documentName <- "http://www.biostat.harvard.edu/~carey/SGDI.owl"
@
<<lkmod,eval=FALSE>>=
omod
@
\begin{Soutput}
Ontology model (instance of com.hp.hpl.jena.ontology.impl.OntModelImpl )
source: http://www.biostat.harvard.edu/~carey/SGDI.owl
There are 4 named classes.
Base namespace: http://www.owl-ontologies.com/unnamed.owl .
\end{Soutput}

The Java-based ontology model object here is completely
independent of the
tagged hierarchical nomenclature structures which are
implemented purely in R.  For our purposes, the \Robject{omod}
object is just a bridge from OWL to R.
\Robject{omod} can be interrogated for the underlying RDF model.
This is just a set of triples (subject, predicate, object),
and getSplits returns a list of two elements: bypred and bysub.
<<lkont,eval=FALSE>>=
somod <- getSplits(omod)
names(somod)
@
We have defined \verb+NCI_Meta_tag+ as a property
with domain BCTerms and range \verb+NCI_Meta_Termset+.
<<lkmt,eval=FALSE>>=
somod$bysub$NCI_Meta_tag
@

The value of the property for each BCTerm instance is:
<<lkbc,eval=FALSE>>=
somod$bypred$NCI_Meta_tag
@

This table is the basis of the tagged nomenclature that
we need in the SGDI ontology.  The ``subj'' values
are the tokens we are interested in.  The ``obj'' values
are the formal identifiers of the NCI Metathesaurus
entries that we want to map our tokens to.  When CACore
is
working, we will use these formal identifiers to retrieve
detailed definitions, focused keyword matches, etc.

\subsection{The nomenclature}
<<lknom>>=
data(SGDIvocab)
SGDIvocab
@

\section{Summary and future work}
We have defined a few classes representing
nomenclatures and hierarchical nomenclatures.
Populating instances of these classes with existing
genomic information was illustrated  with
KEGG and NCI Thesaurus (not metathesaurus!).
Search and navigation of these structures
is supported, but additional facilities will
be required as the workflow clarifies.

When a specific collection of clinical and
technical terms is identified, we propose
to formally manage the collection using
Protege to define an OWL/RDF model.  The model
can be deserialized to Java and R structures
using Rswub, which will be placed in Bioc
development in the near future.

The OWL/RDF model has a regimented graphical
structure.  It is a collection of linked
ordered triples with interpretation (subject, predicate,
object).  When OWL objectProperty status is
conferred on an entity (property term), a domain and range can
be defined for that property.  A benefit of formal
specification of domain and range is that misuse of
the property can be programmatically forbidden.
In our example, the \verb+NCI_Meta_tag+ property maps from 
the BCTerms set to an instance of the \verb+NCI_Meta_termset+.
If we ask for the \verb+NCI_Meta_tag+ property of any BCTerm instance,
we are guaranteed to receive an instance of \verb+NCI_Meta_termset+.
An instance of the \verb+NCI_Meta_termset+ is composed of a formal
NCI Metathesaurus alphanumeric tag, and a brief verbal definition.
Ultimately the tag will be used in programmatic interrogation
of the metathesaurus for ancillary information such as
detailed definition, links to defining and illustrative documents,
and so forth.

Use of the OWL technology conveys access to techniques
for vocabulary harmonization.  Schemas may be defined
that establish equivalent properties with different names,
or set-theoretically defined combinations of formal classes,
and models may be revised using formal inference based
on the schema.  Logical inference within the OWL model
is also supported.


@
\end{document}