%% Sweave("GeneR/inst/doc/GeneR.Rnw")
\documentclass[a4paper]{article}

\usepackage{graphics}

\title{GeneR Package}
\author{L. Cottret, A. Lucas, O. Rogier, E. Marrakchi, V. Lefort,\\
 P. Durosay,
	A. Viari, C. Thermes \& Y. d'Aubenton-Carafa}

%\VignetteIndexEntry{Introduction to GeneR}
%\VignettePackage{GeneR}
% Because NCBI changed the web service the seqNcbi() function
% needs to connect to, the function is broken and consequently
% some of the code in this vignette too. I'm turning off code
% evaluation so GeneR can make it into the BioC 2.7 release but
% seqNcbi() will need to be fixed ASAP. Thanks!
%\SweaveOpts{echo=FALSE}
\SweaveOpts{eval=FALSE,echo=FALSE}
\usepackage{a4wide}

\begin{document}

\maketitle

\tableofcontents

\section{Overview}
  To summarize, we can split major functions into 5 categories.

  \subsection*{Reading and writing sequences}

    GeneR has been designed for fast sequence retrieving even from very large sequence databanks, in Fasta, Embl or Genbank formats. It is also possible to enter sequences directly from a R command line.

  \subsection*{Handling sequences}
  
  The usual copy-paste of parts of 
  sequences or other manipulations can be performed by functions using vectors
  of strands and positions.
  Annotations from the features field within formatted sequence entries can be extracted and used directly in vectors. By this way, it is easy to extract sequence fragments of interest.

  \subsection*{Analyzing sequences }
  
  Some tools are designed to count oligo-nucleotides, to look for exact word positions or to shuffle sequences (useful for statistical validations).


  \subsection*{Manipulation of regions on a chromosome}

  In all the amount of annotations, we often have mass informations on
genes (introns, CDS, isoforms) but we have to deduce intergenic
regions, exons or more sophisticated regions. We provide a 
complete set of tools to easily compute any subregions, without
an exhaustive texture on a whole chromosome.

  \subsection*{Genetic tools }
  
  Finally, the package also contains functions related to genetic and structural aspects of the sequences : ORF localization, translation,
 or RNA secondary structure determination
  (with extention of GeneR: GeneRfold package).\\


\section{Installation}

%From your R session, type:
%\begin{verbatim}
%source("http://www.bioconductor.org/getBioC.R")
%getBioC(libName='GeneR',develOK=TRUE)
%\end{verbatim}

%Or  as for all R package sources, you can try also something like:
With package sources, you can try also something like:
\begin{verbatim}
R CMD INSTALL GeneR_x.x-x.tar.gz
\end{verbatim}
or try in the menu (for Windows and mac users). 

\section{Quick Start}

First steps on GeneR could be to check man page of GeneR:
\begin{verbatim}
library(GeneR)
help(GeneR)
\end{verbatim} 

\noindent
And for impatient:
\begin{verbatim}
demo(GeneR)
\end{verbatim} 



\section{Usage}


\subsection{Basique manipulation with sequences as character string}

Standard R commands allow to extract subpart of a character string, 
or append two character string, put in upper case (functions
{\em substr, paste, toupper}). We add tools for the specific use
of genetic sequences: insert a sequence into a first one, compute
the reverse complementary, count mono, di or tri-nucleotides of 
sequence and the possibility to import sequence from a Embl, Fasta 
or GeneBank file.\\

\noindent
{\bf Example}

\noindent
Insert a poly A into sequence
<<echo=TRUE>>=
library(GeneR)
s <- "gtcatgcatgctaggtgacagttaaaatgcgtctaggtgacagtctaacaa"
s2 <- insertSeq(s,"aaaaaaaaaaaaaa",20) 
s2
@ 


\noindent
Compute the reverse complementary:
<<echo=TRUE>>=
library(GeneR)
strComp(s)
@ 


\noindent
Count mono nucleotides, di-nucleotides (wsize can be from 1 to 15 if computer
has enough memory).
<<echo=TRUE>>=
strCompoSeq(s2,wsize=1)

strCompoSeq(s2,wsize=2)
@ 

\noindent
Get some data from the web
<<echo=TRUE>>=
seqNcbi("BY608190",file='toto.fa')
strReadFasta('toto.fa',from=10,to=35)
@ 

\subsection{Manipulation of sequences with buffers}

\subsubsection{Introduction}
To work on large sequences (i.e. a whole chromosome), we use a
C++ adapted class that requiers a minimum memory allocation.


Standard use: a chromosome load in the buffer, all computation
done with GeneR functions, and only usefull results returns to
standard R objects.\\


\subsubsection{Basic manipulation}

\noindent
Sequence read from a fasta file:
<<echo=TRUE>>=
readFasta('toto.fa')
readFasta('toto.fa',seqno=1)
@ 


\noindent
Show current buffer
<<echo=TRUE>>=
getSeq(from=10,to=35)
@ 

\noindent
Sames previous functions with buffers: Extract from multiple positions
<<echo=TRUE>>=
getSeq(seqno=0,from=c(1,10,360),to=c(10,20,0))
assemble(seqno=0,destSeqno=1,from=c(1,10,360),to=c(10,20,0))
getSeq(seqno=1)
@ 

\noindent
See the end of sequence
<<echo=TRUE>>=
sizeSeq(seqno=0)
size0 <- sizeSeq(seqno=0)
getSeq(seqno=0,from=size0-20,to=0)
@ 

\noindent
Append and concat:
<<echo=TRUE>>=
getSeq(0,from=1,to=35)
getSeq(1)
appendSeq(0,1)
getSeq(seqno=0,from=size0-20,to=size0+10)

concat(seqno1=0,seqno2=1,destSeqno=3, from1=2,to1=10,from2=8,to2=0, strand1=1)
getSeq(3)
@ 

\subsubsection{Look for motifs, composition, masked position}

Several tools are designed to
\begin{itemize}
\item  gets match positions of an oligomer in fragments of a sequence
\item  gets ORFs (Open Reading Frames) from a sequence. 
\item Get composition in  mono, di or trinucleotides of 
  fragments of a sequence 
\item mask sequence in lower case or with letter 'N'
\item get masked position from a sequence.
\item change sequence to Dna or Rna.
\end{itemize}


\noindent
Look for motifs
<<echo=TRUE>>=
exactWord("AAAG",seqno=0)
@ 

\noindent
 Find Orfs 
<<echo=TRUE>>=
getOrfs(seqno=0)
maxOrf(seqno=0)

@ 

\noindent
Mask sequences while lowering case of parts of sequence

<<echo=T>>=
x=c(5,15,30);y=c(10,20,35)
lowerSeq(from=x,to=y)
getSeq(from=1,to=35)
@ 

\noindent
And the reverse: get masked positions from a sequence:

<<echo=T>>=
posMaskedSeq(seqno=0,type="lower")
posMaskedSeq(seqno=0,type="upper")
@ 

\noindent
Or masked with N:
<<echo=T>>=
s <- "ATGCtgTGTTagtacATNNNNNNNNNNNNNNNTGGGTTTaAAAattt"
placeString(s,upper=FALSE,seqno=0)
posMaskedSeq(seqno=0,type="N")
@ 


\noindent Change from Dna to Rna alphabet
<<echo=T>>=
dnaToRna()
getSeq()
@
\subsubsection{Adresses manipulations}

\paragraph{Presentation}

  When GeneR load a subset of a larger sequence stored in a bank file,
%  (by example a gene from a whole chromosome).
  it will store the following informations in the C adapted class
  (buffers, by default 100 buffers than can be extended if necessary):
  \begin{itemize}
    \item subsequence (i.e. the succession of A,T,G,C).
    \item postions of the extremities of the subsequence  in the
    master sequence
    \item size of the whole sequence in the bank file
    \item name of the sequence
 \end{itemize}
  
    For specific purposes as renaming a sequence, all these variables
    can be viewed and carefully changed at any time  (here functions
    {\em getAccn} and {\em setAccn}).
    
    Several sequences can be stored
    simultaneously and called by their buffer number.

    Strand is another global variable which can be set and viewed 
    (functions {\em getStrand} and {\em setStrand}). It
    is used as input parameter in many functions to analyze
    complementary strand. It was designed to avoid doing explicitly the
    complement of the loaded strand then to store it in a buffer with,
    as consequence, loss of the informations linked to the master sequence.
    
  
    We have defined 3 types of addresses on a subsequence extracted from
    a master sequence:

  \begin{itemize}
    \item Absolute addresses i.e. addresses on the master sequence, from the 5' end
    of the input strand refered as forward (noted A)
    \item Real addresses, i.e.  addresses on the master sequence, from the 5'
    end of one of strands (noted R)
    \item Relative addresses, i.e.  addresses on working subsequence,
    from the 5' end of one of strands (noted T).
  \end{itemize}
  
  Let's show an example, if we read sequence from 11 to 25 from
  a gene of size 40 (see Figure\ref{adressage}).



  Obviously, when an entire sequence is stored, real and relative
  addresses will be the same.
    
  Although all functions using positions need and return absolute
  addresses, 6 functions allow to convert R, A, T into any other type
  (functions {\em RtoA, RtoT, AtoR,
    AtoT, TtoR, TtoA}).
  
  A global variable {\em strand} is used to convert positions
  (see  {\em setStrand getStrand}).

We keep globals variables of sequences...
We implement tools to ...

\begin{figure}
\resizebox{\textwidth}{!}{\includegraphics{GeneR_address}}
\caption{Adresses of a sequence}
\label{adressage}
\end{figure}

\paragraph{Example}

\noindent
Imagine our chromosome is as follow, and we study only part of
chromosome between position 11 to 25:
<<echo=TRUE>>=
s <- "xxxxxxxxxxATGTGTCGTTAATTGxxxxxxxxxxxxxxx"
placeString(s)
setStrand(0)
writeFasta(file="toto.fa")
readFasta(file="toto.fa",from=11,to=25)
getSeq()
@ 

\noindent
Orfs on 'Absolute' and 'relative' Adresses
<<echo=TRUE>>=
getOrfs()
AtoT(getOrfs())
@ 

\noindent
More Fun:
<<echo=TRUE>>=
getSeq()
exactWord(word="AA")
AtoT(exactWord(word="AA")[[1]])

setStrand(1)
getSeq()
exactWord(word="AA")
AtoT(exactWord(word="AA")[[1]])
exactWord(word="TT")
AtoT(exactWord(word="TT")[[1]])
@ 


\subsection{Bank files}

\subsubsection{Retrieve sequences}

We develop 2 functions to get sequence files from internet:
seqUrl (from a srs server) and seqNcbi (from Ncbi). 

Sequence are retrieve in Fasta or Embl format for seqUrl;
Fasta or GenBank format for seqNcib.

<<echo=TRUE,eval=false>>=
seqNcbi("BY608190",file="bank.fa")
seqNcbi("BY608190",file="BY608190.gbk",type="genbank")

#seqSrs("embl|BY608190|BY608190",file="BY608190.embl",submotif=TRUE)
@ 
<<echo=FALSE,results=hide>>=
setStrand(0)
seqNcbi("BY608190",file="bank.fa")
seqNcbi("BY608190",file="BY608190.gbk",type="genbank")

#seqSrs("embl|BY608190|BY608190",file="BY608190.embl",submotif=TRUE)
@ 


\subsubsection{Fasta files}

Basic tools to get or write informations from bank file
<<echo=T>>=
readFasta(file="bank.fa",
          name="gi|26943372|gb|BY608190.1|BY608190",
          from=1,to=30)
getSeq()
fastaDescription(file="bank.fa",
                 name="gi|26943372|gb|BY608190.1|BY608190")
@ 

And to write a to the fasta bank file:
<<echo=TRUE>>=
s="cgtagctagctagctagctagctagctagcta"
placeString(s,seqno=0)
writeFasta(seqno=0,file="bank.fa",name="MySequence",
           comment="A sequence Generated by R",append=TRUE)

## Show bank file
cat(paste(readLines("bank.fa"),collapse='\n'))
@ 

\subsubsection{Embl files}

We made basic tools to get sequence and write Embl files.
<<echo=T>>=
#     readEmbl(file='BY608190.embl',name='BY608190',from=10,to=30)
#     getSeq()

     s<-"gtcatgcatcctaggtgtcagggaaaatgcgtctacgtgacagtctaacaa"
     placeString(s)

     # Add lines with "CC   bla bla bla" and a line "XX"
     writeEmblComment(file="toto.embl",code="CC",text="This is a comment for \
     this dummy sequence... I try to be long enough to show that this comment \
     will be written on several lines",append=FALSE)

     # Add a line with "FT  CDS  bla bla bla"
     writeEmblLine(file="toto.embl",code="FT",header="CDS",text="<1..12",
                   nextfield = FALSE)
     # Add lines with "FT       bla bla bla"
     writeEmblLine(file="toto.embl",code="FT",header="",text="/codon_start=2",
                   nextfield = FALSE)
     writeEmblLine(file="toto.embl",code="FT",header="",text="/gene=\"toto\"",
                   nextfield = FALSE)
     writeEmblLine(file="toto.embl",code="FT",header="",text="/note=\"Here is \
     what I think about this gene\"",nextfield = TRUE)

     ## Translation
     prot <- translate(seqno=0,from=getOrfs()[1,1],to=getOrfs()[1,2])
     writeEmblLine (file="toto.embl",code='FT',header='',
                    text=paste('/translation="',prot ,'\"',
                    sep=''),nextfield =FALSE)

     # Add sequence
     writeEmblSeq(file="toto.embl")

     ## Show file
     cat(paste(readLines("toto.embl"),collapse='\n'))
@ 


\subsubsection{GenBank files}

We have a function to open any sequence from a GeneBank files
<<echo=T>>=
     readGbk(file='BY608190.gbk',name='BY608190',from=10,to=30)
     getSeq()
@ 


\subsection{Segments manipulations}

\subsubsection{Overview}

We made a set of tools manipulation segments. The aim of these 
classes is to describe regions on chromosomes
that are discontinuous segments on a line like:

\begin{verbatim}            
            1        10             25             40
Region A:   ##########              #######
Region B:   ####  ####              #############
Region C:                                          ######  ####
\end{verbatim}

\noindent
We made two kind of class
  
\begin{itemize}
\item {\tt segSet}: segments set, is a matrix nx2 composed of a
  column of "from", and a column of "to". Used to describe a region like
  'A' or 'B' in our example. (A matrix 3x2 to describe region 'B').
\item {\tt globalSeg}: a list of {\tt segSet}. It allows the
  notion of list of discontinuous segments (our use: a list of gene's
  models as a gene's model is stored as a list of its exons). In our
  sample, globalSeg will be the list of the 3 regions A,B and C. Note
  that it store more information than just a matrix with 2 columns
  containing all {\tt segments} of theses 3 regions.    
\end{itemize}

\noindent
For a better comprehension of other man pages, we introduce this
notation:
\begin{itemize}
\item a segment is just a part of a line determined by two values
      (from and to)
\item an object of class segSet is a set of {\tt n} segments,
  determined by a matrix {\tt n}x2
\item an  object of class globalSeg is a set of
  segSets, determined by a list of matrix.
\end{itemize}



\subsubsection{Examples}

We show segments usage with figure:
\begin{itemize}
\item Figure~\ref{segset} presents our objects
\item Figure~\ref{unionseg} show union, intersection, and basic manipulation
      on object of class {\em segSet}.
\item Figure~\ref{unionglobseg} show union, intersection, and basic manipulation
      on object of class {\em globalSeg}.
\item Figure~\ref{unary} show unary operators on our objects.
\end{itemize}


\begin{center}
\begin{figure}
<<echo=TRUE,results=hide,fig=T,width=8,height=5>>=
a = matrix(c(1,5,15,45,17,38,
             100,120,130,140,
             135,145,142,160),
             ncol=2,byrow=TRUE)
b  = matrix(c(15,18, 28,45,
              1,10, 15,20, 25,40,
              17,23, 35,38,100,105,
              110,120),ncol=2,byrow=TRUE)

a <- as.segSet(a)
b <- as.segSet(b)

A = list(
         matrix( c( 1, 15, 17,  5, 45, 38),ncol=2),
         matrix( c( 100 , 120),ncol=2),
         matrix( c( 130, 135, 140, 145),ncol=2),
         matrix( c( 142 , 160),ncol=2))

B = list(
         matrix( c(15, 28, 18, 45),ncol=2),
         matrix( c(1, 15, 25, 10, 20, 40),ncol=2),
         matrix( c(17, 35, 23, 38),ncol=2),
         matrix( c(100, 110, 105, 120),ncol=2))

A = as.globalSeg(A)
B = as.globalSeg(B)


c = or(a,b)
d = and(a,b)
e = not(a,b)
f = Xor(a,b)
par(mfrow=c(4,1))
##par(mfrow=c(2,1))
plot(a,xlim=c(1,160),main="a (segSet)")
plot(b,xlim=c(1,160),main="b (segSet)")
plot(A,xlim=c(1,160),main="A (globalSet)")
plot(B,xlim=c(1,160),main="B (globalSet)")
@ 
\caption{Representation of the two class of elements.}
\label{segset}
\end{figure}
\end{center}

\begin{center}
\begin{figure}
<<echo=F,results=hide,fig=T,width=8,height=5>>=
g = or(a)
h = and(a)
#i = not(a)
#j = Xor(a)

C = or(A,B)
D = and(A,B)
E = not(A,B)
F0 = Xor(A,B)
G = or(A)
H = and(A)
#I = not(A)
#J = Xor(A)

par(mfrow=c(4,1))
##par(mfrow=c(2,1))
plot(c,xlim=c(1,160),main="a or b")
plot(d,xlim=c(1,160),main="a and b")
plot(e,xlim=c(1,160),main="a not b")
plot(f,xlim=c(1,160),main="a Xor b")
@ 
\caption{Union, intersection, substraction on segments set.}
\label{unionseg}
\end{figure}


\begin{figure}
<<echo=FALSE,results=hide,fig=T,width=8,height=5>>=
par(mfrow=c(4,1))
##par(mfrow=c(2,1))
plot(C,xlim=c(1,160),main="A or B")
plot(D,xlim=c(1,160),main="A and B")
plot(E,xlim=c(1,160),main="A not B")
plot(F0,xlim=c(1,160),main="A Xor B")
@ 
\caption{Union, intersection, substraction on global segments.}
\label{unionglobseg}
\end{figure}
\noindent


\begin{figure}
<<echo=FALSE,results=hide,fig=T,width=8,height=5>>=
par(mfrow=c(4,1))
##par(mfrow=c(2,1))
plot(G,xlim=c(1,160),main="or A")
plot(H,xlim=c(1,160),main="and A")
#plot(I,xlim=c(1,160),main="not A")
#plot(J,xlim=c(1,160),main="Xor A")

plot(g,xlim=c(1,160),main="or a")
##plot.segSet(h,xlim=c(1,160),main="and a")
#plot(i,xlim=c(1,160),main="not a")
#plot(j,xlim=c(1,160),main="Xor a")
@ 
\label{unary}
\caption{Unary operator.}
\end{figure}
\end{center}

\subsection{Random sequences}

\subsubsection{Presentation}

We made two tools to generate a random sequence, or to randomize
an existing sequence. 

Function {\em randomSeq} creates a random sequence from a
distribution of nulcleotides, of ploy-nucleotides. A real
composition of nucleotides can be use from function {\em 
compoSeq}, with param {\tt p=TRUE}.

Function {\em shufleSeq} generate a sequence while assembling 
at random specific number of each nucleotides (or ploy-nucleotides).
These number of nucleotide can be provided by function {\em
compoSeq}, with param {\tt p=FALSE}: it is then a re-assemblage
of all nucleotides (or tri-nucleotides, or ploy-nucleotides)
of a real sequence.


\subsubsection{Example}

<<echo=TRUE>>=
## Set seed of your choice (not requiered)
set.seed(3)

####  ----  RANDOMSEQ  ----
## Create a sequence of size 30, GC rich
randomSeq(prob = c(0.20, 0.30, 0.20, 0.30), 
                 letters = c("T", "C","A", "G"), n = 30)
## [1] "CTGGAACCGAGGGGTTCATCCCCCCAGTGA"

## use with bi-nucleotides
randomSeq(prob=rep(0.0625,16),letters = c("TT","TC","TA","TG",
          "CT","CC","CA","CG","AT","AC","AA","AG","GT","GC","GA","GG"),n=10)
## [1] "CGCATGATCCCAGGCTAACT"

####  ----  SHUFFLESEQ  ----
## Create a sequence with 7 T, 3 C and A, and 4 G.
shuffleSeq(count=c(7,3,3,4,0),letters=c("T","C","A","G","N"))
## [1] "TATCTTTTGTCGGACGA"

## Same with bi-nucleotides
shuffleSeq(count=c(rep(4,4),rep(2,4),rep(1,4),rep(0,4)),
           letters = c("TT","TC","TA","TG","CT","CC","CA",
                       "CG","AT","AC","AA","AG","GT","GC","GA","GG"))
## [1] "TCTTTCCATTCCTTCTAGTGTACCCGTATACGTGTCTGTGTACTTCAACAACTTAT"


## From a real sequence:
seqNcbi("BY608190",file="BY608190.fa")
readFasta("BY608190.fa")

## create a random sequence from a real tri-nucleotides distribution
## Size of sequence will be 10*3.
randomSeq(compoSeq(wsize=3,p=TRUE),n=10)

## re assemble real tri-nucleotides of a real sequence
shuffleSeq(compoSeq(wsize=3,from=1,to=30,p=FALSE))
@


\subsection{Profiles}

\subsubsection{Overview}

Computes profile(s) of user defined quantities around sites of interest in sequence fragments. Profile(s) is(are) constituted of bins of equal size around the sites of interest named origins. It produces for each bin, and for each quantity the mean, the standard deviation and the number of valid events.

\subsubsection{Examples}


<<echo=TRUE,results=hide,fig=T>>=
  ## We create a sequence with a biais every 100 nucleotides
  s <- ""
  for(i in 1:20)
     s <- paste(s, randomSeq(n=100),randomSeq(prob=c(0.3,1,1,1,0)/3.3,n=100),sep="")
 
  placeString(s,seqno=0)

  dens <- densityProfile(ori=200*(1:20)-100,from=200*(0:19)+1,to=200*(1:20),
              seqno=0, fun=seqSkew,nbinL=10,nbinR=10,sizeBin=10)


  plot(dens$skta,main="TA skew")

  ## Example with flagged 'N'

  ## We create a sequence with a biais every 100 nucleotides
  s <- ""
  for(i in 1:20)
     s <- paste(s, randomSeq(n=100),randomSeq(prob=c(0.3,1,1,1,0.2)/3.5,n=100),sep="")
 
  placeString(s,seqno=1)

  dens2 <- densityProfile(ori=200*(1:20)-100,from=200*(0:19)+1,to=200*(1:20),
              seqno=1, fun=compoSeq,nbinL=10,nbinR=10,sizeBin=10)

  plot(dens2$T,main="#T")

  ## The same but more permissive (allow 4 N in each bin)

  dens3 <- densityProfile(ori=200*(1:20)-100,from=200*(0:19)+1,to=200*(1:20),
              seqno=1, fun=compoSeq,nbinL=10,nbinR=10,sizeBin=10,threshold=4)

  plot(dens3$T,main="#T")
@

\section{A complete Example}

In this section, we will gets intronic, 3' UTR and intergenic
compositions on a whole chromosome (Human: chromosome 3). 
Data will be retrieve from
Ucsc Mysql server, (it requiers to install RMySql package).


First download a fasta file of chromosome 3:

\begin{verbatim}
download.file('http://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/chr3.fa.gz',
               destfile='chr3.fa.gz')
\end{verbatim}


Gets all information on this gene from Ucsc Mysql database:

\begin{verbatim}
library(RMySQL)
con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", 
         user="genome", dbname="hg17")

rs <- dbSendQuery(con, statement = paste(
                         "SELECT * ",
                         "FROM knownGene",
                         "WHERE chrom = 'chr3'",
                         "AND strand = '+'"
                         ))
ucscSQLdataKnownGene <- fetch(rs, n = -1)

rs <- dbSendQuery(con, statement = paste(
                         "SELECT * ",
                         "FROM chr3_mrna"
                         ))
ucscSQLdataMRNA <- fetch(rs, n = -1)
\end{verbatim} 

Now we transfrom matrix of data (n row: one by gene) to intervals 
(a list of n elements, one by gene).

We will transform something like

\begin{verbatim}
ucscSQLdataMRNA[1:3,]
\end{verbatim}
to a list adapted to our computations:

\begin{verbatim}
Genes3plus <- apply(ucscSQLdataKnownGene,1,function(u){
     matrix( c(as.integer(strsplit(as.character(u[9]),",")[[1]]),
       as.integer(strsplit(as.character(u[10]),",")[[1]])),
       ncol=2) 
     })
Genes3plus <- as.globalSeg(Genes3plus)

getsOneExon <- function(u)
{
  start<- as.integer(strsplit(as.character(u[22]),",")[[1]])
  stop <- as.integer(strsplit(as.character(u[20]),",")[[1]]) + start 
  if(length(start) >0)
    return(matrix(c(start,stop),ncol=2))
  else
    return(NULL)
}

AllMrna3 <- apply(ucscSQLdataMRNA,1,getsOneExon)
AllMrna3 <- as.golbalSeg(AllMrna3)
rm(ucscSQLdataKnownGene,ucscSQLdataMRNA,rs)

AllMrna3[1:4]
\end{verbatim}

Here we gets {\tt Genes3plus}: an object with all known human genes
on strand forward and on chromosome 3, and {\tt AllMrna3}:
an object with all possible human genes on chromosome 3.

No we easily deduce all known introns (on strand forward), 
and all possible genic regions:

\begin{verbatim}
Genes3ranges <- range(Genes3plus)
possibleExons <- or(AllMrna3)
rangeDNA <- matrix(range.globalSeg(possibleExons,global=TRUE),ncol=2)

notExon <-  xorRecouvr(as.matrix(possibleExons),rangeDNA)
knownIntrons <- xorRecouvr(as.matrix(Genes3plus),as.matrix(Genes3ranges))

## Safe Introns
safeIntrons <- and(knownIntrons,notExon)


rangeMRNA <- range.intervals(possibleExons)
safeIntergenic <-  xorRecouvr(as.matrix(possibleExons),as.matrix(rangeMRNA))
\end{verbatim}


Now compute\dots
\begin{verbatim}
readFasta("chr3.fa")
safeIntrons <- as.matrix(safeIntrons)
safeIntergenic <- as.matrix (safeIntergenic)

composIntrons <-compoSeq(from=safeIntrons[,1],to=safeIntrons[,2])
composIntergenic <-compoSeq(from=safeIntergenic[,1],to=safeIntergenic[,2])

composIntrons
composIntergenic
\end{verbatim}


\section*{Aknowledgments}

A. Viary

\end{document}