\name{compareGOProfiles}
\alias{compareGOProfiles}
\title{Comparison of lists of genes through their functional profiles}
\description{
Compare two samples of genes in terms of their GO profiles \code{pn} and \code{qm}. Both
samples may share a common subsample of genes, with GO profile \code{pqn0}.
'compareGOProfiles' implements some inferential procedures based on
asymptotic properties of the squared euclidean distance between 
the contracted versions of pn and qm
}
\usage{                                
compareGOProfiles(pn, qm = NULL, pqn0 = NULL, n = ngenes(pn), m = ngenes(qm), n0 = ngenes(pqn0), method = "lcombChisq", 
ab.approx = "asymptotic", confidence = 0.95, nsims = 10000, simplify = T, ...)
}
\arguments{
\item{pn}{an object of class ExpandedGOProfile representing one or more
"sample" expanded GO profiles for a fixed ontology (see the 'Details' section)}
  \item{qm}{an object of class ExpandedGOProfile representing one or more
"sample" expanded GO profiles for a fixed ontology (see the 'Details' section)}
\item{pqn0}{an object of class ExpandedGOProfile representing one or more
"sample" expanded GO profiles for a fixed ontology (see the 'Details' section)}
\item{n}{a numeric vector with the number of genes profiled in each column of pn.
This parameter is included to allow the possibility of exploring the consequences of varying sample sizes, other than the true sample size in pn.}
\item{m}{a numeric vector with the number of genes profiled in each column of qm.}
\item{n0}{a numeric vector with the number of genes profiled in each column of pqn0.}
\item{method}{the approximation method to the sampling distribution under the null hypothesis specifying that the samples pn and qm come from the same population. See the 'Details' section below}
\item{confidence}{the confidence level of the confidence interval in the result}
\item{ab.approx}{the approximation used for computing 'a' and 'b' coefficients (see details)}
\item{nsims}{some inferential methods require a simulation step; the number of simulation replicates is specified with this parameter}
\item{simplify}{should the result be simplified, if possible? See the 'Details' section}
\item{...}{Other arguments needed}
}

\details{
 An object of S3 class 'ExpandedGOProfile' is, essentially, a 'data.frame' object
 with each column representing the relative frequencies in all observed node
 combinations, resulting from profiling a set of genes, for a given and fixed
 ontology. The row.names attribute codifies the node combinations and each
 data.frame column (say, each profile) has an attribute, 'ngenes', indicating the
 number of profiled genes.
 The arguments 'pn', 'qm' and 'pqn0' are compared in a column by column wise, 
 recycling columns, if necessary, in order to perform max(ncol(pn),ncol(qm),ncol(pqn0))
 comparisons (each comparison resulting in an object of class 'GOProfileHtest',
 an specialization of 'htest').  
 In order to be properly compared, these arguments are expanded by row, according
 to their row names. That is, the data arguments can have unequal row numbers. Then,
 they are expanded adding rows with zero frequencies, in order to make them
 comparable.

 In the i-th comparison (i from 1 to max(ncol(pn),ncol(qm),ncol(pqn0))),
 the parameters n, m and n0 are included to allow the possibility of exploring the consequences of varying sample
 sizes, other than the true sample sizes included as an attribute in pn, qm and pqn0.

 When qm = NULL, the genes profiled in pn are compared with a subsample of them, those profiled in pqn0 (compare
 a set of genes with a restricted subset, e.g. those overexpressed under a disease). In this case we take qm=pqn0.
 When pqn0 = NULL, two profiles with no genes in common are compared.

 Let Pn and Qm correspond to the contracted functional profiles (the total counts or relative frequencies
 of hits in each one of the s GO categories being compared) obtained from pn and qm.
 If P stands for the "population" profile originating the sample profile Pn[,j], Q for the profile originating Qm[,j]
 and d(,) for the squared euclidean distance, if P != Q, the distribution of sqrt(nm/(n+m))(d(Pn[,j],Qm[,j]) - d(P,Q))/se(d) is
 approximately standard normal, N(0,1). This provides the basis for the confidence interval in the result field
 icDistance.
 When P=Q, the asymptotic distribution of (nm/(n+m)) d(Pn[,j],Qm[,j]) corresponds to the distribution of a mixture of independent
 chi-square random variables, each one with one degree of freedom. The sampling distribution under H0 P=Q may be directly
 computed from this distribution (approximating it by simulation) (method="lcombChisq") or by a chi-square
 approximation to it, based on two correcting constants a and b (method="chi-square").
 These constants are chosen to equate the first two moments of both distributions (the linear
 combination of chi-square random variables distribution and the approximating chi-square distribution).
 When method="chi-square", the returned test statistic value is the chi-square approximation (n d(pn[,j],qm[,j]) - b) / a. Then,
 the result field 'parameter' is a vector containing the 'a' and 'b' values and the number of degrees of freedom, 'df'.
 Otherwise, the returned test statistic value is (nm/(n+m)) d(Pn[,j],Qm[,j]) and 'parameter' contains the coefficients of the linear
 combination of chi-squares.
}
\value{
A list containing max(ncol(pn),ncol(qm),ncol(pqn0)) objects of class 'GOProfileHtest', directly inheriting from 'htest'
or a single 'GOProfileHtest' object if max(ncol(pn),ncol(qm),ncol(pqn0))==1 and simplify == T.
Each object of class 'GOProfileHtest' has the following fields:
\item{profilePn}{the first contracted profile to compute the squared Euclidean distance}
\item{profileQm}{the second contracted profile to compute the squared Euclidean distance}
\item{statistic}{test statistic; its meaning depends on the value of "method", see the 'Details' section.}
\item{parameter}{parameters of the sample distribution of the test statistic, see the 'Details' section.}
\item{p.value}{associated p-value to test the null hypothesis of profiles equality.}
\item{conf.int}{asymptotic confidence interval for the squared euclidean distance. Its attribute "conf.level" contains its nominal confidence level.}
\item{estimate}{squared euclidean distance between the contracted profiles. Its attribute "se"
contains its standard error estimate.}
\item{method}{a character string indicating the method used to perform the test.}
\item{data.name}{a character string giving the names of the data.}
\item{alternative}{a character string describing the alternative hypothesis (always 'true squared Euclidean distance between the contracted profiles is greater than zero'}
}
\seealso{fitGOProfile, equivalentGOProfiles}
\references{Sanchez-Pla, A., Salicru M. and Ocana, J. Statistical methods for the analysis of highthroughput data based on functional profiles derived from the gene ontology. Journal of
Statistical Planning and Inference, 2007.}
\author{Jordi Ocana}

\examples{
data(prostateIds)
expandedWelsh <- expandedProfile(welsh01EntrezIDs[1:100], onto="MF",
                        level=2, orgPackage="org.Hs.eg.db")
expandedSingh <- expandedProfile(singh01EntrezIDs[1:100], onto="MF",
                        level=2, orgPackage="org.Hs.eg.db")
commonGenes <- intersect(welsh01EntrezIDs[1:100], singh01EntrezIDs[1:100])
commonExpanded <- expandedProfile(commonGenes, onto="MF", level=2, orgPackage="org.Hs.eg.db")
comparedMF <-compareGOProfiles (pn=expandedWelsh, 
                          qm  = expandedSingh, 
                          pqn0= commonExpanded)
print(comparedMF)
# print(compSummary(comparedMF))

}
\keyword{htest}