The package MORSE is devoted to the analysis of data from standard toxicity bioassays. It provides a simple workflow to explore/visualize a dataset, and compute estimations of risk assessment indicators. This document illustrates a typical use of MORSE on survival and reproduction data, which can be followed to analyze new datasets.
The following example shows all the stages to perform survival analysis on standard bioassay data and produce estimate values of the \(LC_x\). We will use a dataset of the library named cadmium2
, which contains both survival and reproduction data from a chronic laboratory bioassay. In this experiment, snails were exposed to six concentrations of a metal contaminant (cadmium) during 56 days.
The data from a survival assay should be gathered in a data.frame
with a specific layout. This is documented in the paragraph on survData
in the reference manual, and you can also inspect one of the datasets provided in the package (like cadmium2
). First, we load the dataset and use the function survDataCheck
to verify that it has the expected layout:
data(cadmium2)
survDataCheck(cadmium2)
## No error detected.
survData
objectThe class survData
represents survival data and is the basic representation used for the subsequent operations. Note that if the call to survDataCheck
reports no error, it is guaranteed that survData
will not fail.
dat <- survData(cadmium2)
head(dat)
## replicate conc time Nsurv Nrepro Ninit Nindtime
## 1 A 0 0 5 0 5 0
## 2 A 0 3 5 262 5 15
## 3 A 0 7 5 343 5 35
## 4 A 0 10 5 459 5 50
## 5 A 0 14 5 328 5 70
## 6 A 0 17 5 742 5 85
The function plot
can be used to plot the number of surviving individuals as a function of time for all concentrations and replicates.
plot(dat, style = "ggplot", pool.replicate = FALSE)
Two graphical styles are available, "generic"
for standar R
plots or "ggplot"
to call the package ggplot2
. If the argument pool.replicate
is TRUE
the datapoints for a given time and concentration are pooled and only the mean number of survivors is plotted. To observe the full dataset, we set this option to FALSE
.
By fixing the concentration to a (tested) value, we can visualize one subplot in particular:
plot(dat, concentration = 124, addlegend = TRUE,
pool.replicate = FALSE, style = "ggplot")
We can also plot the number of surviving individuals at a given time point as a function of the concentration, by fixing target.time
instead of concentration
:
plot(dat, target.time = 21, pool.replicate = FALSE, style = "ggplot",
addlegend = TRUE)
Finally, when fixing both concentration
and target.time
, we can compare one condition to the control:
plot(dat, concentration = 232, target.time = 21)
The function summary
provides some descriptive statistics on the experimental design.
summary(dat)
##
## Number of replicate per time and concentration:
## time
## conc 0 3 7 10 14 17 21 24 28 31 35 38 42 45 49 52 56
## 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 53 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 78 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 124 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 232 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 284 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
##
## Number of survival (sum of replicate) per time and concentration:
## 0 3 7 10 14 17 21 24 28 31 35 38 42 45 49 52 56
## 0 30 30 30 30 29 29 29 29 29 28 28 28 28 28 28 28 28
## 53 30 30 29 29 29 29 29 29 29 29 28 28 28 28 28 28 28
## 78 30 30 30 30 30 30 29 29 29 29 29 29 29 29 29 27 27
## 124 30 30 30 30 30 29 28 28 27 26 25 23 21 18 11 11 9
## 232 30 30 30 22 18 18 17 14 13 12 8 4 3 1 0 0 0
## 284 30 30 15 7 4 4 4 2 2 1 1 1 1 1 1 0 0
Now we are ready to fit a probabilistic model for the survival data, which aims at finding the relation between concentration of pollutant and mean survival rate at the target time. Our model assumes the latter is a log-logistic function of the former, and the work here is to estimate the parameters of this log-logistic function. Once we have estimated the parameters, we can then calculate the \(LC_x\) for any \(x\). All this work is performed by the survFitTT
function, which requires a survData
object and the levels of \(LC_x\) we need:
fit <- survFitTT(dat,
target.time = 21,
lcx = c(10, 20, 30, 40, 50))
The returned value is an object of class survFitTT
and provides the estimated parameters as a posterior1 distribution, in order to model our uncertainty on their true value. For the parameters of the models, as well as the \(LC_x\) values, we report the median (as the estimated value) and the 2.5 % and 97.5 % quantiles of the posterior (as a measure of uncertainty, a.k.a. credible intervals). These can seen using the summary
method:
summary(fit)
## Summary:
##
## The loglogisticbinom_3 model with a binomial stochastic part was used !
##
## Priors on parameters (quantiles):
##
## 50% 2.5% 97.5%
## b 1.000 0.013 79.433
## d 0.500 0.025 0.975
## e 122.687 53.898 279.268
##
## Posterior of the parameters (quantiles):
##
## 50% 2.5% 97.5%
## b 8.540 3.996 15.993
## d 0.957 0.909 0.986
## e 236.257 210.113 253.734
##
## Posterior of the LCx (quantiles):
##
## 50% 2.5% 97.5%
## LC10 182.901 126.753 213.787
## LC20 200.962 154.160 225.938
## LC30 213.995 174.446 235.127
## LC40 225.375 192.972 243.782
## LC50 236.257 210.113 253.734
If the inference went well, it is expected that the difference between quantiles in the posterior will be reduced compared to the prior, meaning that the data was helpful to reduce the uncertainty on the true value of the parameters. This simple check can be performed using the summary function.
The fit can also be plotted:
plot(fit, log.scale = TRUE, ci = TRUE, style = "ggplot",
addlegend = TRUE)
This representation shows the estimated relation between concentration of pollutant and survival rate (red curve). It is computed by choosing for each parameter the median value of its posterior. To assess the uncertainty on this estimation, we compute many such curves this time by sampling the parameters in the posterior distribution. This gives rise to the pink band, showing for any given concentration an interval (called credible interval) containing the survival rate 95% of the time in the posterior distribution. The experimental data points are represented in black and correspond to the observed survival rate when pooling all replicates. The black error bars correspond to a 95% confidence interval, which is another, more straightforward way to bound the most probable value of the survival rate for a tested concentration. In favorable situations, we expect that the credible interval around the estimated curve and the confidence interval around experimental points will largely overlap.
Note that survFitTT
will warn you if the estimated \(LC_{50}\) lies outside the range of tested concentrations, as in the following example:
data("cadmium1")
wrong_fit <- survFitTT(survData(cadmium1),
target.time = 21,
lcx = c(10, 20, 30, 40, 50))
## Warning in survFitTT(survData(cadmium1), target.time = 21, lcx = c(10,
## 20, : The LC50 estimation lies outsides the range of tested concentration
## and may be unreliable !
plot(wrong_fit, log.scale = TRUE, ci = TRUE, style = "ggplot",
addlegend = TRUE)
In this example, the experimental design did not include sufficiently high concentrations, and we are missing measurements that would have a major influence on the final estimation. For this reason this result should be considered unreliable.
The estimation can be further validated using so-called posterior predictive checks: the idea is to plot the observed values against the corresponding estimated predictions, along with their 95% credible interval. If the fit is correct, we expect to see 95% of the data inside the intervals.
ppc(fit, style = "ggplot")
In this plot, each black dot represents an observation made at some concentration, and the number of survivors at target time is given by the value on X-axis. Using the concentration and the fitted model, we can produce a corresponding prediction of the expected number of survivors at that concentration. This prediction is given by the Y-axis. Ideally observations and predictions should coincide, so we’d expect to see the black dots on the points of coordinate \(Y = X\). Our model provides a tolerable variation around the predited mean value as an interval where we expect 95% of the dots to be in average. The intervals are represented in green if they overlap with the line \(Y=X\), and in red otherwise.
The steps in reproduction data analysis are absolutely analogous to what we described for survival data. This time the goal is to estimate the relation between pollutant concentration and reproduction rate per individual-day.
Here is a typical session:
# (1) load dataset
data(cadmium2)
# (2) check structure and integrity of the dataset
reproDataCheck(cadmium2)
## No error detected.
# (3) create a `reproData` object
dat <- reproData(cadmium2)
# (4) represent the cumulated number of offspring as a function time
plot(dat, style = "ggplot", pool.replicate = FALSE)
plot(dat, target.time = 21, addlegend = TRUE, style = "ggplot",
pool.replicate = FALSE)
plot(dat, concentration = 124, addlegend = TRUE, style = "ggplot",
pool.replicate = FALSE)
plot(dat, concentration = 124, target.time = 21, style = "ggplot")
# (5) check information on the experimental design
summary(dat)
##
## Number of replicate per time and concentration:
## time
## conc 0 3 7 10 14 17 21 24 28 31 35 38 42 45 49 52 56
## 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 53 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 78 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 124 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 232 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 284 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
##
## Number of survival (sum of replicate) per time and concentration:
## 0 3 7 10 14 17 21 24 28 31 35 38 42 45 49 52 56
## 0 30 30 30 30 29 29 29 29 29 28 28 28 28 28 28 28 28
## 53 30 30 29 29 29 29 29 29 29 29 28 28 28 28 28 28 28
## 78 30 30 30 30 30 30 29 29 29 29 29 29 29 29 29 27 27
## 124 30 30 30 30 30 29 28 28 27 26 25 23 21 18 11 11 9
## 232 30 30 30 22 18 18 17 14 13 12 8 4 3 1 0 0 0
## 284 30 30 15 7 4 4 4 2 2 1 1 1 1 1 1 0 0
##
## Number of offspring (sum of replicate) per time and concentration:
## 0 3 7 10 14 17 21 24 28 31 35 38 42 45
## 0 0 1659 1587 2082 1580 2400 2069 2316 1822 2860 2154 3200 1603 2490
## 53 0 1221 1567 1710 1773 1859 1602 1995 1800 2101 1494 2126 935 1629
## 78 0 1066 2023 1752 1629 1715 1719 1278 1717 1451 1826 1610 1097 1727
## 124 0 807 1917 1423 567 383 568 493 605 631 573 585 546 280
## 232 0 270 1153 252 30 0 37 28 46 119 19 9 0 0
## 284 0 146 275 18 1 0 0 0 0 0 0 0 0 0
## 49 52 56
## 0 1609 2149 2881
## 53 2108 1686 1628
## 78 2309 1954 1760
## 124 594 328 380
## 232 0 0 0
## 284 0 0 0
# (6) fit an exposure-response model at target-time
fit <- reproFitTT(dat, stoc.part = "bestfit",
target.time = 21,
ecx = c(10, 20, 30, 40, 50),
quiet = TRUE)
summary(fit)
## Summary:
##
## The loglogistic model with a Gamma Poisson stochastic part was used !
##
## Priors on parameters (quantiles):
##
## 50% 2.5% 97.5%
## b 1.000 0.013 79.433
## d 18.303 15.538 21.067
## e 148.835 79.015 280.350
## omega 1.000 0.000 6309.573
##
## Posterior of the parameters (quantiles):
##
## 50% 2.5% 97.5%
## b 3.853 2.843 5.638
## d 17.765 15.469 20.186
## e 136.896 114.477 171.119
## omega 1.437 0.824 2.808
##
## Posterior of the ECx (quantiles):
##
## 50% 2.5% 97.5%
## EC10 77.246 54.364 113.598
## EC20 95.343 72.004 131.595
## EC30 109.780 86.507 145.796
## EC40 123.117 100.293 158.063
## EC50 136.896 114.477 171.119
plot(fit, log.scale = TRUE, ci = TRUE,
style = "ggplot")
ppc(fit, style = "ggplot")
As in survival analysis, we assume that the reproduction rate per individual-day is a log-logistic function of the concentration. More details and parameter signification can be found in the Modeling vignette.
For reproduction analysis, we consider one model which neglects inter-individual variability (named “Poisson”) and another one which takes it into account (named “gamma Poisson”). You can choose either one using the option stoc.part
, but by setting it to "bestfit"
, you let reproFitTT
decide which models fits the data best. The choice can be seen by calling the summary
function:
summary(fit)
## Summary:
##
## The loglogistic model with a Gamma Poisson stochastic part was used !
##
## Priors on parameters (quantiles):
##
## 50% 2.5% 97.5%
## b 1.000 0.013 79.433
## d 18.303 15.538 21.067
## e 148.835 79.015 280.350
## omega 1.000 0.000 6309.573
##
## Posterior of the parameters (quantiles):
##
## 50% 2.5% 97.5%
## b 3.853 2.843 5.638
## d 17.765 15.469 20.186
## e 136.896 114.477 171.119
## omega 1.437 0.824 2.808
##
## Posterior of the ECx (quantiles):
##
## 50% 2.5% 97.5%
## EC10 77.246 54.364 113.598
## EC20 95.343 72.004 131.595
## EC30 109.780 86.507 145.796
## EC40 123.117 100.293 158.063
## EC50 136.896 114.477 171.119
When the gamma Poisson model is selected, the summary shows an additional parameter called omega
, which quantifies the inter-individual variability (the higher omega
the higher the variability).
In MORSE, reproduction datasets are a special case of survival datasets: a reproduction dataset includes the same information than a survival dataset plus the information on reproduction outputs. For that reason the S3 class reproData
inherits from the class survData
, which means that any operation on a survData
object is legal on a reproData
object. In particular, to use the plot function related to survival analysis on reproData
object, we can use the as.survData
conversion function:
dat <- reproData(cadmium2)
plot(as.survData(dat))
In Bayesian inference, the parameters of a model are estimated from the data starting from a so-called prior, which is a probability distribution representing an initial guess on the true parameters, before seing the data. The posterior distribution represents the uncertainty on the parameters after seeing the data and combining it with the prior. To obtain a point estimate of the parameters, it is typical to compute the mean or median of the posterior and quantify the uncertainty by reporting the standard deviation or inter-quantile distance.↩