Approximate Bayesian Computation for Biological Science

Approximate Bayesian computation (ABC) provides us a rigorous tool to perform parameter inference for models without an easily accessible likelihood function. Here we give a short introduction to ABC, focusing on applications in biological science. Furthermore, we introduce users to a Python suite implementing ABC algorithms, with optimal use of high performance computing facilities.


Introduction
With the recent innovations in biological science, we are increasingly facing large datasets of varied type and more realistic but complex models of natural phenomenon. This trend has led to a scenario where we do not easily have a likelihood function which is available in closed form and thus easy to evaluate at any given point (as required by most Monte Carlo and Markov chain Monte Carlo methods). Thus, traditional likelihood-based inference, as maximum likelihood or Bayesian methodology, is not possible. Still, if from the complex model, given values of the parameters that index it, we can forward simulate pseudo-dataset, a new methodology becomes available, namely Approximate Bayesian Computation (ABC). Models that have this possibility of forward simulating are known as simulator-based models and are becoming more and more popular in diverse fields of science [1][2][3], just restricting to the biological domain we can find many examples: evolution of genomes Marttinen et al. [4], numerical model of platelet deposition [5], demographic spread of a species among many [6]. Research in statistical science in the last decade or so, has illustrated how ABC can be a tool to infer and calibrate the parameters of these models.
The fundamental rejection ABC sampling scheme iterates between three step: First a pseudo-dataset, : For a better approximation of the likelihood function, computationally efficient sequential ABC algorithms Marin et al. [7] Lenormand et al. [8], Albert et al. [9] decrease the value of the threshold ∈ adaptively while exploring the parameter space. The crucial aspect for a good ABC approximation to the likelihood function is the choice of the summary statistics, as we define the discrepancy measure between sim x and 0 x through a distance between the extracted summary statistics from sim x and 0 .
x Knowledge domain driven summary statistics are normally chosen keeping in mind that we want to minimize the loss of information on φ contained in the data through the choice of summary statistics. But one can also rely on automatic summary selection for ABC, thus removing a subjective component in this choice, as described in Fearnhead & Prangle [10], Pudlo et al. [11], Jiang et al. [12] and Gutmann et al. [13]. ABC provides a tool for statistical inference for simulator-based models, still, the necessity to simulate lots of pseudo-data, makes the algorithm extremely computationally expensive when data-simulation itself is costly. Further, the varied types of data sets available in different domain specific problems have hindered the applicability ABC algorithms to many applied science domains. Recently, Dutta et al. [14,15], have developed a High Performance Computing framework to efficiently parallelize different ABC algorithms which we believe will be extremely beneficial for inferential problems across different scientific domains. To highlight the versatility of ABC and ABCpy in diverse applied problems, we point the interested reader to two recent research paper of ours with applications to biology:

a.
Estimation of parameters of a numerical platelets deposition model, where each forward simulation takes 10 minutes [16] and b.
Estimation of parameters of spreading processes on a network (e.g., epidemics on a contact network, but also fake news on a social network where the datasets are series of networks) [17].

Conclusion
We would like to stress here the fact that ABC inference scheme provides not only a point estimate of the parameters of interest but also their entire (approximated) posterior distribution thus allowing for uncertainty quantification: the higher the variability of the posterior distribution the higher the uncertainty inherent in the inferential scheme. Via the ABC approximated posterior one can then construct credible intervals and perform hypothesis testing. Furthermore, ABC allows to compare possible alternative models by simply adding, to the three steps ABC scheme illustrated above, an additional initial layer where first a model index is sampled from the model prior distribution and then, once a model has been selected a regular ABC scheme within that model is performed. For details on ABC model selection via random forest approach [11].