REVERSE ENGINEERING OF GENE REGULATORY NETWORKS. METHODS AND CHALLENGES.
The aim of this project is to use the computational resources of national grid infrastructure in order to develop advanced algorithm of reverse engineering of gene regulatory networks, which will be applied to the study of rat hepatocytes response to different perturbations. The modern molecular biology has raised to the qualitatively new level. Simultaneous and independent development of high-throughput technologies, informatics and computer technologies led to the emergence of Systems Biology, aimed to represent biological processes in their maximal complexity by analyzing immense amount of data. One of the Systems Biology approaches is a modeling of gene regulatory networks, which rapidly spreads in medicine and biology. The modeling presupposes the integration of high-throughput approaches in biology and informatics areas, namely large-scale research of genes expression subjected to repeated systems perturbations and gene network modeling with the help of state-of-the-art methods of information technologies.
At present there are several approaches to reverse engineering of gene regulatory networks: machine learning, bayesian networks, boolean networks, differential equations, information theory, neural networks, genetic algorithms, Petri networks [1 – 3]. The search space of optimal regulatory network is exponential: for 9 genes we have 1.21*1015 possible variations, 20 genes - 2.34*1072, 30 genes – 2.71*10158 possible networks in the case of bayesian networks usage . Though we have much less search space when using information theory, it's still may take weeks of serial computations when dealing with large networks . That is why almost every attempt of network reconstruction is limited by not very precise heuristics, which on the one hand decrease search space (annealing, greedy, genetic algorithms), but on the other hand lead to low precision of reconstructed network . This problem becomes more important when dealing with real biological data (instead of training models), which amount is usually limited because of high price of their mining. Therefore we intend to minimize the usage of algorithmic heuristics in this project. The efforts will be concentrated on two mutually complementary solutions of a problem connected with the exponential increase of search space: 1) development of advanced algorithm of reverse engineering of gene regulatory networks with the usage of biological heuristics; 2) implementation of developed algorithm in programming language, which supports high computing parallelization on GRID.
To reconstruct the gene regulatory network we use two approaches. One of them is the development of advanced algorithm on the basis of real data and the generation of the set of synthetic networks containing numerous genes with the help of Grid rechnology. Another one is a reconstruction of gene regulatory network on the basis of our own data on liver transcriptome under the influence of interferon-alpha and the other data from GEO (Gene Expression Omnibus) database obtained under the influence of various perturbations.
Different methods of gene networks representation
The same network can be represented differently, basically by directed or undirected graph. Directed graph G is pair , where V denotes a set of vertices and E — a set of edges. Vertices correspond to the genes (or other system components) and the edges, which are denoted as pair of vertices , correspond to regulatory interactions between the genes. A graph is directed if i and j can be assigned to the head and the tail of the edge. The definition of the vertices and edges can be expanded – they may contain additional information about genes and their interactions. For example, the edge can be expressed as . The entry properties may indicate whether j inhibits (-) or activates (+) i .
In the Boolean network the expression level of each gene is assigned to a binary variable, which possesses a value 1 or 0, i.e. the gene is considered to be transcribed or not. In synchronous Boolean networks the states of the genes are updated simultaneously in discrete time steps. The new state can depend on the previous state of the same gene or other genes. The N nodes of Boolean network correspond to N genes of the regulatory network, k inputs denotes maximal amount of inputs for each node and k interactions, which regulate expression of a certain gene. k inputs to certain node eventually denotes the binary level of expression of corresponding gene. Since every node can be only in two states, a network with N genes can assume different states. An N-dimensional vector of variables can describe the state at time t. The value of each variable at time t+1 depends on input data, and can be computed by means of the Boolean rules. For a node with k inputs, the number of possible boolean rules is . The sequence of states given by the Boolean transitions represents the trajectory of the system. Since the number of states is finite, the number of possible transitions is also finite. Therefore each trajectory will lead either to steady or cyclic state. These states are entitled as attractors. All states that lead to the same attractor constitute the basin of attraction.
Boolean networks have been used to explore general properties of large gene networks. Considering random networks (the number of k inputs per gene and the corresponding Boolean rules were chosen by chance), Kauffman [7, 8] has shown that the system exhibits highly ordered dynamics for small k and chooses the certain rules. The mean expected number of attractors is about , and the length of attractors is restricted to a value proportional to . Kauffman suggested that the number of possible attractors corresponds to the number of cell types with the same genome. This number correlates with the present knowledge about the cell types .
For the boolean network reconstruction from microarray time-series data we can use the algorithm, described in  and . This algorithm determines whether the set of vertices explains an expression of a certain vertex . Boolean activator-inhibitor function, which assigned to vertex , can be found by brute-force search and is described as , where the first round bracket denotes activation vertices, and the second one – inhibition vertices. Apparently, when is small, the algorithm complexity will be polinomial, but it can affect the quality of obtained network. As it was mentioned above, the number of all possible boolean functions is , that is why increasing of leads to exponential complexity.
Besides the mentioned problem we can emphasize a range of general drawbacks of boolean network reverse engineering from real data :
Binarization is a complicate process which substantially influences the result. It is not so easy to determine from gene expression data which level of binarization should be .
The states are incomplete. In practice, most of the state transitions are missing after binarization.
The availability of many time points is crucial. In order to filter correct states from false states, we have to screen many state transitions to get a stable result.
Time points should not be too close to each other. It is a tradeoff between the detection of as many state transitions as possible and avoiding false-positive transitions. If two time points are too close, the transition shows no changes as binarization is a very rough threshold that does not detect small changes of concentration. This will lead to many false-positive self-loops in the corresponding graph.
A Bayesian network represents the regulatory network as a directed acyclic graph G=. According to conventional graph definition, the vertices represent genes, and the edges - regulatory interactions. Variables belonging to the vertices denote a property relevant to the regulation, e. g. the expression level of a gene or the amount of active protein. A conditional probability distribution is defined for each , where are the parent variables belonging to direct regulators of . The directed graph and the conditional distributions together specify a joint probability distribution that determines the Bayesian network. The joint probability distribution can be decomposed into:
The directed graph expresses the dependences of probabilities: the expression level of a gene represented by a child vertex depends on the expression level of genes belonging to the parent vertices. Hence, it also implies conditional independencies , meaning that is independent of the set of variables given the set of variables . Two graphs or Bayesian networks are equivalent if they imply the same set of independencies. In this case they can be considered as the same undirected graph, but with varying direction of edges. Equivalent graphs cannot be distinguished by observation of the variables .
However, it is more reasonable to use dynamic bayesian networks (DBN). They can be considered as extended conventional bayesian networks which are able to represent the dynamics of genetic networks. Let us consider that a microarray time series variable , where for number of time slices and for number of genes, represents an observation of gene at time , then observation vector at the time can be represent as vector , and gene at all time points can be represented as vector . DBN models assume a time dependence in which directed arcs should “ﬂoat” forward in time .
Usually, one assumes that such models are the ﬁrst order Markov model, in which each gene is only under directly inﬂuence of preceeding genes . As these models depend on time , the feedback loop networks can be easily constructed.
The joint probability distribution for DNB can be decomposed as follows:
The main goal of the modeling of gene regulatory network is to construct such a model which would explain in the best way the experimental data. To get this goal, one should learn the structure and parameters of DBNs using experimental data. This task may be formulated in the following way.
Given a time series data set , one should ﬁnd a model that best of all conforms with , where is speciﬁed by the structure of DBN and the corresponding parameter from the family of conditional probabilities distribution. According to the Bayes rule, the posterior distribution over a model is
where denominator is a normalization factor which is independent of M, therefore, taking logarithm, a scoring function for a model can be calculated as follows,
where is a prior for a model. is denoted as a marginal likelihood over data given .
, where is a prior distribution for parameters. The selection of the best model can be reached by maximizing the marginal likelihood. In order to obtain a solution for the integration of marginal likelihood, normally one can choose the conjugate prior distribution, such as Dirichlet family for a discrete multinomial distribution, and Normal-Wishart family for a continual Gaussian distribution .
Even having a scoring function, the selection of an optimal DBN for the modeling of gene network is a big challenge. Firstly, the set of parent vertices for each node is , where - the number of nodes. So, the optimization problem of identifying high scoring model is known to be NP-hard . Secondly, the searching algorithm does not always choose the best model, usually a selected model is localy optimal. That is why the single model with high score is not obviously the best one.
There are several traditional approaches how to solve the described problem. One of them is greedy hill-climbing search with restarts (GSR) . At each restart, a random structure should be produced. The mutation of such structure is made by a single edge addition or removal. The algorithm will check all the possible mutations of initial structure and choose the one with the highest scoring, and set it as a new base structure. This procedure is iterated until the model score reaches a local maximum, and then the model is stored and new restart is initiated. Finally, a number of models are obtained equal to the number of restarts. This algorithm is well described by Huihai Wu та Xiaohui Liu .
Another class of heuristic algorithm is Markov Chain Monte Carlo (MCMC) method , which generates samples from some high dimensional complex distribution. The principle of MCMC method is the construction of a Markov chain in which a new model is generated only on the basis of previous model . Eventually a chain of models will be produced, that will convergence to the target distribution. The sufficient condition for such convergence is a balance equation appropriate for all the models :
, where is a transition probability from to .
One of the most important MCMC algorithms is a Metropolis-Hastings algorithm, which is based on acceptance-rejection sampling algorithm . For each run, the algorithm will sample a new candidate model from the jumping distribution, which is the probability of returning a new model given a current model . Given the candidate model , the acceptance probability can be computed as . If the probability meets the requirements the Markov chain will choose the current candidate model .
Applying MCMC to the search of optimal bayesian network is a computationally complex task. The increasing amount of nodes leads to exponential complexity . In comparison with greedy search MCMC shows better results and is more fast .
Considering two different approaches to gene network reconstruction, it becomes obvious, that with the increasing amount of nodes we are facing with exponentially complex problem.
In case of boolean networks binarization of gene expression values (active or inhibited states only) allows us to examine large network with their general features. We can additionally limit the number of regulators of certain genes and thus simplify our algorithm and gain time. However, above-mentioned simplifications affect the quality of obtained network. In case of bayesian network the described algorithms will allow to reach local, not global, maximum because of exponential complexity of the model and the usage of heuristics .
In both cases more qualitative result may be reached by distribution of computational workload among several computer clusters. We can use both data and algorithm parallelism. Furthermore, it is reasonable to use ensembl-methods, e.g. combination of different methods, which undoubtedly will provide more precise reconstruction.
Thus, we can make a conclusion, that algorithms of reverse engineering of gene regulatory networks using boolean and bayesian networks require elaboration of detailed mathematical basis for opimal correspondence of reconstructed model with experimental data, and modern computational approaches. That is why the aim of this project is to use grid resources in construction of gene regulatory networks.
Wei-Po Lee, Wen-Shyong Tzou. Computational methods for discovering gene networks from expression data // Brief Bioinform.–2009.–Vol.10,N 4.–P.408-423.
Hecker M., Lambeck S., Toepfer S., et al. Gene regulatory network inference: data integration in dynamic models—a review // Biosystems.–2009.–Vol. 96.–P. 86-103.
Karlebach G., Shamir R. Modelling and analysis of gene regulatory networks // Nat Rev Mol Cell Biol.–2008.–Vol. 9.–P. 770-80.
Ott S., Imoto S., Miyano S. Finding optimal models for small gene networks // Pacific Symposium on Biocomputing.–2004.–P. 557-567.
Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera, and Andrea Califano. ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context // BMC Bioinformatics.–2006. –Vol. 7(Suppl 1): S7.
Klipp E. Systems biology in practice: concepts, implementation and application.–Wiley-VCH, 2005. –465p.
Kauffman S. Antichaos and adaptation // Scientific American.–1991.–Vol. 265, N. 2.–P. 78–84.
Kauffman S. The Origins of Order.–Oxford University Press, 1993. –709p.
Kauffman S. Investigations.–Oxford University Press, 2002. –308p.
Akutsu T., Miyano S., Kuhara S. Identification of genetic networks from a small number of gene expression patterns under the Boolean network model // Pacific Symposium on Biocomputing.–1999.–Vol. 4.–P. 17–28.
Martin S., Zhang Z., Martino A., Faulon J. L. Boolean dynamics of genetic regulatory networks inferred from microarray time series data // Bioinformatics.–2007.–Vol. 23, N. 7.–P. 866.
Wu H., Liu X. Dynamic bayesian networks modeling for inferring genetic regulatory networks by search strategy: Comparison between greedy hill climbing and mcmc methods // Proceedings of World Academy of Science, Engineering and Technology.–2008.–Vol. 34.–P. 224–234.
Sima C., Hua J., S. Jung S. Inference of Gene Regulatory Networks Using Time-Series Data: A Survey // Current Genomics. –2009.–Vol. 10, N. 6.–P. 416-429.
Yu J., Smith V., Wang P., Hartemink A., Jarvis E. Using Bayesian network inference algorithms to recover molecular genetic regulatory networks // 3rd International Conference on Systems Biology (ICSB02). –2002.
Chickering D., Heckerman D., Meek C. Large-sample learning of Bayesian networks is NP-hard // The Journal of Machine Learning Research.–2004.–Vol. 5.–P. 1287–1330.
De Campos L., Fernandez-Luna J., Puerta J. An iterated local search algorithm for learning Bayesian networks with restarts based on conditional independence tests // International Journal of Intelligent Systems.–2003.–Vol. 18, No. 2.–P. 221-235.
Scollnik D. An introduction to Markov Chain Monte Carlo methods and their actuarial applications // Proceedings of the Casualty Actuarial Society. –1996.–Vol. 83.–P .114–165.
Chib S., Greenberg E. Understanding the Metropolis-Hastings Algorithm // The American Statistician.–1995.–Vol. 49, N. 4.–P. 327-335.
Friedman N., Koller D. Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks // Machine learning. –2003.–Vol. 50, N. 1.–P. 95–125.