Machine Learning Applications in
Systems Biology
large volumes of data (data mining). It could thus 1)Inductive Logic Programming
extract th...
6. Form rules which express the common properties of regulation and thus discover causal gene pathways [1...
non-standardized experimental techniqu...
of 4


Published on: Mar 3, 2016

Transcripts - NatashaBME1450.doc

  • 1. 1 Machine Learning Applications in Systems Biology Natasha Alves, M.A.Sc. Candidate, ECE inherent redundancy in many pathways and feedback Abstract— Recent advances in high-throughput systems. technologies have led to an immense flow of biological A lot of useful and important information about biological data. Extracting the information hidden in the ever- systems is hidden in high volumes of experimental data. For expanding biological databases has been an obstacle in the instance, there are 37 billion bases of DNA in 32,000 progress of systems biology. Machine Learning has sequence records in GenBank alone (Feb. 2004)[12]. proved to be an efficient and inexpensive approach to Analyzing high volumes of data to understand biological organizing data; developing new tools to analyze data; systems demands tedious experimentation and modern and discovering new knowledge from data. This paper computational technology. This is the grand challenge for introduces Machine Learning techniques like inductive systems biology in this era. logic programming, clustering, Bayesian networks, and An intelligent approach is needed to extract the hidden decision trees in the context of their applications in information from the data and to cope with the rapid rate of systems biology. The shortcomings of these Machine data deposition. Learning techniques are also addressed. Index Terms—Artificial Intelligence, Bayesian Networks, Clustering, Decision Trees, Inductive Logic Programming, III.MACHINE LEARNING Machine Learning, Systems Biology. Machine Learning (ML) is the capability of computer algorithms to improve automatically through experience (i.e. the computer programs itself by seeing examples of the I.INTRODUCTION behavior we want). ML approaches are ideally suited for Systems Biology is an in-depth, systems-level analysis of domains characterized by the presence of large amounts of biological systems grounded on the molecular level [1]. It is data, noisy patterns and the absence of general theories [4]. different from other methods of biological study where the The fundamental idea behind these approaches is to learn the focus is on the characteristics of isolated parts of a cell or theory automatically from the data through a process of organism. Systems biology examines the structure and inference and model fitting. A system that can learn from dynamics of cellular and organism functions, and their experience and improve its performance automatically could interconnections and interrelationships. One ultimate goal of serve as a tool for solving biological systems. systems biology is to use the knowledge of the complete The main goal of ML is to induce general functions from a genome sequence and all proteins encoded by that genome to specific training data set. The learning agent is given a set of reconstruct the biological systems that are implied [2]. training examples, and it defines the hypothesis for them. The development of systems biology is driven by The agent must search through the hypothesis space and technology. Sophisticated computational techniques are locate the best hypothesis when given the test set [5]. needed to analyze biological systems because of the Because ML is concerned with learning from data complexity and dynamics involved. Machine Learning, examples, it often uses a probabilistic approach. which is an automatic and intelligent learning technique, has for long been used to discover meaningful associations between proteins, and for scientific hypothesis formation [3]. IV.OVERCOMING THE CHALLENGES IN SYSTEMS BIOLOGY The aim of this paper is to introduce Machine Learning ML approaches have gained popularity in systems biology. techniques in the context of their application in systems The characteristics of ML that make it well suited for biology. systems biology are: 1. Many problems in biological systems are not well defined, but have a lot of experimental data. ML is II.CHALLENGES IN SYSTEMS BIOLOGY useful when the structure of the task is not well Much of our failure to fully understand biological systems understood but the task can be characterized by a data has been due to their size and complexity. Systems biology set with strong statistic regularity. While input/output emphasizes on large-scale discovery of the interactions of pairs can be easily specified, the relationship between genes, proteins, and other cell elements. It is confronted with the inputs and outputs are often unknown (e.g. the dynamic biological responses, a huge number of interactions, protein folding mechanism). ML approaches can extract relationships and correlations hidden under  Manuscript received November 1, 2004
  • 2. 2 large volumes of data (data mining). It could thus 1)Inductive Logic Programming extract the information encoded in biological Inductive logic programming (ILP) is a research area databases and use the available data to predict formed at the intersection of ML and Logic Programming. meaningful biological properties. ILP systems develop predicate descriptions from 2. ML approaches can adjust their internal structure to observations and background knowledge. There are three produce correct outputs for a large number of sample main elements in an ILP learning system: observations, inputs. They can thus constrain their input/output background knowledge, and hypothesis [5]. Each of these function to approximate the implicit relationship in elements of ILP is a logic program. Fig.1 shows the general the training examples [6]. scheme for ILP methods. Observations and background 3. ML approaches adapt themselves to new information knowledge are combined by an ILP program to form a (training examples). This is important in systems hypothesis. A set of IF – THEN rules can then be derived biology because new data are generated every day. from the hypothesis. For example: The newly generated data might update the initial Hypothesis: fold('Four-helical up-and-down bundle',P) :- learning hypotheses. helix(P,H1), length(H1,hi), position(P,H1,Pos), ML thus provides efficient approaches to analyze interval(1 =< Pos =< 3), adjacent(P,H1,H2), biological data. helix(P,H2). Rule: The protein P has fold class ‘Four-helical up-and- down bundle’ if it contains a long helix H1 at a secondary V.MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY structure position between 1 and 3, and H1 is followed by a A variety of ML techniques can be used to solve most of second helix H2. [5] the problems in systems biology. In a systems biology The rules are tested on additional data. If experimentation context, ML is used to discover meaningful knowledge from leads to high confidence in the hypothesis validity the existing biological databases and to present that knowledge hypothesis is added to the background knowledge. in an understandable pattern. The tasks of ML in systems biology can be divided into seven categories as shown in Table 1 [5]. These techniques, operating individually or in combination, can meet the various challenges in systems biology. TABLE I MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY [5] Application Description 1 Classification Predicting an item’s class. 2 Forecasting Predicting a parameter value. 3 Clustering Finding groups of items. 4 Description Describing a group. Figure 1. Scheme for ILP Methods [7] 5 Deviation Finding changes. Detection ILP has been used for protein structure prediction. 6 Link Analysis Finding relationships. Muggleton et al. implemented ILP by separating the data set 7 Visualization Presenting data visually to of proteins into groups of the same type of domain structure facilitate human discovery. (ex. α-type domains). This allowed the system to have a more homogenous data set, thus allowing better prediction Machine learning approaches to protein structure [7]. The ILP program used in this method was Golem. The prediction and gene pathway discovery in are examined in basic algorithm was as follows: the following sections. 1. Take a random sample of pairs of residues from the training set. This represents a set of pairs of residues A.Protein Structure Prediction chosen randomly from the set of all residues in all Proteins are the essence of life. The secondary structure of proteins represented. protein consists of α-helices, β-strands and coils. The folding 2. Compute all the common properties for each pair of of these secondary structure elements forms the unique 3D residues. structure of a protein. A lot of useful information is contained 3. Convert the common properties into a rule that is true in this 3D structure. However, predicting proteins’ structure for the residue pair under consideration. is a central problem in bioinformatics. It is the bottleneck 4. Choose the rule for the best residue pair. For example, between sequencing efforts and drug design. ML approaches choose the rule that predicts the most true α-helix like Inductive Logic Programming can be used to predict residues while predicting less than a pre-defined protein structure. threshold of non-α-helix residues from the training set. 5. Take another sample of unpredicted residue pairs.
  • 3. 3 6. Form rules which express the common properties of regulation and thus discover causal gene pathways [10]. The the best pair together with each of the individual GEEVE system, shown in Fig.2, consists of two modules: the residue pairs in the sample. causal Bayesian network update module, and the decision 7. Repeat steps 4-6 until no improvement in prediction is tree generation and evaluation module. produced. The algorithm uses the best rule to eliminate a set of predicted residues from the training set. The reduced training set is then used to build up further rules. The process terminates when no further rules can be found. Golem produced an accuracy of about 81% when applied to 16 proteins with α-type domains. The disadvantage of ILP is the lack of probability in its rules. Biological systems are characterized by a high degree of uncertainty; thus, the hypotheses will have a higher descriptive power if they incorporate a certain degree of probability [5]. To date, ML methods cannot, by themselves, completely describe a new protein’s structure; however, they can provide valuable information regarding numerous structural attributes. B.Gene Pathway Discovery Systems biology seeks to discover causal relationships among a large number of genes and other cellular Figure 2: The GEEVE system [10] 2)Causal Bayesian Networks constituents. From a system-level point of view, the various A Bayesian network is a directed, acyclic graph of nodes interactions and control loops, which form a genetic network, representing variables and arcs representing dependencies the represent the basis upon which the vast complexity and variables. A Bayesian network encodes the joint probability flexibility of life processes emerges. distribution over all the variables. The joint distribution of a ML techniques like clustering, Bayesian networks and Bayesian network with N variables can be factored as decision trees can be used to discover gene regulation follows: pathways. P(x1, x2,…., xN| K) = , (1) where xi is the state of variable Xi, πi is a joint state of the 1)Gene Clustering parents of Xi, and K denotes background knowledge [10]. Clustering is a discovery approach that organizes and Bayesian networks are capable of handling incomplete identifies subsets of data and groups them into classes. Each data sets, and are able to learn and predict the missing data. class represents data with similar attributes. A derivative They also provide models of causal influence. These clustering algorithm can also be used to predict and explain properties make Bayesian networks a promising tool for complex data. analyzing gene expression patterns. Clustering algorithms are used to discover groups of genes In the context of genetic pathway inference, each node of a that show similar expression patterns under different Bayesian network is assigned to a gene, and can assume the experimental conditions. By this procedure, different families different expression levels of this gene throughout the of cell-cycle regulated genes in the bakers’ yeast, training data. Each edge between the nodes (genes) denotes a Saccharomyces Cerevisiae, have been identified [8]. regulatory relationship between them. If the edge is directed, Gene clustering has several drawbacks. Firstly, the as shown in Fig 3, it denotes that one gene controls the other. assignment of genes to single clusters by most clustering Fig.4 shows the feature graph trained for a genetic sub- methods potentially prevents the exposure of complex network of the bakers’ yeast. interrelationships among genes. Secondly, clustering does not always provide causal information. Genes sharing similar expression profiles may not always share a function. Even when similar expression levels correspond to similar functions, the functional relationships among genes in a cluster cannot be determined from the cluster data alone [9]. In contrast, a gene may be suppressed to allow another to be expressed; thus, functionally related genes may be clustered separately, blurring the existing relationship. A system named GEEVE, introduced by Yoo and Cooper, uses gene expression data to learn the models of gene
  • 4. 4 non-standardized experimental techniques, etc. The uncertainty associated with experiment-based research is very high. Despite these challenges, ML techniques have prompted the success of systems biology in recent years. ML has helped accelerate research in several areas of systems biology including protein structure prediction, inference of genetic and molecular networks, and gene-protein interactions. The author believes that systems biology will continue to benefit from ML techniques in coming years. Figure 3: The structure of a causal Bayesian network that represents a portion REFERENCES of a hypothetical gene regulation pathway [10] [1] Kitano, H.,”Looking beyond the details: a rise in system- oriented approaches in genetics and molecular biology”, Curr. Genet., Vol. 41(1), 2002,pp.1-10 [2] R. Lathrop,” Intelligent Systems in Biology: Why the Excitement?”, IEEE Intelligent Sys,Vol.16(6), 2001, pp. 8-13 [3] Luke, S. Hamahashi, S. Kyoda, K. Ueda, H., “Biology: see it again-for the first time”, IEEE Intelligent Systems, Vol. 13 (5), 1998, pp. 6-8. [4] Hu, Y, Kibler, D, “Combinatorial motif analysis and hypothesis generation on a genomic scale”, Bioinformatics., Vol 16 (3), 2000;pp. 222-32 [5] Tan, A, Gilbert, G,”Machine Learning and its Application to Bioinformatics: An Overview”, www.brc.dcs.gla.ac.uk/ ~actan/publications.html), Retrieved: Oct. 27, 2004 [6] Nilsson, N, “Introduction to Machine Learning”, Figure 4: Genetic sub-networks of the bakers yeast. [11] unpublished,http://robotics.stanford.edu/people/nilsson/ mlbook.html,1996, Retrieved: Oct. 27, 2004 While Bayesian networks produce better results than rule- [7] Muggleton, S., King, R., Sternberg, M., “Using logic for based learning methods, there is no clear explanation of the protein structure prediction”, Proceedings of the 25th learning process. It is therefore hard to understand the results Hawaii Int. Conf. on System Sciences, IEEE Computer and to interpret it into useful knowledge [5]. Society Press, 1992 [8] Spellman, P.T., “Comprehensive Identification of Cell 3)Decision Trees Cycle-regulated Genes of the Yeast Saccharomyces The decision tree is a simple inductive learning system cerevisiae by Microarray Hybridization”, Molecular that uses discrete-valued functions to estimate and classify Biology of the Cell, 1998, pp. 3273-3297. the provided training set. The system is represented by a tree [9] Shatkay, H. Edwards, S. Boguski, M., “Information whose internal nodes are tests (boolean decisions) and whose retrieval meets gene analysis”, IEEE Intelligent Systems, leaf nodes are classes. The tree can make predictions about Vol. 17 (2), 2002, pp. 45- 53. the probability of a particular case belonging to a particular [10] Yoo, C, Cooper, G.,”An Evaluation of a System that class. Recommends Microarray Experiments to Perform to Decision trees can be used to model gene perturbation in Discover Gene-Regulation Pathways”, Journal of experiments. The GEEVE system, for example, builds and Artificial Intelligence in Medicine;Vol. 31(2), 2004, pp.169-182. evaluates a decision tree based on pair-wise gene [11] Stetter, M, “Large-Scale Computational Modeling of relationships. Thus, the effects on gene X when gene Y is Genetic Regulatory Networks”, Artificial Intelligence perturbed can be modeled [11]. Review 20, 2003, pp. 75–93 The drawbacks of decision trees are over-fitting of data [12] National Center for Biotechnology Information: and overlapping in the classes. These and other factors make GenBank decision trees difficult to optimize. Overview,www.ncbi.nlm.nih.gov/Genbank/GenbankOve rview.html, Retrieved: Oct 27, 2004 VI.CONCLUSION Since ML primarily deals with the extraction of knowledge from data, redundancy of data is an important issue facing ML. The quality of biological data is usually compromised by experimental errors, wrong interpretation by biologists,

Related Documents