Discovering Patterns in Gene Ontology Using Association Rule Mining

Gene Ontology (GO) is one of the largest interdisciplinary bioinformatics projects that aims to provide a uniform and consistent representation of genes and gene products across different species. It has fast become a vast repository of data consisting of biological terms arranged in the form of three different ontologies, and annotation files that represent how these terms are linked to genes across different organisms. Further, this dataset is ever growing due to the various genomic projects underway. While this growth in data is a very welcome development, there is a critical need to develop data mining tools and algorithms that can extract summaries, and discover useful knowledge in an automated way. This paper presents a review of the efforts in this area, focusing on information discovery in the form of association rule mining.


Introduction
Gene Ontology (GO) is one of the largest interdisciplinary projects in bioinformatics that seeks to develop a consistent vocabulary and structured organization of gene-related terms and products [1]. It consists of biomedical terms, their inter-relationships, and term-gene associations for different organisms stored in annotation files. Terms are categorized into three different ontologies -Biological Process (BP), Cellular Components (CC), and Molecular Function (MF) -which are each organized in the form of a directed acyclic graph (DAG). These ontologies are constructed independently of species and represent the current knowledge in the form of a term hierarchy and term relations, such as "is-a" or "part-of" relationships. Another aspect of GO is the annotation of ontology terms to genes of different species. The Gene Ontology Consortium manages the annotations for various species in specific databases such as the Saccharomyces Cerevisiae database, or the Homo sapiens database [2]. Annotations are constantly added and updated by various research projects, and the data can be downloaded in various formats from the GO website. As of October 2015, there were 43,835 terms in the GO that were related by 73,776 explicitly encoded "is-a" relationships, 7436 explicitly encoded "part-of" relationships, and 8,263 other explicitly encoded relationships [2]. A recent check revealed over 6.8 million annotations to genes across different organisms [3].
With so many active research projects producing and updating information in GO, there is a need for effective information extraction tools that can automatically discover patterns and knowledge from this massive dataset. One of the reasons behind the development of GO was the observation that genes that are similar across different organisms are likely to have similar functions. While the annotations are developed by the GO consortium consisting of projects for different organisms, such as UniProt, Mouse Genome Informatics, Saccharomyces Genome Database, the real challenge lies in extracting knowledge that can relate gene functions across different organisms and species. Much effort is ongoing in this direction with the development of tools like AmiGO, GOOSE [3], and GO enrichment analysis [4].
While much work has been done in the semantic similarity area, the task of finding and discovering patterns in the term annotations has not been investigated extensively. In this review, we will examine the use of Association Rule Mining (ARM) to investigate whether certain statistically significant rules can be extracted from the annotation data. Association Rule (AR) discovery is generally performed on a set of transactions, In the original retail sales application area, this type of transaction data is also referred to as market basket data [5]. Each transaction can include or not include a particular item, and thus the data can be represented as a binary matrix. In a retail setting, items may be milk, bread, butter, etc., and each trip to the supermarket represents a transaction that includes a subset of these items. We are often interested in answering questions such as which items are frequently bought together, and the extraction of rules indicating a high probability of an item being purchased given that the customer has purchased other items. The strength of such rules is usually evaluated using measures such as support, confidence, and lift [5].
ARM has been used in bioinformatics for applications such as finding associations in gene expression datasets [6], association analysis of microarray data [7], and association rule discovery from protein-protein interaction data [8]. Specifically for GO, Carmona-Saez et al. [9] were one of the earliest to apply association rules to an integrated dataset of gene expression and gene ontology annotations. Martinez et al. [10] developed a tool for association rule discovery called GenMiner that works on integrated data sources, such as gene expression data and gene ontology data. Another such tool to discover frequently cooccurring annotations in genes was GENECODIS [11].
While the earlier approaches tried to discover association rules in datasets that combined gene data from expression profiles and term annotations, several more recent works have focused on just GO data. Manda et al. [12] applied association rule mining to find patterns across the three ontologies of GO using a level-by-level ontology traversal mechanism. In another work by the same first author [13], they used the structure and relationships of terms to discover multi-ontology, multi-level association rules, and also proposed support and confidence measures for multiple ontologies. Kumar et al. [14] developed association rules such that the antecedent and consequent terms were from different ontologies within GO. Another work in the area of cross-ontology association rule extraction was done by Agapito et al. [15]. Nagar et al. [16] proposed a method for discovering association rules for terms having a similar level of specificity in the ontology. Using this approach, they were able to discover strong rules that were validated using biological evidence in the literature.

Discussion
Association Rule Mining is a very powerful technique to discover associations from large datasets. It has been widely used in many areas of bioinformatics, but it has not been fully exploited for GO. One of the reasons is that during the annotation process, genes are annotated with GO terms that could have different level of specificity because of their position in the ontology structure. A term that is at the bottom of the structure is likely to be a very specific term, whereas a term at the top will be a very generic term. It becomes a challenge to run association rule mining in such a scenario. A solution for this could be to normalize terms so that they represent the same level of detail.
Other solutions might involve pruning the terms so that we work with a specified threshold of specificity.

Conclusion
Gene Ontology has quickly become the largest repository of gene product and annotation data. It stores a massive amount of data that need to be analyzed and converted into useful knowledge across various genomes. Association rule mining can be a very effective tool in this direction, as it can automatically extract significant associations and rules from the ontology. This review paper presented some of the work that has been done on GO using association rule mining, the challenges involved, and some possible solutions. There is significant room for more work in this area, especially for predicting terms that could be annotated to genes whose characteristics, such as expression profiles or sequences, are known but whose function and exact role in biological pathways is still unknown.