Regularized Network-Based Algorithm for Predicting Gene Functions with High-Imbalanced Data
DOI:
https://doi.org/10.14806/ej.18.A.377Abstract
Motivations. The gene function prediction problem is a real-world problem consisting in finding new bio-molecular functions of genes/gene products and characterized by hundreds or thousands of functional classes structured according to a predefined hierarchy. This problem can be formalized as a semi-supervised multi-class, multi-label classification problem where the biological functions of new genes can be predicted by exploiting their connections with genes whose biological functions are known. Many different approaches have been proposed to address this problem, including "guilt- by-association" [1], "label propagation" [2], module-assisted techniques [3], SVMs [4]. Nevertheless, these methods usually suffer a decay in performance when input data are highly unbalanced, that is positive examples are significantly less than negatives. This scenario characterizes in particular the most specific classes of the ontology, which are the classes more far from the root classes and that better describe the functions of genes. Methods. To address these items, we propose a regularization of a Hopfield-based cost- sensitive algorithm, COSNet, recently proposed to predict gene functions [5]. This algorithm, although designed to manage the imbalance in labeled data, tends to predict an excessively high proportion of positives when data are particularly unbalanced (that is in particular on most specific classes). By adding a term to the energy function of the network, we are able in modifying the dynamics in order to prevent the number of positives becomes too large. This energy term is minimized when the proportion of positive neurons (current positive rate) resembles the rate of positive labels in the training set (expected positive rate). The higher the difference between current and expected positive rates, the more the penalty to the energy function. We call this regularized version R-COSNet. Results. We tested R-COSNet on the prediction of yeast genes, by using four different data sets and the classes of the FunCat ontology [6]. This ontology is structured in forest of trees, in which each node belong to one of the six levels of specificity. Level 1 refers to the root nodes, level i to nodes at distance i from the root. The considered classes are those with at least 20 positives and are spanned from level 1 to level 5. We compared our methods with a label propagation algorithm, LP-Zhu [2], and Support Vector Machine (SVM) with probabilistic output [4]. In Figure 1 we report the results in terms of F-score averaged across the functional classes belonging to the level 4 and level 5 of the hierarchy. References 1. Oliver, S. Guilt-by-association goes global. Nature 2000, 403: 601-603. 2. Zhu, X, Ghahramani, Z, and Lafferty, J. Semi-supervised learning using gaussian fields and harmonic functions. In ICML 2003, 912-919. 3. Sharan, R, Ulitsky, I, and Shamir, R. Network-based prediction of protein function. Molecular Systems Biology 2007, 3:88. 4. Lin, HT, Lin, CJ, Weng, R. A note on platt’s probabilistic outputs for support vector machines. Machine Learning 2007, 68(3): 267-276. 5. Bertoni, A, Frasca, M, Valentini, G. Cosnet: A cost sensitive neural network for semi- supervised learning in graphs. ECML/PKDD (1) 2011, Lecture Notes in Computer Science, 6911: 219-234. 6. Ruepp, A, et al. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Research 2004, 32(18): 5539-5545.Downloads
Additional Files
Published
2012-04-29
Issue
Section
Oral Presentations
License
Authors who publish with this journal agree to the following terms:- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).