"To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of."

Ronald Fisher

Field of expertise: prediction, machine learning, statistics, computational methods, data preparation

Statistical thinking is a vital part of natural sciences. Our group of machine learning and statistical data analysis is concerned with methodological aspects of quantitative scientific research, from hypothesis formulation and experiment design to data collection and statistical analysis of the results. Gained insights are used to build predictive models of the observed phenomena. We believe that involvement of researchers with both biological and statistical backgrounds in all parts of this process is beneficial for the overall quality of the research.

To make sense of vast quantities of biological data generated by high-throughput experiments, we apply various machine learning techniques to extract complex multidimensional dependencies not visible to the "naked eye". Such methods allow us to select important features, compare multidimensional objects, find similarity clusters or measure strength of linear or non-linear relations. The same tools are used to develop useful predictors of various biological properties working on previously unseen data and providing guidelines for planning further experiments.

Apart from applying the existing tools for data analysis, we are also involved in the development of new computational methods which would be able to cope with the challenges posed by biological data. We are aware of the specific problems concerning quality of the data, data volume, scarcity of the observations or missing records. We tackle these problems both from theoretical and practical perspective, trying to develop efficient tools which are easy to run in parallel on modern multi-core computer architectures.

Michał Denkiewicz
PhD candidate

Ziad Al Bkhetan
PhD student

Paulina Urban
PhD student

Abstract: With the avalanche of genomic and proteomic data generated in the postgenomic age, it is highly desirable to develop automated methods for rapidly and effectively analyzing and predicting the structure, function, and other properties of DNA and protein. Researchers realize the importance of machine learning methods and feature selection algorithms for potential knowledge finding tasks in genomics and proteomics. Recent years have shown tremendous advances in the properties prediction of DNA fragments and protein sequences by various pattern recognition methods. These techniques provide economical and time-saving solutions for identifying the properties of DNA and protein. This special issue will focus on various aspects of the application of machine learning methods in genomics and proteomics bioinformatics. The recent developments on the prediction of protein subcellular localization, posttranslational modification sites, DNA-binding site, protein-protein interaction, nucleosome positioning, transcription factor binding site, exon/intron splice site, translation initiation site, and transcription start site will be included in the special issue.

Authors: Lin H, Chen W, Anandakrishnan R, Plewczynski D

Note: 'Application of machine learning method in genomics and proteomics' by Lin H, Chen W,Anandakrishnan R, Plewczynski D. Scientific World Journal. 2015; 2015:914780. doi:10.1155/2015/914780. Epub 2015 Apr 19.

Abstract: The Cyclin-Dependent Kinases (CDKs) are the core components coordinating eukaryotic cell division cycle. Generally the crystal structure of CDKs provides information on possible molecular mechanisms of ligand binding. However, reliable and robust estimation of ligand binding activity has been a challenging task in drug design. In this regard, various machine learning techniques, such as Support Vector Machine, Naive Bayesian classifier, Decision Tree, and K-Nearest Neighbor classifier, have been used. The performance of these heterogeneous classification techniques depends on proper selection of features from the data set. This fact motivated us to propose an integrated classification technique using Genetic Algorithm (GA), Rotational Feature Selection (RFS) scheme, and Ensemble of Machine Learning methods, named as the Genetic Algorithm integrated Rotational Ensemble based classification technique, for the prediction of ligand binding activity of CDKs. This technique can automatically find the important features and the ensemble size. For this purpose, GA encodes the features and ensemble size in a chromosome as a binary string. Such encoded features are then used to create diverse sets of training points using RFS in order to train the machine learning method multiple times. The RFS scheme works on Principal Component Analysis (PCA) to preserve the variability information of the rotational nonoverlapping subsets of original data. Thereafter, the testing points are fed to the different instances of trained machine learning method in order to produce the ensemble result. Here accuracy is computed as a final result after 10-fold cross validation, which also used as an objective function for GA to maximize. The effectiveness of the proposed classification technique has been demonstrated quantitatively and visually in comparison with different machine learning methods for 16 ligand binding CDK docking and rescoring data sets. In addition, the best possible features have been reported for CDK docking and rescoring data sets separately. Finally, the Friedman test has been conducted to judge the statistical significance of the results produced by the proposed technique. The results indicate that the integrated classification technique has high relevance in predicting of protein-ligand binding activity.

Authors: Saha I, Rak B, Bhowmick SS, Maulik U, Bhattacharjee D, Koch U, Lazniewski M, Plewczynski D

Note: 'Binding Activity Prediction of Cyclin-Dependent Inhibitors' by Saha I, Rak B, Bhowmick SS,Maulik U, Bhattacharjee D, Koch U, Lazniewski M, Plewczynski D. J Chem Inf Model. 2015 Jul27;55(7):1469-82. doi: 10.1021/ci500633c.

Abstract: Protein–protein interactions are important for the majority of biological processes. A significant number of computational methods have been developed to predict protein–protein interactions using protein sequence, structural and genomic data. Vast experimental data is publicly available on the Internet, but it is scattered across numerous databases. This fact motivated us to create and evaluate new high-throughput datasets of interacting proteins. We extracted interaction data from DIP, MINT, BioGRID and IntAct databases. Then we constructed descriptive features for machine learning purposes based on data from Gene Ontology and DOMINE. Thereafter, four well-established machine learning methods: Support Vector Machine, Random Forest, Decision Tree and Naïve Bayes, were used on these datasets to build an Ensemble Learning method based on majority voting. In cross-validation experiment, sensitivity exceeded 80% and classification/prediction accuracy reached 90% for the Ensemble Learning method. We extended the experiment to a bigger and more realistic dataset maintaining sensitivity over 70%. These results confirmed that our datasets are suitable for performing PPI prediction and Ensemble Learning method is well suited for this task. Both the processed PPI datasets and the software are available at http://sysbio.icm.edu.pl/indra/EL-PPI/home.html.

Authors: Saha I, Zubek J, Klingström T, Forsberg S, Wikander J, Kierczak M, Maulik U, Plewczynski D

Note: 'Ensemble learning prediction of protein-protein interactions using proteins functional annotations' by Saha I, Zubek J, Klingström T, Forsberg S, Wikander J, Kierczak M, Maulik U, Plewczynski D. Mol Biosyst. 2014 Apr;10(4):820-30. doi: 10.1039/c3mb70486f.

Abstract: We study mathematical models of the collaborative solving of a two-choice discrimination task. We estimate the difference between the shared performance for a group of nn observers over a single person performance. Our paper is a theoretical extension of the recent work of Bahrami, Olsen, Latham, Roepstorff, and Frith (2010) from a dyad (a pair) to a group of nn interacting minds. We analyze several models of communication, decision-making and hierarchical information-aggregation. The maximal slope of psychometric function is a convenient parameter characterizing performance. For every model we investigated, the group performance turns out to be a product of two numbers: a scaling factor depending of the group size and an average performance. The scaling factor is a power function of the group size (with the exponent ranging from 0 to 1), whereas the average is arithmetic mean, quadratic mean, or maximum of the individual slopes. Moreover, voting can be almost as efficient as more elaborate communication models, given the participants have similar individual performances.

Authors: P Migdał, J Rączaszek- Leonardi, M Denkiewicz and D Plewczynski

Note: 'Information-sharing and aggregation models for interacting minds' by P. Migdał, J. Rączaszek- Leonardi, M. Denkiewicz and D. Plewczynski. Journal of Mathematical Psychology 56: 417-426 (2013).

Abstract: The physico-chemical properties of interaction interfaces have a crucial role in characterization of protein-protein interactions (PPI). In silico prediction of participating amino acids helps to identify interface residues for further experimental verification using mutational analysis, or inhibition studies by screening library of ligands against given protein. Given the unbound structure of a protein and the fact that it forms a complex with another known protein, the objective of this work is to identify the residues that are involved in the interaction. We attempt to predict interaction sites in protein complexes using local composition of amino acids together with their physico-chemical characteristics. The local sequence segments (LSS) are dissected from the protein sequences using a sliding window of 21 amino acids. The list of LSSs is passed to the support vector machine (SVM) predictor, which identifies interacting residue pairs considering their inter-atom distances. We have analyzed three different model organisms of Escherichia coli, Saccharomyces Cerevisiae and Homo sapiens, where the numbers of considered hetero-complexes are equal to 40, 123 and 33 respectively. Moreover, the unified multi-organism PPI meta-predictor is also developed under the current work by combining the training databases of above organisms. The PPIcons interface residues prediction method is measured by the area under ROC curve (AUC) equal to 0.82, 0.75, 0.72 and 0.76 for the aforementioned organisms and the meta-predictor respectively.

Authors: BK Sriwastava, S Basu, U Maulik, D Plewczynski

Note: 'PPIcons: identification of protein-protein interaction sites in selected organisms' by BK. Sriwastava, S. Basu, U. Maulik, D. Plewczynski. J Mol Model 19(9):4059-70 (2013).

Abstract: We present here the 2011 update of the AutoMotif Service (AMS 4.0) that predicts the wide selection of 88 different types of the single amino acid post-translational modifications (PTM) in protein sequences. The selection of experimentally confirmed modifications is acquired from the latest UniProt and Phospho.ELM databases for training. The sequence vicinity of each modified residue is represented using amino acids physico-chemical features encoded using high quality indices (HQI) obtaining by automatic clustering of known indices extracted from AAindex database. For each type of the numerical representation, the method builds the ensemble of Multi-Layer Perceptron (MLP) pattern classifiers, each optimising different objectives during the training (for example the recall, precision or area under the ROC curve (AUC)). The consensus is built using brainstorming technology, which combines multi-objective instances of machine learning algorithm, and the data fusion of different training objects representations, in order to boost the overall prediction accuracy of conserved short sequence motifs. The performance of AMS 4.0 is compared with the accuracy of previous versions, which were constructed using single machine learning methods (artificial neural networks, support vector machine). Our software improves the average AUC score of the earlier version by close to 7 % as calculated on the test datasets of all 88 PTM types. Moreover, for the selected most-difficult sequence motifs types it is able to improve the prediction performance by almost 32 %, when compared with previously used single machine learning methods. Summarising, the brainstorming consensus meta-learning methodology on the average boosts the AUC score up to around 89 %, averaged over all 88 PTM types. Detailed results for single machine learning methods and the consensus methodology are also provided, together with the comparison to previously published methods and state-of-the-art software tools. The source code and precompiled binaries of brainstorming tool are available at http://code.google.com/p/automotifserver/ under Apache 2.0 licensing.

Authors: D Plewczynski, S Basu and I Saha

Note: 'AMS 4.0: consensus prediction of post-translational modifications in protein sequences' by D. Plewczynski, S. Basu and I. Saha. Amino Acids 43(2):573-82 (2012).

Abstract: In this article, we categorize presently available experimental and theoretical knowledge of various physicochemical and biochemical features of amino acids, as collected in the AAindex database of known 544 amino acid (AA) indices. Previously reported 402 indices were categorized into six groups using hierarchical clustering technique and 142 were left unclustered. However, due to the increasing diversity of the database these indices are overlapping, therefore crisp clustering method may not provide optimal results. Moreover, in various large-scale bioinformatics analyses of whole proteomes, the proper selection of amino acid indices representing their biological significance is crucial for efficient and error-prone encoding of the short functional sequence motifs. In most cases, researchers perform exhaustive manual selection of the most informative indices. These two facts motivated us to analyse the widely used AA indices. The main goal of this article is twofold. First, we present a novel method of partitioning the bioinformatics data using consensus fuzzy clustering, where the recently proposed fuzzy clustering techniques are exploited. Second, we prepare three high quality subsets of all available indices. Superiority of the consensus fuzzy clustering method is demonstrated quantitatively, visually and statistically by comparing it with the previously proposed hierarchical clustered results. The processed AAindex1 database, supplementary material and the software are available at http://sysbio.icm.edu.pl/aaindex/ .

Authors: I Saha, U Maulik, S Bandyopadhyay and D Plewczynski

Note: 'Fuzzy Clustering of Physicochemical and Biochemical Properties of Amino Acids' by I. Saha, U. Maulik, S. Bandyopadhyay and D. Plewczynski. Amino Acids 43(2):583-94 (2012).

Abstract: Clustering is an important tool for analysing the microarray data to identify groups of co-expressed genes. The problem of fuzzy clustering in microarray data motivated us to develop an improved clustering algorithm. In this paper, an improved differential evolution based fuzzy clustering technique is proposed. The performance of the proposed improved differential evolution based fuzzy clustering technique has been compared with other state-of-the-art clustering algorithms for publicly available benchmark microarray data sets. Statistical and biological significance tests have been carried out to establish the statistical superiority of the proposed clustering approach and biological relevance of clusters of co-expressed genes, respectively.

Authors: I Saha, D Plewczynski, U Maulik and S Bandyopadhyay

Note: 'Improved differential evolution for microarray analysis' by I. Saha, D. Plewczynski, U. Maulik and S. Bandyopadhyay. J. Data Mining and Bioinformatics 6(1):86-103 (2012).

Abstract: Secondary structure prediction is a crucial task for understanding the variety of protein structures and performed biological functions. Prediction of secondary structures for new proteins using their amino acid sequences is of fundamental importance in bioinformatics. We propose a novel technique to predict protein secondary structures based on position-specific scoring matrices (PSSMs) and physico-chemical properties of amino acids. It is a two stage approach involving multiclass support vector machines (SVMs) as classifiers for three different structural conformations, viz., helix, sheet and coil. In the first stage, PSSMs obtained from PSI-BLAST and five specially selected physicochemical properties of amino acids are fed into SVMs as features for sequence-to-structure prediction. Confidence values for forming helix, sheet and coil that are obtained from the first stage SVM are then used in the second stage SVM for performing structure-to-structure prediction. The two-stage cascaded classifiers (PSP_MCSVM) are trained with proteins from RS126 dataset. The classifiers are finally tested on target proteins of critical assessment of protein structure prediction experiment-9 (CASP9). PSP_MCSVM with brainstorming consensus procedure performs better than the prediction servers like Predator, DSC, SIMPA96, for randomly selected proteins from CASP9 targets. The overall performance is found to be comparable with the current state-of-the art. PSP_MCSVM source code, train-test datasets and supplementary files are available freely in public domain at: http://sysbio.icm.edu.pl/secstruct and http://code.google.com/p/cmater-bioinfo/

Authors: P Chatterjee, S Basu, M Kundu, M Nasipuri, and D Plewczynski

Note: 'PSP_MCSVM: brainstorming consensus prediction of protein secondary structures using two-stage multiclass support vector machines' by P. Chatterjee, S. Basu, M. Kundu, M. Nasipuri, and D. Plewczynski. J Mol Model. 17(9):2191-201 (2011).