iCell project proposes the development of the novel computational modelling framework for multiscale analysis of tumor growth that exploits the idea of information processing in a population of cells as a driving force for pathogenicity and tumor growth. Our proposed research is related to the modeling of transitions between different scales: metabolic, signaling and the level of tumour cells population. icell is a variant of simulation performed using theTimo_2 simulator, which describes the behavior of Her2 positive breast cancer cells. To build the model of the tumor we will use models and tools for data integration (Recon2, Timothy, iMate, QSSPN) already existing however, significantly improved for the project. This will provide a unique model that allows the transition from the level of single cell metabolic and signalomic pathways to higher level of complex population of tumor cells.

Weronika Wronowska
researcher
MSc

Grzegorz Bokota
researcher
MSc

Anna Maria Rusek
intern
PhD student

Denis Kazakievich
researcher
PhD student

Paulina Urban
researcher
PhD student

Abstract: Here, we present two perspectives on the task of predicting post translational modifications (PTMs) from local sequence fragments using machine learning algorithms. The first is the description of the fundamental steps required to construct a PTM predictor from the very beginning. These steps include data gathering, feature extraction, or machine-learning classifier selection. The second part of our work contains the detailed discussion of more advanced problems which are encountered in PTM prediction task. Probably the most challenging issues which we have covered here are: (1) how to address the training data class imbalance problem (we also present statistics describing the problem); (2) how to properly set up cross-validation folds with an approach which takes into account the homology of protein data records, to address this problem we present our folds-over-clusters algorithm; and (3) how to efficiently reach for new sources of learning features. Presented techniques and notes resulted from intense studies in the field, performed by our and other groups, and can be useful both for researchers beginning in the field of PTM prediction and for those who want to extend the repertoire of their research techniques.

Authors: Tatjewski M, Kierczak M, Plewczynski D

Note: 'Predicting Post-Translational Modifications from Local Sequence Fragments Using Machine Learning Algorithms: Overview and Best Practices.' by Tatjewski M, Kierczak M, Plewczynski D. Methods Mol Biol. 2017;1484:275-300.

Abstract: The influenza virus type A (IVA) is an important pathogen which is able to cause annual epidemics and even pandemics. This fact is the consequence of the antigenic shifts and drifts capabilities of IVA, caused by the high mutation rate and the reassortment capabilities of the virus. The hemagglutinin (HA) protein constitutes the main IVA antigen and has a crucial role in the infection mechanism, being responsible for the recognition of host-specific sialic acid derivatives. Despite the relative abundance of HA sequence and serological studies, comparative structure-based analysis of HA are less investigated. The 3DFlu database contains well annotated HA representatives: 1192 models and 263 crystallographic structures. The relations between these proteins are defined using different metrics and are visualized as a network in the provided web interface. Moreover structural and sequence comparison of the proteins can be explored. Metadata information (e.g. protein identifier, IVA strain, year and location of infection) can enhance the exploration of the presented data. With our database researchers gain a useful tool for the exploration of high quality HA models, viewing and comparing changes in the HA viral subtypes at several information levels (sequence, structure, ESP). The complete and integrated view of those relations might be useful to determine the efficiency of transmission, pathogenicity and for the investigation of evolutionary tendencies of the influenza virus.Database URL: http://nucleus3d.cent.uw.edu.pl/influenza.

Authors: Mazzocco G, Lazniewski M, Migdał P, Szczepińska T, Radomski JP, Plewczynski D

Note: '3DFlu: database of sequence and structural variability of the influenza hemagglutinin at population scale.' by Mazzocco G, Lazniewski M, Migdał P, Szczepińska T, Radomski JP, Plewczynski D. Database (Oxford). 2016 Oct 2;2016. pii: baw130.

Abstract: Motivation: Accurate and effective dendritic spine segmentation from the dendrites remains as a challenge for current neuroimaging research community. In this paper, we present a new method (2dSpAn) for 2-d segmentation, classification and analysis of structural/plastic changes of hippocampal dendritic spines. A user interactive segmentation method with convolution kernels is designed to segment the spines from the dendrites. Formal morphological definitions are presented to describe key attributes related to the shape of segmented spines. Spines are automatically classified into one of four classes: Stubby, Filopodia, Mushroom and Spine-head Protrusions. Results: The developed method is validated using confocal light microscopy images of dendritic spines from dissociated hippocampal cultures for: 1) quantitative analysis of spine morphological changes, 2) reproducibility analysis for assessment of user-independence of the developed software, 3) accuracy analysis with respect to the manually labeled ground truth images, and also with respect to the available state-of-the-art. The developed method is monitored and used to precisely describe the morphology of individual spines in real-time experiments, i.e., consequent images of the same dendritic fragment. Availability: The software and the source code are available at https://sites.google.com/site/2dspan/ under open-source license for non-commercial use.

Authors: Subhadip Basu, Dariusz Plewczynski, Satadal Saha, MatyldaRoszkowska, Marta Magnowska, Ewa Baczynska and Jakub Wlodarczyk

Note: '2dSpAn: semiautomated 2-d segmentation, classification and analysis of hippocampaldendritic spine plasticity' by Subhadip Basu, Dariusz Plewczynski, Satadal Saha, MatyldaRoszkowska, Marta Magnowska, Ewa Baczynska and Jakub Wlodarczyk. Bioinformatics 2016 doi: 10.1093/bioinformatics/btw172 First published online: April 1, 2016

Abstract: ChIA-PET and Hi-C are high throughput versions of 3C-based mapping technologies that reveal long-range chromatin interactions and provide insights into the basic principles of spatial genome organization and gene regulation. Recently, we showed that a single ChIA-PET experiment provides information at all genomic scales of interest, from the high resolution locations of binding sites and enriched chromatin interactions mediated by specific protein factors, to the low resolution non-enriched interactions that reflect topological neighborhoods of higher-order associations. This multilevel nature of ChIA-PET data offers us an opportunity to use multiscale 3D models to study structural-functional relationships at multiple length scales, but doing so requires a structural modeling platform, which takes advantage of the full range of ChIA-PET data. Here we report 3D-NOME (3-Dimensional NucleOme Modeling Engine), a complete computational pipeline for processing and analyzing ChIA-PET data. 3D-NOME consists of three integrated tools: a graph-distance-based heatmap normalization tool, a 3D modeling platform, and an interactive 3D visualization tool. We use ChIA-PET and Hi-C data of human B-lymphocytes to demonstrate the effectiveness of 3D-NOME in building 3D genome models at multiple levels, including the entire nucleome, individual chromosomes, and specific segments at megabase (Mb) and kilobase (kb) resolutions. Our simulation protocol generates a single average structure or an ensemble of structures. We incorporate CTCF-motif orientation and high-resolution looping patterns in order to achieve more reliable, biologically plausible structures.

Authors: Szałaj P, Tang Z, Michalski P, Pietal MJ, Luo OJ, Sadowski M, Li X, Radew K, Ruan Y, Plewczynski D

Note: 'An integrated 3-dimensional genome modeling engine for data-driven simulation of spatial genome organization.' by Szałaj P, Tang Z, Michalski P, Pietal MJ, Luo OJ, Sadowski M, Li X, Radew K, Ruan Y, Plewczynski D. Genome Res. 2016 Oct 27. pii: gr.205062.116. [Epub ahead of print]

Abstract: Accurate identification of protein–protein interactions (PPI) is the key step in understanding proteins’ biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein–protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein–protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).

Authors: Zubek J, Tatjewski M, Boniecki A, Mnich M, Basu S, Plewczynski D

Note: 'Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae'by Zubek J, Tatjewski M, Boniecki A, Mnich M, Basu S, Plewczynski D. PeerJ. 2015 Jul 2;3:e1041.doi: 10.7717/peerj.1041.

Abstract: Bacteria are increasingly resistant to existing antibiotics, which target a narrow range of pathways. New methods are needed to identify targets, including repositioning targets among distantly related species. We developed a novel combination of systems and structural modeling and bioinformatics to reposition known antibiotics and targets to new species. We applied this approach to Mycoplasma genitalium, a common cause of urethritis. First, we used quantitative metabolic modeling to identify enzymes whose expression affects the cellular growth rate. Second, we searched the literature for inhibitors of homologs of the most fragile enzymes. Next, we used sequence alignment to assess that the binding site is shared by M. genitalium, but not by humans. Lastly, we used molecular docking to verify that the reported inhibitors preferentially interact with M. genitalium proteins over their human homologs. Thymidylate kinase was the top predicted target and piperidinylthymines were the top compounds. Further work is needed to experimentally validate piperidinylthymines. In summary, combined systems and structural modeling is a powerful tool for drug repositioning.

Authors: Kazakiewicz D, Karr JR, Langner KM, Plewczynski D

Note: 'A combined systems and structural modeling approach repositions antibiotics for Mycoplasma genitalium' by Kazakiewicz D, Karr JR, Langner KM, Plewczynski D. Comput Biol Chem. 2015 Jul 30. pii: S1476-9271(15)30089-X. doi: 10.1016/j.compbiolchem.2015.07.007.

Abstract: Whole-cell models that explicitly represent all cellular components at the molecular level have the potential to predict phenotype from genotype. However, even for simple bacteria, whole-cell models will contain thousands of parameters, many of which are poorly characterized or unknown. New algorithms are needed to estimate these parameters and enable researchers to build increasingly comprehensive models. We organized the Dialogue for Reverse Engineering Assessments and Methods (DREAM) 8 Whole-Cell Parameter Estimation Challenge to develop new parameter estimation algorithms for whole-cell models. We asked participants to identify a subset of parameters of a whole-cell model given the model’s structure and in silico “experimental” data. Here we describe the challenge, the best performing methods, and new insights into the identifiability of whole-cell models. We also describe several valuable lessons we learned toward improving future challenges. Going forward, we believe that collaborative efforts supported by inexpensive cloud computing have the potential to solve whole-cell model parameter estimation.

Authors: Karr JR, Williams AH, Zucker JD, Raue A, Steiert B, Timmer J, Kreutz C, DREAM8 Parameter Estimation Challenge Consortium, Wilkinson S, Allgood BA, Bot BM, Hoff BR, Kellen MR, Covert MW, Stolovitzky GA, Meyer P

Note: 'Summary of the DREAM8 Parameter Estimation Challenge: Toward Parameter Identification forWhole-Cell Models' by Karr JR, Williams AH, Zucker JD, Raue A, Steiert B, Timmer J, Kreutz C;DREAM8 Parameter Estimation Challenge Consortium, Wilkinson S, Allgood BA, Bot BM, Hoff BR,Kellen MR, Covert MW, Stolovitzky GA, Meyer P. PLoS Comput Biol. 2015 May 28;11(5):e1004096.doi: 10.1371/journal.pcbi.1004096.

Abstract: Motivation: To date, only a few distinct successful approaches have been introduced to reconstruct a protein 3D structure from a map of contacts between its amino acid residues (a 2D contact map). Current algorithms can infer structures from information-rich contact maps that contain a limited fraction of erroneous predictions. However, it is difficult to reconstruct 3D structures from predicted contact maps that usually contain a high fraction of false contacts. Results: We describe a new, multi-step protocol that predicts protein 3D structures from the predicted contact maps. The method is based on a novel distance function acting on a fuzzy residue proximity graph, which predicts a 2D distance map from a 2D predicted contact map. The application of a Multi-Dimensional Scaling algorithm transforms that predicted 2D distance map into a coarse 3D model, which is further refined by typical modeling programs into an all-atom representation. We tested our approach on contact maps predicted de novo by MULTICOM, the top contact map predictor according to CASP10. We show that our method outperforms FT-COMAR, the state-of-the-art method for 3D structure reconstruction from 2D maps. For all predicted 2D contact maps of relatively low sensitivity (60–84%), GDFuzz3D generates more accurate 3D models, with the average improvement of 4.87 Å in terms of RMSD. Availability: GDFuzz3D server and standalone version are freely available at http://iimcb.genesilico.pl/gdserver/GDFuzz3D/.

Authors: Michal J Pietal, Janusz M Bujnicki, and Lukasz P Kozlowski

Note: 'GDFuzz3D: a method for protein 3D structure reconstruction from contact maps, based on a non-Euclidean distance function' by Michal J. Pietal, Janusz M. Bujnicki, and Lukasz P. Kozlowski. Bioinformatics 2015 doi: 10.1093/bioinformatics/btv390 First published online: June 30, 2015

Abstract: Class II human leukocyte antigens (HLA II) are proteins involved in the human immunological adaptive response by binding and exposing some pre-processed, non-self peptides in the extracellular domain in order to make them recognizable by the CD4+ T lymphocytes. However, the understanding of HLA-peptide binding interaction is a crucial step for designing a peptide-based vaccine because the high rate of polymorphisms in HLA class II molecules creates a big challenge, even though the HLA II proteins can be grouped into supertypes, where members of different class bind a similar pool of peptides. Hence, first we performed the supertype classification of 27 HLA II proteins using their binding affinities and structural-based linear motifs to create a stable group of supertypes. For this purpose, a well-known clustering method was used, and then, a consensus was built to find the stable groups and to show the functional and structural correlation of HLA II proteins. Thus, the overlap of the binding events was measured, confirming a large promiscuity within the HLA II-peptide interactions. Moreover, a very low rate of locus-specific binding events was observed for the HLA-DP genetic locus, suggesting a different binding selectivity of these proteins with respect to HLA-DR and HLA-DQ proteins. Secondly, a predictor based on a support vector machine (SVM) classifier was designed to recognize HLA II-binding peptides. The efficiency of prediction was estimated using precision, recall (sensitivity), specificity, accuracy, F-measure, and area under the ROC curve values of random subsampled dataset in comparison with other supervised classifiers. Also the leave-one-out cross-validation was performed to establish the efficiency of the predictor. The availability of HLA II-peptide interaction dataset, HLA II-binding motifs, high-quality amino acid indices, peptide dataset for SVM training, and MATLAB code of the predictor is available at http://sysbio.icm.edu.pl/HLA .

Authors: I Saha, G Mazzocco and D Plewczynski

Note: 'Consensus classification of Human Leukocyte Antigens class II proteins' by I. Saha, G. Mazzocco and D. Plewczynski. Immunogenetics 65(2):97-105 (2013).

Abstract: Protein-protein interactions (PPI) control most of the biological processes in a living cell. In order to fully understand protein functions, a knowledge of protein-protein interactions is necessary. Prediction of PPI is challenging, especially when the three-dimensional structure of interacting partners is not known. Recently, a novel prediction method was proposed by exploiting physical interactions of constituent domains. We propose here a novel knowledge-based prediction method, namely PPI_SVM, which predicts interactions between two protein sequences by exploiting their domain information. We trained a two-class support vector machine on the benchmarking set of pairs of interacting proteins extracted from the Database of Interacting Proteins (DIP). The method considers all possible combinations of constituent domains between two protein sequences, unlike most of the existing approaches. Moreover, it deals with both single-domain proteins and multi domain proteins; therefore it can be applied to the whole proteome in high-throughput studies. Our machine learning classifier, following a brainstorming approach, achieves accuracy of 86%, with specificity of 95%, and sensitivity of 75%, which are better results than most previous methods that sacrifice recall values in order to boost the overall precision. Our method has on average better sensitivity combined with good selectivity on the benchmarking dataset. The PPI_SVM source code, train/test datasets and supplementary files are available freely in the public domain at: http://code.google.com/p/cmater-bioinfo/.

Authors: P Chatterjee, S Basu, M Kundu, M Nasipuri, and D Plewczynski

Note: 'PPI_SVM: prediction of protein-protein interactions using machine learning, domain-domain affinities and frequency tables' by P. Chatterjee, S. Basu, M. Kundu, M. Nasipuri, and D. Plewczynski Cell Mol Biol Lett 16(2):264-78 (2011).

Abstract: Studying the interactome is one of the exciting frontiers of proteomics, as shown lately at the recent bioinformatics conferences (for example ISMB 2010, or ECCB 2010). Distribution of data is facilitated by a large number of databases. Metamining databases have been created in order to allow researchers access to several databases in one search, but there are serious difficulties for end users to evaluate the metamining effort. Therefore we suggest a new standard, 'Good Interaction Data Metamining Practice' (GIDMP), which could be easily automated and requires only very minor inclusion of statistical data on each database homepage. Widespread adoption of the GIDMP standard would provide users with: a standardized way to evaluate the statistics provided by each metamining database, thus enhancing the end-user experience; a stable contact point for each database, allowing the smooth transition of statistics; a fully automated system, enhancing time- and cost-effectiveness. The proposed information can be presented as a few hidden lines of text on the source database www page, and a constantly updated table for a metamining database included in the source/credits web page.

Authors: D Plewczynski, and T Klingström Cell

Note: 'GIDMP: good protein-protein interaction data metamining practice' by D. Plewczynski, and T. Klingström Cell Mol Biol Lett. 16(2):258-63 (2011).