The goal of the Bioinformatics component of the Center is to develop effective tools for simulation and prediction of complex biological processes. This includes, in particular, genome sequence analysis. While bioinformatic problems have always been computation-intensive, the rapid rise in the amount of available data is making them also data-intensive. For example, large repositories storing thousands of (very similar) genomes of the same species are emerging. We therefore strive not only for good algorithms for simulating complex processes and performing complex searches, but also for good data representations that fit large amounts of data in highly compressed space.
We have developed efficient representations of large similar sequence collections, building on various measures of repetitiveness: size of Lempel-Ziv parsing, size of the smallest string attractor, number of runs in the Burrows-Wheeler Transform, etc. The main achievement is that the size of those representations is proportional to the amount of unique information in the dataset, as opposed to the plain dataset size. As a result, space reductions of up to 100 to 1 are obtained, which allows handling of much larger datasets. In the simplest case, we aim at providing direct access to any place in any sequence in the collection. We developed a compressed filesystem adapted to managing collections of similar sequences on disk in a transparent way, so that it can be integrated in any application without change, and stored in a centralized server or in the client’s equipment.
The next level of functionality of such compressed representations is to allow for pattern matching, that is, finding all the places where a short given pattern appears in the sequence. We have obtained a number of so-called text indexes that support optimal or near-optimal pattern matching within highly compressed space. On top of that functionality, more sophisticated representations have been developed that support document retrieval functionality, for example, find the sequences where a pattern appears most often. Another sophisticated functionality is to simulate suffix trees, which enable pattern mining operations such as discovering long enough frequent substrings in the collection. In all those cases we have obtained a number of novel results of both theoretical and practical significance, and are now integrating the best algorithms in public software used for managing large collections of reads.
Recent research on metagenomic sequencing is leaning towards graph representations that encompass all the genetic variations that arise in a population, which is then used for various kinds of analyses. We have also developed space-efficient representations of the overlap graph, and the de Bruijn graph, of a collection of reads.
On the side of simulation and prediction, an important achievement was the development of Gdrift++, an approximate Bayesian computation-based simulation software product for population and evolutionary genetics. Gdrift++ performs multiple forward-in-time simulations. We investigated the application of metaheuristics designed to reduce the computational effort without losing effectiveness in the results.
We also developed a parallel algorithm for the physical characterization of small non-coding RNA in bacterial genomes. The algorithm proceeds by calculating the Z-score, which requires significant computational effort. The algorithm exploits multi-core architectures to speed up the calculation of minimum folding energy for each RNA segment, which are the ingredients to compute the Z-score.
Finally, we have developed new effective techniques for predicting protein complexes and signaling pathways by combining information on dense subgraphs in the PPI networks with information on gene expression and phenotypes.
In the last couple of years we have also developed new automated workflows to calculate protein properties that are being used to predict the effect of mutations in engineered enzymes and also mathematical methods to represent a peptide sequence in a multidimensional Fast Fourier-Transform space of transformed physicochemical characteristics. Finally we have also started to develop informatic tools for the analysis of high dimensional datasets using data mining and patern recognition techniques implemented in Python and web-based tools which are especially useful for users who have large volumes of data to be analyzed but little informatics, mathematical or statistical knowledge.