Big genetic data: powerful indexing
The future of biomedical research is closely linked to a better understanding of the genome. In particular, success depends on storing, analysing, and logically linking the genetic information contained in hundreds of thousands of samples.
Portrait / project description (completed research project)
Improved technologies in biomedical research now allow for sequencing an individual's entire genome at a low cost. In this project, new technical concepts are developed for a computational system that stores tens of thousands of such genomic data sets in one place and makes them accessible for research and clinical applications. To build this system, genome graphs were used. This data structure incorporates the information from the genome sequences with other relevant clinical or experimental data. New information can be added to genome graphs efficiently. They combine a low storage capacity requirement with a capability for rapid information searching. This research focuses primarily on reducing the storage space required while retaining and enabling efficient access to all the information. By doing this, data becomes FAIR: findable, accessible, interoperable, and reusable.
Background
Living organisms carry their construction plan in their cells’ DNA. It is necessary to read this information, store and compare it to understand life’s processes and the cause of diseases. The research methods used in this context often employ statistics or machine learning approaches that only acquire informative value if they incorporate information from thousands of samples. This need for enormous data sets, in turn, means that there is a need for a low-cost means of storage and rapid data comparison, as the rapidly growing global sequencing capacity generates petabytes of new data every month.
Aim
This project aimed to develop a computational system based on new technical concepts capable of recording the genomic information in tens of thousands of biological or medical samples and representing them efficiently. It allows for searching new samples quickly and comparing them with existing information. The system also provides room for information about the sample's origin and other data relevant for research in the constantly growing, continuously learning, information storage system.
Relevance/application
Understanding the relationship between genomic information and biological characteristics necessitates comparing a broad spectrum of this information. For example, to better understand a genetic disease or cancer, the DNA of as many patients as possible needs to be analysed. One major challenge in this context is the correct and user-friendly storage of the enormous volumes of data required. The software system developed in this project enables biomedical research to be done efficiently by providing a technical basis for this work.
Results
This research addressed the problem of rapid data growth. It developed new algorithms and data structures to efficiently compress and search sequence collections at a petabyte scale. Achieving compression ratios of up to 1000-fold through the reduction of redundancies, the resulting data is not only more accessible but also much more cost-efficiently to store.
The following results are part of this achievement:
- A tool and a modular framework (MetaGraph) to index arbitrary biomedical sequencing data on a petabase scale. The MetaGraph framework directly addresses the inaccessibility of sequences in large public archives. At the basis of many MetaGraph features stand the theoretical contributions of the project to the field of sequence bioinformatics.
- An index comprising more than four million sequencing samples using this framework has been computed and made publicly available. The pre-computed indexes were shared with the public. An interactive platform has been developed to directly query these indexes via a web interface or a publicly accessible API .
- Important methodological advances in the field of sequence bioinformatics have been achieved. Several new methods have been developed for efficient compression with the additional requirement of allowing for fast query or access without decompression.
- Progress has been made how sequence graphs can represent information with several contributions to efficiency and alignment strategies.
- Analysis methods have been developed that can be applied to the graphs instead of the much larger and redundant input data. A concept has been introduced for the sensitive detection and assembly of marker sequences across thousands of samples given external labels. This concept has important implications for detecting clinical sequence markers (e.g., from metagenomic samples) in the absence of taxonomic information, as shown in collaboration with the Institute of Medical Virology (University of Zurich).
- Lastly, fully functional reference implementations have been made publicly accessible for all developed methods. All software code is licensed as open-source software and publicly available for download and contribution.
Original title
Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation