Bioinformatics databases: queries in natural language

Complex bioinformatics databases hold enormous amounts of knowledge that can only be retrieved with technical know-how. The goal of this project is to develop an intuitive, Google-like search function designed to help identify new correlations in the stored data.

  • Portrait / project description (completed research project)

    Dropdown Icon

    The project remit is comparable to translating one language into another. Continuing with this analogy, the languages used for retrieving bioinformatics data can be compared with Esperanto and Latin. With no or merely a poor command of these languages, only limited bioscientific findings can be obtained since communication with the system is laborious. The objective behind BioSODA (Search Over DAta Warehouse for Biology) is to convert intuitive search terms into complex search queries.

  • Background

    Dropdown Icon

    Rapid advances in DNA sequencing are transforming biosciences into a highly data-intensive discipline. Vast quantities of bioinformatics data are stored in complex databases which are built on powerful technologies, but also demand a great deal of background information technology expertise when it comes to retrieval. New search technologies are needed to efficiently analyse dozens of bioinformatics databases.

  • Aim

    Dropdown Icon

    The present project is developing novel Google-like search options that allow researchers to query databases intuitively and concentrate on scientific questions.

  • Relevance/application

    Dropdown Icon

    BioSODA makes running queries in huge quantities of bioinformatics data an easier task. The program also makes search suggestions in order to display information that has not been expressly sought. We aim to achieve easier access to knowledge and thus gain faster insights into perhaps still unknown biological correlations.

  • Results

    Dropdown Icon

    Before the Bio-SODA project started, accessing the major bioinformatics databases required end users to be proficient in the query language SPARQL and to know the underlying structure of the databases. Since most end users did not have sufficient skills, they could not effectively query the troves of information sources or needed help from a few specialists to access their data. This process was both time consuming and inefficient since precious time of researchers was spent on data wrangling rather than pursuing scientific research.

    The Bio-SODA successfully laid the foundations for applying the developed system and the research approach well beyond life sciences. For instance, Bio-SODA is now also applied in the project INODE – Intelligent Open Data Exploration ( – funded by the European Union’s Horizon 2020 programme. The goal of Bio-SODA in INODE is to enable natural language querying across datasets from three different scientific domains, namely Cancer Biomarker Research, Research and Innovation Policy Making, and Astrophysics.

    Three main messages:

    1) Digitalisation efforts across all fields of knowledge have advanced rapidly in recent years. However, to achieve the full potential of digitalisation – empowering domain experts to routinely extract insights and scientific findings from big data – we have to improve data sharing and integration, and user-friendly interfaces to query this data.

    2) The Bio-SODA project has shown how bioinformatics datasets from traditionally disconnected fields of comparative genomics can be made interoperable. The project illustrated through real-world use cases the benefits of data integration in enabling more powerful semantic queries than previously possible.

    3) Bio-SODA made a significant contribution in talking to databases almost as if to a human by enabling intuitive natural language access to complex bioinformatics databases - while also highlighting the considerable potential for further improvement when it comes to performing complex queries across multiple resources.

  • Original title

    Dropdown Icon

    Bio-SODA: Enabling Complex, Semantic Queries to Bioinformatics Databases through Intuitive Searching over Data