Loosely structured data: new tools for integration
Integrating metadata is currently an expensive and tedious task because it has proved very difficult to automate. This project aimed to develop new techniques for the efficient, automatic integration of metadata taken from the Web or social networks, for example.
Portrait / project description (completed research project)
This project was divided into two parts. The first part consisted in developing and then testing new techniques for extracting data in order to characterise the available data automatically, understand the relationships between pieces of data and model their value distribution. Second, this information was used to facilitate the analysis and integration of the available data. It was necessary to develop new techniques capable of creating data patterns on demand and providing abstraction layers. The ultimate goal was to provide processes which allow data sets to be easily combined while preserving their specific features and history.
Background
One of the cornerstones of Big Data consists in combining several sources of information in order to model a specific phenomenon. Most current methods are based on analysis of data patterns, and particularly on the metadata that unambiguously defines the structure of the information to be combined. Nevertheless, in practice these patterns often turn out to be incomplete, e.g. for data originating from social networks or the Web. Given that it is currently impossible to combine this data automatically, experts have no choice other than to prepare and integrate it manually. The resulting loss of time is one of the major problems of Big Data.
Aim
The aim of this project was to devise new techniques for the automatic or semi-automatic integration of data. Because the data structure is often not defined in advance, the central challenge for this research was to understand it retrospectively, by reconstructing patterns using the available data.
Relevance/application
This project is particularly important because of the disproportion between the ever-increasing volume of data available and the limited time available for analysts to process it. The results of this project will help to substantially speed up the process of turning raw data into models and visualisations. Numerous fields that require the combination of heterogeneous data sets (e.g. smart cities, personalised healthcare and e-science) stand to benefit from new methods of combining different data sets, resulting in more powerful analyses and models.
Results
The project resulted in a number of novel, next-generation integration algorithms, as well as several deployments over real data. In particular, new techniques were developed to integrate and query microposts [TKDE2021], as well as new human-in-the-loop methods to analyse them [AAAI2020]. Significant progress in terms of improving graph integration and analysis was also made; new embedding techniques in that context that are one to two orders of magnitude faster than previous approaches [KDD2019], as well as new imputation techniques to improve the quality of knowledge graphs used for data integration [WWW2021] resulted of this project as well. Approaches over two real use-cases were developed: one to analyse and integrate pdfs for the Swiss Federal Archives, and a second one to integrate loosely structured data for cancer diagnosis [BigData2020].
Original title
Tighten-it-All: Big Data Integration for Loosely-Structured Data