Stream analytics: fast processing and privacy-preserving tools
Society produces data continuously, and at unprecedented speed. As a result, it is increasingly unrealistic to educate a sufficient number of skilled computer scientists to collect and analyse these data. Instead, new ways are needed to analyse data as it is being produced.
Portrait / project description (completed research project)
In this project, a petabyte-scale, privacy-preserving processing system for commodity (i.e. standard) hardware was developed. First, a user-friendly programming language was provided based on traditional querying but with extensions for statistical operations and capacity for real-time operations. Second, this language permits users to specify the desired level of privacy. Third, the system compiler translates the statistical functions and privacy specifications into executable computations. Finally, the runtime environment selects the best approach for optimising execution using existing systems (e.g. Apache Flink, Spark Streaming or Storm).
Background
Production of Big Data will soon outpace the availability of both storage and computer science experts who know how to handle such data. Moreover, society is increasingly concerned about data protection. Addressing these issues requires so-called stream-processing systems that continuously analyse incoming data (rather than store it) and allow non-computer scientists to specify its analysis in a privacy-preserving manner. This project could vastly simplify the development of new, societally acceptable applications of real-time data analytics.
Aim
A petabyte-scale analytics system (i.e. processing millions of gigabytes) was developed that enables non-computer scientists to analyse high-performance data streams. The solution supports real-time advanced statistical operations and ensures the privacy of the data. To evaluate the robustness and functionality of the system, the processing pipeline for the Australian Square Kilometer Array Pathfinder radio telescope was replicated. This generated up to 2.5 gigabytes per second of raw data. To evaluate privacy preservation, the TV viewing habits of around 3 million individuals were analysed.
Relevance/application
The ubiquity of electronic devices and sensors is leading society to a data deluge. The results of this project allow non-computer scientists to efficiently analyse and explore these ever-increasing data sources while adhering to data protection laws.
Results
The main results of the project are
- new algorithms to compute the Fourier transform in streaming environments,
- a functional extension of relational systems with linear algebra operations, and
- the easy-to-use privacy-preserving querying of RDF streams with SihlMill.
First, new algorithms were developed that make it possible to use the Fourier transformation to process high-velocity and high-volume data streams. The algorithm, the Single Point Incremental Fourier Transform (SPIFT), exploits twiddle factors to reduce the complexity of processing a single new observation arriving in the data stream. SPIFT proposes circular shifts to reduce the complexity from a quadratic to a linear number of multiplications. The algorithm was also extended to efficiently process batches of observations with MPIFT (Multi Point Incremental Fourier Transform).
Second, declarative high-level languages were advanced with functional extensions for linear algebra operations. Specifically, it was possible to manage to elegantly extend, respectively, the relational algebra and SQL with linear algebra operations and build a system that integrates the functional extension into the kernel of the MonetDB column store.
Third, SihlQL was developed, a SPARQL inspired query language for the privacy-preserving querying of RDF data streams. The starting point was to propose an easy-to-understand probabilistic parameter for systems that are based on differential privacy and allow domain experts to easily specify the desired level of privacy. Then, starting from SihlQL, a compiler was developed to transform the queries to Apache Flink workflows. The resulting system, SihlMill, is published as an open-source project and implements privacy-preserving algorithms from the state of the art, as well as new mechanisms designed to extend the expressiveness of SihlQL.
Original title
Privacy Preserving, Peta-scale Stream Analytics for Domain-Experts