Data streams: monitoring in real-time
The operational procedures of numerous authorities and companies are subject to strict regulations. In practice, it is often difficult to establish whether the rules in place are actually being complied with. We developed algorithms that can automatically monitor regulatory compliance and handle large volumes of data.
Portrait / project description (completed research project)
We developed and integrated efficient, parallelised algorithms capable of monitoring whether data in a dynamic environment comply with given rules. The input for these algorithms is a real-time data feed as well as the rules to be monitored. The rules are formulated in an input language that allows users to express temporal and data dependencies between different events in the data feed in a simple and intuitive manner. If any of the rules are violated, a compact output of the data that caused the violation will be produced.
Background
Compliance is a crucial task, and companies have entire departments whose task is to oversee it. Given voluminous log files and other inputs, these departments must quickly and reliably monitor whether the procedures are being followed in compliance with the (possibly complex) rules in force. A typical rule for a bank might be: No customer may withdraw more than 5000 Swiss francs per week. Being able to identify individual violations is of considerable value.
Aim
We developed algorithms that continually monitor incoming data for rule violations. The more complex the rule, the greater the challenge of checking it efficiently against enormous volumes of data. The expressiveness of the input language influences the possible complexity of rules and therefore the efficiency of the monitoring algorithm. Our goal was to find efficient monitoring algorithms for highly expressive and hence practically useful input languages.
Relevance / application
The problem of automated rule monitoring can be approached from two directions. On the one hand, theoretical research is improving the expressiveness of the input language for rules and developing algorithms for these languages. However, the scalability of these algorithms to accommodate huge data volumes is frequently neglected. On the other hand, practical research is being conducted on implementing scalable algorithms for parallelised execution in computer clusters. In this context, work on designing input languages often does not receive the attention it merits. Our project combines the two approaches to the benefit of both.
Results
We have made significant contributions to the project’s overall objectives. With the results from our project, we have pushed the scalability boundaries for expressive first-order monitors, so that they can be deployed in big data settings. Moreover, we have identified alternative modes of operation which result in unprecedented efficiency for propositional monitors. All these foundational improvements are accompanied by publicly available, open-source implementations, whose efficiency and scalability are validated in synthetic and realistic case studies. In addition, we have addressed the questions of correctness of monitoring algorithms, the understandability of the monitoring output, and devised algorithms for deploying monitors in the distributed setting, where events may arrive at the monitor out of order.
Original title
Big Data Monitoring