Artificial Intelligence is leading to ubiquitous sources of Big Data arriving at high-velocity and in real-time. To effectively deal with it, we need to be able to adapt to changes in the distribution of the data being produced, and we need to do it using a minimum amount of time and memory. In this paper, we detail modern applications falling into this context, and discuss some state-of-the-art methodologies in mining data streams in real-time, and the open source tools that are available to do machine learning/data mining in real-time for this challenging setting.
The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define "big data", i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.
A crucial concern regarding cloud computing is the confidentiality of sensitive data being processed in the cloud. Trusted Execution Environments (TEEs), such as Intel Software Guard extensions (SGX), allow applications to run securely on an untrusted platform. However, using TEEs alone for stream processing is not enough to ensure privacy as network communication patterns may leak information about the data.
This paper introduces two techniques -- anycast and multicast --for mitigating leakage at inter-stage communications in streaming applications according to a user-selected mitigation level. These techniques aim to achieve network data obliviousness, i.e., communication patterns should not depend on the data. We implement these techniques in an SGX-based stream processing system. We evaluate the latency and throughput overheads, and the data obliviousness using three benchmark applications. The results show that anycast scales better with input load and mitigation level, and provides better data obliviousness than multicast.
An essential security concern in the publish/subscribe paradigm is that of guaranteeing the confidentiality of the data being transmitted. Existing solutions require that some initial parameters, keys or secrets be exchanged or otherwise established between communicating entities before secure end-to-end communication can occur. Most existing solutions in the literature either weaken the desirable decoupling properties of pub/sub or rely on a completely trusted out-of-band service to disseminate these values. This problem can be avoided through the use of Shamir's secret sharing scheme, at the cost of a prohibitively large number of messages, scaling exponentially with the path length between publisher and subscriber. Intel's Software Guard Extensions (SGX) offers trusted execution environments to shield application data from untrusted software running at a higher privilege level. Unfortunately, SGX requires the use of Intel's proprietary hardware and architecture.
We mitigate these problems through HyShare, a hybrid broker network used for the purposes of sharing a secret between communicating publishers and subscribers. The broker network is composed of regular brokers that use Shamir's secret sharing scheme and brokers with SGX to reduce the overall number of messages needed to share a secret. By fine tuning the combination of these brokers, it is possible to strike a balance between network resource use and hardware heterogeneity.
The Internet of Things (IoT) envisions a huge number of networked sensors connected to the internet. These sensors collect large streams of data which serve as input to wide range of IoT applications and services such as e-health, e-commerce, and automotive services. Complex Event Processing (CEP) is a powerful tool that transforms streams of raw sensor data into meaningful information required by these IoT services. Often these streams of data collected by sensors carry privacy-sensitive information about the user. Thus, protecting privacy is of paramount importance in IoT services based on CEP.
In this paper we present a novel pattern-level access control mechanism for CEP based services that conceals private information while minimizing the impact on useful non-sensitive information required by the services to provide a certain quality of service (QoS). The idea is to reorder events from the event stream to conceal privacy-sensitive event patterns while preserving non-privacy sensitive event patterns to maximize QoS. We propose two approaches, namely an ILP-based approach and a graph-based approach, calculating an optimal reordering of events. Our evaluation results show that these approaches are effective in concealing private patterns without significant loss of QoS.
In a networked world, events are transmitted from multiple distributed sources into CEP systems, where events are related to one another along multiple dimensions, e.g., temporal and spatial, to create complex events. The big data era brought with it an increase in the scale and frequency of event reporting. Internet of Things adds another layer of complexity with multiple, continuously changing event sources, not all of which are perfectly reliable, often suffering from late arrivals. In this work we propose a probabilistic model to deal with the problem of reduced reliability of event arrival time. We use statistical theories to fit the distributions of inter-generation at the source and network delays per event type. Equipped with these distributions we propose a predictive method for determining whether an event belonging to a window has yet to arrive. Given some user-defined tolerance levels (on quality and timeliness), we propose an algorithm for dynamically determining the amount of time a complex event time-window should remain open. Using a thorough empirical analysis, we compare the proposed algorithm against state-of-the-art mechanisms for delayed arrival of events and show the superiority of our proposed method.
Distributed pub/sub must make principal design choices with regards to overlay topologies and routing protocols. It is challenging to tackle both aspects together, and most existing work merely considers one. We argue the necessity to address both problems simultaneously since only the right combination of the two can deliver an efficient internet-scale pub/sub. Traditional design space spans from structured data-oblivious overlays employing greedy routing strategies all the way to unstructured data-driven overlays using naive broadcast-based routing. The two ends of the spectra come with unacceptable prices: the former often exerts considerable overhead on each node for forwarding irrelevant messages, while the latter is difficult to scale due to prohibitive latencies stemming from unbounded node degrees and network diameters.
To achieve the best of both worlds, we propose BeaConvey, a distributed pub/sub system for federated environments. First, we define the small-world and interest-close overlay (SWICO) that embraces both small-world properties and pub/sub semantics. To construct a SWCIO, we devise a greedy heuristic to assign small-world identifiers and fingers in a centralized manner. Second, we develop a family of peer-to-peer pub/sub routing protocols that leverages such SWICOs.
Empirical evaluation shows that BeaConvey achieves substantial improvement in routing overhead and propagation delays. For instance, the routing overhead of BeaConvey is only 20% to 40% of the state of the art. This acceleration is consistent across a variety of pub/sub workloads, and BeaConvey obtains such adaptability by optimizing both overlay and routing, which complement each other in different situations. Under one Facebook workload with a skewed distribution, 78% of the improvement is accredited to a better overlay. Under another non-skewed workload, more advanced routing contributes 95% of cost reduction.
State changes over time are inherent characteristics of stateful applications. So far, there are almost no attempts to make the past application history programmatically accessible or even modifiable. This is primarily due to the complexity of temporal changes and a difficult alignment with prevalent programming primitives and persistence strategies. Retroactive computing enables powerful capabilities though, including computations and predictions of alternate application timelines, post-hoc bug fixes, or retroactive state explorations. We propose an event-driven programming model that is oriented towards serverless computing and applies retroaction to the event sourcing paradigm. Our model is deliberately restrictive, but therefore keeps the complexity of retroactive operations in check. We introduce retro-λ, a runtime platform that implements the model and provides retroactive capabilites to its applications. While retro-λ only shows negligible performance overheads compared to similar solutions for running regular applications, it enables its users to execute retroactive computations on the application histories as part of its programming model.
Enterprise Application Integration is the centerpiece of current on-premise, cloud and device integration scenarios. We describe optimization strategies that help reduce the model complexity, and improve the process execution using design time techniques. In order to achieve this, we formalize compositions of Enterprise Integration Patterns based on their characteristics, and propose a realization of optimization strategies using graph rewriting. The framework is successfully evaluated on a real-world catalog of pattern compositions, containing over 900 integration scenarios.
The data acquisition system of the ATLAS experiment, a major experiment of the Large Hadron Collider (LHC) at CERN, will go through a major upgrade in the next decade. The upgrade is driven by experimental physics requirements, calling for increased data rates on the order of 6 TB/s. By contrast, the data rate of the existing system is 160 GB/s. Among the changes in the upgraded system will be a very large buffer with a projected size on the order of 70 PB. The buffer role will be decoupling of data production from on-line data processing, storing data for periods of up to 24 hours until it can be analyzed by the event processing system.
The larger buffer will allow a new data recording strategy, providing additional margins to handle variable data rates. At the same time it will provide sensible trade-offs between buffering space and on-line processing capabilities. This compromise between two resources will be possible since the data production cycle includes time periods where the experiment will not produce data.
In this paper we analyze the consequences of such trade-offs, and introduce a tool that allows a detailed exploration of different strategies for resource provisioning. It is based on a model of the upgraded data acquisition system, implemented in a simulation framework. From this model it is possible to obtain insight into the dynamics of the running system. Given predefined resource constraints, we provide bounds for the provisioning of buffering space and on-line processing requirements.
Mobile devices are increasingly being used in edge and fog computing environments to process contextual data collected by sensors. Although complex event processing (CEP) is a suitable approach for realizing context-aware services on mobile devices in these environments, existing mobile CEP engines do not leverage the full potential of modern mobile hardware/software architectures. In this paper, we present multimodal CEP, a novel approach to process streams of events on-device in user space (user mode), in the operating system (kernel mode), on the Wi-Fi chip (Wi-Fi mode), and/or on a sensor hub (hub mode), providing significant improvements in terms of power consumption and throughput. Multimodal CEP automatically breaks up CEP queries and selects the most adequate execution mode for the involved CEP operators. Filter, aggregation, and correlation operators can be expressed in a high-level language without requiring system-level domain-specific knowledge. Multimodal CEP enables developers to efficiently detect user activities, collect environmental conditions, or interpret operating system and network events. Furthermore, it facilitates novel context-aware services, demonstrated by a use case for gathering and analyzing mobility data by Wi-Fi probe request tracking.
To fully exploit the capabilities of sensors in real life, especially cameras, smart camera surveillance requires the cooperation from both domain experts in computer vision and systems. Existing alert-based smart surveillance is only capable of tracking a limited number of suspicious objects, while in most real-life applications, we often do not know the perpetrator ahead of time for tracking their activities in advance. In this work, we propose a radically different approach to smart surveillance for vehicle tracking. Specifically, we explore a smart camera surveillance system aimed at tracking all vehicles in real time. The insight is not to store the raw videos, but to store the space-time trajectories of the vehicles. Since vehicle tracking is a continuous and geo-distributed task, we assume a geo-distributed Fog computing infrastructure as the execution platform for our system. To bound the storage space for storing the trajectories on each Fog node (serving the computational needs of a camera), we focus on the activities of vehicles in the vicinity of a given camera in a specific geographic region instead of the time dimension, and the fact that every vehicle has a "finite" lifetime. To bound the computational and network communication requirements for detection, re-identification, and inter-node communication, we propose novel techniques, namely, forward and backward propagation that reduces the latency for the operations and the communication overhead. STTR is a system for smart surveillance that we have built embodying these ideas. For evaluation, we develop a toolkit upon SUMO to emulate camera detections from traffic flow and adopt MaxiNet to emulate the fog computing infrastructure on Microsoft Azure.
Operator placement has a profound impact on the performance of a distributed complex event processing system (DCEP). Since the behavior of a placement mechanism strongly depends on its environment; a single placement mechanism is often not enough to fulfill stringent performance requirements under environmental changes. In this paper, we show how DCEP can benefit from the adaptive use of multiple placement mechanisms. We propose Tcep, a DCEP system to integrate multiple placement mechanisms. By enabling transitions, Tcep can seamlessly exchange distinct operator mechanisms at runtime. We make two main contributions that are highly important for a cost-efficient transition: i) a transition strategy for efficiently scheduling state migrations and ii) a lightweight learning algorithm to adaptively select an appropriate placement mechanism as a consequence of a transition. Our evaluations for important decentralized placement mechanisms in the context of an IoT scenario show that transitions can better fulfill QoS demands in a dynamic environment. Thereby efficient scheduling of state migrations can help to faster complete transitions by up to 94 %.
We design Fogstore, a key-value store for event-based systems, that exploits the concept of relevance to guarantee low-latency access to relevant data with strong consistency guarantees, while providing tolerance from geographically correlated failures. Distributed event-based processing pipelines are envisioned to utilize the resources of densely geo-distributed infrastructures for low-latency responses - enabling real-time applications. Increasing complexity of such applications results in higher dependence on state, which has driven the incorporation of state-management as a core functionality of contemporary stream processing engines a la Apache Flink and Samza. Processing components executing under the same context (like location) often produce information that may be relevant to others, thereby necessitating shared state and an out-of-band globally-accessible data-store. Efficient access to application state is critical for overall performance, thus centralized data-stores are not a viable option due to the high-latency of network traversals. On the other hand, a highly geo-distributed datastore with low-latency implemented with current key-value stores would necessitate degrading client expectation of consistency as per the PACELC theorem. In this paper we exploit the notion of contextual relevance of events (data) in situation-awareness applications - and offer differential consistency guarantees for clients based on their context. We highlight important systems concerns that may arise with a highly geo-distributed system and show how Fogstore's design tackles them. We present, in detail, a prototype implementation of Fogstore's mechanisms on Apache Cassandra and a performance evaluation. Our evaluations show that Fogstore is able to achieve the throughput of eventually consistent configurations while serving data with strong consistency to the contextually relevant clients.
Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications.
In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.
Smart Grids and Advanced Metering Infrastructures are rapidly replacing traditional energy grids. The cumulative computational power of their IT devices, which can be leveraged to continuously monitor the state of the grid, is nonetheless vastly underused.
This paper provides evidence of the potential of streaming analysis run at smart grid devices. We propose a structural component, which we name LoCoVolt (Local Comparison of Voltages), that is able to detect in a distributed fashion malfunctioning smart meters, which report erroneous information about the power quality. This is achieved by comparing the voltage readings of meters that, because of their proximity in the network, are expected to report readings following similar trends. Having this information can allow utilities to react promptly and thus increase timeliness, quality and safety of their services to society and, implicitly, their business value. As we show, based on our implementation on Apache Flink and the evaluation conducted with resource-constrained hardware (i.e., with capacity similar to that of hardware in smart grids) and data from a real-world network, the streaming paradigm can deliver efficient and effective monitoring tools and thus achieve the desired goals with almost no additional computational cost.
Evaluation of Distributed Complex Event Processing (CEP) systems is a rather challenging task. To simplify this task, we developed the open simulation framework for Distributed CEP, called DCEP-Sim. The goal of this tutorial is to facilitate the process of using DCEP-Sim. Since DCEP-Sim is designed and implemented in the popular network simulator ns-3 we introduce the most important concepts of ns-3. Simulations in ns-3 are configured and executed though a main program called an ns-3 script. We use a simple example script to explain how simulations with DCEP-Sim are set up and executed. To give an idea how DCEP-Sim can be adjusted to particular needs, we explain how DCEP-Sim can be adapted (e.g., through changing the workload and the network topology) and how new Distributed CEP solutions can be added by explaining how to add a new operator to DCEP-Sim.
Popularly known for powering cryptocurrencies such as Bitcoin and Ethereum, blockchains is seen as a disruptive technology capable of impacting a wide variety of domains, ranging from finance to governance, by offering superior security, reliability, and transparency in a decentralized manner. In this tutorial presentation, we first study the original Bitcoin design, as well as Ethereum and Hyperledger, and reflect on their design from an academic perspective. We provide an overview of potential applications and associated research challenges, as well as a survey of ongoing research projects. We mention opportunities blockchain creates for event-based systems. Finally, we conclude with a walkthrough showing the process of developing a decentralized application (ĐSApp), using a popular Smart Contract language (Solidity) for the blockchain platform of Ethereum.
The ACM DEBS 2018 Grand Challenge is the eighth in a series of challenges which seek to provide a common ground and evaluation criteria for a competition aimed at both research and industrial event-based systems. The focus of the 2018 Grand Challenge is on the application of machine learning to spatio-temporal streaming data. The goal of the challenge is to make the naval transportation industry more reliable by providing predictions for vessels' destinations and arrival times. This paper describes the specifics of the data streams and queries that define the DEBS 2018 Grand Challenge. It also describes the benchmarking platform that supports testing of corresponding solutions.
1Predicting the destination port and arrival time of a vessel is challenging, even with the availability of a tremendous amount of trace data. Our goal for this challenge is to build a solution to accurately predict the destination port and arrival times of a given vessel using Bayesian inference and heuristics.
In this paper, we present our approach for solving the DEBS Grand Challenge 2018. The challenge asks to provide a prediction for (i) a destination and the (ii) arrival time of ships in a streaming-fashion using Geo-spatial data in the maritime context. Novel aspects of our approach include the use of ensemble learning based on Random Forest, Gradient Boosting Decision Trees (GBDT), XGBoost Trees and Extremely Randomized Trees (ERT) in order to provide a prediction for a destination while for the arrival time, we propose the use of Feed-forward Neural Networks. In our evaluation, we were able to achieve an accuracy of 97% for the port destination classification problem and 90% (in minutes) for the ETA prediction.
The 2018 Grand Challenge targets the problem of accurate predictions on data streams produced by automatic identification system (AIS) equipment, describing naval traffic. This paper reports the technical details of a custom solution, which exposes multiple tuning parameters, making its configurability one of the main strengths. Our solution employs a cell grid architecture essentially based on a sequence of hash tables, specifically built for the targeted use case. This makes it particularly effective in prediction on AIS data, obtaining a high accuracy and scalable performance results. Moreover, the architecture proposed accommodates also an optionally semi-supervised learning process besides the basic supervised mode.
In this paper, we present MtDetector, a high performance marine traffic detector that can predict the destination and the arrival time of travelling vessels. MtDetector accepts streaming data reported by the moving vessels and generates continuous predictions of the arrival port and arrival time for those vessels. To predict the destination for a ship, MtDetector builds a neural network for every port and infers the arrival port for vessels based on their departure port. For the arrival time prediction, we derive informative features from training data and apply Deep Neural Network (DNN) to estimate the traveling time. MtDetector is built on top of DtCraft [1,2], a high-performance distributed execution engine for stream programming. By utilizing the task-based parallelism in DtCraft, MtDetector can process multiple predictions concurrently to achieve high throughput and low latency.
The ACM DEBS 2018 Grand Challenge focuses on (soft) real-time prediction of both the destination port and the time of arrival of vessels, monitored through the Automated Identification System (AIS). Venilia prediction mechanism is based on a variety of machine learning techniques, including Markov predictive models. To improve the accuracy of a model, trained off-line on historical data, Venilia supports also on-line continuous training using an incoming event stream. The software architecture enables a low latency, highly parallelized, and load balanced prediction pipeline. Aiming at a portable and reusable solution, Venilia is implemented on top of the Akka Actor framework. Finally, Venilia is also equipped with a visualization tool for data exploration.
In this paper, we propose scalable algorithms allowing primo to infer a map of vessels' trajectories and secundo to predict future locations of a vessel on sea. Our system is based on Apache Spark -a fast and scalable engine for large-scale data processing. The training dataset is event-based. Each event depicts the GPS position of the vessel at a timestamp. We propose and implement a workflow computing trips' patterns, with GPS locations of each trip summarized using GeoHashing. The latter is an efficient encoding of a geographic location into a short string of letters and digits. In order to perform prediction queries efficiently, we propose (i) a geohash positional index which maps each geohash to a list of pairs (trip-pattern-identifier, offset of the geohash in the geohash sequence of the trip-pattern), (ii) a departure-port index which maps each departure port to a list of trip-patterns' identifiers, as well as (iii) a pairwise geohash sequence alignment allowing to score the similarity of two geohash-sequences using queen-spatial neighborhood.
We propose a sequence-to-sequence based method to predict vessels' destination port and estimated arrival time. We consider this problem as an extension of trajectory prediction problem, that takes a sequence of historical locations as input and returns a sequence of future locations, which is used to determine arrival port and estimated arrival time. Our solution first represents the trajectories on a spatial grid covering Mediterranean Sea. Then, we train a sequence-to-sequence model to predict the future movement of vessels based on movement tendency and current location. We built our solution using distributed architecture model and applied load balancing techniques to achieve both high performance and scalability.
This paper presents a generic solution to the spatiotemporal prediction problem provided for the DEBS Grand Challenge 2018. Our solution employs an efficient multi-dimensional index to store the training and historical dataset. With the arrival of new tasks of events, we query our indexing structure to determine the closest points of interests. Based on these points, we select the ones with the highest overall score and predict the destination and time of the vessel in question. Our solution does not rely on existing machine learning techniques and provides a novel view of the prediction problem in the streaming settings. Hence, the prediction is not just based on the recent data, but on all the useful historical dataset.
The DEBS Grand Challenge 2018 is set in the context of maritime route prediction. Vessel routes are modeled as streams of Automatic Identification System (AIS) data points selected from real-world tracking data. The challenge requires to correctly estimate the destination ports and arrival times of vessel trips, as early as possible. Our proposed solution partitions the training vessel routes by reported destination port and uses a nearest neighbor search to find the training routes that are closer to the query AIS point. Particular improvements have been included as well, such as a way to avoid changing the predicted ports frequently within one query route and automating the parameters tuning by the use of a genetic algorithm. This leads to significant improvements on the final score.
Hyperledger Fabric (HLF) is a modular and extensible permissioned blockchain platform released to open-source and hosted by the Linux Foundation. The platform's design exhibits principles required by enterprise grade business applications like supply-chains, financial transactions, asset management, food safety, and many more. For that end HLF introduces several innovations, two of which are smart contracts in general purpose languages (chaincode in HLF), and flexible endorsement policies, which govern whether a transaction is considered valid.
Typical blockchain applications are comprised of two tiers: the first tier focuses on the modelling of the data schema and embedding of business rules into the blockchain by means of smart contracts (chaincode) and endorsment policies; and the second tier uses the SDK (Software Development Kit) provided by HLF to implement client side application logic.
However there is a gap between the two tiers that hinders the rapid adoption of changes in the chaincode and endorsement policies within the client SDK. Currently, the chaincode location and endorsement policies are statically configured into the client SDK. This limits the reliability and availability of the client in the event of changes in the platform, and makes the platform more difficult to use. In this work we address and bridge the gap by describing the design and implementation of Service Discovery.
Service Discovery provides APIs which allow dynamic discovery of the configuration required for the client SDK to interact with the platform, alleviating the client from the burden of maintaining it. This enables the client to rapidly adapt to changes in the platform, thus significantly improving the reliability of the application layer. It also makes the HLF platform more consumable, simplifying the job of creating blockchain applications.
Event sourcing is increasingly used and implemented in event-based systems for maintaining the evolution of application state. However, unbounded event logs are impracticable for many systems, as it is difficult to align scalability requirements and long-term runtime behavior with the corresponding storage requirements. To this end, we explore the design space of log pruning approaches suitable for event-sourced systems. Furthermore, we survey specific log pruning mechanisms for event-sourced logs. In a brief evaluation, we point out the trade-offs when applying pruning to event logs and highlight the applicability of log pruning to event-sourced systems.
Nowadays data stream processing systems need to efficiently handle large volumes of data in near real-time. To achieve this, the schedulers within such systems minimise the data movement between highly communicating tasks, improving system throughput. However, finding an optimal schedule for these systems is NP-hard. In this research, we propose a heuristic scheduling algorithm which reliably and efficiently finds the highly communicating tasks by exploiting graph partitioning algorithms and a mathematical optimisation software package. We evaluate our scheduler with two popular existing schedulers R-Storm and Aniello et al.'s 'Online scheduler' using two real-world applications and show that our proposed scheduler outperforms R-Storm, increasing throughput by between 3% and 30% and Online scheduler by 20--86% as a result of finding a more efficient schedule.
Social networks provide a rich data source for researchers that can be accessed in a comparatively effortless way. As data and text mining methods such as Sentiment Analysis are becoming increasingly refined, the wealth of social network data opens up entirely new possibilities for exploring specific in-depth research questions. In this paper an approach towards the retrieval, analysis and interpretation of social network data for research purposes is developed. The data is filtered according to relevant criteria and analyzed using Sentiment Analysis tools tailored specifically to the data source. The approach is verified by applying it to two example research questions, confirming past findings on cultural and gender differences in sentiment expression.
Distributed event-sourced systems adopt a fairly new architectural style for data-intensive applications that maintains the full history of the application state. However, the performance implications of such systems are not yet well explored, let alone how the performance of these systems can be improved. A central issue is the lack of systematic performance engineering approaches that take into account the specific characteristics of these systems. To address this problem, we suggest a methodology for performance engineering and performance analysis of distributed event-sourced systems based on specific measurements and subsequent, targeted optimizations. The methodology blends in well into existing software engineering processes and helps developers to identify bottlenecks and to resolve performance issues. Using our structured approach, we improved an existing event-sourced system prototype and increased its performance considerably.
In this paper, we design a novel platform that facilitates integrated healthcare services without a centralized orchestration. Events that reflect dynamically changing conditions of patients are published using a scalable messaging middleware built on top of a publish/subscribe broker overlay network. Events matching service rules are routed to the appropriate caretakers. Services rules are issued autonomously by the caretakers who subscribe to the future matching events. Through this event-driven system, we aim to help the caretakers and medical staff to recommend and offer services to patients in a more timely and seamless manner.
This paper presents Moscato, a web-based tool for a more effective management of large-scale data and event processing platforms. With Moscato, composing data and event processing services can be done intuitively. The process of deploying new service instances including the task of installation and configuration can be automated. With such automation feature, we expect administrators tedious and error-prone management tasks are reduced. Instead, administrators can leverage Moscato's various novel visual cues in order to conduct multilateral situation analysis.
Trade surveillance is an important concern in recent trading engines to detect and prevent fraudulent trades at earliest. In traditional trading platforms, to achieve high throughput and low latency requirements focus of developers has always been on high-performance languages such as C, C++ and FPGA based systems. These systems have limitations of scalability and fault-tolerance. With the arrival of in-memory technology these requirements can be met with Java-based frameworks like Ignite, Flink, Spark. In this paper, we propose a novel way of implementing trade surveillance architecture using Apache Ignite In-Memory Data Grid (IMDG). Paper discusses the engineering approach to tune system architecture on the single node in terms of achieving high throughput, low latency and then scaling out to multiple nodes.
The primary consumption of news is now increasingly online and has resulted in a large volume of online news from varied news outlets. Consequently news aggregators have become popular for clustering, ranking and personalization of news which process millions of news articles each day. In addition, since news articles stream constantly, there is a need for a scalable event-based system which can facilitate news mining in an online fashion. To address these challenges, we propose a distributed framework to process news articles and cluster them to facilitate many news mining tasks. The core of our system is a novel and scalable distributed clustering algorithm using Locality Sensitive Hashing which is robust to outliers and noise. In addition, we also propose an online version of the clustering algorithm to dynamically maintain the news event clusters. We implement the proposed solution on Apache Spark. Using a large news collection with over 8 million news articles, we show that our approach outperforms widely-used clustering techniques such as K-Means both in run time and clustering quality.
In this paper, we propose a neural network based system to predict vessels' trajectories including the destination port and estimated arrival time. The system is designed to address DEBS Grand Challenge 2018, which provides a set of data streams containing vessel information and coordinates ordered by time. Our goal is to design a system which can accurately predict future trajectories, destination port and arrival time for a vessel.
Our solution is based on the sequence-to-sequence model which uses a spatial grid for trajectory prediction. We divided sea area into a spatial grid and then used vessels' recent trajectory as a sequence of codes to extract movement tendency. The extracted movement tendency allowed us to predict future movements till the destination. We built our solution using distributed architecture model and applied load balancing techniques to achieve maximum performance and scalability. We also design an interactive user interface which showcases real-time trajectories of vessels including their predicted destination and arrival time.
The recent success of electric vehicles leads to unprecedentedly high peaks of demand on the electric grid at the times when most people charge their cars. In order to avoid unreasonably rising costs due to inefficient utilization of the electricity infrastructure, we propose EVA: a scheduling system to solve the valley filling problem by distributing the electricity demand generated by electric vehicles in a geographically limited area efficiently over time spans in which the electric grid is underutilized. EVA is based on a smart contract running on the Ethereum blockchain in combination with off-chain computational nodes performing the schedule calculation using the Alternating Direction Method of Multipliers (ADMM). This allows for a high degree of transparency and verifiability in the scheduling computation results while maintaining a reasonable level of efficiency. In order to interact with the scheduling system, we developed a decentralized app with a graphical frontend, where the user can enter vehicle information and future energy requirements as well as review upcoming schedules. The calculation of the schedule is performed on a daily basis, continuously providing schedules for participating users for the following day.
Stateful applications are based on the state they hold and how it changes over time. This history of state changes is usually discarded as the application progresses. By building on concepts from event processing and storing the application history we envision a novel programming paradigm that supports retroaction. Retroactive computing introduces new opportunities for a developer to access and even modify an application timeline. By enabling the exploration of alternative scenarios, retroactive computing establishes powerful new ways to debug systems and introduces new approaches to solve problems. Initial work has shown the practicality and possibilities of this new programming paradigm and introduces further research questions and challenges.
Data acquisition systems for particle physics experiments produce vasts amounts of data. It is sometimes unfeasible to store it all since the storage requirements will be enormous. For this reason, an on-line filtering system selects the relevant pieces of information according to the goals of the experiment, before finally sending them to permanent storage. While data is being analyzed, it is temporarily stored in a large high-speed buffering system. Data production follows a cycle, with long periods of many hours where no data is being produced by the experiment. Also, data production is not constant, and there are fluctuations in the input rate. This offers the possibility of over-provisioning the buffering system and trading processing power for storage space. This buffer can be used for storage for periods of many days. In this work, a model was created to study the behavior of some aspects of the ATLAS data acquisition system, and specifically the buffering system for the on-line filter.
Complex Event Processing (CEP) system enables extraction of higher-level information from real-time data streams produced by distributed sources. However, these systems are subject to changes in the user environment e.g., density of sources, rate at which events occur and mobile sources. Therefore, it becomes difficult to satisfy stringent performance requirements posed in terms of Quality of Service (QoS) demands under such a dynamic environment.
This work investigates adaptive use of CEP mechanisms e.g., operator placement and operator migration by supporting transitions i.e., dynamic exchange of these mechanisms. In particular, we build a transition-capable CEP system --- Tcep to enable integration of multiple heterogeneous CEP mechanisms and allow cost-efficient and seamless transitions between them. As a proof-of-concept, we have recently designed and developed an initial architecture named Tcep, where we have shown benefits of transitions among operator placement mechanisms. In an ongoing research, we explore other CEP mechanisms e.g., operator migration and investigate whether transitions can bring performance benefits, under the execution of different strategies. In the future, we will investigate if mechanism transitions in CEP are beneficial in middleware infrastructures including information-centric networks.