ICPE '21: Companion of the ACM/SPEC International Conference on Performance Engineering

Full Citation in the ACM Digital Library

SESSION: Auto-DaSP 2021 Workshop

An Online Approach to Determine Correlation between Data Streams

Real time stream processing demands processed outcomes in minimal latency. Massive streams are generated in real time where linear relationship is determined using correlation. Existing approaches are used for correlating static data sets such as, Kandell, Pearson, Spearman etc. These approaches are insufficient to solve noise free online correlation. In this paper, we propose an online ordinal correlation approach having functionalities such as single pass, avoiding recalculation from scratch, removing outliers, and low memory requirements. In this approach, Compare Reduce Aggregate (CRA) algorithm is used for determining association between two feature vectors in real time using single scanning technique. Time and space complexities in CRA algorithm are measured as O(n) and O(1), respectively. This algorithm is used for reducing noise or error in a stream and used as a replacement of rank based correlation. It is recommended to have distinct elements and less variability in the streams for gaining maximum performance of this algorithm.

Elastic Pulsar Functions for Distributed Stream Processing

An increasing number of data-driven applications rely on the ability of processing data flows in a timely manner, exploiting for this purpose Data Stream Processing~(DSP) systems. Elasticity is an essential feature for DSP systems, as workload variability calls for automatic scaling of the application processing capacity, to avoid both overload and resource wastage. In this work, we implement auto-scaling in Pulsar Functions, a function-based streaming framework built on top of Apache Pulsar. The latter is is a distributed publish-subscribe messaging platform that natively supports serverless functions. Considering various state-of-the-art policies, we show that the proposed solution is able to scale application parallelism with minimal overhead.

Motivations and Challenges for Stream Processing in Edge Computing

The 2030 Agenda for Sustainable Development of the United Nations General Assembly defines 17 development goals to be met for a sustainable future. Goals such as Industry, Innovation and Infrastructure and Sustainable Cities and Communities depend on digital systems. As a matter of fact, billions of Euros are invested into digital transformation within the European Union, and many researchers are actively working to push state-of-the-art boundaries for techniques/tools able to extract value and insights from the large amounts of raw data sensed in digital systems. Edge computing aims at supporting such data-to-value transformation. In digital systems that traditionally rely on central data gathering, edge computing proposes to push the analysis towards the devices and data sources, thus leveraging the large cumulative computational power found in modern distributed systems. Some of the ideas promoted in edge computing are not new, though. Continuous and distributed data analysis paradigms such as stream processing have argued about the need for smart distributed analysis for basically 20 years. Starting from this observation, this talk covers a set of standing challenges for smart, distributed, and continuous stream processing in edge computing, with real-world examples and use-cases from smart grids and vehicular networks.

Towards Elastic and Sustainable Data Stream Processing on Edge Infrastructure

Much of the data produced today is processed as it is generated by data stream processing systems. Although the cloud is often the target infrastructure for deploying data stream processing applications, resources located at the edges of the Internet have increasingly been used to offload some of the processing performed in the cloud and hence reduce the end-to-end latency when handling data events. In this work, I highlight some of the challenges in executing data stream processing applications on edge computing infrastructure and discuss directions for future research on making such applications more elastic and sustainable.

SESSION: HotCloudPerf 2021 Workshop

Towards Independent Run-Time Cloud Monitoring

Cloud computing services are integral to the digital transformation. They deliver greater connectivity, tremendous savings, and lower total cost of ownership. Despite such benefits and benchmarking advances, costs are still quite unpredictable, performance is unclear, security is inconsistent, and there is minimal control over aspects like data and service locality. Estimating performance of cloud environments is very hard for cloud consumers. They would like to make informed decisions about which provider better suits their needs using specialized evaluation mechanisms. Providers have their own tools reporting specific metrics, but they are potentially biased and often incomparable across providers. Current benchmarking tools allow comparison but consumers need more flexibility to evaluate environments under actual operating conditions for specialized applications. Ours is early stage work and a step towards a monitoring solution that enables independent evaluation of clouds for very specific application needs. In this paper, we present our initial architecture of the Cloud Monitor that aims to integrate existing and new benchmarks in a flexible and extensible way. By way of a simplistic demonstrator, we illustrate the concept. We report some preliminary monitoring results after a brief time of monitoring and are able to observe unexpected anomalies. The results suggest an independent monitoring solution is a powerful enabler of next generation cloud computing, not only for the consumer but potentially the whole ecosystem.

Distributed Double Machine Learning with a Serverless Architecture

This paper explores serverless cloud computing for double machine learning. Being based on repeated cross-fitting, double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing. It allows to get fast on-demand estimations without additional cloud maintenance effort. We provide a prototype Python implementation DoubleML-Serverless for the estimation of double machine learning models with the serverless computing platform AWS Lambda and demonstrate its utility with a case study analyzing estimation times and costs.

Cloud Performance Variability Prediction

Cloud computing plays an essential role in our society nowadays. Many important services are highly dependant on the stable performance of the cloud. However, as prior work has shown, clouds exhibit large degrees of performance variability. Next to the stochastic variation induced by noisy neighbors, an important facet of cloud performance variability is given by changepoints---the instances where the non-stationary performance metrics exhibit persisting changes, which often last until subsequent changepoints occur. Such undesirable artifacts of the unstable application performance lead to problems with application performance evaluation and prediction efforts. Thus, characterization and understanding of performance changepoints become important elements of studying application performance in the cloud. In this paper, we showcase and tune two different changepoint detection methods, as well as demonstrate how the timing of the changepoints they identify can be predicted. We present a gradient-boosting-based prediction method, show that it can achieve good prediction accuracy, and give advice to practitioners on how to use our results.

10 Years Later: Cloud Computing is Closing the Performance Gap

Can cloud computing infrastructures provide HPC-competitive performance for scientific applications broadly? Despite prolific related literature, this question remains open. Answers are crucial for designing future systems and democratizing high-performance computing. We present a multi-level approach to investigate the performance gap between HPC and cloud computing, isolating different variables that contribute to this gap. Our experiments are divided into (i) hardware and system microbenchmarks and (ii) user application proxies. The results show that today's high-end cloud computing can deliver HPC-competitive performance not only for computationally intensive applications, but also for memory- and communication-intensive applications -- at least at modest scales -- thanks to the high-speed memory systems and interconnects and dedicated batch scheduling now available on some cloud platforms.

Performance and Cost Comparison of Cloud Services for Deep Learning Workload

Many organizations are migrating their on-premise artificial intelligence workloads to the cloud due to the availability of cost-effective and highly scalable infrastructure, software and platform services. To ease the process of migration, many cloud vendors provide services, frameworks and tools that can be used for deployment of applications on cloud infrastructure. Finding the most appropriate service and infrastructure for a given application that results in a desired performance at minimal cost, is a challenge.

In this work, we present a methodology to migrate a deep learning model based recommender system to ML platform and serverless architecture. Furthermore, we show our experimental evaluation of the AWS ML platform called SageMaker and the serverless platform service known as Lambda. In our study, we also discuss performance and cost trade-off while using cloud infrastructure.

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.

An Empirical Evaluation of the Performance of Video Conferencing Systems

The global COVID-19 pandemic forced society to shift to remote education and work. This shift relies on various video conference systems (VCSs) such as Zoom, Microsoft Teams, and Jitsi, consequently increasing pressure on their digital service infrastructure. Although understanding the performance of these essential cloud services could lead to better designs and improved service deployments, only limited research on this topic currently exists. Addressing this problem, in this work we propose an experimental method to analyze and compare VCSs. Our method is based on real-world experiments where the client-side is controlled, and focuses on VCS resource requirements and performance. We design and implement a tool to automatically conduct these real-world experiments, and use it to compare three platforms on the client side: Zoom, Microsoft Teams, and Jitsi. Our work exposes that there are significant differences between the systems tested in terms of resource usage and performance variability, and provides evidence for a suspected memory leak in Zoom, the system widely regarded as the industry market leader.

SESSION: LTB 2021 Workshop

Viability of Azure IoT Hub for Processing High Velocity Large Scale IoT Data

We utilize the Clemson supercomputer to generate a massive workload for testing the performance of Microsoft Azure IoT Hub. The workload emulates sensor data from a large manufacturing facility. We study the effects of message frequency, distribution, and size on round-trip latency for different IoT Hub configurations. Significant variation in latency occurs when the system exceeds IoT Hub specifications. The results are predictable and well-behaved for a well-engineered system and can meet soft real-time deadlines.

Enabling Containerized, Parametric and Distributed Database Deployment and Benchmarking as a Service

Containerized environments introduce a set of performance challenges that require extensive measurements and benchmarking to identify and model application behavior regarding a variety of parameters. Databases present extra challenges given their extensive need for synchronization and orchestration of a benchmark run, especially in microservice-oriented technologies (such as container platforms) and dynamic business models such as DBaaS. In this work we describe the adaptation of our open source, baseline load injection as a service tool, Flexibench, in order to enable the automated, parametric launching and measurement of containerized and distributed databases as a service. Adaptation and synchronization needs are described for ensuring test sequence and applied through a case study on MySQL. Therefore a performance engineer can directly test selected configuration and performance of a database in a given target workload with simple REST invocations. Experimentation starts from adapting the official MySQL docker images as well as OLTP Bench Client ones and investigates scenarios such as parameter sweep experiments and co-allocation scenarios where multiple DB instances are sharing physical nodes, as expected in the DBaaS paradigm.

PIERES: A Playground for Network Interrupt Experiments on Real-Time Embedded Systems in the IoT

IoT devices have become an integral part of our lives and the industry. Many of these devices run real-time systems or are used as part of them. As these devices receive network packets over IP networks, the network interface informs the CPU about their arrival using interrupts that might preempt critical processes. Therefore, the question arises whether network interrupts pose a threat to the real-timeness of these devices. However, there are few tools to investigate this issue. We present a playground which enables researchers to conduct experiments in the context of network interrupt simulation. The playground comprises different network interface controller implementations, load generators and timing utilities. It forms a flexible and easy to use foundation for future network interrupt research. We conduct two verification experiments and two real world examples. The latter give insight into the impact of the interrupt handling strategy parameters and the influence of different load types on the execution time with respect to these parameters.

How to Measure Scalability of Distributed Stream Processing Engines?

Scalability is promoted as a key quality feature of modern big data stream processing engines. However, even though research made huge efforts to provide precise definitions and corresponding metrics for the term scalability, experimental scalability evaluations or benchmarks of stream processing engines apply different and inconsistent metrics. With this paper, we aim to establish general metrics for scalability of stream processing engines. Derived from common definitions of scalability in cloud computing, we propose two metrics: a load capacity function and a resource demand function. Both metrics relate provisioned resources and load intensities, while requiring specific service level objectives to be fulfilled. We show how these metrics can be employed for scalability benchmarking and discuss their advantages in comparison to other metrics, used for stream processing engines and other software systems.

Performance Interference on Key-Value Stores in Multi-tenant Environments: When Block Size and Write Requests Matter

Key-value stores are currently used by major cloud computing vendors, such as Google, Facebook, and LinkedIn, to support large-scale applications with concurrent read and write operations. Based on very simple data access APIs, the key-value stores can deliver outstanding throughput, which have been hooked up to high-performance solid-state drives (SSDs) to boost this performance even further. However, measuring performance interference on SSDs while sharing cloud computing resources is complex and not well covered by current benchmarks and tools. Different applications can access these resources concurrently until becoming overloaded without notice either by the benchmark or the cloud application. In this paper, we define a methodology to measure the problem of performance interference. Depending on the block size and the proportion of concurrent write operations, we show how a key-value store may quickly degrade throughput until becoming almost inoperative while sharing persistent storage resources with other tenants.

SESSION: PECS 2021 Workshop

Transactions in the Era of Non Volatile Memory and Heterogeneous Memory Architectures

Transactions are a simple, yet powerful, abstraction that aims at masking programmers from the complexity of having to ensure correct and efficient synchronization of concurrent code.

Originally introduced in the domain of database systems, transactions have recently garnered significant interest in the broader domain of concurrent programming, via the Transactional Memory (TM) paradigm. Nowadays, hardware supports for TM are provided in commodity CPUs (e.g., by Intel and IBM) and, at the software level, TM has been integrated in mainstream programming languages, such as C/C++ and Java.

In this talk I will present the novel challenges and research opportunities that arise in the area of TM due to the emergence of two recent hardware trends, namely Non-Volatile Memory (NVM) and heterogeneous computing architectures.

On the front of NVM, I will focus on the problem of how to allow the execution of transactions over NVM using unmodified commodity hardware TM (HTM) implementations. However, the reliance of commodity HTM implementations on CPU caches raises a crucial problem when applications access data stored in NVM from within a HTM transaction. Since CPU caches are volatile in today's systems, HTM implementations do not guarantee that the effects of a hardware transaction are atomically transposed to PM when the transaction commits --- although such effects are immediately visible to subsequent transactions.

In this talk, I will overview somoe recent approaches to tackle this problem and present experimental results highlighting the existence of several bottlenecks that hinder the scalability of existing solutions. Next, I will show how these limitations can be addressed by presenting SPHT. SPHT introduces a novel commit logic that considerably mitigates the scalability bottlenecks of previous alternatives, providing up to 2.6x/2.2x speedups at 64 threads in, resp., STAMP/TPC-C. Moreover, SPHT introduces a novel approach to log replay that employs cross-transaction log linking and a NUMA-aware parallel background replayer. In large persistent heaps, the proposed approach achieves gains of 2.8x.

On the front of heterogeneous computing, I will present the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions.

Besides introducing the abstract semantics and programming model of HeTM, I will present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages speculative techniques that aim at hiding the large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead.

An Experimental Evaluation of Workload Driven DVFS

Modern processors support dynamic voltage and frequency scaling (DVFS) that can be leveraged by BIOS or OS drivers to regulate energy consumed in run-time. In this paper, we describe the results of a study that explores the effectiveness of the existing DVFS governors by measuring performance, energy efficiency, and the product of performance and energy efficiency (PxEE), when running both the speed and throughput SPEC CPU2017 benchmark suites. We find that the processor operates at the highest clock frequency even when ~90% of all active CPU cycles are stalled, resulting in poor energy-efficiency, especially in the case of memory-intensive benchmarks. To remedy this problem, we introduce two new workload-driven DVFS techniques that utilize hardware events, (i) the percentage of all stalls (FS-Total Stalls) and (ii) the percentage of memory-related stalls (FS-Memory Stalls), linearly mapping them into available clock frequencies every 10 ms. Our experimental evaluation finds that the proposed techniques considerably improve PxEE relative to the case when the processor is running at a fixed, nominal frequency. FS-Total Stalls improves PxEE by ~26% when all benchmarks are considered and ~67% when only memory-intensive benchmarks are considered, whereas FS-Memory Stalls improves PxEE by ~15% and ~41%, respectively. The proposed techniques thus outperform a prior proposal that utilizes cycles per instruction to control clock frequencies (FS-CPI) that improves PxEE by 4% and 9%, respectively.

Investigating the Cause and Effect of an AMD Zen Energy Management Anomaly

This paper discusses an architectural anomaly observed on server processors of the AMD Zen microarchitecture: At a specific operating point, increasing the number of active cores reduces system power consumption while increasing performance more than proportionally to the additional cores. The occurrence of the anomaly is rooted in the hardware control loop for energy management and software-independent. Experiments show a connection to the AMD turbo frequency feature Max Core Boost Frequency (MCBF). In less efficient configurations, this feature could be employed from a processor's perspective, even though it is not necessarily used on any core. Voltage measurements indicate that the availability of MCBF leads to a higher voltage from mainboard voltage regulators, subsequently raising power consumption unnecessarily.

We describe the impact of this anomaly on the performance and energy-efficiency of several micro-benchmarks. The reduced power consumption when additional cores are enabled can lead to higher core frequencies and increased per-core-performance. The presented findings can be used to avoid inefficient core configurations and reduce the overall energy-to-solution.

SESSION: WEPPE 2021 Workshop

The Role of Analytical Models in the Engineering and Science of Computer Systems

2021 is the 50th anniversary of SIGMETRICS, the ACM Special Interest Group on Performance Evaluation. For this occasion, I wrote a review1 of the role played by analytical modeling - a major topic in SIGMETRICS – in the engineering and science of computer systems. This talk is a summary of that review.

Performance Monitoring Guidelines

Monitoring, that is, the process of collecting measurements on infrastructures and services, is an important subject of performance engineering. Although monitoring is not a new education topic, nowadays its relevance is rapidly increasing and its application is particularly demanding due to the complex distributed architectures of new and emerging technologies. As a consequence, monitoring has become a "must have" skill for students majoring in computer science and in computing-related fields. In this paper, we present a set of guidelines and recommendations to plan, design and setup sound monitoring projects. Moreover, we investigate and discuss the main challenges to be faced to build confidence in the entire monitoring process and ensure measurement quality. Finally, we describe practical applications of these concepts in teaching activities.

Experience with Teaching Performance Measurement and Testing in a Course on Functional Testing

Stevens Institute of Technology offers a graduate course on functional software testing that addresses test planning driven by use cases, the use of software tools, and the derivation of test cases to achieve coverage with minimal effort. The course also contains material on performance testing. Teaching performance testing and measurement in a university setting was challenging because giving the students access to a target system would have required more time, resources, and planning than were available. %neither the students nor the university typically have access to a system that can be tested and measured. We addressed these challenges (a) by showing the students how resource usage could be measured in a controlled way with the instrumentation that comes with most modern laptops by default, and (b) by having the students use JMeter to measure the response times of existing websites . We describe how students were introduced to the concept of a controlled performance test by playing recordings of the same musical piece with and without video. We make recommendations for the future avoidance of the emergent ethical issue that one should not subject one does not own to anything but the most trivial loads. We also describe some successes and pitfalls in this effort.

An Analysis of Distributed Systems Syllabi With a Focus on Performance-Related Topics

We analyze a dataset of 51 current (2019-2020) Distributed Systems syllabi from top Computer Science programs, focusing on finding the prevalence and context in which topics related to performance are being taught in these courses. We also study the scale of the infrastructure mentioned in DS courses, from small client-server systems to cloud-scale, peer-to-peer, global-scale systems. We make eight main findings, covering goals such as performance, and scalability and its variant elasticity; activities such as performance bench-marking and monitoring; eight selected performance-enhancing techniques (replication, caching, sharding, load balancing, scheduling, streaming, migrating, and offloading); and control issues such as trade-offs that include performance and performance variability.

A New Course on Systems Benchmarking - For Scientists and Engineers

A benchmark is a tool coupled with a methodology for the evaluation and comparison of systems or components with respect to specific characteristics, such as performance, reliability, or security. Benchmarks enable educated purchasing decisions and play a great role as evaluation tools during system design, development, and maintenance. In research, benchmarks play an integral part in evaluation and validation of new approaches and methodologies. Traditional benchmarks have been focused on evaluating performance, typically understood as the amount of useful work accomplished by a system (or component) compared to the time and resources used. Ranging from simple benchmarks, targeting specific hardware or software components, to large and complex benchmarks focusing on entire systems (e.g., information systems, storage systems, cloud platforms), performance benchmarks have contributed significantly to improve successive generations of systems. Beyond traditional performance benchmarking, research on dependability benchmarking has increased in the past two decades. Due to the increasing relevance of security issues, security benchmarking has also become an important research field. Finally, resilience benchmarking faces challenges related to the integration of performance, dependability, and security benchmarking as well as to the adaptive characteristics of the systems under consideration. Each benchmark is characterized by three key aspects: metrics, workloads, and measurement methodology. The metrics determine what values should be derived based on measurements to produce the benchmark results. The workloads determine under which usage scenarios and conditions (e.g., executed programs, induced system load, injected failures/security attacks) measurements should be performed to derive the metrics. Finally, the measurement methodology defines the end-to-end process to execute the benchmark, collect measurements, and produce the benchmark results. The increasing size and complexity of modern systems make the engineering of benchmarks a challenging task. Thus, we see the need for a better education on the theoretical and practical foundations necessary for gaining a deep understanding of benchmarking and the benchmark engineering process. In this talk, we present an overview of a new course focused on systems benchmarking, based on our book "Systems Benchmarking - For Scientists and Engineers" (http://benchmarking-book.com/). The course captures our experiences that have been gained over the past 15 years in teaching a regular graduate course on performance engineering of computing systems. The latter was taught at four different European universities since 2006, including University of Cambridge, Technical University of Catalonia, Karlsruhe Institute of Technology, and University of Würzburg. The conception, design, and development of benchmarks requires a thorough understanding of the benchmarking fundamentals beyond understanding of the system under test, including statistics, measurement methodologies, metrics, and relevant workload characteristics. The course addresses these issues in depth; it covers how to determine relevant system characteristics to measure, how to measure these characteristics, and how to aggregate the measurement results in a metric. Further, the aggregation of metrics into scoring systems, as well as the design of workloads, including workload characterization and modeling, are additional challenging topics that are covered. Finally, modern benchmarks and their application in industry and research are studied. We cover a broad range of different application areas for benchmarking, presenting contributions in specific fields of benchmark development. These contributions address the unique challenges that arise in the conception and development of benchmarks for specific systems or subsystems. They also demonstrate how the foundations and concepts of the first part of the course are being used in existing benchmarks.

Performance Engineering and Database Development at MongoDB

Performance and the related properties of stability and resilience are essential to MongoDB. We have invested heavily in these areas: involving all development engineers in aspects of performance, building a team of specialized performance engineers to understand issues that do not fit neatly within the scope of individual development teams, and dedicating multiple teams to develop and support tools for performance testing and analysis.

We have built an automated infrastructure for performance testing that is integrated with our continuous integration system. Performance tests routinely run against our development branch in order to identify changes in performance as early as possible. We have invested heavily to ensure both that results are low noise and reproducible and that we can detect when performance changes. We continue to invest to make the system better and to make it easier to add new workloads.

All development engineers are expected to interact with our performance system: investigating performance changes, fixing regressions, and adding new performance tests. We also expect performance to be considered at project design time. The project design should ensure that there is appropriate performance coverage for the project, which may require repurposing existing tests or adding new ones.

Not all performance issues are specific to a team or software module. Some issues emerge from the interaction of multiple modules or interaction with external systems or software. To attack these larger problems, we have a dedicated performance team. Our performance team is responsible for investigating these more complex issues, identifying high value areas for improvement, as well as helping guide the development engineers with their performance tests.

We have experience both hiring and training engineers for the performance engineering skills needed to ship a performant database system. In this talk we will cover the skills needed for our performance activities and which skills, if added to undergraduate curricula, would help us the most. We will address the skills we would like all development engineers to have, as well as those for our dedicated performance team.

Software Performance Engineering Education: What Topics Should be Covered?

This presentation considers elements of Software Performance Engineering (SPE) and how they have evolved. It addresses both skills needed by practitioners and areas of research. Which topics should be covered? How can the education cover realistic systems and problems? What is the history of SPE and is it relevant today? Are these topics unique to SPE education? How should SPE education be integrated with other specialties in Computer Science and Engineering?

SESSION: WOSP-C 2021 Workshop

Towards Extraction of Message-Based Communication in Mixed-Technology Architectures for Performance Model

Software systems architected using multiple technologies are becoming popular. Many developers use these technologies as it offers high service quality which has often been optimized in terms of performance. In spite of the fact that performance is a key to the technology-mixed software applications, still there a little research on performance evaluation approaches explicitly considering the extraction of architecture for modelling and predicting performance.

In this paper, we discuss the opportunities and challenges in applying existing architecture extraction approaches to support model-driven performance prediction for technology-mixed software. Further, we discuss how it can be extended to support a message-based system. We describe how various technologies deriving the architecture can be transformed to create the performance model. In order to realise the work, we used a case study from the energy system domain as an running example to support our arguments and observations throughout the paper.

Performance Evaluation and Improvement of Real-Time Computer Vision Applications for Edge Computing Devices

Advances in deep neural networks have provided a significant improvement in accuracy and speed across a large range of Computer Vision (CV) applications. However, our ability to perform real-time CV on edge devices is severely restricted by their limited computing capabilities. In this paper we employ Vega, a parallel graph-based framework, to study the performance limitations of four heterogeneous edge-computing platforms, while running 12 popular deep learning CV applications.

We expand the framework's capabilities, introducing two new performance enhancements: 1) an adaptive stage instance controller (ASI-C) that can improve performance by dynamically selecting the number of instances for a given stage of the pipeline; and 2) an adaptive input resolution controller (AIR-C) to improve responsiveness and enable real-time performance. These two solutions are integrated together to provide a robust real-time solution.

Our experimental results show that ASI-C improves run-time performance by 1.4x on average across all heterogeneous platforms, achieving a maximum speedup of 4.3x while running face detection executed on a high-end edge device. We demonstrate that our integrated optimization framework improves performance of applications and is robust to changing execution patterns.

Performance Models of Event-Driven Architectures

Event-driven architecture (EDAs) improves scalability by combining stateless servers and asynchronous interactions. Models to predict the performance of pure EDA systems are relatively easy to make, systems with a combination of event-driven components and legacy components with blocking service requests (synchronous interactions) require special treatment. Layered queueing was developed for such systems, and this work describes a method for combining event-driven behaviour and synchronous behaviour in a layered queueing model. The performance constraints created by the legacy components can be explored to guide decisions regarding converting them, or reconfiguring them, when the system is scaled.

On Preventively Minimizing the Performance Impact of Black Swans (Vision Paper)

Recent episodes of web overloads suggest the need to test system performance under loads that reflect extreme variations in usage patterns well outside normal anticipated ranges. These loads are sometimes expected or even scheduled. Examples of expected loads include surges in transactions or request submission when popular rock concert tickets go on sale, when the deadline for the submission of census forms approaches, and when a desperate population is attempting to sign up for a vaccination during a pandemic. Examples of unexpected loads are the surge in unemployment benefit applications in many US states with the onset of COVID19 lockdowns and repeated queries about the geographic distribution of signatories on the U.K. Parliament's petition website prior to a Brexit vote in 2019. We will consider software performance ramifications of these examples and the architectural questions they raise. We discuss how modeling and performance testing and known processes for evaluating architectures and designs can be used to identify potential performance issues that would be caused by sudden increases in load or changes in load patterns.

Performance Modelling of Intelligent Transportation Systems: Experience Report

Modern information systems connecting software, physical systems and people, are usually characterized by high dynamism. These dynamics introduce uncertainties, which in turn may harm the quality of systems and lead to incomplete, inaccurate, and unreliable results. To deal with this issue, in this paper we report our incremental experience on the usage of different performance modelling notations while analyzing Intelligent Transportation Systems. More specifically, Queueing Networks and Petri Nets have been adopted and interesting insights are derived.

SESSION: Tutorials

Enhancing Observability of Serverless Computing with the Serverless Application Analytics Framework

To improve the observability of workload performance, resource utilization, and infrastructure underlying serverless Function-as-a-Service (FaaS) platforms, we have developed the Serverless Application Analytics Framework (SAAF). SAAF provides a reusable framework supporting multiple programming languages that developers can leverage to inspect performance, resource utilization, scalability, and infrastructure metrics of function deployments to commercial and open-source FaaS platforms. To automate reproducible FaaS performance experiments, we provide the FaaS Runner as a multithreaded FaaS client. FaaS Runner provides a programmable client that can orchestrate over one thousand concurrent FaaS function calls. The ReportGenerator is then used to aggregate experiment output into CSV files for consumption by popular data analytics tools. SAAF and its supporting tools combined can assess forty-eight distinct metrics to enhance observability of serverless software deployments. In this tutorial paper, we describe SAAF and its supporting tools and provide examples of observability insights that can be derived.

SESSION: Poster & Demo Papers

Comparison of Object Detectors for Fully Autonomous Aerial Systems Performance

Unmanned aerial vehicles (UAVs) are gaining popularity in many governmental and civilian sectors. The combination of aerial mobility and data sensing capabilities facilitates previously impossible workloads. UAVs are normally piloted by remote operators who determine where to fly and when to sense data, but operations over large areas put a heavy burden on human pilots. Fully autonomous aerial systems (FAAS) have emerged as an alternative to human piloting by using software combined with edge and cloud hardware to execute autonomous UAV missions. The compute and networking infrastructure required for autonomy has significant power and performance demands. FAAS deployed in remote environments, such as crop fields, must manage limited power and networking capabilities. To facilitate widespread adoption of FAAS, middleware must support heterogeneous compute and networking resources at the edge while ensuring that the workloads quickly produce effective and efficient autonomous flight paths. Object detectors are a vital component of FAAS. FAAS flight mission goals and flight path generation are often focused on locating and photographing phenomena identified using object detectors. Given the importance of object detection to FAAS, it is paramount that object detectors produce accurate results as quickly and efficiently as possible to elongate FAAS missions and save precious energy. In this poster, we analyze the performance of different object detection techniques for facial recognition, a core FAAS workload. We analyzed the accuracy and performance of three facial recognition techniques provided in SoftwarePilot, an FAAS middleware, on two architectural configurations for FAAS edge systems. These findings can be used when selecting an object detector for any FAAS mission type and hardware configuration.

SPEC — Spotlight on the International Standards Group (ISG)

The driving philosophy for the Standard Performance Evaluation Corporation (SPEC) is to ensure that the marketplace has a fair and useful set of metrics to differentiate systems, by providing standardized benchmark suites and international standards. This poster-paper gives an overview of SPEC with a focus on the newly founded International Standards Group (ISG).

Demonstration Paper: Monitoring Machine Learning Contracts with QoA4ML

Using machine learning (ML) services, both service customers and providers need to monitor complex contractual constraints of ML service that are strongly related to ML models and data. Therefore, establishing and monitoring comprehensive ML contracts are crucial in ML serving. This paper demonstrates a set of features and utilities of the QoA4ML framework for ML contracts.

SESSION: Works-in-Progress

An Empirical Evaluation of Video Conferencing Systems Used in Industry, Academia, and Entertainment

Video Conferencing Systems (VCS) are used daily---at work, in online education, and for get-togethers with friends and family. Many new VCSs have emerged in the past decade and a new market-leader has risen during the coronavirus period of 2020. Understanding how these systems work could help us improve them rapidly. However, no experimental comparison of such systems currently exists. In this work we propose a method to compare VCSs in real-world operation and implement it as a tool. Our method considers four main kinds of real-world experiments. Each captures different aspects, such as communication channels (audio, video, audio-video) and types of network environments (e.g., Ethernet, WiFi, 4G), and reports system and network utilization. We further implement an automated tool to conduct these real-world experiments, and experiment with three popular VCSs, Zoom, Microsoft Team, and Discord. We find that there are significant performance differences between these systems, and their behavior in different environments.

Buzzy: Towards Realistic DBMS Benchmarking via Tailored, Representative, Synthetic Workloads: Vision Paper

Distributed Database Management Systems~(DBMS) are a crucial component of modern IT applications. Understanding their performance and non-functional properties is of paramount importance. Yet, benchmarking distributed DBMS has proven to be difficult in practice. Either, a realistic workload is often mapped to a synthetic workload without knowing if this mapping is correct or available workload traces are replayed. While the latter approach provides more realistic results, real-world traces are hard to obtain and their scope is limited in time scale and variance.

We propose collecting real-world traces and then applying data generation techniques to synthesize similar realistic traces based on it. Based in this approach, we can obtain workloads for benchmarking, exhibit variability with respect to different aspects of interest while still being similar to the original traces. Varying generation parameters, we are able to support benchmarking what-if scenarios with hypothetical workloads and introduced anomalies.

Towards a Benchmark for Software Resource Efficiency

Data centers already account for over 250TWh of energy usage every year and their energy demand will grow above 1PWh until 2030 even in the best-case scenarios of some studies. As this demand cannot be met with renewable sources as of today, this growth will lead to a further increase of CO2 emissions. The data center growth is mainly driven by software resource usage but most of the energy efficiency improvements are nowadays done on hardware level that cannot compensate the demand. To reduce the resource demand of software in data centers one needs to be able to quantify its resource usage. Therefore, we propose a benchmark to assess the resource consumption of data center software (i.e., cloud applications) and make the resource usage of standard application types comparable between vendors. This benchmark aims to support three main goals (i) software vendors should be able to get an understanding of the resource consumption of their software; (ii) software buyers should be able to compare the software of different vendors; and (iii) spark competition between the software vendors to make their software more efficient and thus, in the long term, reduce the data center growth as the software systems require less resources.

Optimization of Java Virtual Machine Flags using Feature Model and Genetic Algorithm

Optimizing the Java Virtual Machine (JVM) options in order to get the best performance out of a program for production is a challenging and time-consuming task. HotSpot, the Oracle's open-source Java VM implementation offers more than 500 options, called flags, that can be used to tune the JVM's compiler, garbage collector (GC), heap size and much more. In addition to being numerous, these flags are sometimes poorly documented and create a need of benchmarking to ensure that the flags and their associated values deliver the best performance and stability for a particular program to execute.

Auto-tuning approaches have already been proposed in order to mitigate this burden. However, in spite of increasingly sophisticated search techniques allowing for powerful optimizations, these approaches take little account of the underlying complexities of JVM flags. Indeed, dependencies and incompatibilities between flags are non-trivial to express, which if not taken into account may lead to invalid or spurious flag configurations that should not be considered by the auto-tuner.

In this paper, we propose a novel model, inspired by the feature model used in Software Product Line, which takes the complexity of JVM's flags into account. We then demonstrate the usefulness of this model, using it as an input of a Genetic Algorithm (GA) to optimize the execution times of DaCapo Benchmarks.