ICPE '20: Proceedings of the ACM/SPEC International Conference on Performance Engineering

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Developing Effective Software Productively

It is not uncommon to hear laments about how long it takes to build software systems and how often, once built, those systems fail to meet the needs and desires of users. Given that attention has been paid to how we build large software systems for over fifty years, you might wonder why we have not figured out how to build the systems people want in a reasonable amount of time. To put the problems into perspective, fifty years is half the life-span of a Galapagos turtle and many software systems may be amongst the most complex systems ever built by humans. In that light, perhaps it is not surprising that we have not figured it all out. In this talk, I will explore what productivity means to software developers, how we might track the value that is delivered in software produced by developers and how we might begin to think about measuring the productive delivery of effective software.

Mining Traces of Embedded Software Systems for Insights

Embedded safety-critical systems are essential for today's society as we rely on them in all aspects of our life. Should safety-critical systems fail to meet their specified function, then they have the potential to cause harm to people, cause loss of capital infrastructure, or cause significant damage to the environment. Safety-critical systems are becoming increasingly complex; the more complex, the higher the risk of safety hazards for the public. With the increase of automation in driving and other areas, the complexity and criticality of the software will continue to increase drastically. Computer assistance will become essential for humans to get a deep understanding of programs underlying modern systems. Mining specifications and properties from program traces is a promising approach to help humans understand modern complex programs. Understanding temporal dependencies in relation to performance is one aspect of such an endeavour. A specification mined from a system trace can allow the developer to understand, among others, task dependencies, activation patterns, and response triggers. The artefacts produced by mining are useful for system designers, developers, safety managers, and can even provide input for other tools. This talk introduces the concepts behind mining traces of embedded software programs and discusses the challenges of building practical tools.

SESSION: SESSION 1: Performance Portability

Out of Band Performance Monitoring of Server Workloads: Leveraging RESTful API to monitor compute resource utilization and performance related metrics for server performance analysis.

Performance monitoring is a useful tool to leverage when additional insight is needed or warranted when evaluating the performance results of a compute solution or benchmark. More often it seems, the use of performance monitoring is an afterthought and is utilized only when unexpected results are encountered. Given that most methods for implementing performance monitoring require running additional applications or kernel code in parallel with the application or benchmark itself, it is understandable that there may be a bias against or a reluctance to leverage performance monitoring capabilities throughout the performance evaluation period as well as extending its use into a production environment. In our paper, we'll introduce a performance monitoring architecture that leverages an out of band (OOB) approach for measuring key server resource performance metrics. We will demonstrate that this approach has zero impact to the performance of the workload running on the server itself and is suitable for use in a production environment. Although the out of band approach inherently has limits to the amount of information that can be gathered for performance analysis, we will demonstrate the usefulness of the information that is available today in debugging performance related issues.

Transferring Pareto Frontiers across Heterogeneous Hardware Environments

Software systems provide user-relevant configuration options called features. Features affect functional and non-functional system properties, whereas selections of features represent system configurations. A subset of configuration space forms a Pareto frontier of optimal configurations in terms of multiple properties, from which a user can choose the best configuration for a particular scenario. However, when a well-studied system is redeployed on a different hardware, information about property value and the Pareto frontier might not apply. We investigate whether it is possible to transfer this information across heterogeneous hardware environments. We propose a methodology for approximating and transferring Pareto frontiers of configurable systems across different hardware environments. We approximate a Pareto frontier by training an individual predictor model for each system property, and by aggregating predictions of each property into an approximated frontier. We transfer the approximated frontier across hardware by training a transfer model for each property, by applying it to a respective predictor, and by combining transferred properties into a frontier. We evaluate our approach by modeling Pareto frontiers as binary classifiers that separate all system configurations into optimal and non-optimal ones. Thus we can assess quality of approximated and transferred frontiers using common statistical measures like sensitivity and specificity. We test our approach using five real-world software systems from the compression domain, while paying special attention to their performance. Evaluation results demonstrate that accuracy of approximated frontiers depends linearly on predictors' training sample sizes, whereas transferring introduces only minor additional error to a frontier even for small training sizes.

Modeling of Request Cloning in Cloud Server Systems using Processor Sharing

The interest for studying server systems subject to cloned requests has recently increased. In this paper we present a model that allows us to equivalently represent a system of servers with cloned requests, as a single server. The model is very general, and we show that no assumptions on either inter-arrival or service time distributions are required, allowing for, e.g., both heterogeneity and dependencies. Further, we show that the model holds for any queuing discipline. However, we focus our attention on Processor Sharing, as the discipline has not been studied before in this context. The key requirement that enables us to use the single server G/G/1 model is that the request clones have to receive synchronized service. We show examples of server systems fulfilling this requirement. We also use our G/G/1 model to co-design traditional load-balancing algorithms together with cloning strategies, providing well-performing and provably stable designs. Finally, we also relax the synchronized service requirement and study the effects of non-perfect synchronization. We derive bounds for how common imperfections that occur in practice, such as arrival and cancellation delays, affect the accuracy of our model. We empirically demonstrate that the bounds are tight for small imperfections, and that our co-design method for the popular Join-Shortest-Queue (JSQ) policy can be used even under relaxed synchronization assumptions with small loss in accuracy.

Taming Energy Consumption Variations In Systems Benchmarking

The past decade witnessed the inclusion of power measurements to evaluate the energy efficiency of software systems, thus making energy a prime indicator along with performance. Nevertheless, measuring the energy consumption of a software system remains a tedious task for practitioners. In particular, the energy measurement process may be subject to a lot of variations that hinder the relevance of potential comparisons. While the state of the art mostly acknowledged the impact of hardware factors (chip printing process, CPU temperature), this paper investigates the impact of controllable factors on these variations. More specifically, we conduct an empirical study of multiple controllable parameters that one can easily tune to tame the energy consumption variations when benchmarking software systems. To better understand the causes of such variations, we ran more than a 1,000 experiments on more than 100 nodes with different workloads and configurations. The main factors we studied encompass: experimental protocol, CPU features (C-states, Turbo~Boost, core pinning) and generations, as well as the operating system. Our experiments showed that, for some workloads, it is possible to tighten the energy variation by up to 30x. Finally, we summarize our results as guidelines to tame energy consumption variations. We argue that the guidelines we deliver are the minimal requirements to be considered prior to any energy efficiency evaluation.

SESSION: SESSION 2: Performance Learning

An Automated Forecasting Framework based on Method Recommendation for Seasonal Time Series

Due to the fast-paced and changing demands of their users, computing systems require autonomic resource management. To enable proactive and accurate decision-making for changes causing a particular overhead, reliable forecasts are needed. In fact, choosing the best performing forecasting method for a given time series scenario is a crucial task. Taking the "No-Free-Lunch Theorem" into account, there exists no forecasting method that performs best on all types of time series. To this end, we propose an automated approach that (i) extracts characteristics from a given time series, (ii) selects the best-suited machine learning method based on recommendation, and finally, (iii) performs the forecast. Our approach offers the benefit of not relying on a single method with its possibly inaccurate forecasts. In an extensive evaluation, our approach achieves the best forecasting accuracy.

Learning Queuing Networks by Recurrent Neural Networks

It is well known that building analytical performance models in practice is difficult because it requires a considerable degree of proficiency in the underlying mathematics. In this paper, we pro- pose a machine-learning approach to derive performance models from data. We focus on queuing networks, and crucially exploit a deterministic approximation of their average dynamics in terms of a compact system of ordinary differential equations. We encode these equations into a recurrent neural network whose weights can be directly related to model parameters. This allows for an inter- pretable structure of the neural network, which can be trained from system measurements to yield a white-box parameterized model that can be used for prediction purposes such as what-if analyses and capacity planning. Using synthetic models as well as a real case study of a load-balancing system, we show the effectiveness of our technique in yielding models with high predictive power.

The Use of Change Point Detection to Identify Software Performance Regressions in a Continuous Integration System

We describe our process for automatic detection of performance changes for a software product in the presence of noise. A large collection of tests run periodically as changes to our software product are committed to our source repository, and we would like to identify the commits responsible for performance regressions. Previously, we relied on manual inspection of time series graphs to identify significant changes. That was later replaced with a threshold-based detection system, but neither system was sufficient for finding changes in performance in a timely manner. This work describes our recent implementation of a change point detection system built upon the E-Divisive means algorithm. The algorithm produces a list of change points representing significant changes from a given history of performance results. A human reviews the list of change points for actionable changes, which are then triaged for further inspection. Using change point detection has had a dramatic impact on our ability to detect performance changes. Quantitatively, it has dramatically dropped our false positive rate for performance changes, while qualitatively it has made the entire performance evaluation process easier, more productive (ex. catching smaller regressions), and more timely.

SESSION: SESSION 3: Performance as Throughput and Concerns

Throughput Prediction of Asynchronous SGD in TensorFlow

Modern machine learning frameworks can train neural networks using multiple nodes in parallel, each computing parameter updates with stochastic gradient descent (SGD) and sharing them asynchronously through a central parameter server. Due to communication overhead and bottlenecks, the total throughput of SGD updates in a cluster scales sublinearly, saturating as the number of nodes increases. In this paper, we present a solution to predicting training throughput from profiling traces collected from a single-node configuration. Our approach is able to model the interaction of multiple nodes and the scheduling of concurrent transmissions between the parameter server and each node. By accounting for the dependencies between received parts and pending computations, we predict overlaps between computation and communication and generate synthetic execution traces for configurations with multiple nodes. We validate our approach on TensorFlow training jobs for popular image classification neural networks, on AWS and on our in-house cluster, using nodes equipped with GPUs or only with CPUs. We also investigate the effects of data transmission policies used in TensorFlow and the accuracy of our approach when combined with optimizations of the transmission schedule.

Modeling Analytics for Computational Storage

Next generation flash storage will be armed with a substantial amount of computing power. In this paper, we investigate opportunities to utilize this computational capability to optimize Online Analytical Processing (OLAP) applications. We have directed our analysis at the performance of a subset of TPC-DS queries using Hadoop clusters and two database engines, SPARK-SQL and Presto. We model the expected speed-up achieved by offloading a few operations that are executed first within most SQL plans. Offloading these operations requires minimal cooperation from the database engine, and no changes to the existing plan. We show that the speed-up achieved varies significantly among queries and between engines, and that the queries benefiting the most are I/O heavy with high selectivity of the "needle in the haystack" variety. Our main contribution is estimating the speed-up anticipated from pushing the execution of a few key SQL building blocks (scan, filter, and project operations) to computational storage when using read optimized, columnar Parquet format files.

Duet Benchmarking: Improving Measurement Accuracy in the Cloud

We investigate the duet measurement procedure, which helps improve the accuracy of performance comparison experiments conducted on shared machines by executing the measured artifacts in parallel and evaluating their relative performance together, rather than individually. Specifically, we analyze the behavior of the procedure in multiple cloud environments and use experimental evidence to answer multiple research questions concerning the assumption underlying the procedure. We demonstrate improvements in accuracy ranging from 2.3x to 12.5x (5.03x on average) for the tested ScalaBench (and DaCapo) workloads, and from 23.8x to 82.4x (37.4x on average) for the SPEC CPU 2017 workloads.

A Fully Structure-Driven Performance Analysis of Sparse Matrix-Vector Multiplication

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific, machine-learning, and other compute-intensive applications. Performance characteristics, however, depend on a complex combination of storage format, machine capabilities, and choices in code-generation. A deep understanding of the relative impact of these properties is important in itself, and also to better understanding the performance potential of alternative execution contexts such as web-based scientific computing, where the recent introduction ofWebAssembly offers the potential for low-level, near-native performance within a web browser. In this work we characterize the performance of SpMV operations for different sparse storage formats based on the sparse matrix structure and the machine architecture. We extract structural properties from 2000 real-life sparse matrices to understand their impact on the choice of storage format and also on the performance within those storage formats for both WebAssembly and native C. We extend this with new matrix features based on a "reuse-distance" concept to identify performance bottlenecks, and evaluate the effect of interaction between the matrix structure and hardware characteristics on SpMV performance. Our study provides valuable insights to scientific programmers and library developers to apply best practices and guide future optimization for SpMV in general, and in particular for web-based contexts with abstracted hardware and storage models.

Can a Chatbot Support Software Engineers with Load Testing? Approach and Experiences

Even though load testing is an established technique to assess load-related quality properties of software systems, it is applied only seldom and with questionable results. Indeed, configuring, executing, and interpreting results of a load test require high effort and expertise. Since chatbots have shown promising results for interactively supporting complex tasks in various domains (including software engineering), we hypothesize that chatbots can provide developers suitable support for load testing. In this paper, we present PerformoBot, our chatbot for configuring and running load tests. In a natural language conversation, PerformoBot guides developers through the process of properly specifying the parameters of a load test, which is then automatically executed by PerformoBot using a state-of-the-art load testing tool. After the execution, PerformoBot provides developers a report that answers the respective concern. We report on results of a user study that involved 47 participants, in which we assessed our tool's acceptance and effectiveness. We found that participants in the study, particularly those with a lower level of expertise in performance engineering, had a mostly positive view of PerformoBot.

SESSION: SESSION 4: Serverless Apps

Had You Looked Where I'm Looking? Cross-user Similarities in Viewing Behavior for 360-degree Video and Caching Implications

The demand and usage of 360-degree video services are expected to increase. However, despite these services being highly bandwidth intensive, not much is known about the potential value that basic bandwidth saving techniques such as server or edge-network on-demand caching (e.g., in a CDN) could have when used for delivery of such services. This problem is both important and complicated as client-side solutions have been developed that split the full 360-degree view into multiple tiles, and adapt the quality of the downloaded tiles based on the user's expected viewing direction and bandwidth conditions. To better understand the potential bandwidth savings that caching-based techniques may offer for this context, this paper presents the first characterization of the similarities in the viewing directions of users watching the same 360-degree video, the overlap in viewports of these users (the area of the full 360-degree view they actually see), and the potential cache hit rates for different video categories and network conditions. The results provide substantial insight into the conditions under which overlap can be considerable and caching effective, and can inform the design of new caching system policies tailored for 360-degree video.

Microservices: A Performance Tester's Dream or Nightmare?

In recent years, there has been a shift in software development towards microservice-based architectures, which consist of small services that focus on one particular functionality. Many companies are migrating their applications to such architectures to reap the benefits of microservices, such as increased flexibility, scalability and a smaller granularity of the offered functionality by a service. On the one hand, the benefits of microservices for functional testing are often praised, as the focus on one functionality and their smaller granularity allow for more targeted and more convenient testing. On the other hand, using microservices has their consequences (both positive and negative) on other types of testing, such as performance testing. Performance testing is traditionally done by establishing the baseline performance of a software version, which is then used to compare the performance testing results of later software versions. However, as we show in this paper, establishing such a baseline performance is challenging in microservice applications. In this paper, we discuss the benefits and challenges of microservices from a performance tester's point of view. Through a series of experiments on the TeaStore application, we demonstrate how microservices affect the performance testing process, and we demonstrate that it is not straightforward to achieve reliable performance testing results for a microservice application.

A Framework for Satisfying the Performance Requirements of Containerized Software Systems Through Multi-Versioning

With the increasing popularity and complexity of containerized software systems, satisfying the performance requirements of these systems becomes more challenging as well. While a common remedy to this problem is to increase the allocated amount of resources by scaling up or out, this remedy is not necessarily cost-effective and, therefore, often problematic for smaller companies. In this paper, we study an alternative, more cost-effective approach for satisfying the performance requirements of containerized software systems. In particular, we investigate how we can satisfy such requirements by applying software multi-versioning to the system's resource-heavy containers. We present DockerMV, an open-source extension of the Docker framework, to support the multi-versioning of containerized software systems. We demonstrate the efficacy of multi-versioning for satisfying the performance requirements of containerized software systems through experiments on the TeaStore, a microservice reference test application, and Znn, a containerized news portal application. Our DockerMV extension can be used by software developers to introduce multi-versioning in their own containerized software systems, thereby better allowing them to meet the performance requirements of their systems.

Detecting Latency Degradation Patterns in Service-based Systems

Performance in heterogeneous service-based systems shows non-determistic trends. Even for the same request type, latency may vary from one request to another. These variations can occur due to several reasons on different levels of the software stack: operating system, network, software libraries, application code or others. Furthermore, a request may involve several Remote Procedure Calls (RPC), where each call can be subject to performance variation. Performance analysts inspect distributed traces and seek for recurrent patterns in trace attributes, such as RPCs execution time, in order to cluster traces in which variations may be induced by the same cause. Clustering "similar" traces is a prerequisite for effective performance debugging. Given the scale of the problem, such activity can be tedious and expensive. In this paper, we present an automated approach that detects relevant RPCs execution time patterns associated to request latency degradation, i.e. latency degradation patterns. The presented approach is based on a genetic search algorithm driven by an information retrieval relevance metric and an optimized fitness evaluation. Each latency degradation pattern identifies a cluster of requests subject to latency degradation with similar patterns in RPCs execution time. We show on a microservice-based application case study that the proposed approach can effectively detect clusters identified by artificially injected latency degradation patterns. Experimental results show that our approach outperforms in terms of F-score a state-of-art approach for latency profile analysis and widely popular machine learning clustering algorithms. We also show how our approach can be easily extended to trace attributes other than RPC execution time (e.g. HTTP headers, execution node, etc.).

SESSION: SESSION 5: Performance Issues

Software Performance Antipatterns in Cyber-Physical Systems

Software performance antipatterns (SPAs) document common performance problems in software architecture and design and how to fix them. They differ from software antipatterns in their focus on the performance of the software. This paper addresses performance antipatterns that are common in today's Cyber-Physical Systems (CPS). We describe the characteristics of today's CPS that cause performance problems that have been uncommon in real-time embedded systems of the past. Three new performance antipatterns are defined and their impact on CPS is described. Six previously defined performance antipatterns are described that are particularly relevant to today's CPS. The paper concludes with observations on how this work is useful in the design, implementation, and operation of CPS.

How Are Performance Issues Caused and Resolved?-An Empirical Study from a Design Perspective

Empirical experience regarding how real-life performance issues are caused and resolved can provide valuable insights for practitioners to effectively and efficiently prevent, detect, and fix performance issues. Prior work shows that most performance issues have their roots in poor architectural decisions. This paper contributes a large scale empirical study of 192 real-life performance issues, with an emphasis on software design. First, this paper contributes a holistic view of eight common root causes and typical resolutions that recur in different projects, and surveyed existing literature, in particular, tools, that can detect and fix each type of performance issue. Second, this study is first-of-its-kind to investigate performance issues from a design perspective. In the 192 issues, 33% required design-level optimization, i.e. simultaneously revising a group of related source files for resolving the issues. We reveal four design-level optimization patterns, which have shown different prevalence in resolving different root causes. Finally, this study investigated the Return on Investment for addressing performance issues, to help practitioners choose between localized or design-level optimization resolutions, and to prioritize issues due to different root causes.

Optimizing Interrupt Handling Performance for Memory Failures in Large Scale Data Centers

Intermittent hardware failures are generally non-catastrophic and typical large-scale service infrastructures are designed to tolerate them while still serving user traffic. However, intermittent errors cause performance aberrations if they are not handled appropriately. System error reporting mechanisms send hardware interrupts to the Central Processing Unit (CPU) for handling the hardware errors. This disrupts the CPU's normal operation, which impacts the performance of the server. In this paper, we describe common intermittent hardware errors observed on server systems in a large-scale data center environment. We discuss two methodologies of handling interrupts in server systems - System Management Interrupt (SMI) and Corrected Machine Check Interrupt (CMCI). We characterize the performance of these methods in live environments as compared to prior studies that used error injection to simulate error behavior. Our experience shows that error injection methods are not reflective of production behavior. We also present a hybrid approach for handling error interrupts that achieves better performance, while preserving monitoring granularity, in large scale data center environments.

SESSION: SESSION 6: Performance Costs and Emerging Problems

DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs

The past few years have seen a surge of applying Deep Learning (DL) models for a wide array of tasks such as image classification, object detection, machine translation, etc. While DL models provide an opportunity to solve otherwise intractable tasks, their adoption relies on them being optimized to meet target latency and resource requirements. Benchmarking is a key step in this process but has been hampered in part due to the lack of representative and up-to-date benchmarking suites. This paper proposes DLBricks, a composable benchmark generation design that reduces the effort of developing, maintaining, and running DL benchmarks. DLBricks decomposes DL models into a set of unique runnable networks and constructs the original model's performance using the performance of the generated benchmarks. Since benchmarks are generated automatically and the benchmarking time is minimized, DLBricks can keep up-to-date with the latest proposed models, relieving the pressure of selecting representative DL models. We evaluate DLBricks using 50 MXNet models spanning 5 DL tasks on 4 representative CPU systems. We show that DLBricks provides an accurate performance estimate for the DL models and reduces the benchmarking time across systems (e.g. within 95% accuracy and up to 4.4× benchmarking time speedup on Amazon EC2 c5.xlarge).

The Performance Cost of Software-based Security Mitigations

Historically, performance has been the most important feature when optimizing computer hardware. Modern processors are so highly optimized that every cycle of computation time matters. However, this practice of optimizing for performance at all costs has been called into question by new microarchitectural attacks, e.g. Meltdown and Spectre. Microarchitectural attacks exploit the effects of microarchitectural components or optimizations in order to leak data to an attacker. These attacks have caused processor manufacturers to introduce performance impacting mitigations in both software and silicon. To investigate the performance impact of the various mitigations, a test suite of forty-seven different tests was created. This suite was run on a series of virtual machines that tested both Ubuntu 16 and Ubuntu 18. These tests investigated the performance change across version updates and the performance impact of CPU core number vs. default microarchitectural mitigations. The testing proved that the performance impact of the microarchitectural mitigations is non-trivial, as the percent difference in performance can be as high as 200%.

Workload Diffusion Modeling for Distributed Applications in Fog/Edge Computing Environments

This paper addresses the problem of workload generation for distributed applications in fog/edge computing. Unlike most existing work that tends to generate workload data for individual network nodes using historical data from the targeted node, this work aims to extrapolate supplementary workloads for entire application / infrastructure graphs through diffusion of measurements from limited subsets of nodes. A framework for workload generation is proposed, which defines five diffusion algorithms that use different techniques for data extrapolation and generation. Each algorithm takes into account different constraints and assumptions when executing its diffusion task, and individual algorithms are applicable for modeling different types of applications and infrastructure networks. Experiments are performed to demonstrate the approach and evaluate the performance of the algorithms under realistic workload settings, and results are validated using statistical techniques.

MoVIE: A Measurement Tool for Mobile Video Streaming on Smartphones

Mobile video streaming is becoming increasingly popular. In this paper, we describe the design and implementation of a cross-platform measurement tool called MoVIE (Mobile Video Information Extraction) for video streaming on mobile devices. MoVIE is a client-side traffic analyzer that studies smartphone video streaming from different viewpoints. It collects information about network-level packet traffic, transport-layer flows, and application-level video player activities. Then it identifies relationships within the collected data to make mobile video streaming activities transparent. MoVIE is an open-source tool with a graphical user interface. In addition to network traffic measurement, MoVIE supports objective Quality of Experience (QoE) evaluation of video streaming. These features make MoVIE a powerful tool for network traffic measurement, multimedia streaming studies, and privacy analysis. We illustrate MoVIE' scapabilities with a small case study of streaming 360° videos.

Aggregate Architecture Simulation in Event-Sourcing Applications using Layered Queuing Networks

Workload intensity in terms of arrival rate and think-time can be used accurately simulate system performance in traditional systems. Most systems treat individual requests on a standalone basis and resource demands typically do not vary too significantly, which in most cases can be addressed as a parametric dependency. New frameworks such as Command Query Responsibility Segregation and Event Sourcing change the paradigm, where request processing is both parametric dependent and dynamic, as the history of changes that have occurred are replayed to construct the current state of the system. This makes every request unique and difficult to simulate. While traditional systems are studied extensively in the scientific community, the latter is still new and mainly used by practitioners. In this work, we study one such industrial application using Command Query Responsibility Segregation and Event Sourcing frameworks. We propose new workload patterns suited to define the dynamic behavior of these systems, define various architectural patterns possible in such systems based on domain-driven design principles, and create analytical performance models to make predictions. We verify the models by making measurements on an actual application running similar workloads and compare the predictions. Furthermore, we discuss the suitability of the architectural patterns to different usage scenarios and propose changes to architecture in each case to improve performance.

Contention Aware Web of Things Emulation Testbed

Since the advent of the Web, new Web benchmarking tools have frequently been introduced to keep up with evolving workloads and environments. The introduction of Web of Things (WoT) marks the beginning of another important paradigm that requires new benchmarking tools and testbeds. Such a WoT benchmarking testbed can enable the comparison of different WoT application configurations and workload scenarios under assumptions regarding WoT application resource demands and WoT device network characteristics. The powerful computational capabilities of modern commodity multicore servers along with the limited resource consumption footprints of WoT devices suggest the feasibility of a benchmarking testbed that can emulate the application behaviour of a large number of WoT devices on just a single multicore server. However, to obtain test results that reflect the true performance of the system being emulated, care must be exercised to detect and consider the impact of testbed bottlenecks on performance results. For example, if too many WoT devices are emulated then performance metrics obtained from a test run, e.g., WoT device response times, would only reflect contention among emulated devices for shared multicore server resources instead of providing a true indication of the performance of the WoT system being emulated. We develop a testbed that helps a user emulate a system consisting of multiple WoT devices on a single multicore server by exploiting Docker containers. Furthermore, we devise a novel mechanism for the user to check whether shared resource contention in the testbed has impacted the integrity of test results. Our solution allows for careful scaling of experiments and enables resource efficient evaluation of a wide range of WoT systems, architectures, application characteristics, workload scenarios, and network conditions.

SESSION: SESSION 7: Performance Techniques

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

We present a parallel profiling tool, GAPP, that identifies serialization bottlenecks in parallel Linux applications arising from load imbalance or contention for shared resources . It works by tracing kernel context switch events using kernel probes managed by the extended Berkeley Packet Filter (eBPF) framework. The overhead is thus extremely low (an average 4% runtime overhead for the applications explored), the tool requires no program instrumentation and works for a variety of serialization bottlenecks. We evaluate GAPP using the Parsec3.0 benchmark suite and two large open-source projects: MySQL and Nektar++ (a spectral/hp element framework). We show that GAPP is able to reveal a wide range of bottleneck-related performance issues, for example arising from synchronization primitives, busy-wait loops, memory operations, thread imbalance and resource contention.

Predicting the Costs of Serverless Workflows

Function-as-a-Service (FaaS) platforms enable users to run arbitrary functions without being concerned about operational issues, while only paying for the consumed resources. Individual functions are often composed into workflows for complex tasks. However, the pay-per-use model and nontransparent reporting by cloud providers make it challenging to estimate the expected cost of a workflow, which prevents informed business decisions. Existing cost-estimation approaches assume a static response time for the serverless functions, without taking input parameters into account. In this paper, we propose a methodology for the cost prediction of serverless workflows consisting of input-parameter sensitive function models and a monte-carlo simulation of an abstract workflow model. Our approach enables workflow designers to predict, compare, and optimize the expected costs and performance of a planned workflow, which currently requires time-intensive experimentation. In our evaluation, we show that our approach can predict the response time and output parameters of a function based on its input parameters with an accuracy of 96.1%. In a case study with two audio-processing workflows, our approach predicts the costs of the two workflows with an accuracy of 96.2%.

Sampling Effect on Performance Prediction of Configurable Systems: A Case Study

Numerous software systems are highly configurable and provide a myriad of configuration options that users can tune to fit their functional and performance requirements (e.g., execution time). Measuring all configurations of a system is the most obvious way to understand the effect of options and their interactions, but is too costly or infeasible in practice. Numerous works thus propose to measure only a few configurations (a sample) to learn and predict the performance of any combination of options' values. A challenging issue is to sample a small and representative set of configurations that leads to a good accuracy of performance prediction models. A recent study devised a new algorithm, called distance-based sampling, that obtains state-of-the-art accurate performance predictions on different subject systems. In this paper, we replicate this study through an in-depth analysis of x264, a popular and configurable video encoder. We systematically measure all 1,152 configurations of x264 with 17 input videos and two quantitative properties (encoding time and encoding size). Our goal is to understand whether there is a dominant sampling strategy over the very same subject system (x264), i.e., whatever the workload and targeted performance properties. The findings from this study show that random sampling leads to more accurate performance models. However, without considering random, there is no single "dominant" sampling, instead different strategies perform best on different inputs and non-functional properties, further challenging practitioners and researchers.

A Sampling-Based Tool for Scaling Graph Datasets

Graph processing has become a topic of interest in many domains. However, we still observe a lack of representative datasets for in-depth performance and scalability analysis. Neither data collections, nor graph generators provide enough diversity and control for thorough analysis. To address this problem, we proposea heuristic method for scaling existing graphs. Our approach, based onsampling andinterconnection, can provide a scaled "version" of a given graph. Moreover, we provide analytical models to predict the topological properties of the scaled graphs (such as the diameter, degree distribution, density, or the clustering coefficient), and further enable the user to tweak these properties. Property control is achieved through a portfolio of graph interconnection methods (e.g., star, ring, chain, fully connected) applied for combining the graph samples. We further implement our method as an open-source tool which can be used to quickly provide families of datasets for in-depth benchmarking of graph processing algorithms. Our empirical evaluation demonstrates our tool provides scaled graphs of a wide range of sizes, whose properties match well with model predictions and/or user requirements. Finally, we also illustrate, through a case-study, how scaled graphs can be used for in-depth performance analysis of graph processing algorithms.

SESSION: Workshop Summaries

3rd Workshop on Hot Topics in Cloud Computing Performance (HotCloudPerf'20): Performance Variability

Extended Abstract of Performance Analysis and Prediction of Model Transformation

The Eighth International Workshop on Load Testing and Benchmarking of Software Systems (LTB 2020)

The Eighth International Workshop on Load Testing and Benchmarking of Software Systems (LTB 2020) is a full-day event bringing together software testing researchers, practitioners and tool developers to discuss the challenges and opportunities of conducting research on load testing and benchmarking software systems. The workshop, co-located with the 11th International Conference on Performance Engineering (ICPE 2020), is held on April 20th, 2020 in Edmonton, Alberta, Canada.