ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Full Citation in the ACM Digital Library

SESSION: Analysis

A behavioral notion of robustness for software systems

Software systems are designed and implemented with assumptions about the environment. However, once the system is deployed, the actual environment may deviate from its expected behavior, possibly undermining desired properties of the system. To enable systematic design of systems that are robust against potential environmental deviations, we propose a rigorous notion of robustness for software systems. In particular, the robustness of a system is defined as the largest set of deviating environmental behaviors under which the system is capable of guaranteeing a desired property. We describe a new set of design analysis problems based on our notion of robustness, and a technique for automatically computing robustness of a system given its behavior description. We demonstrate potential applications of our robustness notion on two case studies involving network protocols and safety-critical interfaces.

ARDiff: scaling program equivalence checking via iterative abstraction and refinement of common code

Equivalence checking techniques help establish whether two versions of a program exhibit the same behavior. The majority of popular techniques for formally proving/refuting equivalence relies on symbolic execution – a static analysis approach that reasons about program behaviors in terms of symbolic input variables. Yet, symbolic execution is difficult to scale in practice due to complex programming constructs, such as loops and non-linear arithmetic.

This paper proposes an approach, named ARDiff, for improving the scalability of symbolic-execution-based equivalence checking techniques when comparing syntactically-similar versions of a program, e.g., for verifying the correctness of code upgrades and refactoring. Our approach relies on a set of novel heuristics to determine which parts of the versions’ common code can be effectively pruned during the analysis, reducing the analysis complexity without sacrificing its effectiveness. Furthermore, we devise a new equivalence checking benchmark, extending existing benchmarks with a set of real-life methods containing complex math functions and loops. We evaluate the effectiveness and efficiency of ARDiff on this benchmark and show that it outperforms existing method-level equivalence checking techniques by solving 86% of all equivalent and 55% of non-equivalent cases, compared with 47% to 69% for equivalent and 38% to 52% for non-equivalent cases in related work.

C2S: translating natural language comments to formal program specifications

Formal program specifications are essential for various software engineering tasks, such as program verification, program synthesis, code debugging and software testing. However, manually inferring formal program specifications is not only time-consuming but also error-prone. In addition, it requires substantial expertise. Natural language comments contain rich semantics about behaviors of code, making it feasible to infer program specifications from comments. Inspired by this, we develop a tool, named C2S, to automate the specification synthesis task by translating natural language comments into formal program specifications. Our approach firstly constructs alignments between natural language word and specification tokens from existing comments and their corresponding specifications. Then for a given method comment, our approach assembles tokens that are associated with words in the comment from the alignments into specifications guided by specification syntax and the context of the target method. Our tool successfully synthesizes 1,145 specifications for 511 methods of 64 classes in 5 different projects, substantially outperforming the state-of-the-art. The generated specifications are also used to improve a number of software engineering tasks like static taint analysis, which demonstrates the high quality of the specifications.

Detecting and understanding JavaScript global identifier conflicts on the web

JavaScript is widely used for implementing client-side web applications, and it is common to include JavaScript code from many different hosts. However, in a web browser, all the scripts loaded in the same frame share a single global namespace. As a result, a script may read or even overwrite the global objects or functions in other scripts, causing unexpected behaviors. For example, a script can redefine a function in a different script as an object, so that any call of that function would cause an exception at run time.

We systematically investigate the client-side JavaScript code integrity problem caused by JavaScript global identifier conflicts in this paper. We developed a browser-based analysis framework, JSObserver, to collect and analyze the write operations to global memory locations by JavaScript code. We identified three categories of conflicts using JSObserver on the Alexa top 100K websites, and detected 145,918 conflicts on 31,615 websites.

We reveal that JavaScript global identifier conflicts are prevalent and could cause behavior deviation at run time. In particular, we discovered that 1,611 redefined functions were called after being overwritten, and many scripts modified the value of cookies or redefined cookie-related functions. Our research demonstrated that JavaScript global identifier conflict is an emerging threat to both the web users and the integrity of web applications.

Domain-independent interprocedural program analysis using block-abstraction memoization

Whenever a new software-verification technique is developed, additional effort is necessary to extend the new program analysis to an interprocedural one, such that it supports recursive procedures. We would like to reduce that additional effort. Our contribution is an approach to extend an existing analysis in a modular and domain-independent way to an interprocedural analysis without large changes: We present interprocedural block-abstraction memoization (BAM), which is a technique for procedure summarization to analyze (recursive) procedures. For recursive programs, a fix-point algorithm terminates the recursion if every procedure is sufficiently unrolled and summarized to cover the abstract state space.

BAM Interprocedural works for data-flow analysis and for model checking, and is independent from the underlying abstract domain. To witness that our interprocedural analysis is generic and configurable, we defined and evaluated the approach for three completely different abstract domains: predicate abstraction, explicit values, and intervals. The interprocedural BAM-based analysis is implemented in the open-source verification framework CPAchecker. The evaluation shows that the overhead for modularity and domain-independence is not prohibitively large and the analysis is still competitive with other state-of-the-art software-verification tools.

Flexeme: untangling commits using lexical flows

Today, most developers bundle changes into commits that they submit to a shared code repository. Tangled commits intermix distinct concerns, such as a bug fix and a new feature. They cause issues for developers, reviewers, and researchers alike: they restrict the usability of tools such as git bisect, make patch comprehension more difficult, and force researchers who mine software repositories to contend with noise. We present a novel data structure, the 𝛿-NFG, a multiversion Program Dependency Graph augmented with name flows. A 𝛿-NFG directly and simultaneously encodes different program versions, thereby capturing commits, and annotates data flow edges with the names/lexemes that flow across them. Our technique, Flexeme, builds a 𝛿-NFG from commits, then applies Agglomerative Clustering using Graph Similarity to that 𝛿-NFG to untangle its commits. At the untangling task on a C# corpus, our implementation, Heddle, improves the state-of-the-art on accuracy by 0.14, achieving 0.81, in a fraction of the time: Heddle is 32 times faster than the previous state-of-the-art.

HISyn: human learning-inspired natural language programming

Natural Language (NL) programming automatically synthesizes code based on inputs expressed in natural language. It has recently received lots of growing interest. Recent solutions however all require many labeled training examples for their data-driven nature. This paper proposes an NLU-driven approach, a new approach inspired by how humans learn programming. It centers around Natural Language Understanding and draws on a novel graph-based mapping algorithm, foregoing the need of large numbers of labeled examples. The resulting NL programming framework, HISyn, using no training examples, gives synthesis accuracy comparable to those by data-driven methods trained on hundreds of training numbers. HISyn meanwhile demonstrates advantages in interpretability, error diagnosis support, and cross-domain extensibility.

Inductive program synthesis over noisy data

We present a new framework and associated synthesis algorithms for program synthesis over noisy data, i.e., data that may contain incorrect/corrupted input-output examples. This framework is based on an extension of finite tree automata called state-weighted finite tree automata. We show how to apply this framework to formulate and solve a variety of program synthesis problems over noisy data. Results from our implemented system running on problems from the SyGuS 2018 benchmark suite highlight its ability to successfully synthesize programs in the face of noisy data sets, including the ability to synthesize a correct program even when every input-output example in the data set is corrupted.

Inherent vacuity for GR(1) specifications

Vacuity is a well-known quality issue in formal specifications, studied mostly in the context of model checking. Inherent vacuity is a type of vacuity that applies to specifications, without the context of a model. GR(1) is an expressive assume-guarantee fragment of LTL, which enables efficient symbolic synthesis.

In this work we investigate inherent vacuity for GR(1) specifications. We define several general types of inherent vacuity for GR(1), including specification element vacuity and domain value vacuity. We detect vacuities using a reduction to LTL satisfiability, specialized for the context of GR(1). We further extend vacuity detection to handle GR(1) specifications that are enriched with past LTL, monitors, and patterns. Finally, we define a novel notion of vacuity core, which provides means to localize the cause of vacuity.

We implemented our work and evaluated it on benchmarks from the literature. The evaluation shows that vacuities are indeed common in GR(1) specifications, and that we are able to efficiently detect them and effectively localize their causes. Moreover, our evaluation shows that removal of vacuous specification elements may significantly reduce synthesis time.

Interval counterexamples for loop invariant learning

Loop invariant generation has long been a challenging problem. Black-box learning has recently emerged as a promising method for inferring loop invariants. However, the performance depends heavily on the quality of collected examples. In many cases, only after tens or even hundreds of constraint queries, can a feasible invariant be successfully inferred.

To reduce the gigantic number of constraint queries and improve the performance of black-box learning, we introduce interval counterexamples into the learning framework. Each interval counterexample represents a set of counterexamples from constraint solvers. We propose three different generalization techniques to compute interval counterexamples. The existing decision tree algorithm is also improved to adapt interval counterexamples. We evaluate our techniques and report over 40% improvement on learning rounds and verification time over the state-of-the-art approach.

Java Ranger: statically summarizing regions for efficient symbolic execution of Java

Merging execution paths is a powerful technique for reducing path explosion in symbolic execution. One approach, introduced and dubbed “veritesting” by Avgerinos et al., works by translating abounded control flow region into a single constraint. This approach is a convenient way to achieve path merging as a modification to a pre-existing single-path symbolic execution engine. Previous work evaluated this approach for symbolic execution of binary code, but different design considerations apply when building tools for other languages. In this paper, we extend the previous approach for symbolic execution of Java.

Because Java code typically contains many small dynamically dispatched methods, it is important to include them in multi-path regions; we introduce dynamic inlining of method-regions to do so modularly. Java’s typed memory structure is very different from the binary representation, but we show how the idea of static single assignment (SSA) form can be applied to object references to statically account for aliasing. We have implemented our algorithms in Java Ranger, an extension to the widely used Symbolic Pathfinder tool. In a set of nine benchmarks, Java Ranger reduces the running time and number of execution paths by a total of 38% and 71% respectively as compared to SPF. Our results are a significant improvement over the performance of JBMC, a recently released verification tool for Java bytecode. We also participated in a static verification competition at a top theory conference where other participants included state-of-the-art Java verifiers. JR won first place in the competition’s Java verification track.

JShrink: in-depth investigation into debloating modern Java applications

Modern software is bloated. Demand for new functionality has led developers to include more and more features, many of which become unneeded or unused as software evolves. This phenomenon, known as software bloat, results in software consuming more resources than it otherwise needs to. How to effectively and automatically debloat software is a long-standing problem in software engineering. Various debloating techniques have been proposed since the late 1990s. However, many of these techniques are built upon pure static analysis and have yet to be extended and evaluated in the context of modern Java applications where dynamic language features are prevalent.

To this end, we develop an end-to-end bytecode debloating framework called JShrink. It augments traditional static reachability analysis with dynamic profiling and type dependency analysis and renovates existing bytecode transformations to account for new language features in modern Java. We highlight several nuanced technical challenges that must be handled properly and examine behavior preservation of debloated software via regression testing. We find that (1) JShrink is able to debloat our real-world Java benchmark suite by up to 47% (14% on average); (2) accounting for dynamic language features is indeed crucial to ensure behavior preservation---reducing 98% of test failures incurred by a purely static equivalent, Jax, and 84% for ProGuard; and (3) compared with purely dynamic approaches, integrating static analysis with dynamic profiling makes the debloated software more robust to unseen test executions---in 22 out of 26 projects, the debloated software ran successfully under new tests.

Making symbolic execution promising by learning aggressive state-pruning strategy

We present HOMI, a new technique to enhance symbolic execution by maintaining only a small number of promising states. In practice, symbolic execution typically maintains as many states as possible in a fear of losing important states. In this paper, however, we show that only a tiny subset of the states plays a significant role in increasing code coverage or reaching bug points. Based on this observation, HOMI aims to minimize the total number of states while keeping “promising” states during symbolic execution. We identify promising states by a learning algorithm that continuously updates the probabilistic pruning strategy based on data accumulated during the testing process. Experimental results show that HOMI greatly increases code coverage and the ability to find bugs of KLEE on open-source C programs.

Mining assumptions for software components using machine learning

Software verification approaches aim to check a software component under analysis for all possible environments. In reality, however, components are expected to operate within a larger system and are required to satisfy their requirements only when their inputs are constrained by environment assumptions. In this paper, we propose EPIcuRus, an approach to automatically synthesize environment assumptions for a component under analysis (i.e., conditions on the component inputs under which the component is guaranteed to satisfy its requirements). EPIcuRus combines search-based testing, machine learning and model checking. The core of EPIcuRus is a decision tree algorithm that infers environment assumptions from a set of test results including test cases and their verdicts. The test cases are generated using search-based testing, and the assumptions inferred by decision trees are validated through model checking. In order to improve the efficiency and effectiveness of the assumption generation process, we propose a novel test case generation technique, namely Important Features Boundary Test (IFBT), that guides the test generation based on the feedback produced by machine learning. We evaluated EPIcuRus by assessing its effectiveness in computing assumptions on a set of study subjects that include 18 requirements of four industrial models. We show that, for each of the 18 requirements, EPIcuRus was able to compute an assumption to ensure the satisfaction of that requirement, and further, ≈78% of these assumptions were computed in one hour.

Mining input grammars from dynamic control flow

One of the key properties of a program is its input specification. Having a formal input specification can be critical in fields such as vulnerability analysis, reverse engineering, software testing, clone detection, or refactoring. Unfortunately, accurate input specifications for typical programs are often unavailable or out of date.

In this paper, we present a general algorithm that takes a program and a small set of sample inputs and automatically infers a readable context-free grammar capturing the input language of the program. We infer the syntactic input structure only by observing access of input characters at different locations of the input parser. This works on all stack based recursive descent input parsers, including parser combinators, and works entirely without program specific heuristics. Our Mimid prototype produced accurate and readable grammars for a variety of evaluation subjects, including complex languages such as JSON, TinyC, and JavaScript.

Modular collaborative program analysis in OPAL

Current approaches combining multiple static analyses deriving different, independent properties focus either on modularity or performance. Whereas declarative approaches facilitate modularity and automated, analysis-independent optimizations, imperative approaches foster manual, analysis-specific optimizations.

In this paper, we present a novel approach to static analyses that leverages the modularity of blackboard systems and combines declarative and imperative techniques. Our approach allows exchangeability, and pluggable extension of analyses in order to improve sound(i)ness, precision, and scalability and explicitly enables the combination of otherwise incompatible analyses. With our approach integrated in the OPAL framework, we were able to implement various dissimilar analyses, including a points-to analysis that outperforms an equivalent analysis from Doop, the state-of-the-art points-to analysis framework.

Past-sensitive pointer analysis for symbolic execution

We propose a novel fine-grained integration of pointer analysis with dynamic analysis, including dynamic symbolic execution. This is achieved via past-sensitive pointer analysis, an on-demand pointer analysis instantiated with an abstraction of the dynamic state on which it is invoked. We evaluate our technique in three application scenarios: chopped symbolic execution, symbolic pointer resolution, and write integrity testing. Our preliminary results show that the approach can have a significant impact in these scenarios, by effectively improving the precision of standard pointer analysis with only a modest performance overhead.

TypeWriter: neural type prediction with search-based validation

Maintaining large code bases written in dynamically typed languages, such as JavaScript or Python, can be challenging due to the absence of type annotations: simple data compatibility errors proliferate, IDE support is limited, and APIs are hard to comprehend. Recent work attempts to address those issues through either static type inference or probabilistic type prediction. Unfortunately, static type inference for dynamic languages is inherently limited, while probabilistic approaches suffer from imprecision. This paper presents TypeWriter, the first combination of probabilistic type prediction with search-based refinement of predicted types. TypeWriter’s predictor learns to infer the return and argument types for functions from partially annotated code bases by combining the natural language properties of code with programming language-level information. To validate predicted types, TypeWriter invokes a gradual type checker with different combinations of the predicted types, while navigating the space of possible type combinations in a feedback-directed manner. We implement the TypeWriter approach for Python and evaluate it on two code corpora: a multi-million line code base at Facebook and a collection of 1,137 popular open-source projects. We show that TypeWriter’s type predictor achieves an F1 score of 0.64 (0.79) in the top-1 (top-5) predictions for return types, and 0.57 (0.80) for argument types, which clearly outperforms prior type prediction models. By combining predictions with search-based validation, TypeWriter can fully annotate between 14% to 44% of the files in a randomly selected corpus, while ensuring type correctness. A comparison with a static type inference tool shows that TypeWriter adds many more non-trivial types. TypeWriter currently suggests types to developers at Facebook and several thousands of types have already been accepted with minimal changes.

UBITect: a precise and scalable method to detect use-before-initialization bugs in Linux kernel

Use-before-Initialization (UBI) bugs in the Linux kernel have serious security impacts, such as information leakage and privilege escalation. Developers are adopting forced initialization to cope with UBI bugs, but this approach can still lead to undefined behaviors (e.g., NULL pointer dereference). As it is hard to infer correct initialization values, we believe that the best way to mitigate UBI bugs is detection and manual patching. Precise detection of UBI bugs requires path-sensitive analysis. The detector needs to track an associated variable’s initialization status along all the possible program execution paths to its uses. However, such exhaustive analysis prevents the detection from scaling to the whole Linux kernel. This paper presents UBITect, a UBI bug finding tool which combines flow-sensitive type qualifier analysis and symbolic execution to perform precise and scalable UBI bug detection. The scalable qualifier analysis guides symbolic execution to analyze variables that are likely to cause UBI bugs. UBITect also does not require manual effort for annotations and hence, it can be directly applied to the kernel without any source code or intermediate representation (IR) change. On the Linux kernel version 4.14, UBITect reported 190 bugs, among which 78 bugs were deemed by us as true positives and 52 were confirmed by Linux maintainers.


Exploring how deprecated Python library APIs are (not) handled

In this paper, we present the first exploratory study of deprecated Python library APIs to understand the status quo of API deprecation in the realm of Python libraries. Specifically, we aim to comprehend how deprecated library APIs are declared and documented in practice by their maintainers, and how library users react to them. By thoroughly looking into six reputed Python libraries and 1,200 GitHub projects, we experimentally observe that API deprecation is poorly handled by library contributors, which subsequently introduce difficulties for Python developers to resolve the usage of deprecated library APIs. This empirical evidence suggests that our community should take immediate actions to appropriately handle the deprecation of Python library APIs.

Selecting third-party libraries: the practitioners’ perspective

The selection of third-party libraries is an essential element of virtually any software development project. However, deciding which libraries to choose is a challenging practical problem. Selecting the wrong library can severely impact a software project in terms of cost, time, and development effort, with the severity of the impact depending on the role of the library in the software architecture, among others. Despite the importance of following a careful library selection process, in practice, the selection of third-party libraries is still conducted in an ad-hoc manner, where dozens of factors play an influential role in the decision.

In this paper, we study the factors that influence the selection process of libraries, as perceived by industry developers. To that aim, we perform a cross-sectional interview study with 16 developers from 11 different businesses and survey 115 developers that are involved in the selection of libraries. We systematically devised a comprehensive set of 26 technical, human, and economic factors that developers take into consideration when selecting a software library. Eight of these factors are new to the literature. We explain each of these factors and how they play a role in the decision. Finally, we discuss the implications of our work to library maintainers, potential library users, package manager developers, and empirical software engineering researchers.

SESSION: Cloud / Services

A principled approach to GraphQL query cost analysis

The landscape of web APIs is evolving to meet new client requirements and to facilitate how providers fulfill them. A recent web API model is GraphQL, which is both a query language and a runtime. Using GraphQL, client queries express the data they want to retrieve or mutate, and servers respond with exactly those data or changes. GraphQL’s expressiveness is risky for service providers because clients can succinctly request stupendous amounts of data, and responding to overly complex queries can be costly or disrupt service availability. Recent empirical work has shown that many service providers are at risk. Using traditional API management methods is not sufficient, and practitioners lack principled means of estimating and measuring the cost of the GraphQL queries they receive. In this work, we present a linear-time GraphQL query analysis that can measure the cost of a query without executing it. Our approach can be applied in a separate API management layer and used with arbitrary GraphQL backends. In contrast to existing static approaches, our analysis supports common GraphQL conventions that affect query cost, and our analysis is provably correct based on our formal specification of GraphQL semantics. We demonstrate the potential of our approach using a novel GraphQL query-response corpus for two commercial GraphQL APIs. Our query analysis consistently obtains upper cost bounds, tight enough relative to the true response sizes to be actionable for service providers. In contrast, existing static GraphQL query analyses exhibit over-estimates and under-estimates because they fail to support GraphQL conventions.

Beware the evolving ‘intelligent’ web service! an integration architecture tactic to guard AI-first components

Intelligent services provide the power of AI to developers via simple RESTful API endpoints, abstracting away many complexities of machine learning. However, most of these intelligent services---such as computer vision---continually learn with time. When the internals within the abstracted 'black box' become hidden and evolve, pitfalls emerge in the robustness of applications that depend on these evolving services. Without adapting the way developers plan and construct projects reliant on intelligent services, significant gaps and risks result in both project planning and development. Therefore, how can software engineers best mitigate software evolution risk moving forward, thereby ensuring that their own applications maintain quality? Our proposal is an architectural tactic designed to improve intelligent service-dependent software robustness. The tactic involves creating an application-specific benchmark dataset baselined against an intelligent service, enabling evolutionary behaviour changes to be mitigated. A technical evaluation of our implementation of this architecture demonstrates how the tactic can identify 1,054 cases of substantial confidence evolution and 2,461 cases of substantial changes to response label sets using a dataset consisting of 331 images that evolve when sent to a service.

Block public access: trust safety verification of access control policies

Data stored in cloud services is highly sensitive and so access to it is controlled via policies written in domain-specific languages (DSLs). The expressiveness of these DSLs provides users flexibility to cover a wide variety of uses cases, however, unintended misconfigurations can lead to potential security issues. We introduce Block Public Access, a tool that formally verifies policies to ensure that they only allow access to trusted principals, i.e. that they prohibit access to the general public. To this end, we formalize the notion of Trust Safety that formally characterizes whether or not a policy allows unconstrained (public) access. Next, we present a method to compile the policy down to a logical formula whose unsatisfiability can be (1) checked by SMT and (2) ensures Trust Safety. The constructs of the policy DSLs render unsatisfiability checking PSPACE-complete, which precludes verifying the millions of requests per second seen at cloud scale. Hence, we present an approach that leverages the structure of the policy DSL to compute a much smaller residual policy that corresponds only to untrusted accesses. Our approach allows Block Public Access to, in the common case, syntactically verify Trust Safety without having to query the SMT solver. We have implemented Block Public Access and present an evaluation showing how the above optimization yields a low-latency policy verifier that the S3 team at AWS has integrated into their authorization system, where it is currently in production, analyzing millions of policies everyday to ensure that client buckets do not grant unintended public access.

Efficient incident identification from multi-dimensional issue reports via meta-heuristic search

In large-scale cloud systems, unplanned service interruptions and outages may cause severe degradation of service availability. Such incidents can occur in a bursty manner, which will deteriorate user satisfaction. Identifying incidents rapidly and accurately is critical to the operation and maintenance of a cloud system. In industrial practice, incidents are typically detected through analyzing the issue reports, which are generated over time by monitoring cloud services. Identifying incidents in a large number of issue reports is quite challenging. An issue report is typically multi-dimensional: it has many categorical attributes. It is difficult to identify a specific attribute combination that indicates an incident. Existing methods generally rely on pruning-based search, which is time-consuming given high-dimensional data, thus not practical to incident detection in large-scale cloud systems. In this paper, we propose MID (Multi-dimensional Incident Detection), a novel framework for identifying incidents from large-amount, multi-dimensional issue reports effectively and efficiently. Key to the MID design is encoding the problem into a combinatorial optimization problem. Then a specific-tailored meta-heuristic search method is designed, which can rapidly identify attribute combinations that indicate incidents. We evaluate MID with extensive experiments using both synthetic data and real-world data collected from a large-scale production cloud system. The experimental results show that MID significantly outperforms the current state-of-the-art methods in terms of effectiveness and efficiency. Additionally, MID has been successfully applied to Microsoft's cloud systems and helped greatly reduce manual maintenance effort.

Identifying linked incidents in large-scale online service systems

In large-scale online service systems, incidents occur frequently due to a variety of causes, from updates of software and hardware to changes in operation environment. These incidents could significantly degrade system’s availability and customers’ satisfaction. Some incidents are linked because they are duplicate or inter-related. The linked incidents can greatly help on-call engineers find mitigation solutions and identify the root causes. In this work, we investigate the incidents and their links in a representative real-world incident management (IcM) system. Based on the identified indicators of linked incidents, we further propose LiDAR (Linked Incident identification with DAta-driven Representation), a deep learning based approach to incident linking. More specifically, we incorporate the textual description of incidents and structural information extracted from historical linked incidents to identify possible links among a large number of incidents. To show the effectiveness of our method, we apply our method to a real-world IcM system and find that our method outperforms other state-of-the-art methods.

Real-time incident prediction for online service systems

Incidents in online service systems could dramatically degrade system availability and destroy user experience. To guarantee service quality and reduce economic loss, it is essential to predict the occurrence of incidents in advance so that engineers can take some proactive actions to prevent them. In this work, we propose an effective and interpretable incident prediction approach, called eWarn, which utilizes historical data to forecast whether an incident will happen in the near future based on alert data in real time. More specifically, eWarn first extracts a set of effective features (including textual features and statistical features) to represent omen alert patterns via careful feature engineering. To reduce the influence of noisy alerts (that are not relevant to the occurrence of incidents), eWarn then incorporates the multi-instance learning formulation. Finally, eWarn builds a classification model via machine learning and generates an interpretable report about the prediction result via a state-of-the-art explanation technique (i.e., LIME). In this way, an early warning signal along with its interpretable report can be sent to engineers to facilitate their understanding and handling for the incoming incident. An extensive study on 11 real-world online service systems from a large commercial bank demonstrates the effectiveness of eWarn, outperforming state-of-the-art alert-based incident prediction approaches and the practice of incident prediction with alerts. In particular, we have applied eWarn to two large commercial banks in practice and shared some success stories and lessons learned from real deployment.

SESSION: Configuration

Configuration smells in continuous delivery pipelines: a linter and a six-month study on GitLab

An effective and efficient application of Continuous Integration (CI) and Delivery (CD) requires software projects to follow certain principles and good practices. Configuring such a CI/CD pipeline is challenging and error-prone. Therefore, automated linters have been proposed to detect errors in the pipeline. While existing linters identify syntactic errors, detect security vulnerabilities or misuse of the features provided by build servers, they do not support developers that want to prevent common misconfigurations of a CD pipeline that potentially violate CD principles (“CD smells”). To this end, we propose CD-Linter, a semantic linter that can automatically identify four different smells in pipeline configuration files. We have evaluated our approach through a large-scale and long-term study that consists of (i) monitoring 145 issues (opened in as many open-source projects) over a period of 6 months, (ii) manually validating the detection precision and recall on a representative sample of issues, and (iii) assessing the magnitude of the observed smells on 5,312 open-source projects on GitLab. Our results show that CD smells are accepted and fixed by most of the developers and our linter achieves a precision of 87% and a recall of 94%. Those smells can be frequently observed in the wild, as 31% of projects with long configurations are affected by at least one smell.

Dimensions of software configuration: on the configuration context in modern software development

With the rise of containerization, cloud development, and continuous integration and delivery, configuration has become an essential aspect not only to tailor software to user requirements, but also to configure a software system’s environment and infrastructure. This heterogeneity of activities, domains, and processes blurs the term configuration, as it is not clear anymore what tasks, artifacts, or stakeholders are involved and intertwined. However, each re- search study and each paper involving configuration places their contributions and findings in a certain context without making the context explicit. This makes it difficult to compare findings, translate them to practice, and to generalize the results. Thus, we set out to evaluate whether these different views on configuration are really distinct or can be summarized under a common umbrella. By interviewing practitioners from different domains and in different roles about the aspects of configuration and by analyzing two qualitative studies in similar areas, we derive a model of configuration that provides terminology and context for research studies, identifies new research opportunities, and allows practitioners to spot possible challenges in their current tasks. Although our interviewees have a clear view about configuration, it substantially differs due to their personal experience and role. This indicates that the term configuration might be overloaded. However, when taking a closer look, we see the interconnections and dependencies among all views, arriving at the conclusion that we need to start considering the entire spectrum of dimensions of configuration.

Global cost/quality management across multiple applications

Approximation is a technique that optimizes the balance between application outcome quality and its resource usage. Trading quality for performance has been investigated for single application scenarios, but not for environments where multiple approximate applications may run concurrently on the same machine, interfering with each other by sharing machine resources. Applying existing, single application techniques to this multi-programming environment may lead to configuration space size explosion, or result in poor overall application quality outcomes.

Our new RAPID-M system is the first cross-application con-figuration management framework. It reduces the problem size by clustering configurations of individual applications into local"similarity buckets". The global cross-applications configuration selection is based on these local bucket spaces. RAPID-M dynamically assigns buckets to applications such that overall quality is maximized while respecting individual application cost budgets.Once assigned a bucket, reconfigurations within buckets may be performed locally with minimal impact on global selections. Experimental results using six configurable applications show that even large configuration spaces of complex applications can be clustered into a small number of buckets, resulting in search space size reductions of up to 9 orders of magnitude for our six applications. RAPID-M constructs performance cost models with an average prediction error of ≤3%. For our application execution traces, RAPID-M dynamically selects configurations that lower the budget violation rate by 33.9% with an average budget exceeding rate of 6.6% as compared to other possible approaches. RAPID-M successfully finishes 22.75% more executions which translates to a 1.52X global output quality increase under high system loads. Theo verhead ofRAPID-Mis within≤1% of application execution times.

Understanding and discovering software configuration dependencies in cloud and datacenter systems

A large percentage of real-world software configuration issues, such as misconfigurations, involve multiple interdependent configuration parameters. However, existing techniques and tools either do not consider dependencies among configuration parameters— termed configuration dependencies—or rely on one or two dependency types and code patterns as input. Without rigorous understanding of configuration dependencies, it is hard to deal with many resulting configuration issues.

This paper presents our study of software configuration dependencies in 16 widely-used cloud and datacenter systems, including dependencies within and across software components. To understand types of configuration dependencies, we conduct an exhaustive search of descriptions in structured configuration metadata and unstructured user manuals. We find and manually analyze 521 configuration dependencies. We define five types of configuration dependencies and identify their common code patterns. We report on consequences of not satisfying these dependencies and current software engineering practices for handling the consequences.

We mechanize the knowledge gained from our study in a tool, cDep, which detects configuration dependencies. cDep automatically discovers five types of configuration dependencies from bytecode using static program analysis. We apply cDep to the eight Java and Scala software systems in our study. cDep finds 87.9% (275/313) of the related subset of dependencies from our study. cDep also finds 448 previously undocumented dependencies, with a 6.0% average false positive rate. Overall, our results show that configuration dependencies are more prevalent and diverse than previously reported and should henceforth be considered a first-class issue in software configuration engineering.

SESSION: Documentation

Docable: evaluating the executability of software tutorials

The typical software tutorial includes step-by-step instructions for installing developer tools, editing files and code, and running commands. When these software tutorials are not executable, either due to missing instructions, ambiguous steps, or simply broken commands, their value is diminished. Non-executable tutorials impact developers in several ways, including frustrating learning experiences, and limiting usability of developer tools.

To understand to what extent software tutorials are executable---and why they may fail---we conduct an empirical study on over 600 tutorials, including nearly 15,000 code blocks. We find a naive execution strategy achieves an overall executability rate of only 26%. Even a human-annotation-based execution strategy---while doubling executability---still yields no tutorial that can successfully execute all steps. We identify several common executability barriers, ranging from potentially innocuous causes, such as interactive prompts requiring human responses, to insidious errors, such as missing steps and inaccessible resources. We validate our findings with major stakeholders in technical documentation and discuss possible strategies for improving software tutorials, such as providing accessible alternatives for tutorial takers, and investing in automated tutorial testing to ensure continuous quality of software tutorials.

RulePad: interactive authoring of checkable design rules

Good documentation offers the promise of enabling developers to easily understand design decisions. Unfortunately, in practice, design documents are often rarely updated, becoming inaccurate, incomplete, and untrustworthy. A better solution is to enable developers to write down design rules which are checked against code for consistency. But existing rule checkers require learning specialized query languages or program analysis frameworks, creating a barrier to writing project-specific rules. We introduce two new techniques for authoring design rules: snippet-based authoring and semi-natural-language authoring. In snippet-based authoring, developers specify characteristics of elements to match by writing partial code snippets. In semi-natural language authoring, a textual representation offers a representation for understanding design rules and resolving ambiguities. We implemented these approaches in RulePad. To evaluate RulePad, we conducted a between-subjects study with 14 participants comparing RulePad to the PMD Designer, a utility for writing rules in a popular rule checker. We found that those with RulePad were able to successfully author 13 times more query elements in significantly less time and reported being significantly more willing to use RulePad in their everyday work.

SESSION: Empirical

A first look at good first issues on GitHub

Keeping a good influx of newcomers is critical for open source software projects' survival, while newcomers face many barriers to contributing to a project for the first time. To support newcomers onboarding, GitHub encourages projects to apply labels such as good first issue (GFI) to tag issues suitable for newcomers. However, many newcomers still fail to contribute even after many attempts, which not only reduces the enthusiasm of newcomers to contribute but makes the efforts of project members in vain. To better support the onboarding of newcomers, this paper reports a preliminary study on this mechanism from its application status, effect, problems, and best practices. By analyzing 9,368 GFIs from 816 popular GitHub projects and conducting email surveys with newcomers and project members, we obtain the following results. We find that more and more projects are applying this mechanism in the past decade, especially the popular projects. Compared to common issues, GFIs usually need more days to be solved. While some newcomers really join the projects through GFIs, almost half of GFIs are not solved by newcomers. We also discover a series of problems covering mechanism (e.g., inappropriate GFIs), project (e.g., insufficient GFIs) and newcomer (e.g., uneven skills) that makes this mechanism ineffective. We discover the practices that may address the problems, including identifying GFIs that have informative description and available support, and require limited scope and skill, etc. Newcomer onboarding is an important but challenging question in open source projects and our work enables a better understanding of GFI mechanism and its problems, as well as highlights ways in improving them.

A randomized controlled trial on the effects of embedded computer language switching

Polyglot programming, the use of multiple programming languages during the development process, is common practice in modern software development. This study investigates this practice through a randomized controlled trial conducted under the context of database programming. Participants in the study were given coding tasks written in Java and one of three SQL-like embedded languages. One was plain SQL in strings, one was in Java only, and the third was a hybrid embedded language that was closer to the host language. We recorded 109 valid data points. Results showed significant differences in how developers of different experience levels code using polyglot techniques. Notably, less experienced programmers wrote correct programs faster in the hybrid condition (frequent, but less severe, switches), while more experienced developers that already knew both languages performed better in traditional SQL (less frequent but more complete switches). The results indicate that the productivity impact of polyglot programming is complex and experience level dependent.

A theory of the engagement in open source projects via summer of code programs

Summer of code programs connect students to open source software (OSS) projects, typically during the summer break from school. Analyzing consolidated summer of code programs can reveal how college students, who these programs usually target, can be motivated to participate in OSS, and what onboarding strategies OSS communities adopt to receive these students. In this paper, we study the well-established Google Summer of Code (GSoC) and devise an integrated engagement theory grounded in multiple data sources to explain motivation and onboarding in this context. Our analysis shows that OSS communities employ several strategies for planning and executing student participation, socially integrating the students, and rewarding student’s contributions and achievements. Students are motivated by a blend of rewards, which are moderated by external factors. We presented these rewards and the motivation theory to students who had never participated in a summer of code program and collected their shift in motivation after learning about the theory. New students can benefit from the former students' experiences detailed in our results, and OSS stakeholders can leverage both the insight into students’ motivations for joining such programs as well as the onboarding strategies we identify to devise actions to attract and retain newcomers.

An empirical analysis of the costs of clone- and platform-oriented software reuse

Software reuse lowers development costs and improves the quality of software systems. Two strategies are common: clone & own (copying and adapting a system) and platform-oriented reuse (building a configurable platform). The former is readily available, flexible, and initially cheap, but does not scale with the frequency of reuse, imposing high maintenance costs. The latter scales, but imposes high upfront investments for building the platform, and reduces flexibility. As such, each strategy has distinctive advantages and disadvantages, imposing different development activities and software architectures. Deciding for one strategy is a core decision with long-term impact on an organization’s software development. Unfortunately, the strategies’ costs are not well-understood - not surprisingly, given the lack of systematically elicited empirical data, which is difficult to collect. We present an empirical study of the development activities, costs, cost factors, and benefits associated with either reuse strategy. For this purpose, we combine quantitative and qualitative data that we triangulated from 26 interviews at a large organization and a systematic literature review covering 57 publications. Our study both confirms and refutes common hypotheses on software reuse. For instance, we confirm that developing for platform-oriented reuse is more expensive, but simultaneously reduces reuse costs; and that platform-orientation results in higher code quality compared to clone & own. Surprisingly, refuting common hypotheses, we find that change propagation can be more expensive in a platform, that platforms can facilitate the advancement into innovative markets, and that there is no strict distinction of clone & own and platform-oriented reuse in practice.

An empirical study of bots in software development: characteristics and challenges from a practitioner’s perspective

Software engineering bots – automated tools that handle tedious tasks – are increasingly used by industrial and open source projects to improve developer productivity. Current research in this area is held back by a lack of consensus of what software engineering bots (DevBots) actually are, what characteristics distinguish them from other tools, and what benefits and challenges are associated with DevBot usage. In this paper we report on a mixed-method empirical study of DevBot usage in industrial practice. We report on findings from interviewing 21 and surveying a total of 111 developers. We identify three different personas among DevBot users (focusing on autonomy, chat interfaces, and “smartness”), each with different definitions of what a DevBot is, why developers use them, and what they struggle with.We conclude that future DevBot research should situate their work within our framework, to clearly identify what type of bot the work targets, and what advantages practitioners can expect. Further, we find that there currently is a lack of general purpose “smart” bots that go beyond simple automation tools or chat interfaces. This is problematic, as we have seen that such bots, if available, can have a transformative effect on the projects that use them.

Biases and differences in code review using medical imaging and eye-tracking: genders, humans, and machines

Code review is a critical step in modern software quality assurance, yet it is vulnerable to human biases. Previous studies have clarified the extent of the problem, particularly regarding biases against the authors of code,but no consensus understanding has emerged. Advances in medical imaging are increasingly applied to software engineering, supporting grounded neurobiological explorations of computing activities, including the review, reading, and writing of source code. In this paper, we present the results of a controlled experiment using both medical imaging and also eye tracking to investigate the neurological correlates of biases and differences between genders of humans and machines (e.g., automated program repair tools) in code review. We find that men and women conduct code reviews differently, in ways that are measurable and supported by behavioral, eye-tracking and medical imaging data. We also find biases in how humans review code as a function of its apparent author, when controlling for code quality. In addition to advancing our fundamental understanding of how cognitive biases relate to the code review process, the results may inform subsequent training and tool design to reduce bias.

Community expectations for research artifacts and evaluation processes

Background. Artifact evaluation has been introduced into the software engineering and programming languages research community with a pilot at ESEC/FSE 2011 and has since then enjoyed a healthy adoption throughout the conference landscape. Objective. In this qualitative study, we examine the expectations of the community toward research artifacts and their evaluation processes. Method. We conducted a survey including all members of artifact evaluation committees of major conferences in the software engineering and programming language field since the first pilot and compared the answers to expectations set by calls for artifacts and reviewing guidelines. Results. While we find that some expectations exceed the ones expressed in calls and reviewing guidelines, there is no consensus on quality thresholds for artifacts in general. We observe very specific quality expectations for specific artifact types for review and later usage, but also a lack of their communication in calls. We also find problematic inconsistencies in the terminology used to express artifact evaluation’s most important purpose – replicability. Conclusion. We derive several actionable suggestions which can help to mature artifact evaluation in the inspected community and also to aid its introduction into other communities in computer science.

Does stress impact technical interview performance?

Software engineering candidates commonly participate in whiteboard technical interviews as part of a hiring assessment. During these sessions, candidates write code while thinking aloud as they work towards a solution, under the watchful eye of an interviewer. While technical interviews should allow for an unbiased and inclusive assessment of problem-solving ability, surprisingly, technical interviews may be instead a procedure for identifying candidates who best handle and migrate stress solely caused by being examined by an interviewer (performance anxiety).

To understand if coding interviews—as administered today—can induce stress that significantly hinders performance, we conducted a randomized controlled trial with 48 Computer Science students, comparing them in private and public whiteboard settings. We found that performance is reduced by more than half, by simply being watched by an interviewer. We also observed that stress and cognitive load were significantly higher in a traditional technical interview when compared with our private interview. Consequently, interviewers may be filtering out qualified candidates by confounding assessment of problem-solving ability with unnecessary stress. We propose interview modifications to make problem-solving assessment more equitable and inclusive, such as through private focus sessions and retrospective think-aloud, allowing companies to hire from a larger and diverse pool of talent.

Exploring the evolution of software practices

When software products and services are developed and maintained over longer time, software engineering practices tend to drift away from both structured and agile methods. Nonetheless, in many cases the evolving practices are far from ad hoc or chaotic. How are the teams involved able to coordinate their joint development?This article reports on an ethnographic study of a small team at a successful provider of software as a service. What struck us was the very explicit way in which the team adopted and adapted their practices to fit the needs of the evolving development. The discussion relates the findings to the concepts of social practices and methods in software engineering, and explores the differences between degraded behavior and the coordinated evolution of development practices. The analysis helps to better understand how software engineering practices evolve, and thus provides a starting point for rethinking software engineering methods and their relation to software engineering practice.

Heard it through the Gitvine: an empirical study of tool diffusion across the npm ecosystem

Automation tools like continuous integration services, code coverage reporters, style checkers, dependency managers, etc. are all known to provide significant improvements in developer productivity and software quality. Some of these tools are widespread, others are not. How do these automation "best practices" spread? And how might we facilitate the diffusion process for those that have seen slower adoption? In this paper, we rely on a recent innovation in transparency on code hosting platforms like GitHub---the use of repository badges---to track how automation tools spread in open-source ecosystems through different social and technical mechanisms over time. Using a large longitudinal data set, multivariate network science techniques, and survival analysis, we study which socio-technical factors can best explain the observed diffusion process of a number of popular automation tools. Our results show that factors such as social exposure, competition, and observability affect the adoption of tools significantly, and they provide a roadmap for software engineers and researchers seeking to propagate best practices and tools.

Interactive, effort-aware library version harmonization

As a mixed result of intensive dependency on third-party libraries, flexible mechanisms to declare dependencies and increased number of modules in a project, different modules of a project directly depend on multiple versions of the same third-party library. Such library version inconsistencies could increase dependency maintenance cost, or even lead to dependency conflicts when modules are inter-dependent. Although automated build tools (e.g., Maven's enforcer plugin) provide partial support to detect library version inconsistencies, they do not provide any support to harmonize inconsistent library versions.

We first conduct a survey with 131 Java developers from GitHub to retrieve first-hand information about the root causes, detection methods, reasons for fixing or not fixing, fixing strategies, fixing efforts, and tool expectations on library version inconsistencies. Then, based on the insights from our survey, we propose LibHarmo, an interactive, effort-aware library version harmonization technique, to detect library version inconsistencies, interactively suggest a harmonized version with the least harmonization efforts based on library API usage analysis, and refactor build configuration files.

LibHarmo is currently developed for Java Maven projects. Our experimental study on 443 highly-starred Java Maven projects from GitHub shows that i) LibHarmo detected 621 library version inconsistencies in 152 (34.3%) projects with a false positive rate of 16.8%, while Maven's enforcer plugin only detected 219 of them; and ii) LibHarmo saved 87.5% of the harmonization efforts. Further, 31 library version inconsistencies have been confirmed, and 17 of them have been already harmonized by developers.

On the naturalness of hardware descriptions

Mining software repositories (MSR) has been shown effective for extracting data used to improve various software engineering tasks, including code completion, code repair, code search, and code summarization. Despite a large body of work on MSR, researchers have focused almost exclusively on repositories that contain code written in imperative programming languages, such as Java and C/C++. Unlike prior work, in this paper, we focus on mining publicly available hardware descriptions (HDs) written in hardware description languages (HDLs), such as VHDL. HDLs have unique syntax and semantics compared to popular imperative languages, and learning-based tools available to hardware designers are well behind those used in other application domains. We assembled large HD corpora consisting of source code written in several HDLs and report on their characteristics. Our language model evaluation reveals that HDs possess a high level of naturalness similar to software written in imperative languages. Further, by utilizing our corpora, we built several deep learning models for automated code completion in VHDL; our models take into account unique characteristics of HDLs, including similarities of nearby concurrent signal assignment statements, in-built concurrency, and the frequently used signal types. These characteristics led to more effective neural models, achieving a BLEU score of 37.3, an 8-14-point improvement over rule-based and neural baselines.

On the relationship between design discussions and design quality: a case study of Apache projects

Open design discussion is a primary mechanism through which open source projects debate, make and document design decisions. However, there are open questions regarding how design discussions are conducted and what effect they have on the design quality of projects. Recent work has begun to investigate design discussions, but has thus far focused on a single communication channel, whereas many projects use multiple channels. In this study, we examine 37 Apache projects and their design discussions, the project’s design quality evolution, and the relationship between design discussion and design quality. A mixed method empirical analysis (data mining and a survey of 130 developers) shows that: I) 89.51% of all design discussions occur in project mailing list, II) both core and non-core developers participate in design discussions, but core developers implement more design related changes (67.06%), and III) the correlation between design discussions and design quality is small. We conclude the paper with several observations that form the foundation for future research and development.

On the relationship between refactoring actions and bugs: a differentiated replication

Software refactoring aims at improving code quality while preserving the system's external behavior. Although in principle refactoring is a behavior-preserving activity, a study presented by Bavota etal in 2012 reported the proneness of some refactoring actions (eg pull up method) to induce faults. The study was performed by mining refactoring activities and bugs from three systems. Taking profit of the advances made in the mining software repositories field (eg better tools to detect refactoring actions at commit-level granularity), we present a differentiated replication of the work by Bavota etal in which we (i) overcome some of the weaknesses that affect their experimental design, (ii) answer the same research questions of the original study on a much larger dataset (3 vs 103 systems), and (iii) complement the quantitative analysis of the relationship between refactoring and bugs with a qualitative, manual inspection of commits aimed at verifying the extent to which refactoring actions trigger bug-fixing activities. The results of our quantitative analysis confirm the findings of the replicated study, while the qualitative analysis partially demystifies the role played by refactoring actions in the bug introduction.

Questions for data scientists in software engineering: a replication

In 2014, a Microsoft study investigated the sort of questions that data science applied to software engineering should answer. This resulted in 145 questions that developers considered relevant for data scientists to answer, thus providing a research agenda to the community. Fast forward to five years, no further studies investigated whether the questions from the software engineers at Microsoft hold for other software companies, including software-intensive companies with different primary focus (to which we refer as software-defined enterprises). Furthermore, it is not evident that the problems identified five years ago are still applicable, given the technological advances in software engineering. This paper presents a study at ING, a software-defined enterprise in banking in which over 15,000 IT staff provides in-house software solutions. This paper presents a comprehensive guide of questions for data scientists selected from the previous study at Microsoft along with our current work at ING. We replicated the original Microsoft study at ING, looking for questions that impact both software companies and software-defined enterprises and continue to impact software engineering. We also add new questions that emerged from differences in the context of the two companies and the five years gap in between. Our results show that software engineering questions for data scientists in the software-defined enterprise are largely similar to the software company, albeit with exceptions. We hope that the software engineering research community builds on the new list of questions to create a useful body of knowledge.

Reducing implicit gender biases in software development: does intergroup contact theory work?

The software development profession suffers from severe gender biases, which could be explicit and implicit. However, SE literature has not systematically explored and evaluated the methods for reducing gender biases, especially for implicit gender biases. This paper reports on a field experiment to examine whether the intergroup contact theory could reduce implicit gender biases in software development. In the field experiment, 280 undergraduate students taking a project-centric introductory software engineering course were assigned to 70 teams with different contact configurations. We measured and compared their explicit and implicit gender biases before and after contacts in their teams. The study yields a rich set of findings. First, we confirmed the positive effects of intergroup contact theory in reducing gender biases, particularly the implicit gender biases in both general and SE-specific contexts. We further revealed that such effects were subjected to different contact configurations. The intergroup contact theory's effects were maximized in teams where the number of females is greater than or equal to the number of males. When the female is the minority group in a team, contacts among members contribute to reducing male members' implicit gender biases but fail to result in the same scale of effects on female members' implicit gender biases. The findings provide insights into using intergroup contact theory in reducing implicit gender biases in software development contexts.

Robotics software engineering: a perspective from the service robotics domain

Robots that support humans by performing useful tasks (a.k.a., service robots) are booming worldwide. In contrast to industrial robots, the development of service robots comes with severe software engineering challenges, since they require high levels of robustness and autonomy to operate in highly heterogeneous environments. As a domain with critical safety implications, service robotics faces a need for sound software development practices. In this paper, we present the first large-scale empirical study to assess the state of the art and practice of robotics software engineering. We conducted 18 semi-structured interviews with industrial practitioners working in 15 companies from 9 different countries and a survey with 156 respondents from 26 countries from the robotics domain. Our results provide a comprehensive picture of (i) the practices applied by robotics industrial and academic practitioners, including processes, paradigms, languages, tools, frameworks, and reuse practices, (ii) the distinguishing characteristics of robotics software engineering, and (iii) recurrent challenges usually faced, together with adopted solutions. The paper concludes by discussing observations, derived hypotheses, and proposed actions for researchers and practitioners.

Thinking aloud about confusing code: a qualitative investigation of program comprehension and atoms of confusion

Atoms of confusion are small patterns of code that have been empirically validated to be difficult to hand-evaluate by programmers. Previous research focused on defining and quantifying this phenomenon, but not on explaining or critiquing it. In this work, we address core omissions to the body of work on atoms of confusion, focusing on the ‘how’ and ‘why’ of programmer misunderstanding.

We performed a think-aloud study in which we observed programmers, both professionals and students, as they hand-evaluated confusing code. We performed a qualitative analysis of the data and found several surprising results, which explain previous results, outline avenues of further research, and suggest improvements of the research methodology.

A notable observation is that correct hand-evaluations do not imply understanding, and incorrect evaluations not misunderstanding. We believe this and other observations may be used to improve future studies and models of program comprehension. We argue that thinking of confusion as an atomic construct may pose challenges to formulating new candidates for atoms of confusion. Ultimately, we question whether hand-evaluation correctness is, itself, a sufficient instrument to study program comprehension.

Understanding build issue resolution in practice: symptoms and fix patterns

Build systems are essential for modern software maintenance and development, while build failures occur frequently across software systems, inducing non-negligible costs in development activities. Build failure resolution is a challenging problem and multiple studies have demonstrated that developers spend non-trivial time in resolving encountered build failures; to relieve manual efforts, automated resolution techniques are emerging recently, which are promising but still limitedly effective. Understanding how build failures are resolved in practice can provide guidelines for both developers and researchers on build issue resolution. Therefore, this work presents a comprehensive study of fix patterns in practical build failures. Specifically, we study 1,080 build issues of three popular build systems Maven, Ant, and Gradle from Stack Overflow, construct a fine-granularity taxonomy of 50 categories regarding to the failure symptoms, and summarize the fix patterns for different failure types. Our key findings reveal that build issues stretch over a wide spectrum of symptoms; 67.96% of the build issues are fixed by modifying the build script code related to plugins and dependencies; and there are 20 symptom categories, more than half of whose build issues can be fixed by specific patterns. Furthermore, we also address the challenges in applying non-intuitive or simplistic fix patterns for developers.

Understanding type changes in Java

Developers frequently change the type of a program element and update all its references for performance, security, concurrency,library migration, or better maintainability. Despite type changes being a common program transformation, it is the least automated and the least studied. With this knowledge gap, researchers miss opportunities to improve the state of the art in automation for software evolution, tool builders do not invest resources where automation is most needed, language and library designers can-not make informed decisions when introducing new types, and developers fail to use common practices when changing types. To fill this gap, we present the first large-scale and most fine-grained empirical study on type changes in Java. We develop state-of-the-art tools to statically mine 297,543 type changes and their subsequent code adaptations from a diverse corpus of 129 Java projects containing 416,652 commits. With this rich data set we answer research questions about the practice of type changes. Among others, we found that type changes are actually more common than renaming,but the current research and tools for type changes are inadequate.Based on our extensive and reliable data, we present actionable,empirically-justified implications.

SESSION: Fairness

Do the machine learning models on a crowd sourced platform exhibit bias? an empirical study on model fairness

Machine learning models are increasingly being used in important decision-making software such as approving bank loans, recommending criminal sentencing, hiring employees, and so on. It is important to ensure the fairness of these models so that no discrimination is made based on protected attribute (e.g., race, sex, age) while decision making. Algorithms have been developed to measure unfairness and mitigate them to a certain extent. In this paper, we have focused on the empirical evaluation of fairness and mitigations on real-world machine learning models. We have created a benchmark of 40 top-rated models from Kaggle used for 5 different tasks, and then using a comprehensive set of fairness metrics, evaluated their fairness. Then, we have applied 7 mitigation techniques on these models and analyzed the fairness, mitigation results, and impacts on performance. We have found that some model optimization techniques result in inducing unfairness in the models. On the other hand, although there are some fairness control mechanisms in machine learning libraries, they are not documented. The mitigation algorithm also exhibit common patterns such as mitigation in the post-processing is often costly (in terms of performance) and mitigation in the pre-processing stage is preferred in most cases. We have also presented different trade-off choices of fairness mitigation decisions. Our study suggests future research directions to reduce the gap between theoretical fairness aware algorithms and the software engineering methods to leverage them in practice.

Fairway: a way to build fair ML software

Machine learning software is increasingly being used to make decisions that affect people's lives. But sometimes, the core part of this software (the learned model), behaves in a biased manner that gives undue advantages to a specific group of people (where those groups are determined by sex, race, etc.). This "algorithmic discrimination" in the AI software systems has become a matter of serious concern in the machine learning and software engineering community. There have been works done to find "algorithmic bias" or "ethical bias" in the software system. Once the bias is detected in the AI software system, the mitigation of bias is extremely important. In this work, we a)explain how ground-truth bias in training data affects machine learning model fairness and how to find that bias in AI software,b)propose a method Fairway which combines pre-processing and in-processing approach to remove ethical bias from training data and trained model. Our results show that we can find bias and mitigate bias in a learned model, without much damaging the predictive performance of that model. We propose that (1) testing for bias and (2) bias mitigation should be a routine part of the machine learning software development life cycle. Fairway offers much support for these two purposes.

Towards automated verification of smart contract fairness

Smart contracts are computer programs allowing users to define and execute transactions automatically on top of the blockchain platform. Many of such smart contracts can be viewed as games. A game-like contract accepts inputs from multiple participants, and upon ending, automatically derives an outcome while distributing assets according to some predefined rules. Without clear understanding of the game rules, participants may suffer from fraudulent advertisements and financial losses. In this paper, we present a framework to perform (semi-)automated verification of smart contract fairness, whose results can be used to refute false claims with concrete examples or certify contract implementations with respect to desired fairness properties. We implement FairCon, which is able to check fairness properties including truthfulness, efficiency, optimality, and collusion-freeness for Ethereum smart contracts. We evaluate FairCon on a set of real-world benchmarks and the experiment result indicates that FairCon is effective in detecting property violations and able to prove fairness for common types of contracts.

SESSION: Fuzzing

Boosting fuzzer efficiency: an information theoretic perspective

In this paper, we take the fundamental perspective of fuzzing as a learning process. Suppose before fuzzing, we know nothing about the behaviors of a program P: What does it do? Executing the first test input, we learn how P behaves for this input. Executing the next input, we either observe the same or discover a new behavior. As such, each execution reveals ”some amount” of information about P’s behaviors. A classic measure of information is Shannon’s entropy. Measuring entropy allows us to quantify how much is learned from each generated test input about the behaviors of the program. Within a probabilistic model of fuzzing, we show how entropy also measures fuzzer efficiency. Specifically, it measures the general rate at which the fuzzer discovers new behaviors. Intuitively, efficient fuzzers maximize information.

From this information theoretic perspective, we develop Entropic, an entropy-based power schedule for greybox fuzzing which assigns more energy to seeds that maximize information. We implemented Entropic into the popular greybox fuzzer LibFuzzer. Our experiments with more than 250 open-source programs (60 million LoC) demonstrate a substantially improved efficiency and confirm our hypothesis that an efficient fuzzer maximizes information. Entropic has been independently evaluated and invited for integration into main-line LibFuzzer. Entropic now runs on more than 25,000 machines fuzzing hundreds of security-critical software systems simultaneously and continuously.

CrFuzz: fuzzing multi-purpose programs through input validation

Fuzz testing has been proved its effectiveness in discovering software vulnerabilities. Empowered its randomness nature along with a coverage-guiding feature, fuzzing has been identified a vast number of vulnerabilities in real-world programs. This paper begins with an observation that the design of the current state-of-the-art fuzzers is not well suited for a particular (but yet important) set of software programs. Specifically, current fuzzers have limitations in fuzzing programs serving multiple purposes, where each purpose is controlled by extra options.

This paper proposes CrFuzz, which overcomes this limitation. CrFuzz designs a clustering analysis to automatically predict if a newly given input would be accepted or not by a target program. Exploiting this prediction capability, CrFuzz is designed to efficiently explore the programs with multiple purposes. We employed CrFuzz for three state-of-the-art fuzzers, AFL, QSYM, and MOpt, and CrFuzz-augmented versions have shown 19.3% and 5.68% better path and edge coverage on average. More importantly, during two weeks of long-running experiments, CrFuzz discovered 277 previously unknown vulnerabilities where 212 of those are already confirmed and fixed by the respected vendors. We would like to emphasize that many of these vulnerabilities were discoverd from FFMpeg, ImageMagick, and Graphicsmagick, all of which are targets of Google's OSS-Fuzz project and thus heavily fuzzed for last three years by far. Nevertheless, CrFuzz identified a remarkable number of vulnerabilities, demonstrating its effectiveness of vulnerability finding capability.

Detecting critical bugs in SMT solvers using blackbox mutational fuzzing

Formal methods use SMT solvers extensively for deciding formula satisfiability, for instance, in software verification, systematic test generation, and program synthesis. However, due to their complex implementations, solvers may contain critical bugs that lead to unsound results. Given the wide applicability of solvers in software reliability, relying on such unsound results may have detrimental consequences. In this paper, we present STORM, a novel blackbox mutational fuzzing technique for detecting critical bugs in SMT solvers. We run our fuzzer on seven mature solvers and find 29 previously unknown critical bugs. STORM is already being used in testing new features of popular solvers before deployment.

Fuzzing: on the exponential cost of vulnerability discovery

We present counterintuitive results for the scalability of fuzzing. Given the same non-deterministic fuzzer, finding the same bugs linearly faster requires linearly more machines. For instance, with twice the machines, we can find all known bugs in half the time. Yet, finding linearly more bugs in the same time requires exponentially more machines. For instance, for every new bug we want to find in 24 hours, we might need twice more machines. Similarly for coverage. With exponentially more machines, we can cover the same code exponentially faster, but uncovered code only linearly faster. In other words, re-discovering the same vulnerabilities is cheap but finding new vulnerabilities is expensive. This holds even under the simplifying assumption of no parallelization overhead.

We derive these observations from over four CPU years worth of fuzzing campaigns involving almost three hundred open source programs, two state-of-the-art greybox fuzzers, four measures of code coverage, and two measures of vulnerability discovery. We provide a probabilistic analysis and conduct simulation experiments to explain this phenomenon.

Intelligent REST API data fuzzing

The cloud runs on REST APIs. In this paper, we study how to intelligently generate data payloads embedded in REST API requests in order to find data-processing bugs in cloud services. We discuss how to leverage REST API specifications, which, by definition, contain data schemas for API request bodies. We then propose and evaluate a range of data fuzzing techniques, including structural schema fuzzing rules, various rule combinations, search heuristics, extracting data values from examples included in REST API specifications, and learning data values on-the-fly from previous service responses. After evaluating these techniques, we identify the top-performing combination and use this algorithm to fuzz several Microsoft Azure cloud services. During our experiments, we found 100s of “Internal Server Error” service crashes, which we triaged into 17 unique bugs and reported to Azure developers. All these bugs are reproducible, confirmed, and fixed or in the process of being fixed.

MTFuzz: fuzzing with a multi-task neural network

Fuzzing is a widely used technique for detecting software bugs and vulnerabilities. Most popular fuzzers generate new inputs using an evolutionary search to maximize code coverage. Essentially, these fuzzers start with a set of seed inputs, mutate them to generate new inputs, and identify the promising inputs using an evolutionary fitness function for further mutation.Despite their success, evolutionary fuzzers tend to get stuck in long sequences of unproductive mutations. In recent years, machine learning (ML) based mutation strategies have reported promising results. However, the existing ML-based fuzzers are limited by the lack of quality and diversity of the training data. As the input space of the target programs is high dimensional and sparse, it is prohibitively expensive to collect many diverse samples demonstrating successful and unsuccessful mutations to train the model.In this paper, we address these issues by using a Multi-Task Neural Network that can learn a compact embedding of the input space based on diverse training samples for multiple related tasks (i.e.,predicting for different types of coverage). The compact embedding can guide the mutation process by focusing most of the mutations on the parts of the embedding where the gradient is high. MTFuzz uncovers 11 previously unseen bugs and achieves an average of 2× more edge coverage compared with 5 state-of-the-art fuzzer on 10 real-world programs

SESSION: Machine Learning

A comprehensive study on challenges in deploying deep learning based software

Deep learning (DL) becomes increasingly pervasive, being used in a wide range of software applications. These software applications, named as DL based software (in short as DL software), integrate DL models trained using a large data corpus with DL programs written based on DL frameworks such as TensorFlow and Keras. A DL program encodes the network structure of a desirable DL model and the process by which the model is trained using the training data. To help developers of DL software meet the new challenges posed by DL, enormous research efforts in software engineering have been devoted. Existing studies focus on the development of DL software and extensively analyze faults in DL programs. However, the deployment of DL software has not been comprehensively studied. To fill this knowledge gap, this paper presents a comprehensive study on understanding challenges in deploying DL software. We mine and analyze 3,023 relevant posts from Stack Overflow, a popular Q&A website for developers, and show the increasing popularity and high difficulty of DL software deployment among developers. We build a taxonomy of specific challenges encountered by developers in the process of DL software deployment through manual inspection of 769 sampled posts and report a series of actionable implications for researchers, developers, and DL framework vendors.

AMS: generating AutoML search spaces from weak specifications

We consider a usage model for automated machine learning (AutoML) in which users can influence the generated pipeline by providing a weak pipeline specification: an unordered set of API components from which the AutoML system draws the components it places into the generated pipeline. Such specifications allow users to express preferences over the components that appear in the pipeline, for example a desire for interpretable components to appear in the pipeline. We present AMS, an approach to automatically strengthen weak specifications to include unspecified complementary and functionally related API components, populate the space of hyperparameters and their values, and pair this configuration with a search procedure to produce a strong pipeline specification: a full description of the search space for candidate pipelines. ams uses normalized pointwise mutual information on a code corpus to identify complementary components, BM25 as a lexical similarity score over the target API's documentation to identify functionally related components, and frequency distributions in the code corpus to extract key hyperparameters and values. We show that strengthened specifications can produce pipelines that outperform the pipelines generated from the initial weak specification and an expert-annotated variant, while producing pipelines that still reflect the user preferences captured in the original weak specification.

Correlations between deep neural network model coverage criteria and model quality

Inspired by the great success of using code coverage as guidance in software testing, a lot of neural network coverage criteria have been proposed to guide testing of neural network models (e.g., model accuracy under adversarial attacks). However, while the monotonic relation between code coverage and software quality has been supported by many seminal studies in software engineering, it remains largely unclear whether similar monotonicity exists between neural network model coverage and model quality. This paper sets out to answer this question. Specifically, this paper studies the correlation between DNN model quality and coverage criteria, effects of coverage guided adversarial example generation compared with gradient decent based methods, effectiveness of coverage based retraining compared with existing adversarial training, and the internal relationships among coverage criteria.

Deep learning library testing via effective model generation

Deep learning (DL) techniques are rapidly developed and have been widely adopted in practice. However, similar to traditional software systems, DL systems also contain bugs, which could cause serious impacts especially in safety-critical domains. Recently, many research approaches have focused on testing DL models, while little attention has been paid for testing DL libraries, which is the basis of building DL models and directly affects the behavior of DL systems. In this work, we propose a novel approach, LEMON, to testing DL libraries. In particular, we (1) design a series of mutation rules for DL models, with the purpose of exploring different invoking sequences of library code and hard-to-trigger behaviors; and (2) propose a heuristic strategy to guide the model generation process towards the direction of amplifying the inconsistent degrees of the inconsistencies between different DL libraries caused by bugs, so as to mitigate the impact of potential noise introduced by uncertain factors in DL libraries. We conducted an empirical study to evaluate the effectiveness of LEMON with 20 release versions of 4 widely-used DL libraries, i.e., TensorFlow, Theano, CNTK, MXNet. The results demonstrate that LEMON detected 24 new bugs in the latest release versions of these libraries, where 7 bugs have been confirmed and one bug has been fixed by developers. Besides, the results confirm that the heuristic strategy for model generation indeed effectively guides LEMON in amplifying the inconsistent degrees for bugs.

DeepSearch: a simple and effective blackbox attack for deep neural networks

Although deep neural networks have been very successful in image-classification tasks, they are prone to adversarial attacks. To generate adversarial inputs, there has emerged a wide variety of techniques, such as black- and whitebox attacks for neural networks. In this paper, we present DeepSearch, a novel fuzzing-based, query-efficient, blackbox attack for image classifiers. Despite its simplicity, DeepSearch is shown to be more effective in finding adversarial inputs than state-of-the-art blackbox approaches. DeepSearch is additionally able to generate the most subtle adversarial inputs in comparison to these approaches.

DENAS: automated rule generation by knowledge extraction from neural networks

Deep neural networks (DNNs) have been widely applied in the software development process to automatically learn patterns from massive data. However, many applications still make decisions based on rules that are manually crafted and verified by domain experts due to safety or security concerns. In this paper, we aim to close the gap between DNNs and rule-based systems by automating the rule generation process via extracting knowledge from well-trained DNNs. Existing techniques with similar purposes either rely on specific DNNs input instances or use inherently unstable random sampling of the input space. Therefore, these approaches either limit the exploration area to a local decision-space of the DNNs or fail to converge to a consistent set of rules. The resulting rules thus lack representativeness and stability.

In this paper, we address the two aforementioned shortcomings by discovering a global property of the DNNs and use it to remodel the DNNs decision-boundary. We name this property as the activation probability, and show that this property is stable. With this insight, we propose an approach named DENAS including a novel rule-generation algorithm. Our proposed algorithm approximates the non-linear decision boundary of DNNs by iteratively superimposing a linearized optimization function.

We evaluate the representativeness, stability, and accuracy of DENAS against five state-of-the-art techniques (LEMNA, Gradient, IG, DeepTaylor, and DTExtract) on three software engineering and security applications: Binary analysis, PDF malware detection, and Android malware detection. Our results show that DENAS can generate more representative rules consistently in a more stable manner over other approaches. We further offer case studies that demonstrate the applications of DENAS such as debugging faults in the DNNs and generating signatures that can detect zero-day malware.

Detecting numerical bugs in neural network architectures

Detecting bugs in deep learning software at the architecture level provides additional benefits that detecting bugs at the model level does not provide. This paper makes the first attempt to conduct static analysis for detecting numerical bugs at the architecture level. We propose a static analysis approach for detecting numerical bugs in neural architectures based on abstract interpretation. Our approach mainly comprises two kinds of abstraction techniques, i.e., one for tensors and one for numerical values. Moreover, to scale up while maintaining adequate detection precision, we propose two abstraction techniques: tensor partitioning and (elementwise) affine relation analysis to abstract tensors and numerical values, respectively. We realize the combination scheme of tensor partitioning and affine relation analysis (together with interval analysis) as DEBAR, and evaluate it on two datasets: neural architectures with known bugs (collected from existing studies) and real-world neural architectures. The evaluation results show that DEBAR outperforms other tensor and numerical abstraction techniques on accuracy without losing scalability. DEBAR successfully detects all known numerical bugs with no false positives within 1.7–2.3 seconds per architecture. On the real-world architectures, DEBAR reports 529 warnings within 2.6–135.4 seconds per architecture, where 299 warnings are true positives.

Dynamic slicing for deep neural networks

Program slicing has been widely applied in a variety of software engineering tasks. However, existing program slicing techniques only deal with traditional programs that are constructed with instructions and variables, rather than neural networks that are composed of neurons and synapses. In this paper, we introduce NNSlicer, the first approach for slicing deep neural networks based on data-flow analysis. Our method understands the reaction of each neuron to an input based on the difference between its behavior activated by the input and the average behavior over the whole dataset. Then we quantify the neuron contributions to the slicing criterion by recursively backtracking from the output neurons, and calculate the slice as the neurons and the synapses with larger contributions. We demonstrate the usefulness and effectiveness of NNSlicer with three applications, including adversarial input detection, model pruning, and selective model protection. In all applications, NNSlicer significantly outperforms other baselines that do not rely on data flow analysis.

Is neuron coverage a meaningful measure for testing deep neural networks?

Recent effort to test deep learning systems has produced an intuitive and compelling test criterion called neuron coverage (NC), which resembles the notion of traditional code coverage. NC measures the proportion of neurons activated in a neural network and it is implicitly assumed that increasing NC improves the quality of a test suite. In an attempt to automatically generate a test suite that increases NC, we design a novel diversity promoting regularizer that can be plugged into existing adversarial attack algorithms. We then assess whether such attempts to increase NC could generate a test suite that (1) detects adversarial attacks successfully, (2) produces natural inputs, and (3) is unbiased to particular class predictions. Contrary to expectation, our extensive evaluation finds that increasing NC actually makes it harder to generate an effective test suite: higher neuron coverage leads to fewer defects detected, less natural inputs, and more biased prediction preferences. Our results invoke skepticism that increasing neuron coverage may not be a meaningful objective for generating tests for deep neural networks and call for a new test generation technique that considers defect detection, naturalness, and output impartiality in tandem.

Machine translation testing via pathological invariance

Machine translation software has become heavily integrated into our daily lives due to the recent improvement in the performance of deep neural networks. However, machine translation software has been shown to regularly return erroneous translations, which can lead to harmful consequences such as economic loss and political conflicts. Additionally, due to the complexity of the underlying neural models, testing machine translation systems presents new challenges. To address this problem, we introduce a novel methodology called PatInv. The main intuition behind PatInv is that sentences with different meanings should not have the same translation. Under this general idea, we provide two realizations of PatInv that given an arbitrary sentence, generate syntactically similar but semantically different sentences by: (1) replacing one word in the sentence using a masked language model or (2) removing one word or phrase from the sentence based on its constituency structure. We then test whether the returned translations are the same for the original and modified sentences. We have applied PatInv to test Google Translate and Bing Microsoft Translator using 200 English sentences. Two language settings are considered: English-Hindi (En-Hi) and English-Chinese (En-Zh). The results show that PatInv can accurately find 308 erroneous translations in Google Translate and 223 erroneous translations in Bing Microsoft Translator, most of which cannot be found by the state-of-the-art approaches.

Model-based exploration of the frontier of behaviours for deep learning system testing

With the increasing adoption of Deep Learning (DL) for critical tasks, such as autonomous driving, the evaluation of the quality of systems that rely on DL has become crucial. Once trained, DL systems produce an output for any arbitrary numeric vector provided as input, regardless of whether it is within or outside the validity domain of the system under test. Hence, the quality of such systems is determined by the intersection between their validity domain and the regions where their outputs exhibit a misbehaviour.

In this paper, we introduce the notion of frontier of behaviours, i.e., the inputs at which the DL system starts to misbehave. If the frontier of misbehaviours is outside the validity domain of the system, the quality check is passed. Otherwise, the inputs at the intersection represent quality deficiencies of the system. We developed DeepJanus, a search-based tool that generates frontier inputs for DL systems. The experimental results obtained for the lane keeping component of a self-driving car show that the frontier of a well trained system contains almost exclusively unrealistic roads that violate the best practices of civil engineering, while the frontier of a poorly trained one includes many valid inputs that point to serious deficiencies of the system.

On decomposing a deep neural network into modules

Deep learning is being incorporated in many modern software systems. Deep learning approaches train a deep neural network (DNN) model using training examples, and then use the DNN model for prediction. While the structure of a DNN model as layers is observable, the model is treated in its entirety as a monolithic component. To change the logic implemented by the model, e.g. to add/remove logic that recognizes inputs belonging to a certain class, or to replace the logic with an alternative, the training examples need to be changed and the DNN needs to be retrained using the new set of examples. We argue that decomposing a DNN into DNN modules— akin to decomposing a monolithic software code into modules—can bring the benefits of modularity to deep learning. In this work, we develop a methodology for decomposing DNNs for multi-class problems into DNN modules. For four canonical problems, namely MNIST, EMNIST, FMNIST, and KMNIST, we demonstrate that such decomposition enables reuse of DNN modules to create different DNNs, enables replacement of one DNN module in a DNN with another without needing to retrain. The DNN models formed by composing DNN modules are at least as good as traditional monolithic DNNs in terms of test accuracy for our problems.

Operational calibration: debugging confidence errors for DNNs in the field

Trained DNN models are increasingly adopted as integral parts of software systems, but they often perform deficiently in the field. A particularly damaging problem is that DNN models often give false predictions with high confidence, due to the unavoidable slight divergences between operation data and training data. To minimize the loss caused by inaccurate confidence, operational calibration, i.e., calibrating the confidence function of a DNN classifier against its operation domain, becomes a necessary debugging step in the engineering of the whole system.

Operational calibration is difficult considering the limited budget of labeling operation data and the weak interpretability of DNN models. We propose a Bayesian approach to operational calibration that gradually corrects the confidence given by the model under calibration with a small number of labeled operation data deliberately selected from a larger set of unlabeled operation data. The approach is made effective and efficient by leveraging the locality of the learned representation of the DNN model and modeling the calibration as Gaussian Process Regression. Comprehensive experiments with various practical datasets and DNN models show that it significantly outperformed alternative methods, and in some difficult tasks it eliminated about 71% to 97% high-confidence (>0.9) errors with only about 10% of the minimal amount of labeled operation data needed for practical learning techniques to barely work


All your app links are belong to us: understanding the threats of instant apps based attacks

Android deep link is a URL that takes users to a specific page of a mobile app, enabling seamless user experience from a webpage to an app. Android app link, a new type of deep link introduced in Android 6.0, is claimed to offer more benefits, such as supporting instant apps and providing more secure verification to protect against hijacking attacks that previous deep links can not. However, we find that the app link is not as secure as claimed, because the verification process can be bypassed by exploiting instant apps.

In this paper, we explore the weakness of the existing app link mechanism and propose three feasible hijacking attacks. Our findings show that even popular apps are subject to these attacks, such as Twitter, Whatsapp, Facebook Message. Our observation is confirmed by Google. To measure the severity of these vulnerabilities, we develop an automatic tool to detect vulnerable apps, and perform a large-scale empirical study on 400,000 Android apps.

Experiment results suggest that app link hijacking vulnerabilities are prevalent in the ecosystem. Specifically, 27.1% apps are vulnerable to link hijacking with smart text selection (STS); 30.0% apps are vulnerable to link hijacking without STS, and all instant apps are vulnerable to instant app attack. We provide an in-depth understanding of the mechanisms behind these types of attacks. Furthermore, we propose the corresponding detection and defense methods that can successfully prevent the proposed hijackings for all the evaluated apps, thus raising the bar against the attacks on Android app links. Our insights and findings demonstrate the urgency to identify and prevent app link hijacking attacks.

Automated construction of energy test oracles for Android

Energy efficiency is an increasingly important quality attribute for software, particularly for mobile apps. Just like any other software attribute, energy behavior of mobile apps should be properly tested prior to their release. However, mobile apps are riddled with energy defects, as currently there is a lack of proper energy testing tools. Indeed, energy testing is a fledgling area of research and recent advances have mainly focused on test input generation. This paper presents ACETON, the first approach aimed at solving the oracle problem for testing the energy behavior of mobile apps. ACETON employs Deep Learning to automatically construct an oracle that not only determines whether a test execution reveals an energy defect, but also the type of energy defect. By carefully selecting features that can be monitored on any app and mobile device, we are assured the oracle constructed using ACETON is highly reusable. Our experiments show that the oracle produced by ACETON is both highly accurate, achieving an overall precision and recall of 99%, and efficient, detecting the existence of energy defects in only 37 milliseconds on average.

Borrowing your enemy’s arrows: the case of code reuse in Android via direct inter-app code invocation

The Android ecosystem offers different facilities to enable communication among app components and across apps to ensure that rich services can be composed through functionality reuse. At the heart of this system is the Inter-component communication (ICC) scheme, which has been largely studied in the literature. Less known in the community is another powerful mechanism that allows for direct inter-app code invocation which opens up for different reuse scenarios, both legitimate or malicious. This paper exposes the general workflow for this mechanism, which beyond ICCs, enables app developers to access and invoke functionalities (either entire Java classes, methods or object fields) implemented in other apps using official Android APIs. We experimentally showcase how this reuse mechanism can be leveraged to “plagiarize" supposedly-protected functionalities. Typically, we were able to leverage this mechanism to bypass security guards that a popular video broadcaster has placed for preventing access to its video database from outside its provided app. We further contribute with a static analysis toolkit, named DICIDer, for detecting direct inter-app code invocations in apps. An empirical analysis of the usage prevalence of this reuse mechanism is then conducted. Finally, we discuss the usage contexts as well as the implications of this studied reuse mechanism.

Static asynchronous component misuse detection for Android applications

Facing the limited resource of smartphones, asynchronous programming significantly improves the performance of Android applications. Android provides several packaged components to ease the development of asynchronous programming. Among them, the AsyncTask component is widely used by developers since it is easy to implement. However, the abuse of AsyncTask component can decrease responsiveness and even lead to crashes. By investigating the Android Developer Documentation and technical forums, we summarize five misuse patterns about AsyncTask. To detect them, we propose a flow, context, object and field-sensitive inter-procedural static analysis approach. Specifically, the static analysis includes typestate analysis, reference analysis and loop analysis. Based on the AsyncTask-related information obtained during static analysis, we check the misuse according to predefined detection rules. The proposed approach is implemented into a tool called AsyncChecker. We evaluate AsyncChecker on a self-designed benchmark suite called AsyncBench and 1,759 real-world apps. AsyncChecker finds 17,946 misused AsyncTask instances in 1,417 real-world apps (80.6%). The precision, recall and F-measure of AsyncChecker on real-world applications are 97.2%, 89.8% and 0.93, respectively. Compared with existing tools, AsyncChecker can detect more asynchronous problems. We report the misuse problems to developers via GitHub. Several developers have confirmed and fixed the problems found by AsyncChecker. The result implies that our approach is effective and developers do take the misuse of AsyncTask as a serious problem.

SESSION: Performance / QoS

Automatically identifying performance issue reports with heuristic linguistic patterns

Performance issues compromise the response time and resource consumption of a software system. Modern software systems use issue tracking systems to manage all kinds of issue reports, including performance issues. The problem is that performance issues are often not explicitly tagged. The tagging mechanism, if exists, is completely voluntary, depending on the project’s convention and on submitters’ discipline. For example, the performance tag rate in Apache’s Jira system is below 1%. This paper contributes a hybrid classification approach that combines linguistic patterns and machine/deep learning techniques to automatically detect performance issue reports. We manually analyzed 980 real-life performance issue reports and derived 80 project-agnostic linguistic patterns that recur in the reports. Our approach uses these linguistic patterns to construct the sentence-level and issue-level learning features for training effective machine/deep learning classifiers. We test our approach on two separate datasets, each consisting of 980 unclassified issue reports, and compare the results with 31 baseline methods. Our approach can reach up to 83% precision and up to 59% recall. The only comparable baseline method is BERT, which is still 25% lower in the F1-score.

Calm energy accounting for multithreaded Java applications

Energy accounting is a fundamental problem in energy management, defined as attributing global energy consumption to individual components of interest. In this paper, we take on this problem at the application level, where the components for accounting are application logical units, such as methods, classes, and packages. Given a Java application, our novel runtime system Chappie produces an energy footprint, i.e., the relative energy consumption of all programming abstraction units within the application.

The design of Chappie is unique in several dimensions. First, relative to targeted energy profiling where the profiler determines the energy consumption of a pre-defined application logical unit, e.g., a specific method, Chappie is total: the energy footprint encompasses all methods within an application. Second, Chappie is concurrency-aware: energy attribution is fully aware of the multi-threaded behavior of Java applications, including JVM bookkeeping threads. Third, Chappie is an embodiment of a novel philosophy for application-level energy accounting and profiling, which states that the accounting run should preserve the temporal phased power behavior of the application, and the spatial power distribution among the underlying hardware system. We term this important property as calmness. Against state-of-the-art DaCapo benchmarks, we show that the energy footprint generated by Chappie is precise while incurring negligible overhead. In addition, all results are produced with a high degree of calmness.

Dynamically reconfiguring software microbenchmarks: reducing execution time without sacrificing result quality

Executing software microbenchmarks, a form of small-scale performance tests predominantly used for libraries and frameworks, is a costly endeavor. Full benchmark suites take up to multiple hours or days to execute, rendering frequent checks, e.g., as part of continuous integration (CI), infeasible. However, altering benchmark configurations to reduce execution time without considering the impact on result quality can lead to benchmark results that are not representative of the software’s true performance.

We propose the first technique to dynamically stop software microbenchmark executions when their results are sufficiently stable. Our approach implements three statistical stoppage criteria and is capable of reducing Java Microbenchmark Harness (JMH) suite execution times by 48.4% to 86.0%. At the same time it retains the same result quality for 78.8% to 87.6% of the benchmarks, compared to executing the suite for the default duration.

The proposed approach does not require developers to manually craft custom benchmark configurations; instead, it provides automated mechanisms for dynamic reconfiguration. Hence, making dynamic reconfiguration highly effective and efficient, potentially paving the way to inclusion of JMH microbenchmarks in CI.

Testing self-adaptive software with probabilistic guarantees on performance metrics

This paper discusses the problem of testing the performance of the adaptation layer in a self-adaptive system. The problem is notoriously hard, due to the high degree of uncertainty and variability inherent in an adaptive software application. In particular, providing any type of formal guarantee for this problem is extremely difficult. In this paper we propose the use of a rigorous probabilistic approach to overcome the mentioned difficulties and provide probabilistic guarantees on the software performance. We describe the set up needed for the application of a probabilistic approach. We then discuss the traditional tools from statistics that could be applied to analyse the results, highlighting their limitations and motivating why they are unsuitable for the given problem. We propose the use of a novel tool – the scenario theory – to overcome said limitations. We conclude the paper with a thorough empirical evaluation of the proposed approach, using two adaptive software applications: the Tele-Assistance Service and the Self-Adaptive Video Encoder. With the first, we empirically expose the trade-off between data collection and confidence in the testing campaign. With the second, we demonstrate how to compare different adaptation strategies.

SESSION: Recommendation

API method recommendation via explicit matching of functionality verb phrases

Due to the lexical gap between functionality descriptions and user queries, documentation-based API retrieval often produces poor results.Verb phrases and their phrase patterns are essential in both describing API functionalities and interpreting user queries. Thus we hypothesize that API retrieval can be facilitated by explicitly recognizing and matching between the fine-grained structures of functionality descriptions and user queries. To verify this hypothesis, we conducted a large-scale empirical study on the functionality descriptions of 14,733 JDK and Android API methods. We identified 356 different functionality verbs from the descriptions, which were grouped into 87 functionality categories, and we extracted 523 phrase patterns from the verb phrases of the descriptions. Building on these findings, we propose an API method recommendation approach based on explicit matching of functionality verb phrases in functionality descriptions and user queries, called PreMA. Our evaluation shows that PreMA can accurately recognize the functionality categories (92.8%) and phrase patterns (90.4%) of functionality description sentences; and when used for API retrieval tasks, PreMA can help participants complete their tasks more accurately and with fewer retries compared to a baseline approach.

Code recommendation for exception handling

Exception handling is an effective mechanism to avoid unexpected runtime errors. However, novice programmers might fail to handle exceptions properly, causing serious errors like system crashing or resource leaking. In this paper, we introduce FuzzyCatch, a code recommendation tool for handling exceptions. Based on fuzzy logic, FuzzyCatch can predict if a runtime exception would occur in a given code snippet and recommend code to handle that exception. FuzzyCatch is implemented as a plugin for Android Studio. The empirical evaluation suggests that FuzzyCatch is highly effective. For example, it has top-1 accuracy of 77% on recommending what exception to catch in a try catch block and of 70% on recommending what method should be called when such an exception occurs. FuzzyCatch also achieves a high level of accuracy and outperforms baselines significantly on detecting and fixing real exception bugs.

eQual: informing early design decisions

When designing a software system, architects make a series of design decisions that directly impact the system's quality. The number of available design alternatives grows rapidly with system size, creating an enormous space of intertwined design concerns that renders manual exploration impractical. We present eQual, a model-driven technique for simulation-based assessment of architectural designs. While it is not possible to guarantee optimal decisions so early in the design process, eQual improves decision quality. eQual is effective in practice because it (1) limits the amount of information the architects have to provide and (2) adapts optimization algorithms to effectively explore massive spaces of design alternatives. We empirically demonstrate that eQual yields designs whose quality is comparable to a set of systems' known optimal designs. A user study shows that, compared to the state-of-the-art, engineers using eQual produce statistically significantly higher-quality designs with a large effect size, are statistically significantly more confident in their designs, and find eQual easier to use.

Recommending stack overflow posts for fixing runtime exceptions using failure scenario matching

Using online Q&A forums, such as Stack Overflow (SO), for guidance to resolve program bugs, among other development issues, is commonplace in modern software development practice. Runtime exceptions (RE) is one such important class of bugs that is actively discussed on SO. In this work we present a technique and prototype tool called MAESTRO that can automatically recommend an SO post that is most relevant to a given Java RE in a developer's code. MAESTRO compares the exception-generating program scenario in the developer's code with that discussed in an SO post and returns the post with the closest match. To extract and compare the exception scenario effectively, MAESTRO first uses the answer code snippets in a post to implicate a subset of lines in the post's question code snippet as responsible for the exception and then compares these lines with the developer's code in terms of their respective Abstract Program Graph (APG) representations. The APG is a simplified and abstracted derivative of an abstract syntax tree, proposed in this work, that allows an effective comparison of the functionality embodied in the high-level program structure, while discarding many of the low-level syntactic or semantic differences. We evaluate MAESTRO on a benchmark of 78 instances of Java REs extracted from the top 500 Java projects on GitHub and show that MAESTRO can return either a highly relevant or somewhat relevant SO post corresponding to the exception instance in 71% of the cases, compared to relevant posts returned in only 8% - 44% instances, by four competitor tools based on state-of-the-art techniques. We also conduct a user experience study of MAESTRO with 10 Java developers, where the participants judge MAESTRO reporting a highly relevant or somewhat relevant post in 80% of the instances. In some cases the post is judged to be even better than the one manually found by the participant.

Understanding the impact of GitHub suggested changes on recommendations between developers

Recommendations between colleagues are effective for encouraging developers to adopt better practices. Research shows these peer interactions are useful for improving developer behaviors, or the adoption of activities to help software engineers complete programming tasks. However, in-person recommendations between developers in the workplace are declining. One form of online recommendations between developers are pull requests, which allow users to propose code changes and provide feedback on contributions. GitHub, a popular code hosting platform, recently introduced the suggested changes feature, which allows users to recommend improvements for pull requests. To better understand this feature and its impact on recommendations between developers, we report an empirical study of this system, measuring usage, effectiveness, and perception. Our results show that suggested changes support code review activities and significantly impact the timing and communication between developers on pull requests. This work provides insight into the suggested changes feature and implications for improving future systems for automated developer recommendations, such as providing situated, concise, and actionable feedback.

SESSION: Security

An evaluation of methods to port legacy code to SGX enclaves

The Intel Security Guard Extensions (SGX) architecture enables the abstraction of enclaved execution, using which an application can protect its code and data from powerful adversaries, including system software that executes with the highest processor privilege. While the Intel SGX architecture exports an ISA with low-level instructions that enable applications to create enclaves, the task of writing applications using this ISA has been left to the software community.

We consider the problem of porting legacy applications to SGX enclaves. In the approximately four years to date since the Intel SGX became commercially available, the community has developed three different models to port applications to enclaves---the library OS, the library wrapper, and the instruction wrapper models.

In this paper, we conduct an empirical evaluation of the merits and costs of each model. We report on our attempt to port a handful of real-world application benchmarks (including OpenSSL, Memcached, a Web server and a Python interpreter) to SGX enclaves using prototypes that embody each of the above models. Our evaluation focuses on the merits and costs of each of these models from the perspective of the effort required to port code under each of these models, the effort to re-engineer an application to work with enclaves, the security offered by each model, and the runtime performance of the applications under these models.

Search-based adversarial testing and improvement of constrained credit scoring systems

Credit scoring systems are critical FinTech applications that concern the analysis of the creditworthiness of a person or organization. While decisions were previously based on human expertise, they are now increasingly relying on data analysis and machine learning. In this paper, we assess the ability of state-of-the-art adversarial machine learning to craft attacks on a real-world credit scoring system. Interestingly, we find that, while these techniques can generate large numbers of adversarial data, these are practically useless as they all violate domain-specific constraints. In other words, the generated examples are all false positives as they cannot occur in practice. To circumvent this limitation, we propose CoEvA2, a search-based method that generates valid adversarial examples (satisfying the domain constraints). CoEvA2 utilizes multi-objective search in order to simultaneously handle constraints, perform the attack and maximize the overdraft amount requested. We evaluate CoEvA2 on a major bank's real-world system by checking its ability to craft valid attacks. CoEvA2 generates thousands of valid adversarial examples, revealing a high risk for the banking system. Fortunately, by improving the system through adversarial training (based on the produced examples), we increase its robustness and make our attack fail.

SinkFinder: harvesting hundreds of unknown interesting function pairs with just one seed

Mastering the knowledge about security-sensitive functions that can potentially result in bugs is valuable to detect them. However, identifying this kind of functions is not a trivial task. Introducing machine learning-based techniques to do the task is a natural choice. Unfortunately, the approach also requires considerable prior knowledge, e.g., sufficient labelled training samples. In practice, the requirement is often hard to meet.

In this paper, to solve the problem, we propose a novel and practical method called SinkFinder to automatically discover function pairs that we are interested in, which only requires very limited prior knowledge. SinkFinder first takes just one pair of well-known interesting functions as the initial seed to infer enough positive and negative training samples by means of sub-word word embedding. By using these samples, a support vector machine classifier is trained to identify more interesting function pairs. Finally, checkers equipped with the obtained knowledge can be easily developed to detect bugs in target systems. The experiments demonstrate that SinkFinder can successfully discover hundreds of interesting functions and detect dozens of previously unknown bugs from large-scale systems, such as Linux, OpenSSL and PostgreSQL.

SESSION: Testing

Baital: an adaptive weighted sampling approach for improved t-wise coverage

The rise of highly configurable complex software and its widespread usage requires design of efficient testing methodology. t-wise coverage is a leading metric to measure the quality of the testing suite and the underlying test generation engine. While uniform sampling-based test generation is widely believed to be the state of the art approach to achieve t-wise coverage in presence of constraints on the set of configurations, such a scheme often fails to achieve high t-wise coverage in presence of complex constraints. In this work, we propose a novel approach Baital, based on adaptive weighted sampling using literal weighted functions, to generate test sets with high t-wise coverage. We demonstrate that our approach reaches significantly higher t-wise coverage than uniform sampling. The novel usage of literal weighted sampling leaves open several interesting directions, empirical as well as theoretical, for future research.

Cost measures matter for mutation testing study validity

Mutation testing research has often used the number of mutants as a surrogate measure for the true execution cost of generating and executing mutants. This poses a potential threat to the validity of the scientific findings reported in the literature. Out of 75 works surveyed in this paper, we found that 54 (72%) are vulnerable to this threat. To investigate the magnitude of the threat, we conducted an empirical evaluation using 10 real-world programs. The results reveal that: i) percentages of randomly sampled mutants differ from the true execution time, on average, by 44%, varying in difference from 19% to 91%; ii) errors arising from using the surrogate correlate with program size (ρ = 0.74) and number of mutants (ρ = 0.76), making the problem more pernicious for more realistic programs; iii) scientific findings concerning sampling strategies would have approximately 37% rank disagreement, indicating potentially dramatic impact on experiment validity. To investigate whether this threat matters in practice, we reproduced a seminal study on Selective Mutation (widely relied upon for more than two decades). The impact is stark: an inconclusive scientific finding using the surrogate is transformed to an unequivocal finding when using the true execution cost.

Detecting optimization bugs in database engines via non-optimizing reference engine construction

Database Management Systems (DBMS) are used ubiquitously. To efficiently access data, they apply sophisticated optimizations. Incorrect optimizations can result in logic bugs, which cause a query to compute an incorrect result set. We propose Non-Optimizing Reference Engine Construction (NoREC), a fully-automatic approach to detect optimization bugs in DBMS. Conceptually, this approach aims to evaluate a query by an optimizing and a non-optimizing version of a DBMS, to then detect differences in their returned result set, which would indicate a bug in the DBMS. Obtaining a non-optimizing version of a DBMS is challenging, because DBMS typically provide limited control over optimizations. Our core insight is that a given, potentially randomly-generated optimized query can be rewritten to one that the DBMS cannot optimize. Evaluating this unoptimized query effectively corresponds to a non-optimizing reference engine executing the original query. We evaluated NoREC in an extensive testing campaign on four widely-used DBMS, namely PostgreSQL, MariaDB, SQLite, and CockroachDB. We found 159 previously unknown bugs in the latest versions of these systems, 141 of which have been fixed by the developers. Of these, 51 were optimization bugs, while the remaining were error and crash bugs. Our results suggest that NoREC is effective, general and requires little implementation effort, which makes the technique widely applicable in practice.

Efficient binary-level coverage analysis

Code coverage analysis plays an important role in the software testing process. More recently, the remarkable effectiveness of coverage feedback has triggered a broad interest in feedback-guided fuzzing. In this work, we introduce bcov, a tool for binary-level coverage analysis. Our tool statically instruments x86-64 binaries in the ELF format without compiler support. We implement several techniques to improve efficiency and scale to large real-world software. First, we bring Agrawal’s probe pruning technique to binary-level instrumentation and effectively leverage its superblocks to reduce overhead. Second, we introduce sliced microexecution, a robust technique for jump table analysis which improves CFG precision and enables us to instrument jump table entries. Additionally, smaller instructions in x86-64 pose a challenge for inserting detours. To address this challenge, we aggressively exploit padding bytes and systematically host detours in neighboring basic blocks.

We evaluate bcov on a corpus of 95 binaries compiled from eight popular and well-tested packages like FFmpeg and LLVM. Two instrumentation policies, with different edge-level precision, are used to patch all functions in this corpus - over 1.6 million functions. Our precise policy has average performance and memory overheads of 14% and 22% respectively. Instrumented binaries do not introduce any test regressions. The reported coverage is highly accurate with an average F-score of 99.86%. Finally, our jump table analysis is comparable to that of IDA Pro on gcc binaries and outperforms it on clang binaries.

Efficiently finding higher-order mutants

Higher-order mutation has the potential for improving major drawbacks of traditional first-order mutation, such as by simulating more realistic faults or improving test-optimization techniques. Despite interest in studying promising higher-order mutants, such mutants are difficult to find due to the exponential search space of mutation combinations. State-of-the-art approaches rely on genetic search, which is often incomplete and expensive due to its stochastic nature. First, we propose a novel way of finding a complete set of higher-order mutants by using variational execution, a technique that can, in many cases, explore large search spaces completely and often efficiently. Second, we use the identified complete set of higher-order mutants to study their characteristics. Finally, we use the identified characteristics to design and evaluate a new search strategy, independent of variational execution, that is highly effective at finding higher-order mutants even in large codebases.

Evolutionary improvement of assertion oracles

Assertion oracles are executable boolean expressions placed inside the program that should pass (return true) for all correct executions and fail (return false) for all incorrect executions. Because designing perfect assertion oracles is difficult, assertions often fail to distinguish between correct and incorrect executions. In other words, they are prone to false positives and false negatives. In this paper, we propose GAssert (Genetic ASSERTion improvement), the first technique to automatically improve assertion oracles. Given an assertion oracle and evidence of false positives and false negatives, GAssert implements a novel co-evolutionary algorithm that explores the space of possible assertions to identify one with fewer false positives and false negatives. Our empirical evaluation on 34 Java methods from 7 different Java code bases shows that GAssert effectively improves assertion oracles. GAssert outperforms two baselines (random and invariant-based oracle improvement), and is comparable with and in some cases even outperformed human-improved assertions.

FrUITeR: a framework for evaluating UI test reuse

UI testing is tedious and time-consuming due to the manual effort required. Recent research has explored opportunities for reusing existing UI tests from an app to automatically generate new tests for other apps. However, the evaluation of such techniques currently remains manual, unscalable, and unreproducible, which can waste effort and impede progress in this emerging area. We introduce FrUITeR, a framework that automatically evaluates UI test reuse in a reproducible way. We apply FrUITeR to existing test-reuse techniques on a uniform benchmark we established, resulting in 11,917 test reuse cases from 20 apps. We report several key findings aimed at improving UI test reuse that are missed by existing work.

Object detection for graphical user interface: old fashioned or deep learning or a combination?

Detecting Graphical User Interface (GUI) elements in GUI images is a domain-specific object detection task. It supports many software engineering tasks, such as GUI animation and testing, GUI search and code generation. Existing studies for GUI element detection directly borrow the mature methods from computer vision (CV) domain, including old fashioned ones that rely on traditional image processing features (e.g., canny edge, contours), and deep learning models that learn to detect from large-scale GUI data. Unfortunately, these CV methods are not originally designed with the awareness of the unique characteristics of GUIs and GUI elements and the high localization accuracy of the GUI element detection task. We conduct the first large-scale empirical study of seven representative GUI element detection methods on over 50k GUI images to understand the capabilities, limitations and effective designs of these methods. This study not only sheds the light on the technical challenges to be addressed but also informs the design of new GUI element detection methods. We accordingly design a new GUI-specific old-fashioned method for non-text GUI element detection which adopts a novel top-down coarse-to-fine strategy, and incorporate it with the mature deep learning model for GUI text detection.Our evaluation on 25,000 GUI images shows that our method significantly advances the start-of-the-art performance in GUI element detection.

Understanding and automatically detecting conflicting interactions between smart home IoT applications

Smart home devices provide the convenience of remotely control-ling and automating home appliances. The most advanced smart home environments allow developers to write apps to make smart home devices work together to accomplish tasks, e.g., home security and energy conservation. A smart home app typically implements narrow functionality and thus to fully implement desired functionality homeowners may need to install multiple apps. These different apps can conflict with each other and these conflicts can result in undesired actions such as locking the door during a fire.

In this paper, we study conflicts between apps on Samsung SmartThings, the most popular platform for developing and deploying smart home IoT devices. By collecting and studying 198 official and 69 third-party apps, we found significant app conflicts in 3 categories: (1) close to 60% of app pairs that access the same device, (2) more than 90% of app pairs with physical interactions, and (3) around 11% of app pairs that access the same global variable. Our results suggest that the problem of conflicts between smart home apps is serious and can create potential safety risks. We then developed a conflict detection tool that uses model checking to automatically detect up to 96% of the conflicts.

When does my program do this? learning circumstances of software behavior

A program fails. Under which circumstances does the failure occur? Our Alhazenapproach starts with a run that exhibits a particular behavior and automatically determines input features associated with the behavior in question: (1) We use a grammar to parse the input into individual elements. (2) We use a decision tree learner to observe and learn which input elements are associated with the behavior in question. (3) We use the grammar to generate additional inputs to further strengthen or refute hypotheses as learned associations. (4) By repeating steps 2 and 3, we obtain a theory that explains and predicts the given behavior. In our evaluation using inputs for find, grep, NetHack, and a JavaScript transpiler, the theories produced by Alhazen predict and produce failures with high accuracy and allow developers to focus on a small set of input features: “grep fails whenever the --fixed-strings option is used in conjunction with an empty search string.”

SESSION: Industry Papers

A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo

Autonomous Driving System (ADS) is one of the most promising and valuable large-scale machine learning (ML) powered systems. Hence, ADS has attracted much attention from academia and practitioners in recent years. Despite extensive study on ML models, it still lacks a comprehensive empirical study towards understanding the ML model roles, peculiar architecture, and complexity of ADS (i.e., various ML models and their relationship with non-trivial code logic). In this paper, we conduct an in-depth case study on Apollo, which is one of the state-of-the-art ADS, widely adopted by major automakers worldwide. We took the first step to reveal the integration of the underlying ML models and code logic in Apollo. In particular, we study the Apollo source code and present the underlying ML model system architecture. We present our findings on how the ML models interact with each other, and how the ML models are integrated with code logic to form a complex system. Finally, we inspect Apollo in a dynamic view and notice the heavy use of model-relevant components and the lack of adequate tests in general. Our study reveals potential maintenance challenges of complex ML-powered systems and identifies future directions to improve the quality assurance of ADS and general ML systems.

Adapting bug prediction models to predict reverted commits at Wayfair

Researchers have proposed many algorithms to predict software bugs. Given a software entity (e.g., a file or method), these algorithms predict whether the entity is bug-prone. However, since these algorithms cannot identify specific bugs, this does not tend to be particularly useful in practice. In this work, we adapt this prior work to the related problem of predicting whether a commit is likely to be reverted. Given the batch nature of continuous integration deployment at scale, this allows developers to find time-sensitive bugs in production more quickly. The models in this paper are based on features extracted from the revision history of a codebase that are typically used in bug prediction. Our experiments, performed on the three main repositories for the Wayfair website, show that our models can rank reverted commits above 80% of non-reverted commits on average. Moreover, when given to Wayfair developers, our models reduce the amount of time needed to find certain kinds of bugs by 55%. Wayfair continues to use our findings and models today to help find bugs during software deployments.

Can microtask programming work in industry?

A critical issue in software development projects in IT service companies is finding the right people at the right time. By enabling assignments of tasks to people to be more fluid, the use of crowdsourcing approaches within a company offers a potential solution to this challenge. Inside a company, as multiple system development projects are ongoing separately, developers with slack time on one project might use this time to contribute to other projects. In this paper, we report on a case study of the application of crowdsourcing within an industrial web application system development project in a large telecommunications company. Developers worked with system specifications which were organized into a set of microtasks, offering a set of short and self-contained descriptions. When crowd workers in other projects had slack time, they fetched and completed microtasks. Our results offer initial evidence for the potential value of microtask programming in increasing the fluidity of team assignments within a company. Crowd contributors to the project were able to onboard and contribute to a new project in less than 2 hours. After onboarding, the crowd workers were together able to successfully implement a small program which contained only a small number of defects. Interview and survey data gathered from project participants revealed that crowd workers reported that they perceived onboarding costs to be reduced and did not experience issues with the reduced face to face communication, but experienced challenges with motivation.

Change impact analysis in Simulink designs of embedded systems

This paper presents and evaluates the Boundary Diagram Tool for change impact analysis of large Simulink designs of embedded systems. In our previous work, we developed the Reach/Coreach Tool for model slicing within a single Simulink model. The current work extends the Reach/Coreach Tool to trace the impact of model changes through multiple models comprising an embedded system, including network interfaces. The change impact analysis results are represented using various diagrams motivated by industrial needs. Several techniques are used to improve understanding of impact analyses of large industrial systems. The tool has been integrated into the software development process of a large automotive OEM (Original Equipment Manufacturer) to support the following activities: change request analysis and evaluation, implementation, verification and integration. The tool also aids impact analyses required for compliance with functional safety standards. The tool’s effectiveness has been demonstrated on production-scale models.

Clustering test steps in natural language toward automating test automation

For large industrial applications, system test cases are still often described in natural language (NL), and their number can reach thousands. Test automation is to automatically execute the test cases. Achieving test automation typically requires substantial manual effort for creating executable test scripts from these NL test cases. In particular, given that each NL test case consists of a sequence of NL test steps, testers first implement a test API method for each test step and then write a test script for invoking these test API methods sequentially for test automation. Across different test cases, multiple test steps can share semantic similarities, supposedly mapped to the same API method. However, due to numerous test steps in various NL forms under manual inspection, testers may not realize those semantically similar test steps and thus waste effort to implement duplicate test API methods for them. To address this issue, in this paper, we propose a new approach based on natural language processing to cluster similar NL test steps together such that the test steps in each cluster can be mapped to the same test API method. Our approach includes domain-specific word embedding training along with measurement based on Relaxed Word Mover’sDistance to analyze the similarity of test steps. Our approach also includes a technique to combine hierarchical agglomerative clustering and K-means clustering post-refinement to derive high-quality and manually-adjustable clustering results. The evaluation results of our approach on a large industrial mobile app, WeChat, show that our approach can cluster the test steps with high accuracy, substantially reducing the number of clusters and thus reducing the downstream manual effort. In particular, compared with the baseline approach, our approach achieves 79.8% improvement on cluster quality, reducing 65.9% number of clusters, i.e., the number of test API methods to be implemented.

Efficient customer incident triage via linking with system incidents

In cloud service systems, customers will report the service issues they have encountered to cloud service providers. Despite many issues can be handled by the support team, sometimes the customer issues can not be easily solved, thus raising customer incidents. Quick troubleshooting of a customer incident is critical. To this end, a customer incident should be assigned to its responsible team accurately in a timely manner.

Our industrial experiences show that linking customer incidents with detected system incidents can help the customer incident triage. In particular, our empirical study on 7 real cloud service systems shows that with the additional information about the system incidents (i.e., incident reports generated by system monitors), the triage time of customer incidents can be accelerated 13.1× on average. Based on this observation, in this paper, we propose LinkCM, a learning based approach to automatically link customer incidents to monitor reported system incidents. LinkCM incorporates a novel learning-based model that effectively extracts related information from two resources, and a transfer learning strategy is proposed to help LinkCM achieve better performance without huge amount of data. The experimental results indicate that LinkCM is able to achieve accurate link prediction. Furthermore, case studies are presented to demonstrate how LinkCM can help the customer incident triage procedure in real production cloud service systems.

Effort-aware just-in-time defect identification in practice: a case study at Alibaba

Effort-aware Just-in-Time (JIT) defect identification aims at identifying defect-introducing changes just-in-time with limited code inspection effort. Such identification has two benefits compared with traditional module-level defect identification, i.e., identifying defects in a more cost-effective and efficient manner. Recently, researchers have proposed various effort-aware JIT defect identification approaches, including supervised (e.g., CBS+, OneWay) and unsupervised approaches (e.g., LT and Code Churn). The comparison of the effectiveness between such supervised and unsupervised approaches has attracted a large amount of research interest. However, the effectiveness of the recently proposed approaches and the comparison among them have never been investigated in an industrial setting.

In this paper, we investigate the effectiveness of state-of-the-art effort-aware JIT defect identification approaches in an industrial setting. To that end, we conduct a case study on 14 Alibaba projects with 196,790 changes. In our case study, we investigate three aspects: (1) The effectiveness of state-of-the-art supervised (i.e., CBS+,OneWay, EALR) and unsupervised (i.e., LT and Code Churn) effortaware JIT defect identification approaches on Alibaba projects, (2) the importance of the features used in the effort-aware JIT defect identification approach, and (3) the association between projectspecific factors and the likelihood of a defective change. Moreover, we develop a tool based on the best performing approach and investigate the tool's effectiveness in a real-life setting at Alibaba.

Enhancing the interoperability between deep learning frameworks by model conversion

Deep learning (DL) has become one of the most successful machine learning techniques. To achieve the optimal development result, there are emerging requirements on the interoperability between DL frameworks that the trained model files and training/serving programs can be re-utilized. Faithful model conversion is a promising technology to enhance the framework interoperability in which a source model is transformed into the semantic equivalent in another target framework format. However, several major challenges need to be addressed. First, there are apparent discrepancies between DL frameworks. Second, understanding the semantics of a source model could be difficult due to the framework scheme and optimization. Lastly, there exist a large number of DL frameworks, bringing potential significant engineering efforts.

In this paper, we propose MMdnn, an open-sourced, comprehensive, and faithful model conversion tool for popular DL frameworks. MMdnn adopts a novel unified intermediate representation (IR)-based methodology to systematically handle the conversion challenges. The source model is first transformed into an intermediate computation graph represented by the simple graph-based IR of MMdnn and then to the target framework format, which greatly reduces the engineering complexity. Since the model structure expressed by developers may have been changed by DL frameworks (e.g., graph optimization), MMdnn tries to recover the original high-level neural network layers for better semantic comprehension via a pattern matching similar method. In the meantime, a piece of model construction code is generated to facilitate later retraining or serving. MMdnn implements an extensible conversion architecture from the compilation point of view, which eases contribution from the community to support new DL operators and frameworks. MMdnn has reached good maturity and quality, and is applied for converting production models.

Establishing key performance indicators for measuring software-development processes at a large organization

Developing software systems in large organizations requires the cooperation of various organizational units and stakeholders. As software-development processes are distributed among such organizational units; and are constantly transformed to fulfill new domain regulations, address changing customer requirements, or adopt new software-engineering methods; it is challenging to ensure, measure, and steer—essentially monitor—the quality of the resulting systems. One means to facilitate such monitoring throughout whole software-development processes are key performance indicators, which provide a consolidated analysis of an organizations’ performance. However, it is also challenging to introduce key performance indicators for the software development of a large organization, as they must be implemented at and accepted by all relevant organizational units. In this paper, we report our experiences of introducing new key performance indicators for software-development processes at Volkswagen Financial Services AG, a large organization in the financial sector. We describe i) our methodology; ii) how we customized and use key performance indicators; iii) benefits achieved, namely improved monitoring and comparability, which help to define quality-improving actions; iv) and six lessons learned. These insights are helpful for other practitioners, providing an overview of a methodology they can adopt to assess the feasibility of key performance indicators as well as their benefits. Moreover, we hope to motivate research to investigate methods for introducing and monitoring key performance indicators to facilitate their adoption.

Estimating GPU memory consumption of deep learning models

Deep learning (DL) has been increasingly adopted by a variety of software-intensive systems. Developers mainly use GPUs to accelerate the training, testing, and deployment of DL models. However, the GPU memory consumed by a DL model is often unknown to them before the DL job executes. Therefore, an improper choice of neural architecture or hyperparameters can cause such a job to run out of the limited GPU memory and fail. Our recent empirical study has found that many DL job failures are due to the exhaustion of GPU memory. This leads to a horrendous waste of computing resources and a significant reduction in development productivity. In this paper, we propose DNNMem, an accurate estimation tool for GPU memory consumption of DL models. DNNMem employs an analytic estimation approach to systematically calculate the memory consumption of both the computation graph and the DL framework runtime. We have evaluated DNNMem on 5 real-world representative models with different hyperparameters under 3 mainstream frameworks (TensorFlow, PyTorch, and MXNet). Our extensive experiments show that DNNMem is effective in estimating GPU memory consumption.

Exempla gratis (E.G.): code examples for free

Modern software engineering often involves using many existing APIs, both open source and – in industrial coding environments– proprietary. Programmers reference documentation and code search tools to remind themselves of proper common usage patterns of APIs. However, high-quality API usage examples are computationally expensive to curate and maintain, and API usage examples retrieved from company-wide code search can be tedious to review. We present a tool, EG, that mines codebases and shows the common, idiomatic us-age examples for API methods. EG was integrated into Facebook’s internal code search tool for the Hack language and evaluated on open-source GitHub projects written in Python. EG was also compared against code search results and hand-written examples from a popular programming website called ProgramCreek. Compared with these two baselines, examples generated by EG are more succinct and representative with less extraneous statements. In addition, a survey with Facebook developers shows that EG examples are preferred in 97% of cases.

Fireteam: a small-team development practice in industry

Software development is a collective undertaking, and the team’s efficiency is critical in development. In order to reduce project management overheads and improve productivity, a global information and communication technology enterprise institutionalizes an organization wide small-team practice, called fireteams, to tackle the problems arising from human and social aspects, such as amicability, talent, skill, and communications. This paper reports a mixed-method research, which combines archive analysis, interviews and survey, to empirically investigate the characteristics and impacts of fireteam in this industrial setting. We identify three categories of fireteam in terms of its demonstrated characteristics: ordinary agile team with extensions, single-function team, and entire life-cycle team; elaborate four key activities of fireteam, i.e. team formation, maintenance, communication, and meeting. Less communication and management overheads, higher agility & concurrency, and improved personal ability are the three important contributors that increase the productivity of fireteams. Whereas management & leadership effort, divergent understanding of fireteam, and self-organized team are discovered as the three major problems associated with fireteams. Although the benefits of fireteam can be observed from its adoption, this practice does not achieve the enterprise’s anticipations very well. Some considerations and recommendations are also discussed to improve this small-team practice.

FREPA: an automated and formal approach to requirement modeling and analysis in aircraft control domain

Formal methods are promising for modeling and analyzing system requirements. However, applying formal methods to large-scale industrial projects is a remaining challenge. The industrial engineers are suffering from the lack of automated engineering methodologies to effectively conduct precise requirement models, and rigorously validate and verify (V&V) the generated models. To tackle this challenge, in this paper, we present a systematic engineering approach, named Formal Requirement Engineering Platform in Aircraft (FREPA), for formal requirement modeling and V&V in the aerospace and aviation control domains. FREPA is an outcome of the seamless collaboration between the academy and industry over the last eight years. The main contributions of this paper include 1) an automated and systematic engineering approach FREPA to construct requirement models, validate and verify systems in the aerospace and aviation control domain, 2) a domain-specific modeling language AASRDL to describe the formal specification, and 3) a practical FREPA-based tool AeroReq which has been used by our industry partners. We have successfully adopted FREPA to seven real aerospace gesture control and two aviation engine control systems. The experimental results show that FREPA and the corresponding tool AeroReq significantly facilitate formal modeling and V&V in the industry. Moreover, we also discuss the experiences and lessons gained from using FREPA in aerospace and aviation projects.

Graph-based trace analysis for microservice architecture understanding and problem diagnosis

Microservice systems are highly dynamic and complex. For such systems, operation engineers and developers highly rely on trace analysis to understand architectures and diagnose various problems such as service failures and quality degradation. However, the huge number of traces produced at runtime makes it challenging to capture the required information in real-time. To address the faced challenges, in this paper, we propose a graph-based microservice trace analysis approach GMTA for understanding architecture and diagnosing various problems. Built on a graph-based representation, GMTA includes efficient processing of traces produced on the fly. It abstracts traces into different paths and further groups them into business flows. To support various analytical applications, GMTA includes an efficient storage and access mechanism by combining a graph database and a real-time analytics database and using a carefully designed storage structure. Based on GMTA, we construct analytical applications for architecture understanding and problem diagnosis, these applications support various needs such as visualizing service dependencies, making architectural decisions, analyzing the changes of services behaviors, detecting performance issues, and locating root causes. GMTA has been implemented and deployed in eBay. An experimental study based on trace data produced by eBay demonstrates GMTA's effectiveness and efficiency for architecture understanding and problem diagnosis. Case studies conducted in eBay's monitoring team and Site Reliability Engineering (SRE) team further confirm GMTA's substantial benefits in industrial-scale microservice systems.

Harvey: a greybox fuzzer for smart contracts

We present Harvey, an industrial greybox fuzzer for smart contracts, which are programs managing accounts on a blockchain.

Greybox fuzzing is a lightweight test-generation approach that effectively detects bugs and security vulnerabilities. However, greybox fuzzers randomly mutate program inputs to exercise new paths; this makes it challenging to cover code that is guarded by narrow checks. Moreover, most real-world smart contracts transition through many different states during their lifetime, e.g., for every bid in an auction. To explore these states and thereby detect deep vulnerabilities, a greybox fuzzer would need to generate sequences of contract transactions, e.g., by creating bids from multiple users, while keeping the search space and test suite tractable.

In this paper, we explain how Harvey alleviates both challenges with two key techniques. First, Harvey extends standard greybox fuzzing with a method for predicting new inputs that are more likely to cover new paths or reveal vulnerabilities in smart contracts. Second, it fuzzes transaction sequences in a targeted and demand-driven way. We have evaluated our approach on 27 real-world contracts. Our experiments show that our techniques significantly increase Harvey's effectiveness in achieving high coverage and detecting vulnerabilities, in most cases orders-of-magnitude faster.

How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems

In recent years, more and more traditional shrink-wrapped software is provided as 7x24 online services. Incidents (events that lead to service disruptions or outages) could affect service availability and cause great financial loss. Therefore, mitigating the incidents is important and time critical. In practice, a document describing a mitigation process, called a troubleshooting guide (TSG), is usually used to reduce the Time To Mitigate (TTM). To investigate the usage of TSGs in real-world online services, we conduct the first empirical study on 18 real-world, large-scale online service systems in Microsoft. We analyze the distribution and characteristics of TSGs among all incident records in the past two years. According to our study, 27.2% incidents have TSG records and 36.2% of them occurred at least twice. Besides, on average developers spend around 36.3% of the entire mitigation time on locating the desired TSGs.

Our study shows that incidents could occur repeatedly and TSGs could be reused to facilitate incident mitigation. Motivated by our empirical study, we propose an automated TSG recommendation approach, DeepRmd, by leveraging the textual similarity between incident description and its corresponding TSG using deep learning techniques. We evaluate the effectiveness of DeepRmd on 18 online service systems. The results show that DeepRmd can recommend the correct TSG as the Top 1 returned result for 80.3% incidents, which significantly outperforms two baseline approaches.

Improving cybersecurity hygiene through JIT patching

Vulnerability patch management remains one of the most complex issues facing modern enterprises; companies struggle to test and deploy new patches across their networks, often leaving myriad attack vectors vulnerable to exploits. This problem is exacerbated by enterprise server applications, which expose tremendous amounts of information about their security postures, greatly expediting attackers' reconnaissance incursions (e.g., knowledge gathering attacks). Unfortunately, current patching processes offer no insights into attacker activities, and prompt attack remediation is hindered by patch compatibility considerations and deployment cycles.

To reverse this asymmetry, a patch management model is proposed to facilitate the rapid injection of software patches into live, commodity applications without disruption of production workflows, and the transparent sandboxing of suspicious processes for counterreconnaissance and threat information gathering. Our techniques improve workload visibility and vulnerability management, and overcome perennial shortcomings of traditional patching methodologies, such as proneness to attacker fingerprinting, and the high cost of deployment. The approach enables a large variety of novel defense scenarios, including rapid security patch testing with prompt recovery from defective patches and the placement of exploit sensors inlined into production workloads. An implementation for six enterprise-grade server programs demonstrates that our approach is practical and incurs minimal runtime overheads. Moreover, four use cases are discussed, including a practical deployment on two public cloud environments.

IntelliCode compose: code generation using transformer

In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments.

In this paper, we introduce IntelliCode Compose – a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook.

Our best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language.

Learning to extract transaction function from requirements: an industrial case on financial software

In practice, it is very important to determine the size of a proposed software system yet to be built based on its requirements, i.e., early in the development life cycle. The most widely used approach for size estimation is Function Point Analysis (FPA). However, since FPA involves human judgment, the estimation results are some degree of subjective, and the process is labor and cost intensive. In this paper, we propose a novel approach to identify transaction functions from textual requirements automatically by leveraging a set of natural language processing techniques and machine learning models. We evaluate our approach on 1,864 requirements and 104,691 transaction functions taken from 36 financial projects from one banking industry. The results show that the contents of the suggested transaction functions by our approach are high in quality, with low perplexity value of 8.5 and high BLEU score of 34 on average. The types of suggested transaction functions can also be accurately classified, with overall accuracy of 0.99 on average. Our approach can provide reasonable suggestions that assist industrial practitioners to identify transaction functions faster and easier.

Online sports betting through the prism of software engineering

Online sports betting is a $50B industry that is heavily driven by software. The domain imposes significant demands on developers: the resulting solutions are large, complex, distributed, concurrent software systems with strict availability, real-time performance, scalability, reliability, and security requirements. This paper describes our experience with EmpireBet, a family of online sports betting platforms built and deployed over the past 15 years. The initial solution, implemented by four developers in a start-up, catered to users who connected to the system intermittently, for limited periods, via dial-up connections. Today’s system, engineered and maintained in 27 programming and markup languages by a team of 20 developers, is deployed in over 30 countries, integrated with over 50 third-party systems, and processes tens of millions daily transactions by over 680,000 players who are continuously using the system. This was accomplished via an an explicit focus on EmpireBet’s critical non-functional requirements; a modular, extensible architecture; a set of novel abstractions we introduced into the system; and several reusable libraries developed in the process.

Reducing DNN labelling cost using surprise adequacy: an industrial case study for autonomous driving

Deep Neural Networks (DNNs) are rapidly being adopted by the automotive industry, due to their impressive performance in tasks that are essential for autonomous driving. Object segmentation is one such task: its aim is to precisely locate boundaries of objects and classify the identified objects, helping autonomous cars to recognise the road environment and the traffic situation. Not only is this task safety critical, but developing a DNN based object segmentation module presents a set of challenges that are significantly different from traditional development of safety critical software. The development process in use consists of multiple iterations of data collection, labelling, training, and evaluation. Among these stages, training and evaluation are computation intensive while data collection and labelling are manual labour intensive. This paper shows how development of DNN based object segmentation can be improved by exploiting the correlation between Surprise Adequacy (SA) and model performance. The correlation allows us to predict model performance for inputs without manually labelling them. This, in turn, enables understanding of model performance, more guided data collection, and informed decisions about further training. In our industrial case study the technique allows cost savings of up to 50% with negligible evaluation inaccuracy. Furthermore, engineers can trade off cost savings versus the tolerable level of inaccuracy depending on different development phases and scenarios.

Scaling static taint analysis to industrial SOA applications: a case study at Alibaba

In Alibaba, we have seen a growing demand for tracing data flow for scenarios such as data leak detection, change governance, and data consistency checking. Static taint analysis is a technique for such problems, and many approaches are proposed for high scalability and precision. This paper shares our experience in applying taint analysis in Alibaba. In particular, we find that the state-of-the-art taint analysis tool, FlowDroid, does not work well in our cases because our applications make heavy use of libraries, native methods and enterprise-specific frameworks, which impose two major challenges, scalability and implicit dependency, to FlowDroid. This paper presents ANTaint to address these problems. ANTaint improves scalability by expanding the call graph and applying taint propagation on demand for libraries, which account for majority of the program execution but only a small fraction propagates taints. To improve accuracy, we ensure to build a sound call graph with its core part having certain accuracy, and providing a more precise taint propagation model. The practice of applying ANTaint in the company workload validates the idea. According to an experiment on 60 production cases, ANTaint is correct for 95% of the cases (precision: 95%, recall: 98%) while FlowDroid is 13%. ANTaint takes 65% less time and none of the cases run out of memory with 32 GB limitation.

Towards intelligent incident management: why we need it and how we make it

The management of cloud service incidents (unplanned interruptions or outages of a service/product) greatly affects customer satisfaction and business revenue. After years of efforts, cloud enterprises are able to solve most incidents automatically and timely. However, in practice, we still observe critical service incidents that occurred in an unexpected manner and orchestrated diagnosis workflow failed to mitigate them. In order to accelerate the understanding of unprecedented incidents and provide actionable recommendations, modern incident management system employs the strategy of AIOps (Artificial Intelligence for IT Operations). In this paper, to provide a broad view of industrial incident management and understand the modern incident management system, we conduct a comprehensive empirical study spanning over two years of incident management practices at Microsoft. Particularly, we identify two critical challenges (namely, incomplete service/resource dependencies and imprecise resource health assessment) and investigate the underlying reasons from the perspective of cloud system design and operations. We also present IcM BRAIN, our AIOps framework towards intelligent incident management, and show its practical benefits conveyed to the cloud services of Microsoft.

WebRR: self-replay enhanced robust record/replay for web application testing

Record-and-replay tools are important for quality assurance of Web applications by capturing user case scenarios and executing them automatically when needed. However, the tests generated by existing techniques are brittle, and often lead to test breakages as the dynamic behavior and frequent updates of modern Web applications. In this paper, we propose WebRR, a self-replay enhanced robust record-and-replay technique for Web applications testing. The novelty of WebRR is that, it introduces a new self-replay mechanism in the recording phase, which checks the captured event from the record module online, and generates multiple locators (including DOM locators, visual locator and proximity locators) automatically, to improve the robustness of generated test cases. During the replay, it combines multiple locators and new local workflow repair technique to repair test breakages, and can improve the resilience of generated tests to frequent updates of the applications. We applied our approach to 3 enterprise Web applications, which are deployed in a large power grid company of China. The experimental results show that WebRR is effective, and substantially improve the robustness of end-to-end web tests that are generated using record-and-replay technique.

SESSION: Visions and Reflections

Beyond accuracy: assessing software documentation quality

Good software documentation encourages good software engineering, but the meaning of "good" documentation is vaguely defined in the software engineering literature. To clarify this ambiguity, we draw on work from the data and information quality community to propose a framework that decomposes documentation quality into ten dimensions of structure, content, and style. To demonstrate its application, we recruited technical editors to apply the framework when evaluating examples from several genres of software documentation. We summarise their assessments -- for example, reference documentation and README files excel in quality whereas blog articles have more problems -- and we describe our vision for reasoning about software documentation quality and for the expansion and potential of a unified quality framework.

Continuous experimentation on artificial intelligence software: a research agenda

Moving from experiments to industrial level AI software development requires a shift from understanding AI/ ML model attributes as a standalone experiment to know-how integrating and operating AI models in a large-scale software system. It is a growing demand for adopting state-of-the-art software engineering paradigms into AI development, so that the development efforts can be aligned with business strategies in a lean and fast-paced manner. We describe AI development as an “unknown unknown” problem where both business needs and AI models evolve over time. We describe a holistic view of an iterative, continuous approach to develop industrial AI software basing on business goals, requirements and Minimum Viable Products. From this, five areas of challenges are presented with the focus on experimentation. In the end, we propose a research agenda with seven questions for future studies.

Inferring and securing software configurations using automated reasoning

Software configurability opens the door to misconfiguration vulnerabilities, invalid settings that expose software weaknesses. Misconfiguration is one the top ten most critical security risks and the most common. This paper envisions a world without misconfiguration vulnerabilities through the use of automated reasoning techniques to infer and secure software configurations. Real-world software, however, often lacks an explicit specification of secure configurations, relying on hand-validation by users. Real-world systems comprise many individual highly-configurable software components, making the space of possible configurations for the whole system enormous. To realize our vision and overcome these challenges, we aim to create a rigorous definition of configuration specifications, use formal methods to mechanize the inference and generation of valid configurations, and develop algorithms to automatically secure against misconfiguration.

Next generation automated software evolution refactoring at scale

Despite progress in providing software engineers with tools that automate an increasing number of development tasks, complex activities like redesigning and reengineering existing software remain resource intensive or are supported by tools that are error prone. Complex, but common tasks in industry, like evolving large codebases (1M+ SLOC) to meet changing needs, still rely on costly manual efforts and incur significant technical risk. In one example, an organization that we work with estimated 14,000 hours of development work alone (excluding integration and testing) to isolate a feature from the underlying hardware platform. These examples are pervasive in industry. Software engineering research has taken providing effective tools for software evolution for granted for far too long. The time is right for research to take advantage of advances in search-based software engineering and create the next generation of industry-relevant automated software evolution tools. This paper lays out a vision for automated refactoring at scale towards this goal.

Revealing the complexity of automotive software

Software continues its procession into the core of the modern cars. Sophisticated functionalities, like connectivity and active safety, provide gratifying comfort to its users. With the sophisticated functionality, however, comes the underlying complexity that grows overwhelmingly year by year. But the invisibility of software hinders the practitioners and researchers grasping the magnitude of complexity thoroughly. Rather, from time to time, the consequences of complexity surface in forms of ultra-high design efforts, waves of defect reports, and explosions of warranty costs. This article reveals the complexity of software in four key areas of automotive software development. It points out that the existing practices are severely insufficient for systematic complexity management.

Software documentation and augmented reality: love or arranged marriage?

There is a significant rise in the availability, development and size of software projects in the present day. Many open source projects are reused or updated for various purposes that include fixing bugs in existing projects, development and maintenance of project extensions. Developers who interact with the projects might require documentation for better comprehension of the project and to develop extensions. Most of the software projects currently do not have sufficient documentation or it is not updated along with the project. If some projects have reasonably sufficient documentation, it is usually difficult to comprehend it either for maintenance or for reuse purposes. Considering the usefulness of Augmented Reality (AR) towards comprehension, we propose the vision of integrating the domains of augmented reality and software documentation, and specifically, visualization of software documentation using AR. In this paper, we present some of the directions that could be explored towards this vision and also present an example visualization scenario for API documentation using neural system metaphor. We see this paper as a basis for the future research direction of leveraging AR towards making documentation as a primary artifact in the software development process.

Testing machine learning code using polyhedral region

To date, although machine learning has been successful in various practical applications, generic methods of testing machine learning code have not been established yet. Here we present a new approach to test machine learning code using the possible input region obtained as a polyhedron. If an ML system generates different output for multiple input in the polyhedron, it is ensured that there exists a bug in the code. This property is known as one of theoretical fundamentals in statistical inference, for example, sparse regression models such as the lasso, and a wide range of machine learning algorithms satisfy this polyhedral condition, to which our testing procedure can be applied. We empirically show that the existence of bugs in lasso code can be effectively detected by our method in the mutation testing framework.

Towards learning visual semantics

We envision visual semantics learning (VSL), a novel methodology that derives high-level functional description of given software from its visual (graphical) outputs. By visual semantics, we mean the semantic description about the software’s behaviors that are exhibited in its visual outputs. VSL works by composing this description based on visual element labels extracted from these outputs through image/video understanding and natural language generation. The result of VSL can then support tasks that may benefit from the high-level functional description. Just like a developer relies on program understanding to conduct many of such tasks, automatically understanding software (i.e., by machine rather than by human developers) is necessary to eventually enable fully automated software engineering. Apparently, VSL only works with software that does produce visual outputs that meaningfully demonstrate the software’s behaviors. Nevertheless, learning visual semantics would be a useful first step towards automated software understanding. We outline the design of our approach to VSL and present early results demonstrating its merits.

SESSION: Tool Demonstrations

AlloyMC: Alloy meets model counting

Specifying and analyzing desired properties of software systems can play an important role in the development of more dependable systems. Alloy is a mature tool-set that provides a first-order, rela- tional logic with transitive closure for writing the specifications, and a fully automatic backend based on propositional satisfiability (SAT) solvers for analyzing them. Alloy’s intuitive notation and sup- port for modern solvers make it a particularly effective specification and analysis tool, which has been applied in several domains, including verification, security, and synthesis. This paper introduces a new backend for Alloy, which complements SAT solvers, and provides a new method to assist Alloy users to more effectively use the tool-set, specifically in scenarios where multiple solutions to the same formula are desired. We add to the Alloy backend support for model counting, i.e., computing the number of solutions to the given formula. We extend the Alloy grammar to add a new com- mand for model counting, and extend the Alloy GUI to customize it. Our implementation, called AlloyMC, supports two state-of-the-art model counters: the approximate model counter ApproxMC and the exact model counter ProjMC. AlloyMC runs on Linux, Mac, and Windows. To use AlloyMC, users just download and run its integrated JAR file with no need to install dependencies (e.g., model counters and their dependent libraries). The AlloyMC source code, the JAR file, and the data set are available publicly.

ARCADE: an extensible workbench for architecture recovery, change, and decay evaluation

This paper presents the design, implementation, and usage details of ARCADE, an extensible workbench for supporting the recovery of software systems' architectures, and for evaluating architectural change and decay. ARCADE has been developed and maintained over the past decade, and has been deployed in a number of research labs as well as within three large companies. ARCADE's implementation is available at and the video depicting its use at

BEE: a tool for structuring and analyzing bug reports

This paper introduces BEE, a tool that automatically analyzes user-written bug reports and provides feedback to reporters and developers about the system’s observed behavior (OB), expected behavior (EB), and the steps to reproduce the bug (S2R). BEE employs machine learning to (i) detect if an issue describes a bug, an enhancement, or a question; (ii) identify the structure of bug descriptions by automatically labeling the sentences that correspond to the OB, EB, or S2R; and (iii) detect when bug reports fail to provide these elements. BEE is integrated with GitHub and offers a public web API that researchers can use to investigate bug management tasks based on bug reports. We evaluated BEE’s underlying models on more than 5k existing bug reports and found they can correctly detect OB, EB, and S2R sentences as well as missing information in bug reports. BEE is an open-source project that can be found at <a></a>. A screencast showing the full capabilities of BEE can be found at <a></a>.

BugsInPy: a database of existing bugs in Python programs to enable controlled testing and debugging studies

The 2019 edition of Stack Overflow developer survey highlights that, for the first time, Python outperformed Java in terms of popularity. The gap between Python and Java further widened in the 2020 edition of the survey. Unfortunately, despite the rapid increase in Python's popularity, there are not many testing and debugging tools that are designed for Python. This is in stark contrast with the abundance of testing and debugging tools for Java. Thus, there is a need to push research on tools that can help Python developers.

One factor that contributed to the rapid growth of Java testing and debugging tools is the availability of benchmarks. A popular benchmark is the Defects4J benchmark; its initial version contained 357 real bugs from 5 real-world Java programs. Each bug comes with a test suite that can expose the bug. Defects4J has been used by hundreds of testing and debugging studies and has helped to push the frontier of research in these directions.

In this project, inspired by Defects4J, we create another benchmark database and tool that contain 493 real bugs from 17 real-world Python programs. We hope our benchmark can help catalyze future work on testing and debugging tools that work on Python programs.

CRSG: a serious game for teaching code review

The application of code review in a development environment is essential, but this skill is not taught very often in an educational context despite its wide usage. To streamline the teaching process of code review, we propose a browser based "Code Review Serious Game" (CRSG) with high accessibility, progressive level difficulty and an evolvable foundation for prospective improvements or changes. The application is built as a serious game to reinforce the learning experience of its users by immersing them in its story and theme, helping them learn while having fun. The effectiveness of the game components are measured with a case study of 132 students of 2 software engineering courses. The promising result of this case study suggests CRSG can indeed be used effectively to teach code review. The demo video for the game can be accessed at, and CRSG itself at:

Dads: dynamic slicing continuously-running distributed programs with budget constraints

We present Dads, the first distributed, online, scalable, and cost-effective dynamic slicer for continuously-running distributed programs with respect to user-specified budget constraints. Dads is distributed by design to exploit distributed and parallel computing resources. With an online analysis, it avoids tracing hence the associated time and space costs. Most importantly, Dads achieves and maintains practical scalability and cost-effectiveness tradeoffs according to a given budget on analysis time by continually and automatically adjusting the configuration of its analysis algorithm on the fly via reinforcement learning. Against eight real-world Java distributed systems, we empirically demonstrated the scalability and cost-effectiveness merits of Dads. The open-source tool package of Dads with a demo video is publicly available.

DeepCommenter: a deep code comment generation tool with hybrid lexical and syntactical information

As the scale of software projects increases, the code comments are more and more important for program comprehension. Unfortunately, many code comments are missing, mismatched or outdated due to tight development schedule or other reasons. Automatic code comment generation is of great help for developers to comprehend source code and reduce their workload. Thus, we propose a code comment generation tool (DeepCommenter) to generate descriptive comments for Java methods. DeepCommenter formulates the comment generation task as a machine translation problem and exploits a deep neural network that combines the lexical and structural information of Java methods. We implement DeepCommenter in the form of an Integrated Development Environment (i.e., Intellij IDEA) plug-in. Such plug-in is built upon a Client/Server architecture. The client formats the code selected by the user, sends request to the server and inserts the comment generated by the server above the selected code. The server listens for client’s request, analyzes the requested code using the pre-trained model and sends back the generated comment to the client. The pre-trained model learns both the lexical and syntactical information from source code tokens and Abstract Syntax Trees (AST) respectively and combines these two types of information together to generate comments. To evaluate DeepCommenter, we conduct experiments on a large corpus built from a large number of open source Java projects on GitHub. The experimental results on different metrics show that DeepCommenter outperforms the state-of-the-art approaches by a substantial margin.

DiffTech: a tool for differencing similar technologies from question-and-answer discussions

Developers can use different technologies for different software development tasks in their work. However, when faced with several technologies with comparable functionalities, it can be challenging for developers to select the most appropriate one, as trial and error comparisons among such technologies are time-consuming. Instead, developers resort to expert articles, read official documents or ask questions in Q&A sites for technology comparison. However, it is still very opportunistic whether they will get a comprehensive comparison, as online information is often fragmented, contradictory and biased. To overcome these limitations, we propose the DiffTech system that exploits the crowd sourced discussions from Stack Overflow, and assists technology comparison with an informative summary of different comparison aspects. We found 19,118 comparative sentences from 2,410 pairs of comparable technologies. We released our DiffTech website for public use. Our website attracts over 1800 users and we also receive some positive comments on social media. A walkthrough video of the tool demo: Website link:

Enhancing developer interactions with programming screencasts through accurate code extraction

Programming screencasts have become a pervasive resource on the Internet, which is favoured by many developers for learning new programming skills. For developers, the source code in screencasts is valuable and important. However, the streaming nature of screencasts limits the choice that they have for interacting with the code. Many studies apply the Optical Character Recognition (OCR) technique to convert screen images into text, which can be easily searched and indexed. However, we observe that the noise in the screen images significantly affects the quality of OCRed code.

In this paper, we develop a tool named psc2code, which has two components, denoising code extraction from screencasts and enhancing programming video interaction. Experiment results on 1142 programming screencasts from YouTube show psc2code can effectively identify frames containing valid code region with a F1-score of 0.88 and improve the quality of OCRed code by fixing 46% of the errors. We also conduct a user study to evaluate the applicability of psc2code in enhancing video interaction, which shows it helps participants learn the knowledge in tutorials more efficiently.

JITO: a tool for just-in-time defect identification and localization

In software development and maintenance, defect localization is necessary for software quality assurance. Current defect localization techniques mainly rely on defect symptoms (e.g., bug reports or program spectrum) when the defect has been exposed. One challenge task is: can we locate buggy program prior to the appearance of the defect symptom. Such kind of localization is conducted at an early stage (e.g., when buggy program elements are being checked-in) which can be an early step of continuous quality control.

In this paper, we propose a Just-In-Time defect identification and lOcalization tool, named JITO, which can help developers to locate defective lines at check-in time. In summary, JITO contains two phases: (i) identify if a new change is buggy and (ii) locate suspicious buggy code lines in the identified buggy changes. We implement JITO as a plugin in an integrated development environment (i.e., Intellij IDEA). When developers using our plugin, JITO loads the local Git repository to build the JIT defect identification model and localization model based on historical changes. After submitting a new change to the local repository, developers apply JITO to identify whether it is a buggy change. If a buggy change is identified, JITO leverages JIT defect localization model to locate its suspicious buggy lines and highlight them in Intellij IDEA. Experimental results show that JITO outperforms two baselines (i.e., random guess and a static bug finder (i.e., PMD)) by a substantial margin in terms of four ranking measures.

Demo URL: <a></a>

Plugin download: <a></a>

LibComp: an IntelliJ plugin for comparing Java libraries

Software developers heavily rely on third-party libraries to accomplish their programming tasks. Since many libraries offer similar functionality, it can be difficult and tedious for developers differentiate similar libraries in order to select the most suitable one. In our previous work, we proposed the idea of metric-based library comparisons that allow developers to compare various aspects of libraries within the same domain, empowering them with information to aid with their decision. In this paper we present an IntelliJ plugin, LibComp, that provides this library metric-based comparison technique right within the developer’s IDE. As soon as a developer adds a library dependency that LibComp has information about, LibComp will highlight this dependency to let the developer know that there are alternatives available. Once the user triggers the comparison for that library, they can view various metrics about the library and its alternatives and decide if they want to use one of the alternatives. In the process, LibComp also records the number of times the developer invokes the tool and any completed replacements. Such feedback, if optionally sent to us by the developer, provides us valuable insights into developers’replacement decisions as well as information on how we can improve the tool. A video demonstrating the usage of LibComp can be found at

MCBAT: a practical tool for model counting constraints on bounded integer arrays

Model counting procedures for data structures are crucial for advancing the field of automated quantitative program analysis. We present a tool for Model Counting for Bounded Array Theory (MCBAT). MCBAT works on quantified integer array constraints in which all arrays have a finite length. We employ reductions from the theory of arrays to uninterpreted functions and linear integer arithmetic (LIA). Once reduced to LIA, we leverage Barvinok's polynomial time integer lattice point enumeration algorithm. Finally, we present a case study demonstrating applicability to automated quantitative program analysis. MCBAT is available for immediate use as a Docker image and the source code is freely available in our Github repository.

ModCon: a model-based testing platform for smart contracts

Unlike those on public permissionless blockchains, smart contracts on enterprise permissioned blockchains are not limited by resource constraints, and therefore often larger and more complex. Current testing and analysis tools lack support for such contracts, which demonstrate stateful behaviors and require special treatment in quality assurance. In this paper, we present a model-based testing platform, called ModCon, relying on user-specified models to define test oracles, guide test generation, and measure test adequacy. ModCon is Web-based and supports both permissionless and permissioned blockchain platforms. We demonstrate the usage and key features of ModCon on real enterprise smart contract applications.

Mono2Micro: an AI-based toolchain for evolving monolithic enterprise applications to a microservice architecture

Mono2Micro is an AI-based toolchain that provides recommendations for decomposing legacy web applications into microservice partitions. Mono2Micro consists of a set of tools that collect static and runtime information from a monolithic application and process the information using an AI-based technique to generate recommendations for partitioning the application classes. Each partition represents a candidate microservice or a grouping of classes with similar business functionalities. Mono2Micro takes a temporo-spatial clustering approach to compute meaningful and explainable partitions. It generates two types of partition recommendations. First, it computes business-logic-seams-based partitions that represent a desired encapsulation of business functionalities. However, such a recommendation may cut across data dependencies between classes, accommodating which could require significant application updates. To address this, Mono2Micro computes natural-seams-based partitions, which respect data dependencies. We describe the set of tools that comprise Mono2Micro and illustrate them using a well-known open-source JEE application.

MutAPK 2.0: a tool for reducing mutation testing effort of Android apps

Mutation testing is a time consuming process because large sets of fault-injected-versions of an original app are generated and executed with the purpose of evaluating the quality of a given test suite. In the case of Android apps, recent studies even suggest that mutant generation and mutation testing effort could be greater when the mutants are generated at the APK level. To reduce that effort, useless (e.g., equivalent) mutants should be avoided and mutant selection techniques could be used to reduce the set of mutants used with mutation testing. However, despite the existence of mutation testing tools, none of those tools provides features for removing useless mutants and sampling mutant sets. In this paper, we present MutAPK 2.0, an improved version of our open source mutant generation tool (MutAPK) for Android apps at APK level. To the best of our knowledge, MutAPK 2.0 is the first tool that enables the removal of dead-code mutants, provides a set of mutant selection strategies, and removes automatically equivalent and duplicate mutants. MutAPK 2.0 is publicly available at GitHub: VIDEO:

PAClab: a program analysis collaboratory

We present a web-based Program Analysis Collaboratory (PAClab) tool that helps researchers to obtain realistic program benchmarks using user-defined selection criteria. Based on selection criteria, PAClab identifies relevant projects and its programs from open-source repositories, obtains those programs, and if necessary performs sound program transformations to adapt them to the targeted verification tool. PAClab makes the resulting program benchmarks available for download. PAClab is designed as a scalable, modular, and parametrizable tool that takes advantage of a computer cluster to handle multiple user requests.

PCA: memory leak detection using partial call-path analysis

Data dependence analysis underlies various applications in software quality assurance, yet existing frameworks/tools for this analysis commonly suffer scalability challenges. We present PCA, a static interprocedural data dependence analyzer for real-world C programs. PCA performs interprocedural points-to and data-flow analyses with a lightweight design. Most of all, it features a partial call-path (PCA) analysis that consists of optimization options to further speed up data dependence computation. As an example application of it, PCA readily supports memory leak detection, for which it helps achieve close or better performance and precision relative to the same application based on a state-of-the-art value flow analysis. In particular, it found four more memory leaks in an industry-scale system which have been fixed by the developers. Through the data dependence it computes, PCA can enable other applications (e.g., impact analysis and taint analysis).

PRF: a framework for building automatic program repair prototypes for JVM-based languages

PRF is a Java-based framework that allows researchers to build prototypes of test-based generate-and-validate automatic program repair techniques for JVM languages by simply extending it with their patch generation plugins. The framework also provides other useful components for constructing automatic program repair tools, e.g., a fault localization component that provides spectrum-based fault localization information at different levels of granularity, a configurable and safe patch validation component that is 11+X faster than vanilla testing, and a customizable post-processing component to generate fix reports. A demo video of PRF is available at

PRODeep: a platform for robustness verification of deep neural networks

Deep neural networks (DNNs) have been applied in safety-critical domains such as self driving cars, aircraft collision avoidance systems, malware detection, etc. In such scenarios, it is important to give a safety guarantee to the robustness property, namely that outputs are invariant under small perturbations on the inputs. For this purpose, several algorithms and tools have been developed recently. In this paper, we present PRODeep, a platform for robustness verification of DNNs. PRODeep incorporates constraint-based, abstraction-based, and optimisation-based robustness checking algorithms. It has a modular architecture, enabling easy comparison of different algorithms. With experimental results, we illustrate the use of the tool, and easy combination of those techniques.

SVMRanker: a general termination analysis framework of loop programs via SVM

Deciding termination of programs is probably the most famous problem in computer science. Synthesizing ranking functions for programs is a standard way to prove termination of programs. Currently, specific synthesis algorithms have to be developed for each specific type of programs. For instance, the synthesis of ranking functions for programs with linear variables updates is usually based on linear programming techniques and the like, while for programs with polynomial updates, it usually relies on semi-definite programming and the like. The same also applies to the synthesis of different types of ranking functions needed for proving program termination. Each time faced with a new type of programs and a new type of ranking functions, researchers have to spend a considerable amount of effort to develop specialized synthesis algorithms. In this paper, to save this extra effort, we present SVMRanker, a general framework for proving termination of programs, which is able to synthesize different types of ranking functions for programs with both linear and polynomial updates, based on Support-Vector Machines (SVM). We compare SVMRanker with the state-of-the-art tool LassoRanker on standard benchmarks. Empirical results show that SVMRanker is comparable with LassoRanker on programs with linear updates and can manage more programs with polynomial updates, making SVMRanker a valid complement to LassoRanker in proving program termination.

SWAN: a static analysis framework for swift

Swift is an open-source programming language and Apple's recommended choice for app development. Given the global widespread use of Apple devices, the ability to analyze Swift programs has significant impact on millions of users. Although static analysis frameworks exist for various computing platforms, there is a lack of comparable tools for Swift. While LLVM and Clang support some analyses for Swift, they are either primarily dynamic analyses or not suitable for deeper analyses of Swift programs such as taint tracking. Moreover, other existing tools for Swift only help enforce code styles and best practices.

In this paper, we present SWAN, an open-source framework that allows robust program analyses of Swift programs using IBM's T.J. Watson Libraries for Analysis (WALA). To provide a wide range of analyses for Swift, SWAN leverages the well-established libraries in WALA. SWAN is publicly available at We have also made a screencast available at

Threshy: supporting safe usage of intelligent web services

Increased popularity of ‘intelligent’ web services provides end-users with machine-learnt functionality at little effort to developers. However, these services require a decision threshold to be set which is dependent on problem-specific data. Developers lack a systematic approach for evaluating intelligent services and existing evaluation tools are predominantly targeted at data scientists for pre-development evaluation. This paper presents a workflow and supporting tool, Threshy, to help software developers select a decision threshold suited to their problem domain. Unlike existing tools, Threshy is designed to operate in multiple workflows including pre-development, pre-release, and support. Threshy is designed for tuning the confidence scores returned by intelligent web services and does not deal with hyper-parameter optimisation used in ML models. Additionally, it considers the financial impacts of false positives. Threshold configuration files exported by Threshy can be integrated into client applications and monitoring infrastructure. Demo: <a></a>.

tsDetect: an open source test smells detection tool

The test code, just like production source code, is subject to bad design and programming practices, also known as smells. The presence of test smells in a software project may affect the quality, maintainability, and extendability of test suites making them less effective in finding potential faults and quality issues in the project's production code. In this paper, we introduce tsDetect, an automated test smell detection tool for Java software systems that uses a set of detection rules to locate existing test smells in test code. We evaluate the effectiveness of tsDetect on a benchmark of 65 unit test files containing instances of 19 test smell types. Results show that tsDetect achieves a high detection accuracy with an average precision score of 96% and an average recall score of 97%. tsDetect is publicly available, with a demo video, at:

UIED: a hybrid tool for GUI element detection

Graphical User Interface (GUI) elements detection is critical for many GUI automation and GUI testing tasks. Acquiring the accurate positions and classes of GUI elements is also the very first step to conduct GUI reverse engineering or perform GUI testing. In this paper, we implement a User Iterface Element Detection (UIED), a toolkit designed to provide user with a simple and easy-to-use platform to achieve accurate GUI element detection. UIED integrates multiple detection methods including old-fashioned computer vision (CV) approaches and deep learning models to handle diverse and complicated GUI images. Besides, it equips with a novel customized GUI element detection methods to produce state-of-the-art detection results. Our tool enables the user to change and edit the detection result in an interactive dashboard. Finally, it exports the detected UI elements in the GUI image to design files that can be further edited in popular UI design tools such as Sketch and Photoshop. UIED is evaluated to be capable of accurate detection and useful for downstream works.

Tool URL: <a></a>

Github Link: <a></a>

UIScreens: extracting user interface screens from mobile programming video tutorials

Mobile apps are one of the most widely used types of software systems in existence today and more programmers and students learn how to develop them everyday. One of the most popular resources for learning mobile programming are videos hosted on social platforms such as YouTube. While useful, this type of resource has also its limitations, especially when developers are looking for user interface (UI) designs for mobile applications, since these are hard to search for and locate in videos. We propose UIScreens, a web-based analysis and search engine that analyzes the visual contents of mobile programming video tutorials, then identifies and extracts the UI screens displayed in the videos. Our tool offers features such as searching for UI screens in videos, displaying an overview of the UI screens identified in a video under each search result, and navigating to the part of a video where a particular UI screen is being displayed and discussed. In a user study, participants agreed that UIScreens is usable and useful to quickly skim through videos, while the UI screens it extracts can help developers further determine the relevance of videos to a search topic.

WebJShrink: a web service for debloating Java bytecode

As software projects grow in complexity, they come packaged with under-utilized libraries and therefore become bloated. Though several software debloating tools exist, none of them help developers gain insights into how under-utilized those libraries are nor help developers build confidence in the behavior preservation of software after debloating. To bridge this gap, we developed WebJShrink, a visual analytics tool for analyzing and pruning bloated software projects. WebJShrink is built on JShrink which uses static and dynamic reachability analysis to determine the extent of software bloat. WebJShrink provides rich visualizations of the bloat lurking within a target project's internal structure. It then removes unused features, and returns a safer, slimmer variant of the software project. To illustrate the target project's behavior preservation, WebJShrink examines the debloated software with its JUnit tests and visualizes the test results. In evaluating WebJShrink against 26 real world systems, we found WebJShrink could reduce software size by up to 42%, 11% on average, while still passing 100% of unit tests after debloating. We provide a video demonstrating WebJShrink at

SESSION: Doctoral Symposium

Assisting the elite-driven open source development through activity data

Elite developers, who own the administrative privileges for a project, maintain a diverse profile of contributing activities, and drive the development of open source software (OSS). To advance our understanding and further support the OSS community, I present a fresh approach to investigate developers’ public activities from the fine-grained event data provided by GitHub. Further, I develop this approach into an analysis framework for collecting, modeling, and analyzing elite developers’ online contributing activities. Employing this framework, I have conducted empirical studies on various OSS projects and ecosystems to characterize elite developers’ full-spectrum activities and their dynamics, and also unveil relationships between their effort allocation and projects’ technical outcomes. Finally, I propose to design and implement a toolset based on this framework and my results to date, which supports individual developers’ decision-making and assists their routine workflows with automation.

Enhancing developers’ support on pull requests activities with software bots

Software bots are employed to support developers' activities, serving as conduits between developers and other tools. Due to their focus on task automation, bots have become particularly relevant for Open Source Software (OSS) projects hosted on GitHub. While bots are adopted to save development cost, time, and effort, the bots' presence can be disruptive to the community. My research goal is two-fold: (i) identify problems caused by bots that interact in pull requests, and (ii) help bot designers enhance existing bots. Toward this end, we are interviewing maintainers, contributors, and bot developers to understand the problems in the human-bot interaction and how they affect the collaboration in a project. Afterward, we will employ Design Fiction to capture the developers' vision of bots' capabilities, in order to define guidelines for the design of bots on social coding platforms, and derive requirements for a meta-bot to deal with the problems. This work contributes more broadly to the design and use of software bots to enhance developers' collaboration and interaction.

Machine learning based test data generation for safety-critical software

Unit testing focused on Modified Condition/Decision Coverage (MC/DC) criterion is essential in development safety-critical systems. However, design of test data that meets the MC/DC criterion currently needs detailed manual analysis of branching conditions in units under test by test engineers. Multiple state-of-art approaches exist with proven usage even in industrial projects. However, these approaches have multiple shortcomings, one of them being the Path explosion problem which has not been fully solved yet. Machine learning methods as meta-heuristic approximations can model behaviour of programs that are hard to test using traditional approaches, where the Path explosion problem does occur and thus could solve the limitations of the current state-of-art approaches. I believe, motivated by an ongoing collaboration with an industrial partner, that the machine learning methods could be combined with existing approaches to produce an approach suitable for testing of safety-critical projects.

Reusing software engineering knowledge from developer communication

Software development requires many different types of knowledge, such as knowledge about software development processes, practices and techniques, and about the domain of an application. Software, developers often share knowledge in informal communication channels (e.g., instant messaging tools, e-mails, or online forums). Considering that this informal communication contains knowledge that may be potentially relevant for other developers and given that this knowledge is not necessarily captured and formally documented for reuse, in this work we propose (a) exploring whether developer communication (via instant messaging) is a suitable source of reusable software engineering knowledge; (b) investigating how to identify that knowledge using data mining; (c) and analysing through action research how to present it to developers in a useful way for reuse. The envisioned theories and solutions approaches will analyze existing software development data captured in communication, rather than data that were captured and stored specifically to be reused.

Towards transferring lean software startup practices in software engineering education

In the modern economy, software drives innovation and economic growth. Studies show how software increasingly influences all industry sectors. During the last five decades, software engineering has also changed significantly to advance the development of various types and scales of software products. In this context, Software Engineering Education plays an essential role in keeping students updated with software technologies, processes, and practices popular in industries. In this Ph.D. work, I want to answer the following research questions: (1) To what extent are SE Trends presented in SEE research? (2) What do we know about the Lean Startups paradigm? (3) What is the impact of Lean Startup practices to software engineering students and curriculum? I utilize (1) literature review and (2) Mixed-methods approaches (data and methods triangulation) in gathering empirical evidence. In the first phase of the research, I pinpoint the relevance of Lean Startup within the software engineering education throughout an extensive literature review. I gather empirical evidence on Lean Startup practices and their potential transfer in software engineering education during the second research phase. I demonstrate that Lean Startup is part of the emerging software engineering trends within software engineering education research. I identify the gap of growth phase Lean Startup research in present software paradigms. I demonstrate that students can acquire soft, hard, and project management skills in a more realistic context while introducing growth phase Lean Startup practices throughout external course activities. I expect that the present software engineering curricula can benefit from a model and framework, which I intend to propose, facilitating Lean Startup practice transfer within the software engineering curriculum.

SESSION: Student Research Competition

Attention tracking for developers

For improving speed and quality of development of IT projects it is essential to study how to increase the efficiency of developers. One way to improve the quality of developers' work is to help them concentrate better, timely detect a drop in concentration and brain fatigue, which could be done by controlling the level of attention during programming. This study focuses on the study of the level of attention of programmers using EEG (electroencephalogram), as the most promising and realistic for use in this environment. We found that the level of attention can be determined using the control of alpha and beta waves measured using EEG, as well as specific features of the functional state of the brain compared to another type of mental load - driving.

Impact of programming languages on energy consumption for mobile devices

Mobile devices are part of our life and their energy consumption poses significant limits to their further adoption and usage. In this work, we conduct a meta-analytical review of the impact of programming languages on the energy consumption of such devices. We consider as a null hypothesis that from the perspective of the consumption of energy, there is no difference in writing apps in C, C++, and Java. With the studies at hand, we conclude that we cannot reject such null hypotheses.

Recommender systems: metric suggestion mechanisms applied to adaptable software dashboards

Dashboards are software systems aiming to amplify cognition capitalizing on human perceptual abilities. As such, they have intrinsically a human-centric approach, meaning that their purpose is to support effective decision making. This has played a vital role in their success in business performance management, business intelligence, and internal control. However, as today's business requirements change rapidly and continuously, dashboards containing the same set of metrics throughout quickly become ineffective at conveying important information, especially when used by multiple users. This is one of the reasons for adopting the concept of "precooked" dashboards, i.e., building a default template that is useful to an average user.

Repairing confusion and bias errors for DNN-based image classifiers

Recent works in DNN testing show that DNN based image classifiers are susceptible to confusion and bias errors. A DNN model, even robust trained model can be highly confused between certain pair of objects or highly bias towards some object than others. In this paper, we propose a differentiable distance metric, which is highly correlated with confusion errors. We propose a repairing approach by increasing the distance between two classes during retraining the model to reduce the confusion errors. We evaluate our approaches on both single-label and multi-label classification models and datasets. Our results show that our approach effectively reduce confusion errors with very slight accuracy reduce.

Synthesizing correct code for machine learning programs

Success using machine learning (ML) in numerous fields has created a new class of users, who are not experts in the data science domain but want to use ML as a means to solve their inference problems. Various automatic machine learning (AutoML) approaches attempt to make ML solutions accessible to such users. In this work, we present a system that automatically synthesizes correct code within the context of the user’s data using sketching. In sketching, insight is determined through a partial program; a sketch expresses the high-level structure of implementation but leaves holes in place of the low-level details. We use meta-learning on meta-features to approximately solve holes. We observe that the sketch-based approach is more expressive, easier to implement, and easier to optimize than existing AutoML frameworks. Our initial results are very promising. Our approach uses fewer resources and still produces comparable results to existing techniques.