ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

Full Citation in the ACM Digital Library

SESSION: Continuous integration

Learning-to-rank vs ranking-to-learn: strategies for regression testing in continuous integration

In Continuous Integration (CI), regression testing is constrained by the time between commits. This demands for careful selection and/or prioritization of test cases within test suites too large to be run entirely. To this aim, some Machine Learning (ML) techniques have been proposed, as an alternative to deterministic approaches. Two broad strategies for ML-based prioritization are learning-to-rank and what we call ranking-to-learn (i.e., reinforcement learning). Various ML algorithms can be applied in each strategy. In this paper we introduce ten of such algorithms for adoption in CI practices, and perform a comprehensive study comparing them against each other using subjects from the Apache Commons project. We analyze the influence of several features of the code under test and of the test process. The results allow to draw criteria to support testers in selecting and tuning the technique that best fits their context.

A cost-efficient approach to building in continuous integration

Continuous integration (CI) is a widely used practice in modern software engineering. Unfortunately, it is also an expensive practice --- Google and Mozilla estimate their CI systems in millions of dollars. In this paper, we propose a novel approach for reducing the cost of CI. The cost of CI lies in the computing power to run builds and its value mostly lies on letting developers find bugs early --- when their size is still small. Thus, we target reducing the number of builds that CI executes by still executing as many failing builds as early as possible. To achieve this goal, we propose SmartBuildSkip, a technique which predicts the first builds in a sequence of build failures and the remaining build failures separately. SmartBuildSkip is customizable, allowing developers to select different preferred trade-offs of saving many builds vs. observing build failures early. We evaluate the motivating hypothesis of SmartBuildSkip, its prediction power, and its cost savings in a realistic scenario. In its most conservative configuration, SmartBuildSkip saved a median 30% of builds by only incurring a median delay of 1 build in a median of 15% failing builds.

Practical fault detection in puppet programs

Puppet is a popular computer system configuration management tool. By providing abstractions that model system resources it allows administrators to set up computer systems in a reliable, predictable, and documented fashion. Its use suffers from two potential pitfalls. First, if ordering constraints are not correctly specified whenever a Puppet resource depends on another, the non-deterministic application of resources can lead to race conditions and consequent failures. Second, if a service is not tied to its resources (through the notification construct), the system may operate in a stale state whenever a resource gets modified. Such faults can degrade a computing infrastructure's availability and functionality.

We have developed an approach that identifies these issues through the analysis of a Puppet program and its system call trace. Specifically, a formal model for traces allows us to capture the interactions of Puppet resources with the file system. By analyzing these interactions we identify (1) resources that are related to each other (e.g., operate on the same file), and (2) resources that should act as notifiers so that changes are correctly propagated. We then check the relationships from the trace's analysis against the program's dependency graph: a representation containing all the ordering constraints and notifications declared in the program. If a mismatch is detected, our system reports a potential fault.

We have evaluated our method on a large set of popular Puppet modules, and discovered 92 previously unknown issues in 33 modules. Performance benchmarking shows that our approach can analyze in seconds real-world configurations with a magnitude measured in thousands of lines and millions of system calls.

Learning from, understanding, and supporting DevOps artifacts for docker

With the growing use of DevOps tools and frameworks, there is an increased need for tools and techniques that support more than code. The current state-of-the-art in static developer assistance for tools like Docker is limited to shallow syntactic validation. We identify three core challenges in the realm of learning from, understanding, and supporting developers writing DevOps artifacts: (i) nested languages in DevOps artifacts, (ii) rule mining, and (iii) the lack of semantic rule-based analysis. To address these challenges we introduce a toolset, binnacle, that enabled us to ingest 900,000 GitHub repositories.

Focusing on Docker, we extracted approximately 178,000 unique Dockerfiles, and also identified a Gold Set of Dockerfiles written by Docker experts. We addressed challenge (i) by reducing the number of effectively uninterpretable nodes in our ASTs by over 80% via a technique we call phased parsing. To address challenge (ii), we introduced a novel rule-mining technique capable of recovering two-thirds of the rules in a benchmark we curated. Through this automated mining, we were able to recover 16 new rules that were not found during manual rule collection. To address challenge (iii), we manually collected a set of rules for Dockerfiles from commits to the files in the Gold Set. These rules encapsulate best practices, avoid docker build failures, and improve image size and build latency. We created an analyzer that used these rules, and found that, on average, Dockerfiles on GitHub violated the rules five times more frequently than the Dockerfiles in our Gold Set. We also found that industrial Dockerfiles fared no better than those sourced from GitHub.

The learned rules and analyzer in binnacle can be used to aid developers in the IDE when creating Dockerfiles, and in a post-hoc fashion to identify issues in, and to improve, existing Dockerfiles.

SESSION: Cyber-physical systems

Adapting requirements models to varying environments

The engineering of high-quality software requirements generally relies on properties and assumptions about the environment in which the software-to-be has to operate. Such properties and assumptions, referred to as environment conditions in this paper, are highly subject to change over time or from one software variant to another. As a consequence, the requirements engineered for a specific set of environment conditions may no longer be adequate, complete and consistent for another set.

The paper addresses this problem through a tool-supported requirements adaptation technique. A goal-oriented requirements modelling framework is considered to make requirements' refinements and dependencies on environment conditions explicit. When environment conditions change, an adapted goal model is computed that is correct with respect to the new environment conditions. The space of possible adaptations is not fixed a priori; the required changes are expected to meet one or more environment-independent goal(s) to be satisfied in any version of the system. The adapted goal model is generated using a new counterexample-guided learning procedure that ensures the correctness of the updated goal model, and prefers more local adaptations and more similar goal models.

Comparing formal tools for system design: a judgment study

Formal methods and tools have a long history of successful applications in the design of safety-critical railway products. However, most of the experiences focused on the application of a single method at once, and little work has been performed to compare the applicability of the different available frameworks to the railway context. As a result, companies willing to introduce formal methods in their development process have little guidance on the selection of tools that could fit their needs. To address this goal, this paper presents a comparison between 9 different formal tools, namely Atelier B, CADP, FDR4, NuSMV, ProB, Simulink, SPIN, UMC, and UPPAAL SMC. We performed a judgment study, involving 17 experts with experience in formal methods applied to railways. In the study, part of the experts were required to model a railway signaling problem (a moving-block train distancing system) with the different tools, and to provide feedback on their experience. The information produced was then synthesized, and the results were validated by the remaining experts. Based on the outcome of this process, we provide a synthesis that describes when to use a certain tool, and what are the problems that may be faced by modelers. Our experience shows that the different tools serve different purposes, and multiple formal methods are required to fully cover the needs of the railway system design process.

SESSION: Debugging 1

Debugging inputs

When a program fails to process an input, it need not be the program code that is at fault. It can also be that the input data is faulty, for instance as result of data corruption. To get the data processed, one then has to debug the input data---that is, (1) identify which parts of the input data prevent processing, and (2) recover as much of the (valuable) input data as possible. In this paper, we present a general-purpose algorithm called ddmax that addresses these problems automatically. Through experiments, ddmax maximizes the subset of the input that can still be processed by the program, thus recovering and repairing as much data as possible; the difference between the original failing input and the "maximized" passing input includes all input fragments that could not be processed. To the best of our knowledge, ddmax is the first approach that fixes faults in the input data without requiring program analysis. In our evaluation, ddmax repaired about 69% of input files and recovered about 78% of data within one minute per input.

Causal testing: understanding defects' root causes

Understanding the root cause of a defect is critical to isolating and repairing buggy behavior. We present Causal Testing, a new method of root-cause analysis that relies on the theory of counterfactual causality to identify a set of executions that likely hold key causal information necessary to understand and repair buggy behavior. Using the Defects4J benchmark, we find that Causal Testing could be applied to 71% of real-world defects, and for 77% of those, it can help developers identify the root cause of the defect. A controlled experiment with 37 developers shows that Causal Testing improves participants' ability to identify the cause of the defect from 80% of the time with standard testing tools to 86% of the time with Causal Testing. The participants report that Causal Testing provides useful information they cannot get using tools such as JUnit. Holmes, our prototype, open-source Eclipse plugin implementation of Causal Testing, is available at http://holmes.cs.umass.edu/.

SESSION: Ecosystems and evolution

Impact analysis of cross-project bugs on software ecosystems

Software projects are increasingly forming social-technical ecosystems within which individual projects rely on the infrastructures or functional components provided by other projects, leading to complex inter-dependencies. Through inter-project dependencies, a bug in an upstream project may have profound impact on a large number of downstream projects, resulting in cross-project bugs. This emerging type of bugs has brought new challenges in bug fixing due to their unclear influence on downstream projects. In this paper, we present an approach to estimating the impact of a cross-project bug within its ecosystem by identifying the affected downstream modules (classes/methods). Note that a downstream project that uses a buggy upstream function may not be affected as the usage does not satisfy the failure inducing preconditions. For a reported bug with the known root cause function and failure inducing preconditions, we first collect the candidate downstream modules that call the upstream function through an ecosystem-wide dependence analysis. Then, the paths to the call sites of the buggy upstream function are encoded as symbolic constraints. Solving the constraints, together with the failure inducing preconditions, identifies the affected downstream modules. Our evaluation of 31 existing upstream bugs on the scientific Python ecosystem containing 121 versions of 22 popular projects (with a total of 16 millions LOC) shows that the approach is highly effective: from the 25490 candidate downstream modules that invoke the buggy upstream functions, it identifies 1132 modules where the upstream bugs can be triggered, pruning 95.6% of the candidates. The technique has no false negatives and an average false positive rate of 7.9%. Only 49 downstream modules (out of the 1132 we found) were reported before to be affected.

Taming behavioral backward incompatibilities via cross-project testing and analysis

In modern software development, software libraries play a crucial role in reducing software development effort and improving software quality. However, at the same time, the asynchronous upgrades of software libraries and client software projects often result in incompatibilities between different versions of libraries and client projects. When libraries evolve, it is often very challenging for library developers to maintain the so-called backward compatibility and keep all their external behavior untouched, and behavioral backward incompatibilities (BBIs) may occur. In practice, the regression test suites of library projects often fail to detect all BBIs. Therefore, in this paper, we propose DeBBI to detect BBIs via cross-project testing and analysis, i.e., using the test suites of various client projects to detect library BBIs. Since executing all the possible client projects can be extremely time consuming, DeBBI transforms the problem of cross-project BBI detection into a traditional information retrieval (IR) problem to execute the client projects with higher probability to detect BBIs earlier. Furthermore, DeBBI considers project diversity and test relevance information for even faster BBI detection. The experimental results show that DeBBI can reduce the end-to-end testing time for detecting the first and average unique BBIs by 99.1% and 70.8% for JDK compared to naive cross-project BBI detection. Also, DeBBI has been applied to other popular 3rd-party libraries. To date, DeBBI has detected 97 BBI bugs with 19 already confirmed as previously unknown bugs.

Watchman: monitoring dependency conflicts for Python library ecosystem

The PyPI ecosystem has indexed millions of Python libraries to allow developers to automatically download and install dependencies of their projects based on the specified version constraints. Despite the convenience brought by automation, version constraints in Python projects can easily conflict, resulting in build failures. We refer to such conflicts as <u>D</u>ependency <u>C</u>onfict (DC) issues. Although DC issues are common in Python projects, developers lack tool support to gain a comprehensive knowledge for diagnosing the root causes of these issues. In this paper, we conducted an empirical study on 235 real-world DC issues. We studied the manifestation patterns and fixing strategies of these issues and found several key factors that can lead to DC issues and their regressions. Based on our findings, we designed and implemented Watchman, a technique to continuously monitor dependency conflicts for the PyPI ecosystem. In our evaluation, Watchman analyzed PyPI snapshots between 11 Jul 2019 and 16 Aug 2019, and found 117 potential DC issues. We reported these issues to the developers of the corresponding projects. So far, 63 issues have been confirmed, 38 of which have been quickly fixed by applying our suggested patches.

SESSION: Empirical studies for security

One size does not fit all: a grounded theory and online survey study of developer preferences for security warning types

A wide range of tools exist to assist developers in creating secure software. Many of these tools, such as static analysis engines or security checkers included in compilers, use warnings to communicate security issues to developers. The effectiveness of these tools relies on developers heeding these warnings, and there are many ways in which these warnings could be displayed. Johnson et al. [46] conducted qualitative research and found that warning presentation and integration are main issues. We built on Johnson et al.'s work and examined what developers want from security warnings, including what form they should take and how they should integrate into their workflow and work context. To this end, we conducted a Grounded Theory study with 14 professional software developers and 12 computer science students as well as a focus group with 7 academic researchers to gather qualitative insights. To back up the theory developed from the qualitative research, we ran a quantitative survey with 50 professional software developers. Our results show that there is significant heterogeneity amongst developers and that no one warning type is preferred over all others. The context in which the warnings are shown is also highly relevant, indicating that it is likely to be beneficial if IDEs and other development tools become more flexible in their warning interactions with developers. Based on our findings, we provide concrete recommendations for both future research as well as how IDEs and other security tools can improve their interaction with developers.

Schrödinger's security: opening the box on app developers' security rationale

Research has established the wide variety of security failures in mobile apps, their consequences, and how app developers introduce or exacerbate them. What is not well known is why developers do so---what is the rationale underpinning the decisions they make which eventually strengthen or weaken app security? This is all the more complicated in modern app development's increasingly diverse demographic: growing numbers of independent, solo, or small team developers who do not have the organizational structures and support that larger software development houses enjoy.

Through two studies, we open the box on developer rationale, by performing a holistic analysis of the rationale underpinning various activities in which app developers engage when developing an app.

The first study does so through a task-based study with app developers (N=44) incorporating six distinct tasks for which this developer demographic must take responsibility: setting up a development environment, reviewing code, seeking help, seeking testers, selecting an advertisement SDK, and software licensing. We found that, while on first glance in several activities participants seemed to prioritize security, only in the code task such prioritization was underpinned by a security rationale-indicating that development behavior perceived to be secure may only be an illusion until the box is opened on their rationale.

The second study confirms these findings through a wider survey of app developers (N=274) investigating to what extent they find the activities of the task-based study to affect their app's security. In line with the task-based study, we found that developers perceived actively writing code and actively using external SDKs as the only security-relevant, while similarly disregarding other activities having an impact on app security.

Our results suggest the need for a stronger focus on the tasks and activities surrounding the coding task - all of which need to be underpinned by a security rationale. Without such a holistic focus, developers may write "secure code" but not produce "secure apps".

SESSION: Human practice

How software practitioners use informal local meetups to share software engineering knowledge

Informal technology 'meetups' have become an important aspect of the software development community, engaging many thousands of practitioners on a regular basis. However, although local technology meetups are well-attended by developers, little is known about their motivations for participating, the type or usefulness of information that they acquire, and how local meetups might differ from and complement other available communication channels for software engineering information. We interviewed the leaders of technology-oriented Meetup groups, and collected quantitative information via a survey distributed to participants in technology-oriented groups. Our findings suggest that participants in these groups are primarily experienced software practitioners, who use Meetup for staying abreast of new developments, building local networks and achieving transfer of rich tacit knowledge with peers to improve their practice. We also suggest that face to face meetings are useful forums for exchanging tacit knowledge and contextual information needed for software engineering practice.

Predicting developers' negative feelings about code review

During code review, developers critically examine each others' code to improve its quality, share knowledge, and ensure conformance to coding standards. In the process, developers may have negative interpersonal interactions with their peers, which can lead to frustration and stress; these negative interactions may ultimately result in developers abandoning projects. In this mixed-methods study at one company, we surveyed 1,317 developers to characterize the negative experiences and cross-referenced the results with objective data from code review logs to predict these experiences. Our results suggest that such negative experiences, which we call "pushback", are relatively rare in practice, but have negative repercussions when they occur. Our metrics can predict feelings of pushback with high recall but low precision, making them potentially appropriate for highlighting interactions that may benefit from a self-intervention.

SESSION: Web testing

Near-duplicate detection in web app model inference

Automated web testing techniques infer models from a given web app, which are used for test generation. From a testing viewpoint, such an inferred model should contain the minimal set of states that are distinct, yet, adequately cover the app's main functionalities. In practice, models inferred automatically are affected by near-duplicates, i.e., replicas of the same functional webpage differing only by small insignificant changes. We present the first study of near-duplicate detection algorithms used in within app model inference. We first characterize functional near-duplicates by classifying a random sample of state-pairs, from 493k pairs of webpages obtained from over 6,000 websites, into three categories, namely clone, near-duplicate, and distinct. We systematically compute thresholds that define the boundaries of these categories for each detection technique. We then use these thresholds to evaluate 10 near-duplicate detection techniques from three different domains, namely, information retrieval, web testing, and computer vision on nine open-source web apps. Our study highlights the challenges posed in automatically inferring a model for any given web app. Our findings show that even with the best thresholds, no algorithm is able to accurately detect all functional near-duplicates within apps, without sacrificing coverage.

Extracting taint specifications for JavaScript libraries

Modern JavaScript applications extensively depend on third-party libraries. Especially for the Node.js platform, vulnerabilities can have severe consequences to the security of applications, resulting in, e.g., cross-site scripting and command injection attacks. Existing static analysis tools that have been developed to automatically detect such issues are either too coarse-grained, looking only at package dependency structure while ignoring dataflow, or rely on manually written taint specifications for the most popular libraries to ensure analysis scalability.

In this work, we propose a technique for automatically extracting taint specifications for JavaScript libraries, based on a dynamic analysis that leverages the existing test suites of the libraries and their available clients in the npm repository. Due to the dynamic nature of JavaScript, mapping observations from dynamic analysis to taint specifications that fit into a static analysis is non-trivial. Our main insight is that this challenge can be addressed by a combination of an access path mechanism that identifies entry and exit points, and the use of membranes around the libraries of interest.

We show that our approach is effective at inferring useful taint specifications at scale. Our prototype tool automatically extracts 146 additional taint sinks and 7840 propagation summaries spanning 1 393 npm modules. By integrating the extracted specifications into a commercial, state-of-the-art static analysis, 136 new alerts are produced, many of which correspond to likely security vulnerabilities. Moreover, many important specifications that were originally manually written are among the ones that our tool can now extract automatically.

SLACC: simion-based language agnostic code clones

Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. However, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying representation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis framework replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior.

In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/output behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior attempts. Since clusters are generated based on input/output behavior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone detection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%).

This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying representation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools.

Finding client-side business flow tampering vulnerabilities

The sheer complexity of web applications leaves open a large attack surface of business logic. Particularly, in some scenarios, developers have to expose a portion of the logic to the client-side in order to coordinate multiple parties (e.g. merchants, client users, and third-party payment services) involved in a business process. However, such client-side code can be tampered with on the fly, leading to business logic perturbations and financial loss. Although developers become familiar with concepts that the client should never be trusted, given the size and the complexity of the client-side code that may be even incorporated from third parties, it is extremely challenging to understand and pinpoint the vulnerability. To this end, we investigate client-side business flow tampering vulnerabilities and develop a dynamic analysis based approach to automatically identifying such vulnerabilities. We evaluate our technique on 200 popular real-world websites. With negligible overhead, we have successfully identified 27 unique vulnerabilities on 23 websites, such as New York Times, HBO, and YouTube, where an adversary can interrupt business logic to bypass paywalls, disable adblocker detection, earn reward points illicitly, etc.

SESSION: Analysis for security

Securing unsafe rust programs with XRust

Rust is a promising systems programming language that embraces both high-level memory safety and low-level resource manipulation. However, the dark side of Rust, unsafe Rust, leaves a large security hole as it bypasses the Rust type system in order to support low-level operations. Recently, several real-world memory corruption vulnerabilities have been discovered in Rust's standard libraries.

We present XRust, a new technique that mitigates the security threat of unsafe Rust by ensuring the integrity of data flow from unsafe Rust code to safe Rust code. The cornerstone of XRust is a novel heap allocator that isolates the memory of unsafe Rust from that accessed only in safe Rust, and prevents any cross-region memory corruption. Our design of XRust supports both single-and multi-threaded Rust programs. Our extensive experiments on real-world Rust applications and standard libraries show that XRust is both highly efficient and effective in practice.

Is rust used safely by software developers?

Rust, an emerging programming language with explosive growth, provides a robust type system that enables programmers to write memory-safe and data-race free code. To allow access to a machine's hardware and to support low-level performance optimizations, a second language, Unsafe Rust, is embedded in Rust. It contains support for operations that are difficult to statically check, such as C-style pointers for access to arbitrary memory locations and mutable global variables. When a program uses these features, the compiler is unable to statically guarantee the safety properties Rust promotes. In this work, we perform a large-scale empirical study to explore how software developers are using Unsafe Rust in real-world Rust libraries and applications. Our results indicate that software engineers use the keyword unsafe in less than 30% of Rust libraries, but more than half cannot be entirely statically checked by the Rust compiler because of Unsafe Rust hidden somewhere in a library's call chain. We conclude that although the use of the keyword unsafe is limited, the propagation of unsafeness offers a challenge to the claim of Rust as a memory-safe language. Furthermore, we recommend changes to the Rust compiler and to the central Rust repository's interface to help Rust software developers be aware of when their Rust code is unsafe.

Burn after reading: a shadow stack with microsecond-level runtime rerandomization for protecting return addresses

Return-oriented programming (ROP) is an effective code-reuse attack in which short code sequences (i.e., gadgets) ending in a ret instruction are found within existing binaries and then executed by taking control of the call stack. The shadow stack, control flow integrity (CFI) and code (re)randomization are three popular techniques for protecting programs against return address overwrites. However, existing runtime rerandomization techniques operate on concrete return addresses, requiring expensive pointer tracking.

By adding one level of indirection, we introduce BarRA, the first shadow stack mechanism that applies continuous runtime rerandomization to abstract return addresses for protecting their corresponding concrete return addresses (protected also by CFI), thus avoiding expensive pointer tracking. As a nice side-effect, BarRA naturally combines the shadow stack, CFI and runtime rerandomization in the same framework. The key novelty of BarRA, however, is that once some abstract return addresses are leaked, BarRA will enforce the burn-after-reading property by rerandomizing the mapping from the abstract to the concrete return address space in the order of microseconds instead of seconds required for rerandomizing a concrete return address space. As a result, BarRA can be used as a superior replacement for the shadow stack, as demonstrated by comparing both using the 19 C/C++ benchmarks in SPEC CPU2006 (totalling 2,047,447 LOC) and analyzing a proof-of-concept attack, provided that we can tolerate some slight binary code size increases (by an average of 29.44%) and are willing to use 8MB of dedicated memory for holding up to 220 return addresses (on a 64-bit platform). Under an information leakage attack (for some return addresses), the shadow stack is always vulnerable but BarRA is significantly more resilient (by reducing an attacker's success rate to 1/220 on average). In terms of the average performance overhead introduced, both are comparable: 6.09% (BarRA) vs. 5.38% (the shadow stack).

SAVER: scalable, precise, and safe memory-error repair

We present SAVER, a new memory-error repair technique for C programs. Memory errors such as memory leak, double-free, and use-after-free are highly prevalent and fixing them requires significant effort. Automated program repair techniques hold the promise of reducing this burden but the state-of-the-art is still unsatisfactory. In particular, no existing techniques are able to fix those errors in a scalable, precise, and safe way, all of which are required for a truly practical tool. SAVER aims to address these shortcomings. To this end, we propose a method based on a novel representation of the program called object flow graph, which summarizes the program's heap-related behavior using static analysis. We show that fixing memory errors can be formulated as a graph labeling problem over object flow graph and present an efficient algorithm. We evaluated SAVER in combination with Infer, an industrial-strength static bug-finder, and show that 74% of the reported errors can be fixed automatically for a range of open-source C programs.

SESSION: Android and web application testing

Revealing injection vulnerabilities by leveraging existing tests

Code injection attacks, like the one used in the high-profile 2017 Equifax breach, have become increasingly common, now ranking #1 on OWASP's list of critical web application vulnerabilities. Static analyses for detecting these vulnerabilities can overwhelm developers with false positive reports. Meanwhile, most dynamic analyses rely on detecting vulnerabilities as they occur in the field, which can introduce a high performance overhead in production code. This paper describes a new approach for detecting injection vulnerabilities in applications by harnessing the combined power of human developers' test suites and automated dynamic analysis. Our new approach, Rivulet, monitors the execution of developer-written functional tests in order to detect information flows that may be vulnerable to attack. Then, Rivulet uses a white-box test generation technique to repurpose those functional tests to check if any vulnerable flow could be exploited. When applied to the version of Apache Struts exploited in the 2017 Equifax attack, Rivulet quickly identifies the vulnerability, leveraging only the tests that existed in Struts at that time. We compared Rivulet to the state-of-the-art static vulnerability detector Julia on benchmarks, finding that Rivulet outperformed Julia in both false positives and false negatives. We also used Rivulet to detect new vulnerabilities.

RoScript: a visual script driven truly non-intrusive robotic testing system for touch screen applications

Existing intrusive test automation techniques for touch screen applications (e.g., Appium and Sikuli) are difficult to work on many closed or uncommon systems, such as a GoPro. Being non-intrusive can largely extend the application scope of the test automation techniques. To this end, this paper presents RoScript, a truly non-intrusive test-script-driven robotic testing system for test automation of touch screen applications. RoScript leverages visual test scripts to express GUI actions on a touch screen application and uses a physical robot to drive automated test execution. To reduce the test script creation cost, a non-intrusive computer vision based technique is also introduced in RoScript to automatically record touch screen actions into test scripts from videos of human actions on the device under test. RoScript is applicable to touch screen applications running on almost arbitrary platforms, whatever the underlying operating systems or GUI frameworks are. We conducted experiments applying it to automate the testing of 21 touch screen applications on 6 different devices. The results show that RoScript is highly usable. In the experiments, it successfully automated 104 test scenarios containing over 650 different GUI actions on the subject applications. RoScript accurately performed GUI actions on over 90% of the test script executions and accurately recorded about 85% of human screen click actions into test code.

Translating video recordings of mobile app usages into replayable scenarios

Screen recordings of mobile applications are easy to obtain and capture a wealth of information pertinent to software developers (e.g., bugs or feature requests), making them a popular mechanism for crowdsourced app feedback. Thus, these videos are becoming a common artifact that developers must manage. In light of unique mobile development constraints, including swift release cycles and rapidly evolving platforms, automated techniques for analyzing all types of rich software artifacts provide benefit to mobile developers. Unfortunately, automatically analyzing screen recordings presents serious challenges, due to their graphical nature, compared to other types of (textual) artifacts. To address these challenges, this paper introduces V2S, a lightweight, automated approach for translating video recordings of Android app usages into replayable scenarios. V2S is based primarily on computer vision techniques and adapts recent solutions for object detection and image classification to detect and classify user actions captured in a video, and convert these into a replayable test scenario. We performed an extensive evaluation of V2S involving 175 videos depicting 3,534 GUI-based actions collected from users exercising features and reproducing bugs from over 80 popular Android apps. Our results illustrate that V2S can accurately replay scenarios from screen recordings, and is capable of reproducing ≈89% of our collected videos with minimal overhead. A case study with three industrial partners illustrates the potential usefulness of V2S from the viewpoint of developers.

Unblind your apps: predicting natural-language labels for mobile GUI components by deep learning

According to the World Health Organization(WHO), it is estimated that approximately 1.3 billion people live with some forms of vision impairment globally, of whom 36 million are blind. Due to their disability, engaging these minority into the society is a challenging problem. The recent rise of smart mobile phones provides a new solution by enabling blind users' convenient access to the information and service for understanding the world. Users with vision impairment can adopt the screen reader embedded in the mobile operating systems to read the content of each screen within the app, and use gestures to interact with the phone. However, the prerequisite of using screen readers is that developers have to add natural-language labels to the image-based components when they are developing the app. Unfortunately, more than 77% apps have issues of missing labels, according to our analysis of 10,408 Android apps. Most of these issues are caused by developers' lack of awareness and knowledge in considering the minority. And even if developers want to add the labels to UI components, they may not come up with concise and clear description as most of them are of no visual issues. To overcome these challenges, we develop a deep-learning based model, called LabelDroid, to automatically predict the labels of image-based buttons by learning from large-scale commercial apps in Google Play. The experimental results show that our model can make accurate predictions and the generated labels are of higher quality than that from real Android developers.

SESSION: Autonomous driven system

SLEMI: equivalence modulo input (EMI) based mutation of CPS models for finding compiler bugs in Simulink

Finding bugs in commercial cyber-physical system development tools (or "model-based design" tools) such as MathWorks's Simulink is important in practice, as these tools are widely used to generate embedded code that gets deployed in safety-critical applications such as cars and planes. Equivalence Modulo Input (EMI) based mutation is a new twist on differential testing that promises lower use of computational resources and has already been successful at finding bugs in compilers for procedural languages. To provide EMI-based mutation for differential testing of cyber-physical system (CPS) development tools, this paper develops several novel mutation techniques. These techniques deal with CPS language features that are not found in procedural languages, such as an explicit notion of execution time and zombie code, which combines properties of live and dead procedural code. In our experiments the most closely related work (SLforge) found two bugs in the Simulink tool. In comparison, SLEMI found a super-set of issues, including 9 confirmed as bugs by MathWorks Support.

DeepBillboard: systematic physical-world testing of autonomous driving systems

Deep Neural Networks (DNNs) have been widely applied in autonomous systems such as self-driving vehicles. Recently, DNN testing has been intensively studied to automatically generate adversarial examples, which inject small-magnitude perturbations into inputs to test DNNs under extreme situations. While existing testing techniques prove to be effective, particularly for autonomous driving, they mostly focus on generating digital adversarial perturbations, e.g., changing image pixels, which may never happen in the physical world. Thus, there is a critical missing piece in the literature on autonomous driving testing: understanding and exploiting both digital and physical adversarial perturbation generation for impacting steering decisions. In this paper, we propose a systematic physical-world testing approach, namely DeepBillboard, targeting at a quite common and practical driving scenario: drive-by billboards. DeepBillboard is capable of generating a robust and resilient printable adversarial billboard test, which works under dynamic changing driving conditions including viewing angle, distance, and lighting. The objective is to maximize the possibility, degree, and duration of the steering-angle errors of an autonomous vehicle driving by our generated adversarial billboard. We have extensively evaluated the efficacy and robustness of DeepBillboard by conducting both experiments with digital perturbations and physical-world case studies. The digital experimental results show that DeepBillboard is effective for various steering models and scenes. Furthermore, the physical case studies demonstrate that DeepBillboard is sufficiently robust and resilient for generating physical-world adversarial billboard tests for real-world driving under various weather conditions, being able to mislead the average steering angle error up to 26.44 degrees. To the best of our knowledge, this is the first study demonstrating the possibility of generating realistic and continuous physical-world tests for practical autonomous driving systems; moreover, DeepBillboard can be directly generalized to a variety of other physical entities/surfaces along the curbside, e.g., a graffiti painted on a wall.

Misbehaviour prediction for autonomous driving systems

Deep Neural Networks (DNNs) are the core component of modern autonomous driving systems. To date, it is still unrealistic that a DNN will generalize correctly to all driving conditions. Current testing techniques consist of offline solutions that identify adversarial or corner cases for improving the training phase.

In this paper, we address the problem of estimating the confidence of DNNs in response to unexpected execution contexts with the purpose of predicting potential safety-critical misbehaviours and enabling online healing of DNN-based vehicles. Our approach SelfOracle is based on a novel concept of self-assessment oracle, which monitors the DNN confidence at runtime, to predict unsupported driving scenarios in advance. SelfOracle uses autoencoder-and time series-based anomaly detection to reconstruct the driving scenarios seen by the car, and to determine the confidence boundary between normal and unsupported conditions.

In our empirical assessment, we evaluated the effectiveness of different variants of SelfOracle at predicting injected anomalous driving contexts, using DNN models and simulation environment from Udacity. Results show that, overall, SelfOracle can predict 77% misbehaviours, up to six seconds in advance, outperforming the online input validation approach of DeepRoad.

Approximation-refinement testing of compute-intensive cyber-physical models: an approach based on system identification

Black-box testing has been extensively applied to test models of Cyber-Physical systems (CPS) since these models are not often amenable to static and symbolic testing and verification. Black-box testing, however, requires to execute the model under test for a large number of candidate test inputs. This poses a challenge for a large and practically-important category of CPS models, known as compute-intensive CPS (CI-CPS) models, where a single simulation may take hours to complete. We propose a novel approach, namely ARIsTEO, to enable effective and efficient testing of CI-CPS models. Our approach embeds black-box testing into an iterative approximation-refinement loop. At the start, some sampled inputs and outputs of the CI-CPS model under test are used to generate a surrogate model that is faster to execute and can be subjected to black-box testing. Any failure-revealing test identified for the surrogate model is checked on the original model. If spurious, the test results are used to refine the surrogate model to be tested again. Otherwise, the test reveals a valid failure. We evaluated ARIsTEO by comparing it with S-Taliro, an open-source and industry-strength tool for testing CPS models. Our results, obtained based on five publicly-available CPS models, show that, on average, ARIsTEO is able to find 24% more requirements violations than S-Taliro and is 31% faster than S-Taliro in finding those violations. We further assessed the effectiveness and efficiency of ARIsTEO on a large industrial case study from the satellite domain. In contrast to S-Taliro, ARIsTEO successfully tested two different versions of this model and could identify three requirements violations, requiring four hours, on average, for each violation.

A comprehensive study of autonomous vehicle bugs

Self-driving cars, or Autonomous Vehicles (AVs), are increasingly becoming an integral part of our daily life. About 50 corporations are actively working on AVs, including large companies such as Google, Ford, and Intel. Some AVs are already operating on public roads, with at least one unfortunate fatality recently on record. As a result, understanding bugs in AVs is critical for ensuring their security, safety, robustness, and correctness. While previous studies have focused on a variety of domains (e.g., numerical software; machine learning; and error-handling, concurrency, and performance bugs) to investigate bug characteristics, AVs have not been studied in a similar manner. Recently, two software systems for AVs, Baidu Apollo and Autoware, have emerged as frontrunners in the open-source community and have been used by large companies and governments (e.g., Lincoln, Volvo, Ford, Intel, Hitachi, LG, and the US Department of Transportation). From these two leading AV software systems, this paper describes our investigation of 16,851 commits and 499 AV bugs and introduces our classification of those bugs into 13 root causes, 20 bug symptoms, and 18 categories of software components those bugs often affect. We identify 16 major findings from our study and draw broader lessons from them to guide the research community towards future directions in software bug detection, localization, and repair.

SESSION: Debugging 2

Studying the use of Java logging utilities in the wild

Software logging is widely used in practice. Logs have been used for a variety of purposes like debugging, monitoring, security compliance, and business analytics. Instead of directly invoking the standard output functions, developers usually prefer to use logging utilities (LUs) (e.g., SLF4J), which provide additional functionalities like thread-safety and verbosity level support, to instrument their source code. Many of the previous research works on software logging are focused on the log printing code. There are very few works studying the use of LUs, although new LUs are constantly being introduced by companies and researchers. In this paper, we conducted a large-scale empirical study on the use of Java LUs in the wild. We analyzed the use of 3, 856 LUs from 11,194 projects in GitHub and found that many projects have complex usage patterns for LUs. For example, 75.8% of the large-sized projects have implemented their own LUs in their projects. More than 50% of these projects use at least three LUs. We conducted further qualitative studies to better understand and characterize the complex use of LUs. Our findings show that different LUs are used for a variety of reasons (e.g., internationalization of the log messages). Some projects develop their own LUs to satisfy project-specific logging needs (e.g., defining the logging format). Multiple uses of LUs in one project are pretty common for large and very largesized projects mainly for context like enabling and configuring the logging behavior for the imported packages. Interviewing with 13 industrial developers showed that our findings are also generally true for industrial projects and are considered as very helpful for them to better configure and manage the logging behavior for their projects. The findings and the implications presented in this paper will be useful for developers and researchers who are interested in developing and maintaining LUs.

SESSION: Human aspects of software engineering 1

A study on the prevalence of human values in software engineering publications, 2015 -- 2018

Failure to account for human values in software (e.g., equality and fairness) can result in user dissatisfaction and negative socio-economic impact. Engineering these values in software, however, requires technical and methodological support throughout the development life cycle. This paper investigates to what extent top Software Engineering (SE) conferences and journals have included research on human values in SE. We investigate the prevalence of human values in recent (2015 -- 2018) publications in these top venues. We classify these publications, based on their relevance to different values, against a widely used value structure adopted from the social sciences. Our results show that: (a) only a small proportion of the publications directly consider values, classified as directly relevant publications; (b) for the majority of the values, very few or no directly relevant publications were found; and (c) the prevalence of directly relevant publications was higher in SE conferences compared to SE journals. This paper shares these and other insights that may motivate future research on human values in software engineering.

Explaining pair programming session dynamics from knowledge gaps

Background: Despite a lot of research on the effectiveness of Pair Programming (PP), the question when it is useful or less useful remains unsettled.

Method: We analyze recordings of many industrial PP sessions with Grounded Theory Methodology and build on prior work that identified various phenomena related to within-session knowledge build-up and transfer. We validate our findings with practitioners.

Result: We identify two fundamentally different types of required knowledge and explain how different constellations of knowledge gaps in these two respects lead to different session dynamics. Gaps in project-specific systems knowledge are more hampering than gaps in general programming knowledge and are dealt with first and foremost in a PP session.

Conclusion: Partner constellations with complementary knowledge make PP a particularly effective practice. In PP sessions, differences in system understanding are more important than differences in general software development knowledge.

Engineering gender-inclusivity into software: ten teams' tales from the trenches

Although the need for gender-inclusivity in software is gaining attention among SE researchers and SE practitioners, and at least one method (GenderMag) has been published to help, little has been reported on how to make such methods work in real-world settings. Real-world teams are ever-mindful of the practicalities of adding new methods on top of their existing processes. For example, how can they keep the time costs viable? How can they maximize impacts of using it? What about controversies that can arise in talking about gender? To find out how software teams "in the trenches" handle these and similar questions, we collected the GenderMag-based processes of 10 real-world software teams---more than 50 people---for periods ranging from 5 months to 3.5 years. We present these teams' insights and experiences in the form of 9 practices, 2 potential pitfalls, and 2 open issues, so as to provide their insights to other real-world software teams trying to engineer gender-inclusivity into their software products.

SESSION: Version control and programming practice

How has forking changed in the last 20 years?: a study of hard forks on GitHub

The notion of forking has changed with the rise of distributed version control systems and social coding environments, like GitHub. Traditionally forking refers to splitting off an independent development branch (which we call hard forks); research on hard forks, conducted mostly in pre-GitHub days showed that hard forks were often seen critical as they may fragment a community Today, in social coding environments, open-source developers are encouraged to fork a project in order to contribute to the community (which we call social forks), which may have also influenced perceptions and practices around hard forks. To revisit hard forks, we identify, study, and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that, among others, hard forks often evolve out of social forks rather than being planned deliberately and that perception about hard forks have indeed changed dramatically, seeing them often as a positive noncompetitive alternative to the original project.

SESSION: Android application testing

Multiple-entry testing of Android applications by constructing activity launching contexts

Existing GUI testing approaches of Android apps usually test apps from a single entry. In this way, the marginal activities far away from the default entry are difficult to be covered. The marginal activities may fail to be launched due to requiring a great number of activity transitions or involving complex user operations, leading to uneven coverage on activity components. Besides, since the test space of GUI programs is infinite, it is difficult to test activities under complete launching contexts using single-entry testing approaches.

In this paper, we address these issues by constructing activity launching contexts and proposing a multiple-entry testing framework. We perform an inter-procedural, flow-, context- and path-sensitive analysis to build activity launching models and generate complete launching contexts. By activity exposing and static analysis, we could launch activities directly under various contexts without performing long event sequence on GUI. Besides, to achieve an in-depth exploration, we design an adaptive exploration framework which supports the multiple-entry exploration and dynamically assigns weights to entries in each turn.

Our approach is implemented in a tool called Fax, with an activity launching strategy Faxla and an exploration strategy Faxex. The experiments on 20 real-world apps show that Faxla can cover 96.4% and successfully launch 60.6% activities, based on which Faxex further achieves a relatively 19.7% improvement on method coverage compared with the most popular tool Monkey. Our tool also behaves well in revealing hidden bugs. Fax can trigger over seven hundred unique crashes, including 180 Errors and 539 Warnings, which is significantly higher than those of other tools. Among the 46 bugs reported to developers on Github, 33 have been fixed up to now.

ComboDroid: generating high-quality test inputs for Android apps via use case combinations

Android apps demand high-quality test inputs, whose generation remains an open challenge. Existing techniques fall short on exploring complex app functionalities reachable only by a long, meaningful, and effective test input. Observing that such test inputs can usually be decomposed into relatively independent short use cases, this paper presents ComboDroid, a fundamentally different Android app testing framework. ComboDroid obtains use cases for manifesting a specific app functionality (either manually provided or automatically extracted), and systematically enumerates the combinations of use cases, yielding high-quality test inputs.

The evaluation results of ComboDroid on real-world apps are encouraging. Our fully automatic variant outperformed the best existing technique APE by covering 4.6% more code (APE only outperformed Monkey by 2.1%), and revealed four previously unknown bugs in extensively tested subjects. Our semi-automatic variant boosts the manual use cases obtained with little manual labor, achieving a comparable coverage (only 3.2% less) with a white-box human testing expert.

Time-travel testing of Android apps

Android testing tools generate sequences of input events to exercise the state space of the app-under-test. Existing search-based techniques systematically evolve a population of event sequences so as to achieve certain objectives such as maximal code coverage. The hope is that the mutation of fit event sequences leads to the generation of even fitter sequences. However, the evolution of event sequences may be ineffective. Our key insight is that pertinent app states which contributed to the original sequence's fitness may not be reached by a mutated event sequence. The original path through the state space is truncated at the point of mutation.

In this paper, we propose instead to evolve a population of states which can be captured upon discovery and resumed when needed. The hope is that generating events on a fit program state leads to the transition to even fitter states. For instance, we can quickly deprioritize testing the main screen state which is visited by most event sequences, and instead focus our limited resources on testing more interesting states that are otherwise difficult to reach.

We call our approach time-travel testing because of this ability to travel back to any state that has been observed in the past. We implemented time-travel testing into TimeMachine, a time-travel enabled version of the successful, automated Android testing tool Monkey. In our experiments on a large number of open- and closed source Android apps, TimeMachine outperforms the state-of-the-art search-based/model-based Android testing tools Sapienz and Stoat, both in terms of coverage achieved and crashes found.

SESSION: Clones and changes

HeteroRefactor: refactoring for heterogeneous computing with FPGA

Heterogeneous computing with field-programmable gate-arrays (FPGAs) has demonstrated orders of magnitude improvement in computing efficiency for many applications. However, the use of such platforms so far is limited to a small subset of programmers with specialized hardware knowledge. High-level synthesis (HLS) tools made significant progress in raising the level of programming abstraction from hardware programming languages to C/C++, but they usually cannot compile and generate accelerators for kernel programs with pointers, memory management, and recursion, and require manual refactoring to make them HLS-compatible. Besides, experts also need to provide heavily handcrafted optimizations to improve resource efficiency, which affects the maximum operating frequency, parallelization, and power efficiency.

We propose a new dynamic invariant analysis and automated refactoring technique, called HeteroRefactor. First, HeteroRefactor monitors FPGA-specific dynamic invariants---the required bitwidth of integer and floating-point variables, and the size of recursive data structures and stacks. Second, using this knowledge of dynamic invariants, it refactors the kernel to make traditionally HLS-incompatible programs synthesizable and to optimize the accelerator's resource usage and frequency further. Third, to guarantee correctness, it selectively offloads the computation from CPU to FPGA, only if an input falls within the dynamic invariant. On average, for a recursive program of size 175 LOC, an expert FPGA programmer would need to write 185 more LOC to implement an HLS compatible version, while HeteroRefactor automates such transformation. Our results on Xilinx FPGA show that HeteroRefactor minimizes BRAM by 83% and increases frequency by 42% for recursive programs; reduces BRAM by 41% through integer bitwidth reduction; and reduces DSP by 50% through floating-point precision tuning.

HARP: holistic analysis for refactoring Python-based analytics programs

Modern machine learning programs are often written in Python, with the main computations specified through calls to some highly optimized libraries (e.g., TensorFlow, PyTorch). How to maximize the computing efficiency of such programs is essential for many application domains, which has drawn lots of recent attention. This work points out a common limitation in existing efforts: they focus their views only on the static computation graphs specified by library APIs, but leave the influence from the hosting Python code largely unconsidered. The limitation often causes them to miss the big picture and hence many important optimization opportunities. This work proposes a new approach named HARP to address the problem. HARP enables holistic analysis that spans across computation graphs and their hosting Python code. HARP achieves it through a set of novel techniques: analytics-conscious speculative analysis to circumvent Python complexities, a unified representation augmented computation graphs to capture all dimensions of knowledge related with the holistic analysis, and conditioned feedback mechanism to allow risk-controlled aggressive analysis. Refactoring based on HARP gives 1.3--3X and 2.07X average speedups on a set of TensorFlow and PyTorch programs.

CC2Vec: distributed representations of code changes

Existing work on software patches often use features specific to a single task. These works often rely on manually identified features, and human effort is required to identify these features for each task. In this work, we propose CC2Vec, a neural network model that learns a representation of code changes guided by their accompanying log messages, which represent the semantic intent of the code changes. CC2Vec models the hierarchical structure of a code change with the help of the attention mechanism and uses multiple comparison functions to identify the differences between the removed and added code.

To evaluate if CC2Vec can produce a distributed representation of code changes that is general and useful for multiple tasks on software patches, we use the vectors produced by CC2Vec for three tasks: log message generation, bug fixing patch identification, and just-in-time defect prediction. In all tasks, the models using CC2Vec outperform the state-of-the-art techniques.

SESSION: Contracts

Empirical review of automated analysis tools on 47,587 Ethereum smart contracts

Over the last few years, there has been substantial research on automated analysis, testing, and debugging of Ethereum smart contracts. However, it is not trivial to compare and reproduce that research. To address this, we present an empirical evaluation of 9 state-of-the-art automated analysis tools using two new datasets: i) a dataset of 69 annotated vulnerable smart contracts that can be used to evaluate the precision of analysis tools; and ii) a dataset with all the smart contracts in the Ethereum Blockchain that have Solidity source code available on Etherscan (a total of 47,518 contracts). The datasets are part of SmartBugs, a new extendable execution framework that we created to facilitate the integration and comparison between multiple analysis tools and the analysis of Ethereum smart contracts. We used SmartBugs to execute the 9 automated analysis tools on the two datasets. In total, we ran 428,337 analyses that took approximately 564 days and 3 hours, being the largest experimental setup to date both in the number of tools and in execution time. We found that only 42% of the vulnerabilities from our annotated dataset are detected by all the tools, with the tool Mythril having the higher accuracy (27%). When considering the largest dataset, we observed that 97% of contracts are tagged as vulnerable, thus suggesting a considerable number of false positives. Indeed, only a small number of vulnerabilities (and of only two categories) were detected simultaneously by four or more tools.

Gap between theory and practice: an empirical study of security patches in solidity

Ethereum, one of the most popular blockchain platforms, provides financial transactions like payments and auctions through smart contracts. Due to the immense interest in smart contracts in academia, the research community of smart contract security has made a significant improvement recently. Researchers have reported various security vulnerabilities in smart contracts, and developed static analysis tools and verification frameworks to detect them. However, it is unclear whether such great efforts from academia has indeed enhanced the security of smart contracts in reality.

To understand the security level of smart contracts in the wild, we empirically studied 55,046 real-world Ethereum smart contracts written in Solidity, the most popular programming language used by Ethereum smart contract developers. We first examined how many well-known vulnerabilities the Solidity compiler has patched, and how frequently the Solidity team publishes compiler releases. Unfortunately, we observed that many known vulnerabilities are not yet patched, and some patches are not even sufficient to avoid their target vulnerabilities. Subsequently, we investigated whether smart contract developers use the most recent compiler with vulnerabilities patched. We reported that developers of more than 98% of real-world Solidity contracts still use older compilers without vulnerability patches, and more than 25% of the contracts are potentially vulnerable due to the missing security patches. To understand actual impacts of the missing patches, we manually investigated potentially vulnerable contracts that are detected by our static analyzer and identified common mistakes by Solidity developers, which may cause serious security issues such as financial loss. We detected hundreds of vulnerable contracts and about one fourth of the vulnerable contracts are used by thousands of people. We recommend the Solidity team to make patches that resolve known vulnerabilities correctly, and developers to use the latest Solidity compiler to avoid missing security patches.

SESSION: Defect prediction

An investigation of cross-project learning in online just-in-time software defect prediction

Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time. We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.

Understanding the automated parameter optimization on transfer learning for cross-project defect prediction: an empirical study

Data-driven defect prediction has become increasingly important in software engineering process. Since it is not uncommon that data from a software project is insufficient for training a reliable defect prediction model, transfer learning that borrows data/konwledge from other projects to facilitate the model building at the current project, namely cross-project defect prediction (CPDP), is naturally plausible. Most CPDP techniques involve two major steps, i.e., transfer learning and classification, each of which has at least one parameter to be tuned to achieve their optimal performance. This practice fits well with the purpose of automated parameter optimization. However, there is a lack of thorough understanding about what are the impacts of automated parameter optimization on various CPDP techniques. In this paper, we present the first empirical study that looks into such impacts on 62 CPDP techniques, 13 of which are chosen from the existing CPDP literature while the other 49 ones have not been explored before. We build defect prediction models over 20 real-world software projects that are of different scales and characteristics. Our findings demonstrate that: (1) Automated parameter optimization substantially improves the defect prediction performance of 77% CPDP techniques with a manageable computational cost. Thus more efforts on this aspect are required in future CPDP studies. (2) Transfer learning is of ultimate importance in CPDP. Given a tight computational budget, it is more cost-effective to focus on optimizing the parameter configuration of transfer learning algorithms (3) The research on CPDP is far from mature where it is 'not difficult' to find a better alternative by making a combination of existing transfer learning and classification techniques. This finding provides important insights about the future design of CPDP techniques.

Software visualization and deep transfer learning for effective software defect prediction

Software defect prediction aims to automatically locate defective code modules to better focus testing resources and human effort. Typically, software defect prediction pipelines are comprised of two parts: the first extracts program features, like abstract syntax trees, by using external tools, and the second applies machine learning-based classification models to those features in order to predict defective modules. Since such approaches depend on specific feature extraction tools, machine learning classifiers have to be custom-tailored to effectively build most accurate models.

To bridge the gap between deep learning and defect prediction, we propose an end-to-end framework which can directly get prediction results for programs without utilizing feature-extraction tools. To that end, we first visualize programs as images, apply the self-attention mechanism to extract image features, use transfer learning to reduce the difference in sample distributions between projects, and finally feed the image files into a pre-trained, deep learning model for defect prediction. Experiments with 10 open source projects from the PROMISE dataset show that our method can improve cross-project and within-project defect prediction. Our code and data pointers are available at https://zenodo.org/record/3373409#.XV0Oy5Mza35.

SESSION: Human aspects of software engineering 2

Software documentation: the practitioners' perspective

In theory, (good) documentation is an invaluable asset to any software project, as it helps stakeholders to use, understand, maintain, and evolve a system. In practice, however, documentation is generally affected by numerous shortcomings and issues, such as insufficient and inadequate content and obsolete, ambiguous information. To counter this, researchers are investigating the development of advanced recommender systems that automatically suggest high-quality documentation, useful for a given task. A crucial first step is to understand what quality means for practitioners and what information is actually needed for specific tasks.

We present two surveys performed with 146 practitioners to investigate (i) the documentation issues they perceive as more relevant together with solutions they apply when these issues arise; and (ii) the types of documentation considered as important in different tasks. Our findings can help researchers in designing the next generation of documentation recommender systems.

SESSION: Program repair

DLFix: context-based code transformation learning for automated program repair

Automated Program Repair (APR) is very useful in helping developers in the process of software development and maintenance. Despite recent advances in deep learning (DL), the DL-based APR approaches still have limitations in learning bug-fixing code changes and the context of the surrounding source code of the bug-fixing code changes. These limitations lead to incorrect fixing locations or fixes. In this paper, we introduce DLFix, a two-tier DL model that treats APR as code transformation learning from the prior bug fixes and the surrounding code contexts of the fixes. The first layer is a tree-based RNN model that learns the contexts of bug fixes and its result is used as an additional weighting input for the second layer designed to learn the bug-fixing code transformations.

We conducted several experiments to evaluate DLFix in two benchmarks: Defect4j and Bugs.jar, and a newly built bug datasets with a total of +20K real-world bugs in eight projects. We compared DLFix against a total of 13 state-of-the-art pattern-based APR tools. Our results show that DLFix can auto-fix more bugs than 11 of them, and is comparable and complementary to the top two pattern-based APR tools in which there are 7 and 11 unique bugs that they cannot detect, respectively, but we can. Importantly, DLFix is fully automated and data-driven, and does not require hard-coding of bug-fixing patterns as in those tools. We compared DLFix against 4 state-of-the-art deep learning based APR models. DLFix is able to fix 2.5 times more bugs than the best performing baseline.

On the efficiency of test suite based program repair: A Systematic Assessment of 16 Automated Repair Systems for Java Programs

Test-based automated program repair has been a prolific field of research in software engineering in the last decade. Many approaches have indeed been proposed, which leverage test suites as a weak, but affordable, approximation to program specifications. Although the literature regularly sets new records on the number of benchmark bugs that can be fixed, several studies increasingly raise concerns about the limitations and biases of state-of-the-art approaches. For example, the correctness of generated patches has been questioned in a number of studies, while other researchers pointed out that evaluation schemes may be misleading with respect to the processing of fault localization results. Nevertheless, there is little work addressing the efficiency of patch generation, with regard to the practicality of program repair. In this paper, we fill this gap in the literature, by providing an extensive review on the efficiency of test suite based program repair. Our objective is to assess the number of generated patch candidates, since this information is correlated to (1) the strategy to traverse the search space efficiently in order to select sensical repair attempts, (2) the strategy to minimize the test effort for identifying a plausible patch, (3) as well as the strategy to prioritize the generation of a correct patch. To that end, we perform a large-scale empirical study on the efficiency, in terms of quantity of generated patch candidates of the 16 open-source repair tools for Java programs. The experiments are carefully conducted under the same fault localization configurations to limit biases. Eventually, among other findings, we note that: (1) many irrelevant patch candidates are generated by changing wrong code locations; (2) however, if the search space is carefully triaged, fault localization noise has little impact on patch generation efficiency; (3) yet, current template-based repair systems, which are known to be most effective in fixing a large number of bugs, are actually least efficient as they tend to generate majoritarily irrelevant patch candidates.

SESSION: Requirement discovery

Caspar: extracting and synthesizing user stories of problems from app reviews

A user's review of an app often describes the user's interactions with the app. These interactions, which we interpret as mini stories, are prominent in reviews with negative ratings. In general, a story in an app review would contain at least two types of events: user actions and associated app behaviors. Being able to identify such stories would enable an app's developer in better maintaining and improving the app's functionality and enhancing user experience.

We present Caspar, a method for extracting and synthesizing user-reported mini stories regarding app problems from reviews. By extending and applying natural language processing techniques, Caspar extracts ordered events from app reviews, classifies them as user actions or app problems, and synthesizes action-problem pairs. Our evaluation shows that Caspar is effective in finding action-problem pairs from reviews. First, Caspar classifies the events with an accuracy of 82.0% on manually labeled data. Second, relative to human evaluators, Caspar extracts event pairs with 92.9% precision and 34.2% recall. In addition, we train an inference model on the extracted action-problem pairs that automatically predicts possible app problems for different use cases. Preliminary evaluation shows that our method yields promising results. Caspar illustrates the potential for a deeper understanding of app reviews and possibly other natural language artifacts arising in software engineering.

Detection of hidden feature requests from massive chat messages via deep siamese network

Online chatting is gaining popularity and plays an increasingly significant role in software development. When discussing functionalities, developers might reveal their desired features to other developers. Automated mining techniques towards retrieving feature requests from massive chat messages can benefit the requirements gathering process. But it is quite challenging to perform such techniques because detecting feature requests from dialogues requires a thorough understanding of the contextual information, and it is also extremely expensive on annotating feature-request dialogues for learning. To bridge that gap, we recast the traditional text classification task of mapping single dialog to its class into the task of determining whether two dialogues are similar or not by incorporating few-shot learning. We propose a novel approach, named FRMiner, which can detect feature-request dialogues from chat messages via deep Siamese network. We design a BiLSTM-based dialog model that can learn the contextual information of a dialog in both forward and reverse directions. Evaluation on the real-world projects shows that our approach achieves average precision, recall and F1-score of 88.52%, 88.50% and 88.51%, which confirms that our approach could effectively detect hidden feature requests from chat messages, thus can facilitate gathering comprehensive requirements from the crowd in an automated way.

SESSION: Cognition

A tale from the trenches: cognitive biases and software development

Cognitive biases are hard-wired behaviors that influence developer actions and can set them on an incorrect course of action, necessitating backtracking. While researchers have found that cognitive biases occur in development tasks in controlled lab studies, we still don't know how these biases affect developers' everyday behavior. Without such an understanding, development tools and practices remain inadequate. To close this gap, we conducted a 2-part field study to examine the extent to which cognitive biases occur, the consequences of these biases on developer behavior, and the practices and tools that developers use to deal with these biases. About 70% of observed actions that were reversed were associated with at least one cognitive bias. Further, even though developers recognized that biases frequently occur, they routinely are forced to deal with such issues with ad hoc processes and sub-optimal tool support. As one participant (IP12) lamented: There is no salvation!

Recognizing developers' emotions while programming

Developers experience a wide range of emotions during programming tasks, which may have an impact on job performance. In this paper, we present an empirical study aimed at (i) investigating the link between emotion and progress, (ii) understanding the triggers for developers' emotions and the strategies to deal with negative ones, (iii) identifying the minimal set of non-invasive biometric sensors for emotion recognition during programming tasks. Results confirm previous findings about the relation between emotions and perceived productivity. Furthermore, we show that developers' emotions can be reliably recognized using only a wristband capturing the electrodermal activity and heart-related metrics.

Neurological divide: an fMRI study of prose and code writing

Software engineering involves writing new code or editing existing code. Recent efforts have investigated the neural processes associated with reading and comprehending code --- however, we lack a thorough understanding of the human cognitive processes underlying code writing. While prose reading and writing have been studied thoroughly, that same scrutiny has not been applied to code writing. In this paper, we leverage functional brain imaging to investigate neural representations of code writing in comparison to prose writing. We present the first human study in which participants wrote code and prose while undergoing a functional magnetic resonance imaging (fMRI) brain scan, making use of a full-sized fMRI-safe QWERTY keyboard.

We find that code writing and prose writing are significantly dissimilar neural tasks. While prose writing entails significant left hemisphere activity associated with language, code writing involves more activations of the right hemisphere, including regions associated with attention control, working memory, planning and spatial cognition. These findings are unlike existing work in which code and prose comprehension were studied. By contrast, we present the first evidence suggesting that code and prose writing are quite dissimilar at the neural level.

Here we go again: why is it difficult for developers to learn another programming language?

Once a programmer knows one language, they can leverage concepts and knowledge already learned, and easily pick up another programming language. But is that always the case? To understand if programmers have difficulty learning additional programming languages, we conductedan empirical study of Stack Overflow questions across 18 different programming languages. We hypothesized that previous knowledge could potentially interfere with learning a new programming language. From our inspection of 450 Stack Overflow questions, we found 276 instances of interference that occurred due to faulty assumptions originating from knowledge about a different language. To understand why these difficulties occurred, we conducted semi-structured interviews with 16 professional programmers. The interviews revealed that programmers make failed attempts to relate a new programming language with what they already know. Our findings inform design implications for technical authors, toolsmiths, and language designers, such as designing documentation and automated tools that reduce interference, anticipating uncommon language transitions during language design, and welcoming programmers not just into a language, but its entire ecosystem.

SESSION: Deep learning testing and debugging 1

Importance-driven deep learning system testing

Deep Learning (DL) systems are key enablers for engineering intelligent applications due to their ability to solve complex tasks such as image recognition and machine translation. Nevertheless, using DL systems in safety- and security-critical applications requires to provide testing evidence for their dependable operation. Recent research in this direction focuses on adapting testing criteria from traditional software engineering as a means of increasing confidence for their correct behaviour. However, they are inadequate in capturing the intrinsic properties exhibited by these systems. We bridge this gap by introducing DeepImportance, a systematic testing methodology accompanied by an Importance-Driven (IDC) test adequacy criterion for DL systems. Applying IDC enables to establish a layer-wise functional understanding of the importance of DL system components and use this information to assess the semantic diversity of a test set. Our empirical evaluation on several DL systems, across multiple DL datasets and with state-of-the-art adversarial generation techniques demonstrates the usefulness and effectiveness of DeepImportance and its ability to support the engineering of more robust DL systems.

ReluDiff: differential verification of deep neural networks

As deep neural networks are increasingly being deployed in practice, their efficiency has become an important issue. While there are compression techniques for reducing the network's size, energy consumption and computational requirement, they only demonstrate empirically that there is no loss of accuracy, but lack formal guarantees of the compressed network, e.g., in the presence of adversarial examples. Existing verification techniques such as Reluplex, ReluVal, and DeepPoly provide formal guarantees, but they are designed for analyzing a single network instead of the relationship between two networks. To fill the gap, we develop a new method for differential verification of two closely related networks. Our method consists of a fast but approximate forward interval analysis pass followed by a backward pass that iteratively refines the approximation until the desired property is verified. We have two main innovations. During the forward pass, we exploit structural and behavioral similarities of the two networks to more accurately bound the difference between the output neurons of the two networks. Then in the backward pass, we leverage the gradient differences to more accurately compute the most beneficial refinement. Our experiments show that, compared to state-of-the-art verification tools, our method can achieve orders-of-magnitude speedup and prove many more properties than existing tools.

Dissector: input validation for deep learning applications by crossing-layer dissection

Deep learning (DL) applications are becoming increasingly popular. Their reliabilities largely depend on the performance of DL models integrated in these applications as a central classifying module. Traditional techniques need to retrain the models or rebuild and redeploy the applications for coping with unexpected conditions beyond the models' handling capabilities. In this paper, we take a fault tolerance approach, Dissector, to distinguishing those inputs that represent unexpected conditions (beyond-inputs) from normal inputs that are still within the models' handling capabilities (within-inputs), thus keeping the applications still function with expected reliabilities. The key insight of Dissector is that a DL model should interpret a within-input with increasing confidence, while a beyond-input would probably cause confused guesses in the prediction process. Dissector works in an application-specific way, adaptive to DL models used in applications, and extremely efficiently, scalable to large-size datasets from complex scenarios. The experimental evaluation shows that Dissector outperformed state-of-the-art techniques in the effectiveness (AUC: avg. 0.8935 and up to 0.9894) and efficiency (runtime overhead: only 3.3--5.8 milliseconds). Besides, it also exhibited encouraging usefulness in defensing against adversarial inputs (AUC: avg. 0.9983) and improving a DL model's actual accuracy in use (up to 16% for CIFAR-100 and 20% for ImageNet).

Towards characterizing adversarial defects of deep learning software from the lens of uncertainty

Over the past decade, deep learning (DL) has been successfully applied to many industrial domain-specific tasks. However, the current state-of-the-art DL software still suffers from quality issues, which raises great concern especially in the context of safety- and security-critical scenarios. Adversarial examples (AEs) represent a typical and important type of defects needed to be urgently addressed, on which a DL software makes incorrect decisions. Such defects occur through either intentional attack or physical-world noise perceived by input sensors, potentially hindering further industry deployment. The intrinsic uncertainty nature of deep learning decisions can be a fundamental reason for its incorrect behavior. Although some testing, adversarial attack and defense techniques have been recently proposed, it still lacks a systematic study to uncover the relationship between AEs and DL uncertainty.

In this paper, we conduct a large-scale study towards bridging this gap. We first investigate the capability of multiple uncertainty metrics in differentiating benign examples (BEs) and AEs, which enables to characterize the uncertainty patterns of input data. Then, we identify and categorize the uncertainty patterns of BEs and AEs, and find that while BEs and AEs generated by existing methods do follow common uncertainty patterns, some other uncertainty patterns are largely missed. Based on this, we propose an automated testing technique to generate multiple types of uncommon AEs and BEs that are largely missed by existing techniques. Our further evaluation reveals that the uncommon data generated by our method is hard to be defended by the existing defense techniques with the average defense success rate reduced by 35%. Our results call for attention and necessity to generate more diverse data for evaluating quality assurance solutions of DL software.

SESSION: Fuzzing 1

Gang of eight: a defect taxonomy for infrastructure as code scripts

Defects in infrastructure as code (IaC) scripts can have serious consequences, for example, creating large-scale system outages. A taxonomy of IaC defects can be useful for understanding the nature of defects, and identifying activities needed to fix and prevent defects in IaC scripts. The goal of this paper is to help practitioners improve the quality of infrastructure as code (IaC) scripts by developing a defect taxonomy for IaC scripts through qualitative analysis. We develop a taxonomy of IaC defects by applying qualitative analysis on 1,448 defect-related commits collected from open source software (OSS) repositories of the Openstack organization. We conduct a survey with 66 practitioners to assess if they agree with the identified defect categories included in our taxonomy. We quantify the frequency of identified defect categories by analyzing 80,425 commits collected from 291 OSS repositories spanning across 2005 to 2019.

Our defect taxonomy for IaC consists of eight categories, including a category specific to IaC called idempotency (i.e., defects that lead to incorrect system provisioning when the same IaC script is executed multiple times). We observe the surveyed 66 practitioners to agree most with idempotency. The most frequent defect category is configuration data i.e., providing erroneous configuration data in IaC scripts. Our taxonomy and the quantified frequency of the defect categories may help in advancing the science of IaC script quality.

MemLock: memory usage guided fuzzing

Uncontrolled memory consumption is a kind of critical software security weaknesses. It can also become a security-critical vulnerability when attackers can take control of the input to consume a large amount of memory and launch a Denial-of-Service attack. However, detecting such vulnerability is challenging, as the state-of-the-art fuzzing techniques focus on the code coverage but not memory consumption. To this end, we propose a memory usage guided fuzzing technique, named MemLock, to generate the excessive memory consumption inputs and trigger uncontrolled memory consumption bugs. The fuzzing process is guided with memory consumption information so that our approach is general and does not require any domain knowledge. We perform a thorough evaluation for MemLock on 14 widely-used real-world programs. Our experiment results show that MemLock substantially outperforms the state-of-the-art fuzzing techniques, including AFL, AFLfast, PerfFuzz, FairFuzz, Angora and QSYM, in discovering memory consumption bugs. During the experiments, we discovered many previously unknown memory consumption bugs and received 15 new CVEs.

sFuzz: an efficient adaptive fuzzer for solidity smart contracts

Smart contracts are Turing-complete programs that execute on the infrastructure of the blockchain, which often manage valuable digital assets. Solidity is one of the most popular programming languages for writing smart contracts on the Ethereum platform. Like traditional programs, smart contracts may contain vulnerabilities. Unlike traditional programs, smart contracts cannot be easily patched once they are deployed. It is thus important that smart contracts are tested thoroughly before deployment. In this work, we present an adaptive fuzzer for smart contracts on the Ethereum platform called sFuzz. Compared to existing Solidity fuzzers, sFuzz combines the strategy in the AFL fuzzer and an efficient lightweight multi-objective adaptive strategy targeting those hard-to-cover branches. sFuzz has been applied to more than 4 thousand smart contracts and the experimental results show that (1) sFuzz is efficient, e.g., two orders of magnitude faster than state-of-the-art tools; (2) sFuzz is effective in achieving high code coverage and discovering vulnerabilities; and (3) the different fuzzing strategies in sFuzz complement each other.

Targeted greybox fuzzing with static lookahead analysis

Automatic test generation typically aims to generate inputs that explore new paths in the program under test in order to find bugs. Existing work has, therefore, focused on guiding the exploration toward program parts that are more likely to contain bugs by using an offline static analysis.

In this paper, we introduce a novel technique for targeted greybox fuzzing using an online static analysis that guides the fuzzer toward a set of target locations, for instance, located in recently modified parts of the program. This is achieved by first semantically analyzing each program path that is explored by an input in the fuzzer's test suite. The results of this analysis are then used to control the fuzzer's specialized power schedule, which determines how often to fuzz inputs from the test suite. We implemented our technique by extending a state-of-the-art, industrial fuzzer for Ethereum smart contracts and evaluate its effectiveness on 27 real-world benchmarks. Using an online analysis is particularly suitable for the domain of smart contracts since it does not require any code instrumentation---adding instrumentation to contracts changes their semantics. Our experiments show that targeted fuzzing significantly outperforms standard greybox fuzzing for reaching 83% of the challenging target locations (up to 14x of median speed-up).

Planning for untangling: predicting the difficulty of merge conflicts

Merge conflicts are inevitable in collaborative software development and are disruptive. When they occur, developers have to stop their current work, understand the conflict and the surrounding code, and plan an appropriate resolution. However, not all conflicts are equally problematic---some can be easily fixed, while others might be complicated enough to need multiple people. Currently, there is not much support to help developers plan their conflict resolution. In this work, we aim to predict the difficulty of a merge conflict so as to help developers plan their conflict resolution. The ability to predict the difficulty of a merge conflict and to identify the underlying factors for its difficulty can help tool builders improve their conflict detection tools to prioritize and warn developers of difficult conflicts. In this work, we investigate the characteristics of difficult merge conflicts, and automatically classify them. We analyzed 6,380 conflicts across 128 java projects and found that merge conflict difficulty can be accurately predicted (AUC of 0.76) through machine learning algorithms, such as bagging.

SESSION: Static analysis 1

Conquering the extensional scalability problem for value-flow analysis frameworks

Modern static analyzers often need to simultaneously check a few dozen or even hundreds of value-flow properties, causing serious scalability issues when high precision is required. A major factor to this deficiency, as we observe, is that the core static analysis engine is oblivious of the mutual synergy among the properties being checked, thus inevitably losing many optimization opportunities. Our work is to leverage the inter-property awareness and to capture redundancies and inconsistencies when many properties are considered at the same time. We have evaluated our approach by checking twenty value-flow properties in standard benchmark programs and ten real-world software systems. The results demonstrate that our approach is more than 8× faster than existing ones but consumes only 1/7 of the memory. Such substantial improvement in analysis efficiency is not achieved by sacrificing the effectiveness: at the time of writing, thirty-nine bugs found by our approach have been fixed by developers and four of them have been assigned CVE IDs due to their security impact.

Tailoring programs for static analysis via program transformation

Static analysis is a proven technique for catching bugs during software development. However, analysis tooling must approximate, both theoretically and in the interest of practicality. False positives are a pervading manifestation of such approximations---tool configuration and customization is therefore crucial for usability and directing analysis behavior. To suppress false positives, developers readily disable bug checks or insert comments that suppress spurious bug reports. Existing work shows that these mechanisms fall short of developer needs and present a significant pain point for using or adopting analyses. We draw on the insight that an analysis user always has one notable ability to influence analysis behavior regardless of analyzer options and implementation: modifying their program. We present a new technique for automated, generic, and temporary code changes that tailor to suppress spurious analysis errors. We adopt a rule-based approach where simple, declarative templates describe general syntactic changes for code patterns that are known to be problematic for the analyzer. Our technique promotes program transformation as a general primitive for improving the fidelity of analysis reports (we treat any given analyzer as a black box). We evaluate using five different static analyzers supporting three different languages (C, Java, and PHP) on large, real world programs (up to 800KLOC). We show that our approach is effective in sidestepping long-standing and complex issues in analysis implementations.

Pipelining bottom-up data flow analysis

Bottom-up program analysis has been traditionally easy to parallelize because functions without caller-callee relations can be analyzed independently. However, such function-level parallelism is significantly limited by the calling dependence - functions with caller-callee relations have to be analyzed sequentially because the analysis of a function depends on the analysis results, a.k.a., function summaries, of its callees. We observe that the calling dependence can be relaxed in many cases and, as a result, the parallelism can be improved. In this paper, we present Coyote, a framework of bottom-up data flow analysis, in which the analysis task of each function is elaborately partitioned into multiple sub-tasks to generate pipelineable function summaries. These sub-tasks are pipelined and run in parallel, even though the calling dependence exists. We formalize our idea under the IFDS/IDE framework and have implemented an application to checking null-dereference bugs and taint issues in C/C++ programs. We evaluate Coyote on a series of standard benchmark programs and open-source software systems, which demonstrates significant speedup over a conventional parallel design.

SESSION: Traceability

A novel approach to tracing safety requirements and state-based design models

Traceability plays an essential role in assuring that software and systems are safe to use. Automated requirements traceability faces the low precision challenge due to a large number of false positives being returned and mingled with the true links. To overcome this challenge, we present a mutation-driven method built on the novel idea of proactively creating many seemingly correct tracing targets (i.e., mutants of a state machine diagram), and then exploiting model checking within process mining to automatically verify whether the safety requirement's properties hold in the mutants. A mutant is killed if its model checking fails; otherwise, it is survived. We leverage the underlying killed-survived distinction, and develop a correlation analysis procedure to identify the traceability links. Experimental evaluation results on two automotive systems with 27 safety requirements show considerable precision improvements compared with the state-of-the-art.

Establishing multilevel test-to-code traceability links

Test-to-code traceability links model the relationships between test artefacts and code artefacts. When utilised during the development process, these links help developers to keep test code in sync with tested code, reducing the rate of test failures and missed faults. Test-to-code traceability links can also help developers to maintain an accurate mental model of the system, reducing the risk of architectural degradation when making changes. However, establishing and maintaining these links manually places an extra burden on developers and is error-prone. This paper presents TCtracer, an approach and implementation for the automatic establishment of test-to-code traceability links. Unlike existing work, TCtracer operates at both the method level and the class level, allowing us to establish links between tests and functions, as well as between test classes and tested classes. We improve over existing techniques by combining an ensemble of new and existing techniques and exploiting a synergistic flow of information between the method and class levels. An evaluation of TCtracer using four large, well-studied open source systems demonstrates that, on average, we can establish test-to-function links with a mean average precision (MAP) of 78% and test-class-to-class links with an MAP of 93%.

Improving the effectiveness of traceability link recovery using hierarchical bayesian networks

Traceability is a fundamental component of the modern software development process that helps to ensure properly functioning, secure programs. Due to the high cost of manually establishing trace links, researchers have developed automated approaches that draw relationships between pairs of textual software artifacts using similarity measures. However, the effectiveness of such techniques are often limited as they only utilize a single measure of artifact similarity and cannot simultaneously model (implicit and explicit) relationships across groups of diverse development artifacts.

In this paper, we illustrate how these limitations can be overcome through the use of a tailored probabilistic model. To this end, we design and implement a HierarchiCal PrObabilistic Model for SoftwarE Traceability (Comet) that is able to infer candidate trace links. Comet is capable of modeling relationships between artifacts by combining the complementary observational prowess of multiple measures of textual similarity. Additionally, our model can holistically incorporate information from a diverse set of sources, including developer feedback and transitive (often implicit) relationships among groups of software artifacts, to improve inference accuracy. We conduct a comprehensive empirical evaluation of Comet that illustrates an improvement over a set of optimally configured baselines of ≈14% in the best case and ≈5% across all subjects in terms of average precision. The comparative effectiveness of Comet in practice, where optimal configuration is typically not possible, is likely to be higher. Finally, we illustrate Comet's potential for practical applicability in a survey with developers from Cisco Systems who used a prototype Comet Jenkins plugin.

SESSION: API

How Android developers handle evolution-induced API compatibility issues: a large-scale study

As Android platform evolves in a fast pace, API-related compatibility issues become a significant challenge for developers. To handle an incompatible API invocation, developers mainly have two choices: merely performing sufficient checks to avoid invoking incompatible APIs on platforms that do not support them, or gracefully providing replacement implementations on those incompatible platforms. As providing more consistent app behaviors, the latter one is more recommended and more challenging to adopt. However, it is still unknown how these issues are handled in the real world, do developers meet difficulties and what can we do to help them.

In light of this, this paper performs the first large-scale study on the current practice of handling evolution-induced API compatibility issues in about 300,000 Android market apps, and more importantly, their solutions (if exist). Actually, it is in general very challenging to determine if developers have put in counter-measure for a compatibility issue, as different APIs have diverse behaviors, rendering various repair. To facilitate a large-scale study, this paper proposes RAPID, an automated tool to determine whether a compatibility issue has been addressed or not, by incorporating both static analysis and machine learning techniques. Results show that our trained classifier is quite effective by achieving a F1-score of 95.21% and 91.96% in the training stage and the validation stage respectively. With the help of RAPID, our study yields many interesting findings, e.g. developers are not willing to provide alternative implementations when handling incompatible API invocations (only 38.4%); for those incompatible APIs that Google gives replacement recommendations, the ratio of providing alternative implementations is significantly higher than those without recommendations; developers find more ways to repair compatibility issues than Google's recommendations and the knowledge acquired from these experienced developers would be extremely useful to novice developers and may significantly improve the current status of compatibility issue handling.

An empirical study on API parameter rules

Developers build programs based on software libraries to reduce coding effort. If a program inappropriately sets an API parameter, the program may exhibit unexpected runtime behaviors. To help developers correctly use library APIs, researchers built tools to mine API parameter rules. However, it is still unknown (1) what types of parameter rules there are, and (2) how these rules distribute inside documents and source files. In this paper, we conducted an empirical study to investigate the above-mentioned questions. To analyze as many parameter rules as possible, we took a hybrid approach that combines automatic localization of constrained parameters with manual inspection. Our automatic approach---PaRu---locates parameters that have constraints either documented in Javadoc (i.e., document rules) or implied by source code (i.e., code rules). Our manual inspection (1) identifies and categorizes rules for the located parameters, and (2) establishes mapping between document and code rules. By applying PaRu to 9 widely used libraries, we located 5,334 parameters with either document or code rules. Interestingly, there are only 187 parameters that have both types of rules, and 79 pairs of these parameter rules are unmatched. Additionally, PaRu extracted 1,688 rule sentences from Javadoc and code. We manually classified these sentences into six categories, two of which are overlooked by prior approaches. We found that 86.2% of parameters have only code rules; 10.3% of parameters have only document rules; and only 3.5% of parameters have both document and code rules. Our research reveals the challenges for automating parameter rule extraction. Based on our findings, we discuss the potentials of prior approaches and present our insights for future tool design.

When APIs are intentionally bypassed: an exploratory study of API workarounds

Application programming interfaces (APIs) have become ubiquitous in software development. However, external APIs are not guaranteed to contain every desirable feature, nor are they immune to software defects. Therefore, API users will sometimes be faced with situations where a current API does not satisfy all of their requirements, but migrating to another API is costly. In these cases, due to the lack of communication channels between API developers and users, API users may intentionally bypass an existing API after inquiring into workarounds for their API problems with online communities. This mechanism takes the API developer out of the conversation, potentially leaving API defects unreported and desirable API features undiscovered. In this paper we explore API workaround inquiries from API users on Stack Overflow. We uncover general reasons why API users inquire about API workarounds, and general solutions to API workaround requests. Furthermore, using workaround implementations in Stack Overflow answers, we develop three API workaround implementation patterns. We identify instances of these patterns in real-life open source projects and determine their value for API developers from their responses to feature requests based on the identified API workarounds.

Demystify official API usage directives with crowdsourced API misuse scenarios, erroneous code examples and patches

API usage directives in official API documentation describe the contracts, constraints and guidelines for using APIs in natural language. Through the investigation of API misuse scenarios on Stack Overflow, we identify three barriers that hinder the understanding of the API usage directives, i.e., lack of specific usage context, indirect relationships to cooperative APIs, and confusing APIs with subtle differences. To overcome these barriers, we develop a text mining approach to discover the crowdsourced API misuse scenarios on Stack Overflow and extract from these scenarios erroneous code examples and patches, as well as related API and confusing APIs to construct demystification reports to help developers understand the official API usage directives described in natural language. We apply our approach to API usage directives in official Android API documentation and android-tagged discussion threads on Stack Overflow. We extract 159,116 API misuse scenarios for 23,969 API usage directives of 3138 classes and 7471 methods, from which we generate the demystification reports. Our manual examination confirms that the extracted information in the generated demystification reports are of high accuracy. By a user study of 14 developers on 8 API-misuse related error scenarios, we show that our demystification reports help developer understand and debug API-misuse related program errors faster and more accurately, compared with reading only plain API usage-directive sentences.

SESSION: Bug analysis

Simulee: detecting CUDA synchronization bugs via memory-access modeling

While CUDA has become a mainstream parallel computing platform and programming model for general-purpose GPU computing, how to effectively and efficiently detect CUDA synchronization bugs remains a challenging open problem. In this paper, we propose the first lightweight CUDA synchronization bug detection framework, namely Simulee, to model CUDA program execution by interpreting the corresponding LLVM bytecode and collecting the memory-access information for automatically detecting general CUDA synchronization bugs. To evaluate the effectiveness and efficiency of Simulee, we construct a benchmark with 7 popular CUDA-related projects from GitHub, upon which we conduct an extensive set of experiments. The experimental results suggest that Simulee can detect 21 out of the 24 manually identified bugs in our preliminary study and also 24 previously unknown bugs among all projects, 10 of which have already been confirmed by the developers. Furthermore, Simulee significantly outperforms state-of-the-art approaches for CUDA synchronization bug detection.

SESSION: Deep learning testing and debugging 2

White-box fairness testing through adversarial sampling

Although deep neural networks (DNNs) have demonstrated astonishing performance in many applications, there are still concerns on their dependability. One desirable property of DNN for applications with societal impact is fairness (i.e., non-discrimination). In this work, we propose a scalable approach for searching individual discriminatory instances of DNN. Compared with state-of-the-art methods, our approach only employs lightweight procedures like gradient computation and clustering, which makes it significantly more scalable than existing methods. Experimental results show that our approach explores the search space more effectively (9 times) and generates much more individual discriminatory instances (25 times) using much less time (half to 1/7).

Structure-invariant testing for machine translation

In recent years, machine translation software has increasingly been integrated into our daily lives. People routinely use machine translation for various applications, such as describing symptoms to a foreign doctor and reading political news in a foreign language. However, the complexity and intractability of neural machine translation (NMT) models that power modern machine translation make the robustness of these systems difficult to even assess, much less guarantee. Machine translation systems can return inferior results that lead to misunderstanding, medical misdiagnoses, threats to personal safety, or political conflicts. Despite its apparent importance, validating the robustness of machine translation systems is very difficult and has, therefore, been much under-explored.

To tackle this challenge, we introduce structure-invariant testing (SIT), a novel metamorphic testing approach for validating machine translation software. Our key insight is that the translation results of "similar" source sentences should typically exhibit similar sentence structures. Specifically, SIT (1) generates similar source sentences by substituting one word in a given sentence with semantically similar, syntactically equivalent words; (2) represents sentence structure by syntax parse trees (obtained via constituency or dependency parsing); (3) reports sentence pairs whose structures differ quantitatively by more than some threshold. To evaluate SIT, we use it to test Google Translate and Bing Microsoft Translator with 200 source sentences as input, which led to 64 and 70 buggy issues with 69.5% and 70% top-1 accuracy, respectively. The translation errors are diverse, including under-translation, over-translation, incorrect modification, word/phrase mistranslation, and unclear logic.

Automatic testing and improvement of machine translation

This paper presents TransRepair, a fully automatic approach for testing and repairing the consistency of machine translation systems. TransRepair combines mutation with metamorphic testing to detect inconsistency bugs (without access to human oracles). It then adopts probability-reference or cross-reference to post-process the translations, in a grey-box or black-box manner, to repair the inconsistencies. Our evaluation on two state-of-the-art translators, Google Translate and Transformer, indicates that TransRepair has a high precision (99%) on generating input pairs with consistent translations. With these tests, using automatic consistency metrics and manual assessment, we find that Google Translate and Transformer have approximately 36% and 40% inconsistency bugs. Black-box repair fixes 28% and 19% bugs on average for Google Translate and Transformer. Grey-box repair fixes 30% bugs on average for Transformer. Manual inspection indicates that the translations repaired by our approach improve consistency in 87% of cases (degrading it in 2%), and that our repairs have better translation acceptability in 27% of the cases (worse in 8%).

TRADER: trace divergence analysis and embedding regulation for debugging recurrent neural networks

Recurrent Neural Networks (RNN) can deal with (textual) input with various length and hence have a lot of applications in software systems and software engineering applications. RNNs depend on word embeddings that are usually pre-trained by third parties to encode textual inputs to numerical values. It is well known that problematic word embeddings can lead to low model accuracy. In this paper, we propose a new technique to automatically diagnose how problematic embeddings impact model performance, by comparing model execution traces from correctly and incorrectly executed samples. We then leverage the diagnosis results as guidance to harden/repair the embeddings. Our experiments show that TRADER can consistently and effectively improve accuracy for real world models and datasets by 5.37% on average, which represents substantial improvement in the literature of RNN models.

SESSION: Fuzzing 2

Typestate-guided fuzzer for discovering use-after-free vulnerabilities

Existing coverage-based fuzzers usually use the individual control flow graph (CFG) edge coverage to guide the fuzzing process, which has shown great potential in finding vulnerabilities. However, CFG edge coverage is not effective in discovering vulnerabilities such as use-after-free (UaF). This is because, to trigger UaF vulnerabilities, one needs not only to cover individual edges, but also to traverse some (long) sequence of edges in a particular order, which is challenging for existing fuzzers. To this end, we propose to model UaF vulnerabilities as typestate properties, and develop a typestate-guided fuzzer, named UAFL, for discovering vulnerabilities violating typestate properties. Given a typestate property, we first perform a static typestate analysis to find operation sequences potentially violating the property. Our fuzzing process is then guided by the operation sequences in order to progressively generate test cases triggering property violations. In addition, we also employ an information flow analysis to improve the efficiency of the fuzzing process. We have performed a thorough evaluation of UAFL on 14 widely-used real-world programs. The experiment results show that UAFL substantially outperforms the state-of-the-art fuzzers, including AFL, AFLFast, FairFuzz, MOpt, Angora and QSYM, in terms of the time taken to discover vulnerabilities. We have discovered 10 previously unknown vulnerabilities, and received 5 new CVEs.

JVM fuzzing for JIT-induced side-channel detection

Timing side channels arise in software when a program's execution time can be correlated with security-sensitive program input. Recent results on software side-channel detection focus on analysis of program's source code. However, runtime behavior, in particular optimizations introduced during just-in-time (JIT) compilation, can impact or even introduce timing side channels in programs. In this paper, we present a technique for automatically detecting such JIT-induced timing side channels in Java programs. We first introduce patterns to detect partitions of secret input potentially separable by side channels. Then we present an automated approach for exploring behaviors of the Java Virtual Machine (JVM) to identify states where timing channels separating these partitions arise. We evaluate our technique on three datasets used in recent work on side-channel detection. We find that many code variants labeled "safe" with respect to side-channel vulnerabilities are in fact vulnerable to JIT-induced timing side channels. Our results directly contradict the conclusions of four separate state-of-the-art program analysis tools for side-channel detection and demonstrate that JIT-induced side channels are prevalent and can be detected automatically.

Ankou: guiding grey-box fuzzing towards combinatorial difference

Grey-box fuzzing is an evolutionary process, which maintains and evolves a population of test cases with the help of a fitness function. Fitness functions used by current grey-box fuzzers are not informative in that they cannot distinguish different program executions as long as those executions achieve the same coverage. The problem is that current fitness functions only consider a union of data, but not their combination. As such, fuzzers often get stuck in a local optimum during their search. In this paper, we introduce Ankou, the first grey-box fuzzer that recognizes different combinations of execution information, and present several scalability challenges encountered while designing and implementing Ankou. Our experimental results show that Ankou is 1.94× and 8.0× more effective in finding bugs than AFL and Angora, respectively.

SESSION: Static analysis 2

BCFA: bespoke control flow analysis for CFA at scale

Many data-driven software engineering tasks such as discovering programming patterns, mining API specifications, etc., perform source code analysis over control flow graphs (CFGs) at scale. Analyzing millions of CFGs can be expensive and performance of the analysis heavily depends on the underlying CFG traversal strategy. State-of-the-art analysis frameworks use a fixed traversal strategy. We argue that a single traversal strategy does not fit all kinds of analyses and CFGs and propose bespoke control flow analysis (BCFA). Given a control flow analysis (CFA) and a large number of CFGs, BCFA selects the most efficient traversal strategy for each CFG. BCFA extracts a set of properties of the CFA by analyzing the code of the CFA and combines it with properties of the CFG, such as branching factor and cyclicity, for selecting the optimal traversal strategy. We have implemented BCFA in Boa, and evaluated BCFA using a set of representative static analyses that mainly involve traversing CFGs and two large datasets containing 287 thousand and 162 million CFGs. Our results show that BCFA can speedup the large scale analyses by 1%-28%. Further, BCFA has low overheads; less than 0.2%, and low misprediction rate; less than 0.01%.

On the recall of static call graph construction in practice

Static analyses have problems modelling dynamic language features soundly while retaining acceptable precision. The problem is well-understood in theory, but there is little evidence on how this impacts the analysis of real-world programs. We have studied this issue for call graph construction on a set of 31 real-world Java programs using an oracle of actual program behaviour recorded from executions of built-in and synthesised test cases with high coverage, have measured the recall that is being achieved by various static analysis algorithms and configurations, and investigated which language features lead to static analysis false negatives.

We report that (1) the median recall is 0.884 suggesting that standard static analyses have significant gaps with respect to the proportion of the program modelled (2) built-in tests are significantly better to expose dynamic program behaviour than synthesised tests (3) adding precision to the static analysis has little impact on recall indicating that those are separate concerns (4) state-of-the-art support for dynamic language features can significantly improve recall (the median observed is 0.935), but it comes with a hefty performance penalty, and (5) the main sources of unsoundness are not reflective method invocations, but objects allocated or accessed via native methods, and invocations initiated by the JVM, without matching call sites in the program under analysis.

These results provide some novel insights into the interaction between static and dynamic program analyses that can be used to assess the utility of static analysis results and to guide the development of future static and hybrid analyses.

Heaps'n leaks: how heap snapshots improve Android taint analysis

The assessment of information flows is an essential part of analyzing Android apps, and is frequently supported by static taint analysis. Its precision, however, can suffer from the analysis not being able to precisely determine what elements a pointer can (and cannot) point to. Recent advances in static analysis suggest that incorporating dynamic heap snapshots, taken at one point at runtime, can significantly improve general static analysis. In this paper, we investigate to what extent this also holds for taint analysis, and how various design decisions, such as when and how many snapshots are collected during execution, and how exactly they are used, impact soundness and precision. We have extended FlowDroid to incorporate heap snapshots, yielding our prototype Heapster, and evaluated it on DroidMacroBench, a novel benchmark comprising real-world Android apps that we also make available as an artifact. The results show (1) the use of heap snapshots lowers analysis time and memory consumption while increasing precision; (2) a very good trade-off between precision and recall is achieved by a mixed mode in which the analysis falls back to static points-to relations for objects for which no dynamic data was recorded; and (3) while a single heap snapshot (ideally taken at the end of the execution) suffices to improve performance and precision, a better trade-off can be obtained by using multiple snapshots.

SESSION: Big data

Big code != big vocabulary: open-vocabulary models for source code

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.

In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.

All datasets, code, and trained models used in this work are publicly available.

Improving data scientist efficiency with provenance

Data scientists frequently analyze data by writing scripts. We conducted a contextual inquiry with interdisciplinary researchers, which revealed that parameter tuning is a highly iterative process and that debugging is time-consuming. As analysis scripts evolve and become more complex, analysts have difficulty conceptualizing their workflow. In particular, after editing a script, it becomes difficult to determine precisely which code blocks depend on the edit. Consequently, scientists frequently re-run entire scripts instead of re-running only the necessary parts. We present ProvBuild, a tool that leverages language-level provenance to streamline the debugging process by reducing programmer cognitive load and decreasing subsequent runtimes, leading to an overall reduction in elapsed debugging time. ProvBuild uses provenance to track dependencies in a script. When an analyst debugs a script, ProvBuild generates a simplifed script that contains only the information necessary to debug a particular problem. We demonstrate that debugging the simplified script lowers a programmer's cognitive load and permits faster re-execution when testing changes. The combination of reduced cognitive load and shorter runtime reduces the time necessary to debug a script. We quantitatively and qualitatively show that even though ProvBuild introduces overhead during a script's first execution, it is a more efficient way for users to debug and tune complex workflows. ProvBuild demonstrates a novel use of language-level provenance, in which it is used to proactively improve programmer productively rather than merely providing a way to retroactively gain insight into a body of code.

Managing data constraints in database-backed web applications

Database-backed web applications manipulate large amounts of persistent data, and such applications often contain constraints that restrict data length, data value, and other data properties. Such constraints are critical in ensuring the reliability and usability of these applications. In this paper, we present a comprehensive study on where data constraints are expressed, what they are about, how often they evolve, and how their violations are handled. The results show that developers struggle with maintaining consistent data constraints and checking them across different components and versions of their web applications, leading to various problems. Guided by our study, we developed checking tools and API enhancements that can automatically detect such problems and improve the quality of such applications.

SESSION: Deep learning testing and debugging 3

Taxonomy of real faults in deep learning systems

The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks (TensorFlow, Keras and PyTorch) and from related Stack Overflow posts. Structured interviews with 20 researchers and practitioners describing the problems they have encountered in their experience have enriched our taxonomy with a variety of additional faults that did not emerge from the other two sources. Our final taxonomy was validated with a survey involving an additional set of 21 developers, confirming that almost all fault categories (13/15) were experienced by at least 50% of the survey participants.

Testing DNN image classifiers for confusion & bias errors

Image classifiers are an important component of today's software, from consumer and business applications to safety-critical domains. The advent of Deep Neural Networks (DNNs) is the key catalyst behind such wide-spread success. However, wide adoption comes with serious concerns about the robustness of software systems dependent on DNNs for image classification, as several severe erroneous behaviors have been reported under sensitive and critical circumstances. We argue that developers need to rigorously test their software's image classifiers and delay deployment until acceptable. We present an approach to testing image classifier robustness based on class property violations.

We found that many of the reported erroneous cases in popular DNN image classifiers occur because the trained models confuse one class with another or show biases towards some classes over others. These bugs usually violate some class properties of one or more of those classes. Most DNN testing techniques focus on perimage violations, so fail to detect class-level confusions or biases.

We developed a testing technique to automatically detect class-based confusion and bias errors in DNN-driven image classification software. We evaluated our implementation, DeepInspect, on several popular image classifiers with precision up to 100% (avg. 72.6%) for confusion errors, and up to 84.3% (avg. 66.8%) for bias errors. DeepInspect found hundreds of classification mistakes in widely-used models, many exposing errors indicating confusion or bias.

Repairing deep neural networks: fix patterns and challenges

Significant interest in applying Deep Neural Network (DNN) has fueled the need to support engineering of software that uses DNNs. Repairing software that uses DNNs is one such unmistakable SE need where automated tools could be beneficial; however, we do not fully understand challenges to repairing and patterns that are utilized when manually repairing DNNs. What challenges should automated repair tools address? What are the repair patterns whose automation could help developers? Which repair patterns should be assigned a higher priority for building automated bug repair tools? This work presents a comprehensive study of bug fix patterns to address these questions. We have studied 415 repairs from Stack Overflow and 555 repairs from GitHub for five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand challenges in repairs and bug repair patterns. Our key findings reveal that DNN bug fix patterns are distinctive compared to traditional bug fix patterns; the most common bug fix patterns are fixing data dimension and neural network connectivity; DNN bug fixes have the potential to introduce adversarial vulnerabilities; DNN bug fixes frequently introduce new bugs; and DNN bug localization, reuse of trained model, and coping with frequent releases are major challenges faced by developers when fixing bugs. We also contribute a benchmark of 667 DNN (bug, repair) instances.

Fuzz testing based data augmentation to improve robustness of deep neural networks

Deep neural networks (DNN) have been shown to be notoriously brittle to small perturbations in their input data. This problem is analogous to the over-fitting problem in test-based program synthesis and automatic program repair, which is a consequence of the incomplete specification, i.e., the limited tests or training examples, that the program synthesis or repair algorithm has to learn from. Recently, test generation techniques have been successfully employed to augment existing specifications of intended program behavior, to improve the generalizability of program synthesis and repair. Inspired by these approaches, in this paper, we propose a technique that re-purposes software testing methods, specifically mutation-based fuzzing, to augment the training data of DNNs, with the objective of enhancing their robustness. Our technique casts the DNN data augmentation problem as an optimization problem. It uses genetic search to generate the most suitable variant of an input data to use for training the DNN, while simultaneously identifying opportunities to accelerate training by skipping augmentation in many instances. We instantiate this technique in two tools, Sensei and Sensei-SA, and evaluate them on 15 DNN models spanning 5 popular image data-sets. Our evaluation shows that Sensei can improve the robust accuracy of the DNN, compared to the state of the art, on each of the 15 models, by upto 11.9% and 5.5% on average. Further, Sensei-SA can reduce the average DNN training time by 25%, while still improving robust accuracy.

An empirical study on program failures of deep learning jobs

Deep learning has made significant achievements in many application areas. To train and test models more efficiently, enterprise developers submit and run their deep learning programs on a shared, multi-tenant platform. However, some of the programs fail after a long execution time due to code/script defects, which reduces the development productivity and wastes expensive resources such as GPU, storage, and network I/O.

This paper presents the first comprehensive empirical study on program failures of deep learning jobs. 4960 real failures are collected from a deep learning platform in Microsoft. We manually examine their failure messages and classify them into 20 categories. In addition, we identify the common root causes and bug-fix solutions on a sample of 400 failures. To better understand the current testing and debugging practices for deep learning, we also conduct developer interviews. Our major findings include: (1) 48.0% of the failures occur in the interaction with the platform rather than in the execution of code logic, mostly due to the discrepancies between local and platform execution environments; (2) Deep learning specific failures (13.5%) are mainly caused by inappropriate model parameters/structures and framework API misunderstanding; (3) Current debugging practices are not efficient for fault localization in many cases, and developers need more deep learning specific tools. Based on our findings, we further suggest possible research topics and tooling support that could facilitate future deep learning development.

SESSION: Natural language artifacts in software

Primers or reminders?: the effects of existing review comments on code review

In contemporary code review, the comments put by reviewers on a specific code change are immediately visible to the other reviewers involved. Could this visibility prime new reviewers' attention (due to the human's proneness to availability bias), thus biasing the code review outcome? In this study, we investigate this topic by conducting a controlled experiment with 85 developers who perform a code review and a psychological experiment. With the psychological experiment, we find that ≈70% of participants are prone to availability bias. However, when it comes to the code review, our experiment results show that participants are primed only when the existing code review comment is about a type of bug that is not normally considered; when this comment is visible, participants are more likely to find another occurrence of this type of bug. Moreover, this priming effect does not influence reviewers' likelihood of detecting other types of bugs. Our findings suggest that the current code review practice is effective because existing review comments about bugs in code changes are not negative primers, rather positive reminders for bugs that would otherwise be overlooked during code review. Data and materials: https://doi.org/10.5281/zenodo.3653856

Mitigating turnover with code review recommendation: balancing expertise, workload, and knowledge distribution

Developer turnover is inevitable on software projects and leads to knowledge loss, a reduction in productivity, and an increase in defects. Mitigation strategies to deal with turnover tend to disrupt and increase workloads for developers. In this work, we suggest that through code review recommendation we can distribute knowledge and mitigate turnover with minimal impacton the development process. We evaluate review recommenders in the context of ensuring expertise during review, Expertise, reducing the review workload of the core team, CoreWorkload, and reducing the Files at Risk to turnover, FaR. We find that prior work that assigns reviewers based on file ownership concentrates knowledge on a small group of core developers increasing risk of knowledge loss from turnover by up to 65%. We propose learning and retention aware review recommenders that when combined are effective at reducing the risk of turnover by -29% but they unacceptably reduce the overall expertise during reviews by -26%. We develop the Sofia recommender that suggests experts when none of the files under review are hoarded by developers, but distributes knowledge when files are at risk. In this way, we are able to simultaneously increase expertise during review with a ΔExpertise of 6%, with a negligible impact on workload of ΔCoreWorkload of 0.09%, and reduce the files at risk by ΔFaR -28%. Sofia is integrated into GitHub pull requests allowing developers to select an appropriate expert or "learner" based on the context of the review. We release the Sofia bot as well as the code and data for replication purposes.

SESSION: Open source software

How do companies collaborate in open source ecosystems?: an empirical study of OpenStack

Open Source Software (OSS) has come to play a critical role in the software industry. Some large ecosystems enjoy the participation of large numbers of companies, each of which has its own focus and goals. Indeed, companies that otherwise compete, may become collaborators within the OSS ecosystem they participate in. Prior research has largely focused on commercial involvement in OSS projects, but there is a scarcity of research focusing on company collaborations within OSS ecosystems. Some of these ecosystems have become critical building blocks for organizations worldwide; hence, a clear understanding of how companies collaborate within large ecosystems is essential. This paper presents the results of an empirical study of the OpenStack ecosystem, in which hundreds of companies collaborate on thousands of project repositories to deliver cloud distributions. Based on a detailed analysis, we identify clusters of collaborations, and identify four strategies that companies adopt to engage with the OpenStack ecosystem. We alsofind that companies may engage in intentional or passive collaborations, or may work in an isolated fashion. Further, wefi nd that a company's position in the collaboration network is positively associated with its productivity in OpenStack. Our study sheds light on how large OSS ecosystems work, and in particular on the patterns of collaboration within one such large ecosystem.

How to not get rich: an empirical study of donations in open source

Open source is ubiquitous and many projects act as critical infrastructure, yet funding and sustaining the whole ecosystem is challenging. While there are many different funding models for open source and concerted efforts through foundations, donation platforms like PayPal, Patreon, and OpenCollective are popular and low-bar platforms to raise funds for open-source development. With a mixed-method study, we investigate the emerging and largely unexplored phenomenon of donations in open source. Specifically, we quantify how commonly open-source projects ask for donations, statistically model characteristics of projects that ask for and receive donations, analyze for what the requested funds are needed and used, and assess whether the received donations achieve the intended outcomes. We find 25,885 projects asking for donations on GitHub, often to support engineering activities; however, we also find no clear evidence that donations influence the activity level of a project. In fact, we find that donations are used in a multitude of ways, raising new research questions about effective funding.

Scaling open source communities: an empirical study of the Linux kernel

Large-scale open source communities, such as the Linux kernel, have gone through decades of development, substantially growing in scale and complexity. In the traditional workflow, maintainers serve as "gatekeepers" for the subsystems that they maintain. As the number of patches and authors significantly increases, maintainers come under considerable pressure, which may hinder the operation and even the sustainability of the community. A few subsystems have begun to use new workflows to address these issues. However, it is unclear to what extent these new workflows are successful, or how to apply them. Therefore, we conduct an empirical study on the multiple-committer model (MCM) that has provoked extensive discussion in the Linux kernel community. We explore the effect of the model on the i915 subsystem with respect to four dimensions: pressure, latency, complexity, and quality assurance. We find that after this model was adopted, the burden of the i915 maintainers was significantly reduced. Also, the model scales well to allow more committers. After analyzing the online documents and interviewing the maintainers of i915, we propose that overloaded subsystems which have trustworthy candidate committers are suitable for adopting the model. We further suggest that the success of the model is closely related to a series of measures for risk mitigation---sufficient precommit testing, strict review process, and the use of tools to simplify work and reduce errors. We employ a network analysis approach to locate candidate committers for the target subsystems and validate this approach and contextual success factors through email interviews with their maintainers. To the best of our knowledge, this is the first study focusing on how to scale open source communities. We expect that our study will help the rapidly growing Linux kernel and other similar communities to adapt to changes and remain sustainable.

SESSION: Symbolic execution

SpecuSym: speculative symbolic execution for cache timing leak detection

CPU cache is a limited but crucial storage component in modern processors, whereas the cache timing side-channel may inadvertently leak information through the physically measurable timing variance. Speculative execution, an essential processor optimization, and a source of such variances, can cause severe detriment on deliberate branch mispredictions. Despite static analysis could qualitatively verify the timing-leakage-free property under speculative execution, it is incapable of producing endorsements including inputs and speculated flows to diagnose leaks in depth. This work proposes a new symbolic execution based method, SpecuSym, for precisely detecting cache timing leaks introduced by speculative execution. Given a program (leakage-free in non-speculative execution), SpecuSym systematically explores the program state space, models speculative behavior at conditional branches, and accumulates the cache side effects along with subsequent path explorations. During the dynamic execution, SpecuSym constructs leak predicates for memory visits according to the specified cache model and conducts a constraint-solving based cache behavior analysis to inspect the new cache behaviors. We have implemented SpecuSym atop KLEE and evaluated it against 15 open-source benchmarks. Experimental results show that SpecuSym successfully detected from 2 to 61 leaks in 6 programs under 3 different cache settings and identified false positives in 2 programs reported by recent work.

Symbolic verification of message passing interface programs

Message passing is the standard paradigm of programming in high-performance computing. However, verifying Message Passing Interface (MPI) programs is challenging, due to the complex program features (such as non-determinism and non-blocking operations). In this work, we present MPI symbolic verifier (MPI-SV), the first symbolic execution based tool for automatically verifying MPI programs with non-blocking operations. MPI-SV combines symbolic execution and model checking in a synergistic way to tackle the challenges in MPI program verification. The synergy improves the scalability and enlarges the scope of verifiable properties. We have implemented MPI-SV1 and evaluated it with 111 real-world MPI verification tasks. The pure symbolic execution-based technique successfully verifies 61 out of the 111 tasks (55%) within one hour, while in comparison, MPI-SV verifies 100 tasks (90%). On average, compared with pure symbolic execution, MPI-SV achieves 19x speedups on verifying the satisfaction of the critical property and 5x speedups on finding violations.

Efficient generation of error-inducing floating-point inputs via symbolic execution

Floating point is widely used in software to emulate arithmetic over reals. Unfortunately, floating point leads to rounding errors that propagate and accumulate during execution. Generating inputs to maximize the numerical error is critical when evaluating the accuracy of floating-point code. In this paper, we formulate the problem of generating high error-inducing floating-point inputs as a code coverage maximization problem solved using symbolic execution. Specifically, we define inaccuracy checks to detect large precision loss and cancellation. We inject these checks at strategic program locations to construct specialized branches that, when covered by a given input, are likely to lead to large errors in the result. We apply symbolic execution to generate inputs that exercise these specialized branches, and describe optimizations that make our approach practical. We implement a tool named FPGen and present an evaluation on 21 numerical programs including matrix computation and statistics libraries. We show that FPGen exposes errors for 20 of these programs and triggers errors that are, on average, over 2 orders of magnitude larger than the state of the art.

HyDiff: hybrid differential software analysis

Detecting regression bugs in software evolution, analyzing side-channels in programs and evaluating robustness in deep neural networks (DNNs) can all be seen as instances of differential software analysis, where the goal is to generate diverging executions of program paths. Two executions are said to be diverging if the observable program behavior differs, e.g., in terms of program output, execution time, or (DNN) classification. The key challenge of differential software analysis is to simultaneously reason about multiple program paths, often across program variants.

This paper presents HyDiff, the first hybrid approach for differential software analysis. HyDiff integrates and extends two very successful testing techniques: Feedback-directed greybox fuzzing for efficient program testing and shadow symbolic execution for systematic program exploration. HyDiff extends greybox fuzzing with divergence-driven feedback based on novel cost metrics that also take into account the control flow graph of the program. Furthermore HyDiff extends shadow symbolic execution by applying four-way forking in a systematic exploration and still having the ability to incorporate concrete inputs in the analysis. HyDiff applies divergence revealing heuristics based on resource consumption and control-flow information to efficiently guide the symbolic exploration, which allows its efficient usage beyond regression testing applications. We introduce differential metrics such as output, decision and cost difference, as well as patch distance, to assist the fuzzing and symbolic execution components in maximizing the execution divergence.

We implemented our approach on top of the fuzzer AFL and the symbolic execution framework Symbolic PathFinder. Weillustrate HyDiff on regression and side-channel analysis for Java bytecode programs, and further show how to use HyDiff for robustness analysis of neural networks.

SESSION: Testing 1

Seenomaly: vision-based linting of GUI animation effects against design-don't guidelines

GUI animations, such as card movement, menu slide in/out, snackbar display, provide appealing user experience and enhance the usability of mobile applications. These GUI animations should not violate the platform's UI design guidelines (referred to as design-don't guideline in this work) regarding component motion and interaction, content appearing and disappearing, and elevation and shadow changes. However, none of existing static code analysis, functional GUI testing and GUI image comparison techniques can "see" the GUI animations on the scree, and thus they cannot support the linting of GUI animations against design-don't guidelines. In this work, we formulate this GUI animation linting problem as a multi-class screencast classification task, but we do not have sufficient labeled GUI animations to train the classifier. Instead, we propose an unsupervised, computer-vision based adversarial autoencoder to solve this linting problem. Our autoencoder learns to group similar GUI animations by "seeing" lots of unlabeled real-application GUI animations and learning to generate them. As the first work of its kind, we build the datasets of synthetic and real-world GUI animations. Through experiments on these datasets, we systematically investigate the learning capability of our model and its effectiveness and practicality for linting GUI animations, and identify the challenges in this linting problem for future work.

Low-overhead deadlock prediction

Multithreaded programs can have deadlocks, even after deployment, so users may want to run deadlock tools on deployed programs. However, current deadlock predictors such as MagicLock and UnDead have large overheads that make them impractical for end-user deployment and confine their use to development time. Such overhead stems from running an exponential-time algorithm on a large execution trace. In this paper, we present the first low-overhead deadlock predictor, called AirLock, that is fit for both in-house testing and deployed programs. AirLock maintains a small predictive lock reachability graph, searches the graph for cycles, and runs an exponential-time algorithm only for each cycle. This approach lets AirLock find the same deadlocks as MagicLock and UnDead but with much less overhead because the number of cycles is small in practice. Our experiments with real-world benchmarks show that the average time overhead of AirLock is 3.5%, which is three orders of magnitude less than that of MagicLock and UnDead. AirLock's low overhead makes it suitable for use with fuzz testers like AFL and on-the-fly after deployment.

SESSION: Android

An empirical assessment of security risks of global Android banking apps

Mobile banking apps, belonging to the most security-critical app category, render massive and dynamic transactions susceptible to security risks. Given huge potential financial loss caused by vulnerabilities, existing research lacks a comprehensive empirical study on the security risks of global banking apps to provide useful insights and improve the security of banking apps.

Since data-related weaknesses in banking apps are critical and may directly cause serious financial loss, this paper first revisits the state-of-the-art available tools and finds that they have limited capability in identifying data-related security weaknesses of banking apps. To complement the capability of existing tools in data-related weakness detection, we propose a three-phase automated security risk assessment system, named Ausera, which leverages static program analysis techniques and sensitive keyword identification. By leveraging Ausera, we collect 2,157 weaknesses in 693 real-world banking apps across 83 countries, which we use as a basis to conduct a comprehensive empirical study from different aspects, such as global distribution and weakness evolution during version updates. We find that apps owned by subsidiary banks are always less secure than or equivalent to those owned by parent banks. In addition, we also track the patching of weaknesses and receive much positive feedback from banking entities so as to improve the security of banking apps in practice. We further find that weaknesses derived from outdated versions of banking apps or third-party libraries are highly prone to being exploited by attackers. To date, we highlight that 21 banks have confirmed the weaknesses we reported (including 126 weaknesses in total). We also exchange insights with 7 banks, such as HSBC in UK and OCBC in Singapore, via in-person or online meetings to help them improve their apps. We hope that the insights developed in this paper will inform the communities about the gaps among multiple stakeholders, including banks, academic researchers, and third-party security companies.

Accessibility issues in Android apps: state of affairs, sentiments, and ways forward

Mobile apps are an integral component of our daily life. Ability to use mobile apps is important for everyone, but arguably even more so for approximately 15% of the world population with disabilities. This paper presents the results of a large-scale empirical study aimed at understanding accessibility of Android apps from three complementary perspectives. First, we analyze the prevalence of accessibility issues in over 1, 000 Android apps. We find that almost all apps are riddled with accessibility issues, hindering their use by disabled people. We then investigate the developer sentiments through a survey aimed at understanding the root causes of so many accessibility issues. We find that in large part developers are unaware of accessibility design principles and analysis tools, and the organizations in which they are employed do not place a premium on accessibility. We finally investigate user ratings and comments on app stores. We find that due to the disproportionately small number of users with disabilities, user ratings and app popularity are not indicative of the extent of accessibility issues in apps. We conclude the paper with several observations that form the foundation for future research and development.

Collaborative bug finding for Android apps

Many automated test generation techniques have been proposed for finding crashes in Android apps. Despite recent advancement in these approaches, a study shows that Android app developers prefer reading test cases written in natural language. Meanwhile, there exist redundancies in bug reports (written in natural language) across different apps that have not been previously reused. We propose collaborative bug finding, a novel approach that uses bugs in other similar apps to discover bugs in the app under test. We design three settings with varying degrees of interactions between programmers: (1) bugs from programmers who develop a different app, (2) bugs from manually searching for bug reports in GitHub repositories, (3) bugs from a bug recommendation system, Bugine. Our studies of the first two settings in a software testing course show that collaborative bug finding helps students who are novice Android app testers to discover 17 new bugs. As students admit that searching for relevant bug reports could be time-consuming, we introduce Bugine, an approach that automatically recommends relevant GitHub issues for a given app. Bugine uses (1) natural language processing to find GitHub issues that mention common UI components shared between the app under test and other apps in our database, and (2) a ranking algorithm to select GitHub issues that are of the best quality. Our results show that Bugine is able to find 34 new bugs. In total, collaborative bug finding helps us find 51 new bugs, in which eight have been confirmed and 11 have been fixed by the developers. These results confirm our intuition that our proposed technique is useful in discovering new bugs for Android apps.

SESSION: Code summarization and analysis

POSIT: simultaneously tagging natural and programming languages

Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems --- traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy.

CPC: automatically classifying and propagating natural language comments via program analysis

Code comments provide abundant information that have been leveraged to help perform various software engineering tasks, such as bug detection, specification inference, and code synthesis. However, developers are less motivated to write and update comments, making it infeasible and error-prone to leverage comments to facilitate software engineering tasks. In this paper, we propose to leverage program analysis to systematically derive, refine, and propagate comments. For example, by propagation via program analysis, comments can be passed on to code entities that are not commented such that code bugs can be detected leveraging the propagated comments. Developers usually comment on different aspects of code elements like methods, and use comments to describe various contents, such as functionalities and properties. To more effectively utilize comments, a fine-grained and elaborated taxonomy of comments and a reliable classifier to automatically categorize a comment are needed. In this paper, we build a comprehensive taxonomy and propose using program analysis to propagate comments. We develop a prototype CPC, and evaluate it on 5 projects. The evaluation results demonstrate 41573 new comments can be derived by propagation from other code locations with 88% accuracy. Among them, we can derive precise functional comments for 87 native methods that have neither existing comments nor source code. Leveraging the propagated comments, we detect 37 new bugs in open source large projects, 30 of which have been confirmed and fixed by developers, and 304 defects in existing comments (by looking at inconsistencies between existing and propagated comments), including 12 incomplete comments and 292 wrong comments. This demonstrates the effectiveness of our approach. Our user study confirms propagated comments align well with existing comments in terms of quality.

Suggesting natural method names to check name consistencies

Misleading names of the methods in a project or the APIs in a software library confuse developers about program functionality and API usages, leading to API misuses and defects. In this paper, we introduce MNire, a machine learning approach to check the consistency between the name of a given method and its implementation. MNire first generates a candidate name and compares the current name against it. If the two names are sufficiently similar, we consider the method as consistent. To generate the method name, we draw our ideas and intuition from an empirical study on the nature of method names in a large dataset. Our key finding is that high proportions of the tokens of method names can be found in the three contexts of a given method including its body, the interface (the method's parameter types and return type), and the enclosing class' name. Even when such tokens are not there, MNire uses the contexts to predict the tokens due to the high likelihoods of their co-occurrences. Our unique idea is to treat the name generation as an abstract summarization on the tokens collected from the names of the program entities in the three above contexts.

We conducted several experiments to evaluate MNire in method name consistency checking and in method name recommending on large datasets with +14M methods. In detecting inconsistency method names, MNire improves the state-of-the-art approach by 10.4% and 11% relatively in recall and precision, respectively. In method name recommendation, MNire improves relatively over the state-of-the-art technique, code2vec, in both recall (18.2% higher) and precision (11.1% higher). To assess MNire's usefulness, we used it to detect inconsistent methods and suggest new names in several active, GitHub projects. We made 50 pull requests (PRs) and received 42 responses. Among them, five PRs were merged into the main branch, and 13 were approved for later merging. In total, in 31/42 cases, the developer teams agree that our suggested names are more meaningful than the current names, showing MNire's usefulness.

Retrieval-based neural source code summarization

Source code summarization aims to automatically generate concise summaries of source code in natural language texts, in order to help developers better understand and maintain source code. Traditional work generates a source code summary by utilizing information retrieval techniques, which select terms from original source code or adapt summaries of similar code snippets. Recent studies adopt Neural Machine Translation techniques and generate summaries from code snippets using encoder-decoder neural networks. The neural-based approaches prefer the high-frequency words in the corpus and have trouble with the low-frequency ones. In this paper, we propose a retrieval-based neural source code summarization approach where we enhance the neural model with the most similar code snippets retrieved from the training set. Our approach can take advantages of both neural and retrieval-based techniques. Specifically, we first train an attentional encoder-decoder model based on the code snippets and the summaries in the training set; Second, given one input code snippet for testing, we retrieve its two most similar code snippets in the training set from the aspects of syntax and semantics, respectively; Third, we encode the input and two retrieved code snippets, and predict the summary by fusing them during decoding. We conduct extensive experiments to evaluate our approach and the experimental results show that our proposed approach can improve the state-of-the-art methods.

SESSION: Machine learning and models

On learning meaningful assert statements for unit test cases

Software testing is an essential part of the software lifecycle and requires a substantial amount of time and effort. It has been estimated that software developers spend close to 50% of their time on testing the code they write. For these reasons, a long standing goal within the research community is to (partially) automate software testing. While several techniques and tools have been proposed to automatically generate test methods, recent work has criticized the quality and usefulness of the assert statements they generate. Therefore, we employ a Neural Machine Translation (NMT) based approach called Atlas (<u>A</u>u<u>T</u>omatic <u>L</u>earning of <u>A</u>ssert <u>S</u>tatements) to automatically generate meaningful assert statements for test methods. Given a test method and a focal method (i.e., the main method under test), Atlas can predict a meaningful assert statement to assess the correctness of the focal method. We applied Atlas to thousands of test methods from GitHub projects and it was able to predict the exact assert statement manually written by developers in 31% of the cases when only considering the top-1 predicted assert. When considering the top-5 predicted assert statements, Atlas is able to predict exact matches in 50% of the cases. These promising results hint to the potential usefulness of our approach as (i) a complement to automatic test case generation techniques, and (ii) a code completion support for developers, who can benefit from the recommended assert statements while writing test code.

Quickly generating diverse valid test inputs with reinforcement learning

Property-based testing is a popular approach for validating the logic of a program. An effective property-based test quickly generates many diverse valid test inputs and runs them through a parameterized test driver. However, when the test driver requires strict validity constraints on the inputs, completely random input generation fails to generate enough valid inputs. Existing approaches to solving this problem rely on whitebox or greybox information collected by instrumenting the input generator and/or test driver. However, collecting such information reduces the speed at which tests can be executed. In this paper, we propose and study a black-box approach for generating valid test inputs. We first formalize the problem of guiding random input generators towards producing a diverse set of valid inputs. This formalization highlights the role of a guide which governs the space of choices within a random input generator. We then propose a solution based on reinforcement learning (RL), using a tabular, on-policy RL approach to guide the generator. We evaluate this approach, RLCheck, against pure random input generation as well as a state-of-the-art greybox evolutionary algorithm, on four real-world benchmarks. We find that in the same time budget, RLCheck generates an order of magnitude more diverse valid inputs than the baselines.

SESSION: Meta studies

An evidence-based inquiry into the use of grey literature in software engineering

Context: Following on other scientific disciplines, such as health sciences, the use of Grey Literature (GL) has become widespread in Software Engineering (SE) research. Whilst the number of papers incorporating GL in SE is increasing, there is little empirically known about different aspects of the use of GL in SE research.

Method: We used a mixed-methods approach for this research. We carried out a Systematic Literature Review (SLR) of the use of GL in SE, and surveyed the authors of the selected papers included in the SLR (as GL users) and the invited experts in SE community on the use of GL in SE research. Results: We systematically selected and reviewed 102 SE secondary studies that incorporate GL in SE research, from which we identified two groups based on their reporting: 1) 76 reviews only claim their use of GL; 2) 26 reviews report the results by including GL. We also obtained 20 replies from the GL users and 24 replies from the invited SE experts. Conclusion: There is no common understanding of the meaning of GL in SE. Researchers define the scopes and the definitions of GL in a variety of ways. We found five main reasons of using GL in SE research. The findings have enabled us to propose a conceptual model for how GL works in SE research lifecycle. There is an apparent need for research to develop guidelines for using GL in SE and for assessing quality of GL. The current work can provide a panorama of the state-of-the-art of using GL in SE for the follow-up research, as to determine the important position of GL in SE research.

SESSION: Performance

Towards the use of the readily available tests from the release pipeline as performance tests: are we there yet?

Performance is one of the important aspects of software quality. Performance issues exist widely in software systems, and the process of fixing the performance issues is an essential step in the release cycle of software systems. Although performance testing is widely adopted in practice, it is still expensive and time-consuming. In particular, the performance testing is usually conducted after the system is built in a dedicated testing environment. The challenges of performance testing make it difficult to fit into the common DevOps process in software development. On the other hand, there exist a large number of tests readily available, that are executed regularly within the release pipeline during software development. In this paper, we perform an exploratory study to determine whether such readily available tests are capable of serving as performance tests. In particular, we would like to see whether the performance of these tests can demonstrate performance improvements obtained from fixing real-life performance issues. We collect 127 performance issues from Hadoop and Cassandra, and evaluate the performance of the readily available tests from the commits before and after the performance issue fixes. We find that most of the improvements from the fixes to performance issues can be demonstrated using the readily available tests in the release pipeline. However, only a very small portion of the tests can be used for demonstrating the improvements. By manually examining the tests, we identify eight reasons that a test cannot demonstrate performance improvements even though it covers the changed source code of the issue fix. Finally, we build random forest classifiers determining the important metrics influencing the readily available tests (not) being able to demonstrate performance improvements from issue fixes. We find that the test code itself and the source code covered by the test are important factors, while the factors related to the code changes in the performance issues fixes have a low importance. Practitioners may focus on designing and improving the tests, instead of fine-tuning tests for different performance issues fixes. Our findings can be used as a guideline for practitioners to reduce the amount of effort spent on leveraging and designing tests that run in the release pipeline for performance assurance activities.

SESSION: Software verification

Verifying object construction

In object-oriented languages, constructors often have a combination of required and optional formal parameters. It is tedious and inconvenient for programmers to write a constructor by hand for each combination. The multitude of constructors is error-prone for clients, and client code is difficult to read due to the large number of constructor arguments. Therefore, programmers often use design patterns that enable more flexible object construction---the builder pattern, dependency injection, or factory methods.

However, these design patterns can be too flexible: not all combinations of logical parameters lead to the construction of well-formed objects. When a client uses the builder pattern to construct an object, the compiler does not check that a valid set of values was provided. Incorrect use of builders can lead to security vulnerabilities, run-time crashes, and other problems.

This work shows how to statically verify uses of object construction, such as the builder pattern. Using a simple specification language, programmers specify which combinations of logical arguments are permitted. Our compile-time analysis detects client code that may construct objects unsafely. Our analysis is based on a novel special case of typestate checking, accumulation analysis, that modularly reasons about accumulations of method calls. Because accumulation analysis does not require precise aliasing information for soundness, our analysis scales to industrial programs. We evaluated it on over 9 million lines of code, discovering defects which included previously-unknown security vulnerabilities and potential null-pointer violations in heavily-used open-source codebases. Our analysis has a low false positive rate and low annotation burden.

Our implementation and experimental data are publicly available.

Automatically testing string solvers

SMT solvers are at the basis of many applications, such as program verification, program synthesis, and test case generation. For all these applications to provide reliable results, SMT solvers must answer queries correctly. However, since they are complex, highly-optimized software systems, ensuring their correctness is challenging. In particular, state-of-the-art testing techniques do not reliably detect when an SMT solver is unsound.

In this paper, we present an automatic approach for generating test cases that reveal soundness errors in the implementations of string solvers, as well as potential completeness and performance issues. We synthesize input formulas that are satisfiable or unsatisfiable by construction and use this ground truth as test oracle. We automatically apply satisfiability-preserving transformations to generate increasingly-complex formulas, which allows us to detect many errors with simple inputs and, thus, facilitates debugging.

The experimental evaluation shows that our technique effectively reveals bugs in the implementation of widely-used SMT solvers and applies also to other types of solvers, such as automata-based solvers. We focus on strings here, but our approach carries over to other theories and their combinations.

SESSION: Testing 2

A study on the lifecycle of flaky tests

During regression testing, developers rely on the pass or fail outcomes of tests to check whether changes broke existing functionality. Thus, flaky tests, which nondeterministically pass or fail on the same code, are problematic because they provide misleading signals during regression testing. Although flaky tests are the focus of several existing studies, none of them study (1) the reoccurrence, runtimes, and time-before-fix of flaky tests, and (2) flaky tests in-depth on proprietary projects.

This paper fills this knowledge gap about flaky tests and investigates whether prior categorization work on flaky tests also apply to proprietary projects. Specifically, we study the lifecycle of flaky tests in six large-scale proprietary projects at Microsoft. We find, as in prior work, that asynchronous calls are the leading cause of flaky tests in these Microsoft projects. Therefore, we propose the first automated solution, called Flakiness and Time Balancer (FaTB), to reduce the frequency of flaky-test failures caused by asynchronous calls. Our evaluation of five such flaky tests shows that FaTB can reduce the running times of these tests by up to 78% without empirically affecting the frequency of their flaky-test failures. Lastly, our study finds several cases where developers claim they "fixed" a flaky test but our empirical experiments show that their changes do not fix or reduce these tests' frequency of flaky-test failures. Future studies should be more cautious when basing their results on changes that developers claim to be "fixes".

Testing file system implementations on layered models

Generating high-quality system call sequences is not only important to testing file system implementations, but also challenging due to the astronomically large input space. This paper introduces a new approach to the workload generation problem by building layered models and abstract workloads refinement. This approach is instantiated as a three-layer file system model for file system workload generation. In a short-period experiment run, sequential workloads (system call sequences) manifested over a thousand crashes in mainline Linux Kernel file systems, with 12 previously unknown bugs being reported. We also provide evidence that such workloads benefit other domain-specific testing techniques including crash consistency testing and concurrency testing.

SESSION: Code generation

Co-evolving code with evolving metamodels

Metamodels play a significant role to describe and analyze the relations between domain concepts. They are also cornerstone to build a software language (SL) for a domain and its associated tooling. Metamodel definition generally drives code generation of a core API. The latter is further enriched by developers with additional code implementing advanced functionalities, e.g., checkers, recommenders, etc. When a SL is evolved to the next version, the metamodels are evolved as well before to re-generate the core API code. As a result, the developers added code both in the core API and the SL toolings may be impacted and thus may need to be co-evolved accordingly. Many approaches support the co-evolution of various artifacts when metamodels evolve. However, not the co-evolution of code. This paper fills this gap. We propose a semi-automatic co-evolution approach based on change propagation. The premise is that knowledge of the metamodel evolution changes can be propagated by means of resolutions to drive the code co-evolution. Our approach leverages on the abstraction level of metamodels where a given metamodel element has often different usages in the code. It supports alternative co-evaluations to meet different developers needs. Our work is evaluated on three Eclipse SL implementations, namely OCL, Modisco, and Papyrus over several evolved versions of metamodels and code. In response to five different evolved metamodels, we co-evolved 976 impacts over 18 projects.A comparison of our co-evolved code with the versioned ones shows the usefulness of our approach. Our approach was able to reach a weighted average of 87.4% and 88.9% respectively of precision and recall while supporting useful alternative co-evolution that developers have manually performed.

SESSION: Dependencies and configuration

Lazy product discovery in huge configuration spaces

Highly-configurable software systems can have thousands of interdependent configuration options across different subsystems. In the resulting configuration space, discovering a valid product configuration for some selected options can be complex and error prone. The configuration space can be organized using a feature model, fragmented into smaller interdependent feature models reflecting the configuration options of each subsystem.

We propose a method for lazy product discovery in large fragmented feature models with interdependent features. We formalize the method and prove its soundness and completeness. The evaluation explores an industrial-size configuration space. The results show that lazy product discovery has significant performance benefits compared to standard product discovery, which in contrast to our method requires all fragments to be composed to analyze the feature model. Furthermore, the method succeeds when more efficient, heuristics-based engines fail to find a valid configuration.

Reducing run-time adaptation space via analysis of possible utility bounds

Self-adaptive systems often employ dynamic programming or similar techniques to select optimal adaptations at run-time. These techniques suffer from the "curse of dimensionality", increasing the cost of run-time adaptation decisions. We propose a novel approach that improves upon the state-of-the-art proactive self-adaptation techniques to reduce the number of possible adaptations that need be considered for each run-time adaptation decision. The approach, realized in a tool called Thallium, employs a combination of automated formal modeling techniques to (i) analyze a structural model of the system showing which configurations are reachable from other configurations and (ii) compute the utility that can be generated by the optimal adaptation over a bounded horizon in both the best- and worst-case scenarios. It then constructs triangular possibility values using those optimized bounds to automatically compare adjacent adaptations for each configuration, keeping only the alternatives with the best range of potential results. The experimental results corroborate Thallium's ability to significantly reduce the number of states that need to be considered with each adaptation decision, freeing up vital resources at run-time.

SESSION: Recommendations

Context-aware in-process crowdworker recommendation

Identifying and optimizing open participation is essential to the success of open software development. Existing studies highlighted the importance of worker recommendation for crowdtesting tasks in order to detect more bugs with fewer workers. However, these studies mainly focus on one-time recommendations with respect to the initial context at the beginning of a new task. This paper argues the need for in-process crowdtesting worker recommendation. We motivate this study through a pilot study, revealing the prevalence of long-sized non-yielding windows, i.e., no new bugs are revealed in consecutive test reports during the process of a crowdtesting task. This indicates the potential opportunity for accelerating crowdtesting by recommending appropriate workers in a dynamic manner, so that the non-yielding windows could be shortened.

To that end, this paper proposes a context-aware in-process crowdworker recommendation approach, iRec, to detect more bugs earlier and potentially shorten the non-yielding windows. It consists of three main components: 1) the modeling of dynamic testing context, 2) the learning-based ranking component, and 3) the diversity-based re-ranking component. The evaluation is conducted on 636 crowdtesting tasks from one of the largest crowdtesting platforms, and results show the potential of iRec in improving the cost-effectiveness of crowdtesting by saving the cost and shortening the testing process.

SESSION: Security

A large-scale empirical study on vulnerability distribution within projects and the lessons learned

The number of vulnerabilities increases rapidly in recent years, due to advances in vulnerability discovery solutions. It enables a thorough analysis on the vulnerability distribution and provides support for correlation analysis and prediction of vulnerabilities. Previous research either focuses on analyzing bugs rather than vulnerabilities, or only studies general vulnerability distribution among projects rather than the distribution within each project. In this paper, we collected a large vulnerability dataset, consisting of all known vulnerabilities associated with five representative open source projects, by utilizing automated crawlers and spending months of manual efforts. We then analyzed the vulnerability distribution within each project over four dimensions, including files, functions, vulnerability types and responsible developers. Based on the results analysis, we presented 12 practical insights on the distribution of vulnerabilities. Finally, we applied such insights on several vulnerability discovery solutions (including static analysis and dynamic fuzzing), and helped them find 10 zero-day vulnerabilities in target projects, showing that our insights are useful.

Unsuccessful story about few shot malware family classification and siamese network to the rescue

To battle the ever-increasing Android malware, malware family classification, which classifies malware with common features into a malware family, has been proposed as an effective malware analysis method. Several machine-learning based approaches have been proposed for the task of malware family classification. Our study shows that malware families suffer from several data imbalance, with many families with only a small number of malware applications (referred to as few shot malware families in this work). Unfortunately, this issue has been overlooked in existing approaches. Although existing approaches achieve high classification performance at the overall level and for large malware families, our experiments show that they suffer from poor performance and generalizability for few shot malware families, and traditionally downsampling method cannot solve the problem. To address the challenge in few shot malware family classification, we propose a novel siamese-network based learning method, which allows us to train an effective MultiLayer Perceptron (MLP) network for embedding malware applications into a real-valued, continuous vector space by contrasting the malware applications from the same or different families. In the embedding space, the performance of malware family classification can be significantly improved for all scales of malware families, especially for few shot malware families, which also leads to the significant performance improvement at the overall level.

How does misconfiguration of analytic services compromise mobile privacy?

Mobile application (app) developers commonly utilize analytic services to analyze their app users' behavior to support debugging, improve service quality, and facilitate advertising. Anonymization and aggregation can reduce the sensitivity of such behavioral data, therefore analytic services often encourage the use of such protections. However, these protections are not directly enforced so it is possible for developers to misconfigure the analytic services and expose personal information, which may cause greater privacy risks. Since people use apps in many aspects of their daily lives, such misconfigurations may lead to the leaking of sensitive personal information such as a users' real-time location, health data, or dating preferences. To study this issue and identify potential privacy risks due to such misconfigurations, we developed a semi-automated approach, Privacy-Aware Analytics Misconfiguration Detector (PAMDroid), which enables our empirical study on mis-configurations of analytic services. This paper describes a study of 1,000 popular apps using top analytic services in which we found misconfigurations in 120 apps. In 52 of the 120 apps, misconfigurations lead to a violation of either the analytic service providers' terms of service or the app's own privacy policy.

SESSION: Stack overflow

Interpreting cloud computer vision pain-points: a mining study of stack overflow

Intelligent services are becoming increasingly more pervasive; application developers want to leverage the latest advances in areas such as computer vision to provide new services and products to users, and large technology firms enable this via RESTful APIs. While such APIs promise an easy-to-integrate on-demand machine intelligence, their current design, documentation and developer interface hides much of the underlying machine learning techniques that power them. Such APIs look and feel like conventional APIs but abstract away data-driven probabilistic behaviour---the implications of a developer treating these APIs in the same way as other, traditional cloud services, such as cloud storage, is of concern. The objective of this study is to determine the various pain-points developers face when implementing systems that rely on the most mature of these intelligent services, specifically those that provide computer vision. We use Stack Overflow to mine indications of the frustrations that developers appear to face when using computer vision services, classifying their questions against two recent classification taxonomies (documentation-related and general questions). We find that, unlike mature fields like mobile development, there is a contrast in the types of questions asked by developers. These indicate a shallow understanding of the underlying technology that empower such systems. We discuss several implications of these findings via the lens of learning taxonomies to suggest how the software engineering community can improve these services and comment on the nature by which developers use them.